Phi-3 small models, Elastic & RAG: Creating a smart ordering system

Q: Microsoft's Phi-3 small models are cost-efficient models, optimized to perform under restricted resource conditions at low latency, making them ideal for massive/real-time tasks.

Microsoft's Phi-3 small models are cost-efficient models, optimized to perform under restricted resource conditions at low latency, making them ideal for massive/real-time tasks.

This article shows you how to combine Microsoft's efficient Phi-3 models with Elastic's semantic search capabilities to create a smart, conversational ordering system. We'll walk through deploying phi-3 on Azure AI Studio, setting up Elastic, and building an application for an Italian restaurant.

In April, Microsoft announced their state-of-the-art phi-3 small models family, which are cost-efficient models, optimized to perform under restricted resource conditions at low latency, making them ideal for massive/real-time tasks.

The phi-3 small family consists of:

Phi-3-small-8K-instruct
Phi-3-small-128K-instruct

With a maximum supported context length of 8K and 128K tokens, both models are 7B parameters. Note the instruct suffix means these models, unlike the chat models, are not trained for a conversational role-play flow, but instead to follow specific instructions. You are not intended to chat with them but rather give them tasks to achieve.

Some applications of phi-3 models are local devices, offline environments and scoped/specific tasks. For example, to help interacting with sensor systems using natural language in places like a mine, or a farm where internet is not always available, or analyzing restaurant orders in real time that requires a fast response.

Case study: L'asticco Italian Cuisine

L'asticco Italian Cuisine is a popular restaurant with a wide and customizable menu. Chefs are complaining because orders come with errors, and so they are pushing to have tablets on the tables so clients can directly select what they want to eat. The owners want to keep the human touch, keeping the orders conversational between customers and waiters.

This is a good opportunity to take advantage of the phi-3 small model specs since we can ask the model to listen to the customers and extract the dish and customizations. Also, the waiter can give recommendations to the customer, and confirm the order generated by our application, thus keeping everyone happy. Let's help the staff take orders!

An order is generally taken in many turns, and it may have pauses. So we will allow the application to fill the order on many turns by keeping in memory the state of the order. In addition, to keep the context window small, we will not pass the entire menu to the model, but store the dishes in Elastic and then retrieve them using semantic search.

The flow of the application will look like this:

You can follow the notebook to reproduce this article's example here

Steps

Deploying on Azure AI Studio

https://ai.azure.com/explore/models?selectedCollection=phi

To deploy a model on Azure AI Studio, you must create an account, find the model in the catalog, and then deploy it:

Select Managed Compute as the deployment option. From there we need to save the target URI and key for the next steps.

If you want to read more about Azure AI Studio, we have an in-depth article about Azure AI Studio you can read here.

Creating embeddings endpoint

For the embeddings endpoint, we are going to use our ELSER model. This will help us find the right dish even if the customer is not using the exact name. Embeddings are also useful for recommending dishes based on semantic similarity between what the customer is saying and the dish description.

To create the ELSER endpoint you can use the Kibana UI:

Or the _inference API:

PUT _inference/sparse_embedding/elser-embeddings
{
  "service": "elser",
  "service_settings": {
    "model_id": ".elser_model_2",
    "num_allocations": 1,
    "num_threads": 1
  }
}

Creating completion endpoint

We can easily connect to the phi-3 model deployed in Azure AI Studio using Elastic Open Inference Service. You must select "realtime" as the endpoint type, and paste the target uri and key you saved from the first step.

PUT _inference/completion/phi3-completion
{
  "service": "azureaistudio",
  "service_settings": {
    "api_key": "<API_KEY>",
    "target": "<TARGET_URL>",
    "provider": "microsoft_phi",
    "endpoint_type": "realtime"
  }
}

Creating index

We are going to ingest the restaurant dishes, one document per dish, and apply the semantic_text mapping type to the description field. We are going to store the dish customizations as well.

PUT lasticco-menu
{
  "mappings": {
    "properties": {
      "code": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "description": {
        "type": "semantic_text",
        "inference_id": "elser-embeddings"
      },
      "price": {
        "type": "double"
      },
      "customizations": {
        "type": "object"
      }
    }
  }
}

Indexing data

L'asticco menu is very large, so we will ingest only a few dishes:

POST lasticco-menu/_doc
{
  "code": "carbonara",
  "title": "Pasta Carbonara",
  "description": "Pasta Carbonara \n Perfectly al dente spaghetti enrobed in a velvety sauce of farm-fresh eggs, aged Pecorino Romano, and smoky guanciale. Finished with a kiss of cracked black pepper for a classic Roman indulgence.",
  "price": 14.99,
  "customizations": {
    "vegetarian": [
      true,
      false
    ],
    "cream": [
      true,
      false
    ],
    "extras": [
      "cheese",
      "garlic",
      "ham"
    ]
  }
}

POST lasticco-menu/_doc
{
  "code": "alfredo",
  "title": "Chicken Alfredo",
  "description": "Chicken Alfredo \n Recipe includes golden pan-fried seasoned chicken breasts and tender fettuccine, coated in the most dreamy cream sauce ever, coated with a velvety garlic and Parmesan cream sauce.",
  "price": 18.99,
  "customizations": {
    "vegetarian": [
      true,
      false
    ],
    "cream": [
      true,
      false
    ],
    "extras": [
      "cheese",
      "onions",
      "olives"
    ]
  }
}  

POST lasticco-menu/_doc  
{
  "code": "gnocchi",
  "title": "Four Cheese Gnocchi",
  "description": "Four Cheese Gnocchi \n soft pillowy potato gnocchi coated in a silken cheesy sauce made of four different cheeses: Gouda, Parmigiano, Brie, and the star, Gorgonzola. The combination of four different types of cheese will make your tastebuds dance for joy.",
  "price": 15.99,
  "customizations": {
    "vegetarian": [
      true,
      false
    ],
    "cream": [
      true,
      false
    ],
    "extras": [
      "cheese",
      "bacon",
      "mushrooms"
    ]
  }
}

Orders schema

Before start asking questions, we must define the order schema, which will look like this:

{
  "order": [
    {
      "code": "carbonara",
      "qty": 1,
      "customizations": [
        {
          "vegan": true
        }
      ]
    },
    {
      "code": "alfredo",
      "qty": 2,
      "customizations": [
        {
          "extras": ["cheese"]
        }
      ]
    },
    {
      "code": "gnocchi",
      "qty": 1,
      "customizations": [
        {
          "extras": ["mushrooms"]
        }
      ]
    }
  ]
}

Retrieving documents

For each customer request, we must retrieve the most likely dishes they are referring to using semantic search, to then give that information to the model in the next step.

GET lasticco-menu/_search
{
  "query": {
    "semantic": {
      "field": "description",
      "query": "may I have a carbonara with cream and bacon?"
    }
  }
}

We get the following dish back:

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 18.255896,
    "hits": [
      {
        "_index": "lasticco-menu",
        "_id": "YAzAs5ABcaO5Zio3qjNd",
        "_score": 18.255896,
        "_source": {
          "code": "carbonara",
          "price": 14.99,
          "description": {
            "text": """Pasta Carbonara 
 Perfectly al dente spaghetti enrobed in a velvety sauce of farm-fresh eggs, aged Pecorino Romano, and smoky pancetta. Finished with a kiss of cracked black pepper for a classic Roman indulgence.""",
            "inference": {
              "inference_id": "elser-embeddings",
              "model_settings": {
                "task_type": "sparse_embedding"
              },
              "chunks": [
                {
                  "text": """Pasta Carbonara 
 Perfectly al dente spaghetti enrobed in a velvety sauce of farm-fresh eggs, aged Pecorino Romano, and smoky pancetta. Finished with a kiss of cracked black pepper for a classic Roman indulgence.""",
                  "embeddings": {
                    "carbon": 2.2051733,
                    "pasta": 2.1578376,
                    "spaghetti": 1.9468538,
                    "##ara": 1.8116782,
                    "romano": 1.7329221,
                    "dent": 1.6995606,
                    "roman": 1.6205788,
                    "sauce": 1.6086094,
                    "al": 1.5249499,
                    "velvet": 1.5057548,
                    "eggs": 1.4711059,
                    "kiss": 1.4058907,
                    "cracked": 1.3383057,
                    ...
                  }
                }
              ]
            }
          },
          "title": "Pasta Carbonara",
          "customizations": {
            "extras": [
              "cheese",
              "garlic",
              "ham"
            ],
            "cream": [
              true,
              false
            ],
            "vegetarian": [
              true,
              false
            ]
          }
        }
      }
    ]
  }
}

For this example, we are using the dish description for the search. We can provide the rest of the fields to the model for additional context, or use them as query filters.

Taking orders

Now we have all our relevant pieces:

Customer request
Relevant dishes to the request
Order schema

We can ask phi-3 to help us extract the dishes from the current request, and remember the whole order.

We can try the flow using Kibana and the _inference API.

POST _inference/completion/phi3-completion  
{
  "input": """
  Your task is to manage an order based on the AVAILABLE DISHES in the MENU and the USER REQUEST. Follow these strict rules:

1. ONLY add dishes to the order that are explicitly listed in the MENU.
2. If the requested dish is not in the MENU, do not add anything to the order.
3. The response must always be a valid JSON object containing an "order" array, even if it's empty.
4. Do not invent or hallucinate any dishes that are not in the MENU.
5. Respond only with the updated order object, nothing else.

Example of an order object:
{
  "order": [
    {
      "code": "<dish_code>",
      "qty": 1
    },
    {
      "code": "<dish_code>",
      "qty": 2,
      "customizations": "<customizations>"
    }
  ]
}

MENU:
[
  {
    "code": "carbonara",
    "title": "Pasta Carbonara",
    "description": "Pasta Carbonara \n Perfectly al dente spaghetti enrobed in a velvety sauce of farm-fresh eggs, aged Pecorino Romano, and smoky pancetta. Finished with a kiss of cracked black pepper for a classic Roman indulgence.",
    "price": 14.99,
    "customizations": {
      "vegetarian": [true, false],
      "cream": [true, false],
      "extras": ["cheese",  "garlic", "ham"]
    }
  }
]

CURRENT ORDER:
{
  "order": []
}

USER REQUEST: may I have a carbonara with cream and extra cheese?

Remember:

If the requested dish is not in the MENU, return the current order unchanged.
Customizations should be added as an object with the same structure as in the MENU.
For boolean customizations, use true/false values.
For array customizations, use an array with the selected items.

"""
}

This will give us the updated order, which will must send in our next request as "current order" until the order is finished:

{
  "completion": [
    {
      "result": """{
  "order": [
    {
      "code": "carbonara",
      "qty": 1,
      "customizations": {
        "vegan": false,
        "cream": true,
        "extras": ["cheese"]
      }
    }
  ]
}"""
    }
  ]
}

This will capture the user requests and keep the order status updated, all of this with low latency and high efficiency. A possible optimization is to add recommendations or to consider allergies, so the model can add or remove dishes to the order, or recommend dishes to the user. Do you think you can add that feature?

In the Notebook you will find the full working example capturing the user input in a loop.

Conclusion

Phi-3 small models are a very powerful solution for tasks that demand low cost and latency while solving complex requests like language understanding and data extraction. With Azure AI Studio you can easily deploy the model, and consume it from Elastic seamlessly using the Open Inference Service. To overcome the context length limitation of these small models when managing big data volumes, using a RAG system with Elastic as the vector database is the way to go.

If you are interested in reproducing the examples of this article, you can find the Python Notebook with the requests here

Elasticsearch has native integrations to industry leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps Elastic Vector Database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

Report an issue