New

The executive guide to generative AI

Read more

Chat completion inference API

edit

Chat completion inference API

edit

Streams a chat completion response.

The inference APIs enable you to use certain services, such as built-in machine learning models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Azure, Google AI Studio, Google Vertex AI, Anthropic, Watsonx.ai, or Hugging Face. For built-in models and models uploaded through Eland, the inference APIs offer an alternative way to use and manage trained models. However, if you do not plan to use the inference APIs to use these models or if you want to use non-NLP models, use the Machine learning trained model APIs.

Request

edit

POST /_inference/<inference_id>/_unified

POST /_inference/chat_completion/<inference_id>/_unified

Prerequisites

edit
  • Requires the monitor_inference cluster privilege (the built-in inference_admin and inference_user roles grant this privilege)
  • You must use a client that supports streaming.

Description

edit

The chat completion inference API enables real-time responses for chat completion tasks by delivering answers incrementally, reducing response times during computation. It only works with the chat_completion task type for openai and elastic inference services.

The chat_completion task type is only available within the _unified API and only supports streaming.

Path parameters

edit
<inference_id>
(Required, string) The unique identifier of the inference endpoint.
<task_type>
(Optional, string) The type of inference task that the model performs. If included, this must be set to the value chat_completion.

Request body

edit
messages

(Required, array of objects) A list of objects representing the conversation. Requests should generally only add new messages from the user (role user). The other message roles (assistant, system, or tool) should generally only be copied from the response to a previous completion request, such that the messages array is built up throughout a conversation.

Assistant message
content

(Required unless tool_calls is specified, string or array of objects) The contents of the message.

Examples

String example

{
    "content": "Some string"
}

Object example

{
    "content": [
        {
            "text": "Some text",
            "type": "text"
        }
    ]
}
String representation
(Required, string) The text content.
Object representation
text
(Required, string) The text content.
type
(Required, string) This must be set to the value text.
role
(Required, string) The role of the message author. This should be set to assistant for this type of message.
tool_calls

(Optional, array of objects) The tool calls generated by the model.

Examples
{
    "tool_calls": [
        {
            "id": "call_KcAjWtAww20AihPHphUh46Gd",
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "arguments": "{\"location\":\"Boston, MA\"}"
            }
        }
    ]
}
id
(Required, string) The identifier of the tool call.
type
(Required, string) The type of tool call. This must be set to the value function.
function

(Required, object) The function that the model called.

name
(Required, string) The name of the function to call.
arguments
(Required, string) The arguments to call the function with in JSON format.
System message
content

(Required, string or array of objects) The contents of the message.

Examples

String example

{
    "content": "Some string"
}

Object example

{
    "content": [
        {
            "text": "Some text",
            "type": "text"
        }
    ]
}
String representation
(Required, string) The text content.
Object representation
text
(Required, string) The text content.
type
(Required, string) This must be set to the value text.
role
(Required, string) The role of the message author. This should be set to system for this type of message.
Tool message
content

(Required, string or array of objects) The contents of the message.

Examples

String example

{
    "content": "Some string"
}

Object example

{
    "content": [
        {
            "text": "Some text",
            "type": "text"
        }
    ]
}
String representation
(Required, string) The text content.
Object representation
text
(Required, string) The text content.
type
(Required, string) This must be set to the value text.
role
(Required, string) The role of the message author. This should be set to tool for this type of message.
tool_call_id
(Required, string) The tool call that this message is responding to.
User message
content

(Required, string or array of objects) The contents of the message.

Examples

String example

{
    "content": "Some string"
}

Object example

{
    "content": [
        {
            "text": "Some text",
            "type": "text"
        }
    ]
}
String representation
(Required, string) The text content.
Object representation
text
(Required, string) The text content.
type
(Required, string) This must be set to the value text.
role
(Required, string) The role of the message author. This should be set to user for this type of message.
model
(Optional, string) The ID of the model to use. By default, the model ID is set to the value included when creating the inference endpoint.
max_completion_tokens
(Optional, integer) The upper bound limit for the number of tokens that can be generated for a completion request.
stop
(Optional, array of strings) A sequence of strings to control when the model should stop generating additional tokens.
temperature
(Optional, float) The sampling temperature to use.
tools

(Optional, array of objects) A list of tools that the model can call.

Structure
type
(Required, string) The type of tool, must be set to the value function.
function

(Required, object) The function definition.

description
(Optional, string) A description of what the function does. This is used by the model to choose when and how to call the function.
name
(Required, string) The name of the function.
parameters
(Optional, object) The parameters the functional accepts. This should be formatted as a JSON object.
strict
(Optional, boolean) Whether to enable schema adherence when generating the function call.
Examples
{
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_price_of_item",
                "description": "Get the current price of an item",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "item": {
                            "id": "12345"
                        },
                        "unit": {
                            "type": "currency"
                        }
                    }
                }
            }
        }
    ]
}
tool_choice

(Optional, string or object) Controls which tool is called by the model.

String representation
One of auto, none, or requrired. auto allows the model to choose between calling tools and generating a message. none causes the model to not call any tools. required forces the model to call one or more tools.
Object representation
Structure
type
(Required, string) The type of the tool. This must be set to the value function.
function

(Required, object)

name
(Required, string) The name of the function to call.
Examples
{
    "tool_choice": {
        "type": "function",
        "function": {
            "name": "get_current_weather"
        }
    }
}
top_p
(Optional, float) Nucleus sampling, an alternative to sampling with temperature.

Examples

edit

The following example performs a chat completion on the example question with streaming.

POST _inference/chat_completion/openai-completion/_stream
{
    "model": "gpt-4o",
    "messages": [
        {
            "role": "user",
            "content": "What is Elastic?"
        }
    ]
}

The following example performs a chat completion using an Assistant message with tool_calls.

POST _inference/chat_completion/openai-completion/_stream
{
    "messages": [
        {
            "role": "assistant",
            "content": "Let's find out what the weather is",
            "tool_calls": [ 
                {
                    "id": "call_KcAjWtAww20AihPHphUh46Gd",
                    "type": "function",
                    "function": {
                        "name": "get_current_weather",
                        "arguments": "{\"location\":\"Boston, MA\"}"
                    }
                }
            ]
        },
        { 
            "role": "tool",
            "content": "The weather is cold",
            "tool_call_id": "call_KcAjWtAww20AihPHphUh46Gd"
        }
    ]
}

Each tool call needs a corresponding Tool message.

The corresponding Tool message.

The following example performs a chat completion using a User message with tools and tool_choice.

POST _inference/chat_completion/openai-completion/_stream
{
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's the price of a scarf?"
                }
            ]
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_current_price",
                "description": "Get the current price of a item",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "item": {
                            "id": "123"
                        }
                    }
                }
            }
        }
    ],
    "tool_choice": {
        "type": "function",
        "function": {
            "name": "get_current_price"
        }
    }
}

The API returns the following response when a request is made to the OpenAI service:

event: message
data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":"","role":"assistant"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}}

event: message
data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":Elastic"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}}

event: message
data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":" is"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}}

(...)

event: message
data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk","usage":{"completion_tokens":28,"prompt_tokens":16,"total_tokens":44}}} 

event: message
data: [DONE]

The last object message of the stream contains the token usage information.

Was this helpful?
Feedback