New

The executive guide to generative AI

Read more

Elastic Inference Service (EIS)

edit

Elastic Inference Service (EIS)

edit

Creates an inference endpoint to perform an inference task with the elastic service.

Request

edit

PUT /_inference/<task_type>/<inference_id>

Path parameters

edit
<inference_id>
(Required, string) The unique identifier of the inference endpoint.
<task_type>

(Required, string) The type of the inference task that the model will perform.

Available task types:

  • chat_completion,
  • sparse_embedding.

The chat_completion task type only supports streaming and only through the _unified API.

For more information on how to use the chat_completion task type, please refer to the chat completion documentation.

Request body

edit
max_chunk_size
(Optional, integer) Specifies the maximum size of a chunk in words. Defaults to 250. This value cannot be higher than 300 or lower than 20 (for sentence strategy) or 10 (for word strategy).
overlap
(Optional, integer) Only for word chunking strategy. Specifies the number of overlapping words for chunks. Defaults to 100. This value cannot be higher than the half of max_chunk_size.
sentence_overlap
(Optional, integer) Only for sentence chunking strategy. Specifies the numnber of overlapping sentences for chunks. It can be either 1 or 0. Defaults to 1.
strategy

(Optional, string) Specifies the chunking strategy. It could be either sentence or word.

service
(Required, string) The type of service supported for the specified task type. In this case, elastic.
service_settings
(Required, object) Settings used to install the inference model.
model_id
(Required, string) The name of the model to use for the inference task.
rate_limit

(Optional, object) By default, the elastic service sets the number of requests allowed per minute to 1000 in case of sparse_embedding and 240 in case of chat_completion. This helps to minimize the number of rate limit errors returned. To modify this, set the requests_per_minute setting of this object in your service settings:

"rate_limit": {
    "requests_per_minute": <<number_of_requests>>
}

Elastic Inference Service example

edit

The following example shows how to create an inference endpoint called elser-model-eis to perform a text_embedding task type.

PUT _inference/sparse_embedding/elser-model-eis
{
    "service": "elastic",
    "service_settings": {
        "model_name": "elser"
    }
}

The following example shows how to create an inference endpoint called chat-completion-endpoint to perform a chat_completion task type.

PUT /_inference/chat_completion/chat-completion-endpoint
{
    "service": "elastic",
    "service_settings": {
        "model_id": "model-1"
    }
}
Was this helpful?
Feedback