This documentation contains work-in-progress information for future Elastic Stack and Cloud releases. Use the version selector to view supported release docs. It also contains some Elastic Cloud serverless information. Check out our serverless docs for more details.

« Update inference API AlibabaCloud AI Search inference integration »

› › ›

Elastic Inference Service (EIS)

edit

Elastic Inference Service (EIS)

edit

Creates an inference endpoint to perform an inference task with the elastic service.

Request

edit

PUT /_inference/<task_type>/<inference_id>

Path parameters

edit

<inference_id>

(Required, string) The unique identifier of the inference endpoint.

<task_type>

(Required, string) The type of the inference task that the model will perform.

Available task types:

chat_completion,
sparse_embedding.

The chat_completion task type only supports streaming and only through the _unified API.

For more information on how to use the chat_completion task type, please refer to the chat completion documentation.

Request body

edit

max_chunk_size

(Optional, integer) Specifies the maximum size of a chunk in words. Defaults to 250. This value cannot be higher than 300 or lower than 20 (for sentence strategy) or 10 (for word strategy).

overlap

(Optional, integer) Only for word chunking strategy. Specifies the number of overlapping words for chunks. Defaults to 100. This value cannot be higher than the half of max_chunk_size.

sentence_overlap

(Optional, integer) Only for sentence chunking strategy. Specifies the numnber of overlapping sentences for chunks. It can be either 1 or 0. Defaults to 1.

strategy

(Optional, string) Specifies the chunking strategy. It could be either sentence or word.

service: (Required, string) The type of service supported for the specified task type. In this case, elastic.
service_settings: (Required, object) Settings used to install the inference model.

model_id

(Required, string) The name of the model to use for the inference task.

rate_limit

(Optional, object) By default, the elastic service sets the number of requests allowed per minute to 1000 in case of sparse_embedding and 240 in case of chat_completion. This helps to minimize the number of rate limit errors returned. To modify this, set the requests_per_minute setting of this object in your service settings:

"rate_limit": {
    "requests_per_minute": <<number_of_requests>>
}

Elastic Inference Service example

edit

The following example shows how to create an inference endpoint called elser-model-eis to perform a text_embedding task type.

PUT _inference/sparse_embedding/elser-model-eis
{
    "service": "elastic",
    "service_settings": {
        "model_name": "elser"
    }
}

Copy as curl Try in Elastic

The following example shows how to create an inference endpoint called chat-completion-endpoint to perform a chat_completion task type.

PUT /_inference/chat_completion/chat-completion-endpoint
{
    "service": "elastic",
    "service_settings": {
        "model_id": "model-1"
    }
}

Copy as curl Try in Elastic

« Update inference API AlibabaCloud AI Search inference integration »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

Elastic Inference Service (EIS)

Elastic Inference Service (EIS)

Request

Path parameters

Request body

Elastic Inference Service example

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards