Create an ELSER inference endpoint Deprecated Added in 8.11.0

PUT /_inference/{task_type}/{elser_inference_id}

Create an inference endpoint to perform an inference task with the elser service. You can also deploy ELSER by using the Elasticsearch inference integration.


Your Elasticsearch deployment contains a preconfigured ELSER inference endpoint, you only need to create the enpoint using the API if you want to customize the settings.

The API request will automatically download and deploy the ELSER model if it isn't already downloaded.


You might see a 502 bad gateway error in the response when using the Kibana Console. This error usually just reflects a timeout, while the model downloads in the background. You can check the download progress in the Machine Learning UI. If using the Python client, you can set the timeout parameter to a higher value.

After creating the endpoint, wait for the model deployment to complete before using it. To verify the deployment status, use the get trained model statistics API. Look for "state": "fully_allocated" in the response and ensure that the "allocation_count" matches the "target_allocation_count". Avoid creating multiple endpoints for the same model unless required, as each endpoint consumes significant resources.

Path parameters

  • task_type string Required

    The type of the inference task that the model will perform.

    Value is sparse_embedding.

  • elser_inference_id string Required

    The unique identifier of the inference endpoint.

application/json

Body

  • Hide chunking_settings attributes Show chunking_settings attributes object
    • service string Required

      The service type

    • service_settings object Required
    • The maximum size of a chunk in words. This value cannot be higher than 300 or lower than 20 (for sentence strategy) or 10 (for word strategy).

    • overlap number

      The number of overlapping words for chunks. It is applicable only to a word chunking strategy. This value cannot be higher than half the max_chunk_size value.

    • The number of overlapping sentences for chunks. It is applicable only for a sentence chunking strategy. It can be either 1 or 0.

    • strategy string

      The chunking strategy: sentence or word.

  • service string Required

    Value is elser.

  • service_settings object Required
    Hide service_settings attributes Show service_settings attributes object
    • Hide adaptive_allocations attributes Show adaptive_allocations attributes object
      • enabled boolean

        Turn on adaptive_allocations.

      • The maximum number of allocations to scale to. If set, it must be greater than or equal to min_number_of_allocations.

      • The minimum number of allocations to scale to. If set, it must be greater than or equal to 0. If not defined, the deployment scales to 0.

    • num_allocations number Required

      The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput. If adaptive allocations is enabled, do not set this value because it's automatically set.

    • num_threads number Required

      The number of threads used by each model allocation during inference. Increasing this value generally increases the speed per inference request. The inference process is a compute-bound process; threads_per_allocations must not exceed the number of available allocated processors per node. The value must be a power of 2. The maximum value is 32.


      If you want to optimize your ELSER endpoint for ingest, set the number of threads to 1. If you want to optimize your ELSER endpoint for search, set the number of threads to greater than 1.

Responses

  • 200 application/json
    Hide response attributes Show response attributes object
    • Hide attributes Show attributes object
      • The maximum size of a chunk in words. This value cannot be higher than 300 or lower than 20 (for sentence strategy) or 10 (for word strategy).

      • overlap number

        The number of overlapping words for chunks. It is applicable only to a word chunking strategy. This value cannot be higher than half the max_chunk_size value.

      • The number of overlapping sentences for chunks. It is applicable only for a sentence chunking strategy. It can be either 1 or 0.

      • strategy string

        The chunking strategy: sentence or word.

    • service string Required

      The service type

    • service_settings object Required
    • inference_id string Required

      The inference Id

    • task_type string Required

      Values are sparse_embedding, text_embedding, rerank, completion, or chat_completion.

PUT /_inference/{task_type}/{elser_inference_id}
curl \
 --request PUT 'http://api.example.com/_inference/{task_type}/{elser_inference_id}' \
 --header "Authorization: $API_KEY" \
 --header "Content-Type: application/json" \
 --data '"{\n    \"service\": \"elser\",\n    \"service_settings\": {\n        \"num_allocations\": 1,\n        \"num_threads\": 1\n    }\n}"'
Request examples
Run `PUT _inference/sparse_embedding/my-elser-model` to create an inference endpoint that performs a `sparse_embedding` task. The request will automatically download the ELSER model if it isn't already downloaded and then deploy the model.
{
    "service": "elser",
    "service_settings": {
        "num_allocations": 1,
        "num_threads": 1
    }
}
Run `PUT _inference/sparse_embedding/my-elser-model` to create an inference endpoint that performs a `sparse_embedding` task with adaptive allocations. When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load.
{
    "service": "elser",
    "service_settings": {
        "adaptive_allocations": {
            "enabled": true,
            "min_number_of_allocations": 3,
            "max_number_of_allocations": 10
        },
        "num_threads": 1
    }
}
Response examples (200)
A successful response when creating an ELSER inference endpoint.
{
  "inference_id": "my-elser-model",
  "task_type": "sparse_embedding",
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}