Elasticsearch inference service

edit

Elasticsearch inference service

edit

Creates an inference endpoint to perform an inference task with the elasticsearch service.

If you use the E5 model through the elasticsearch service, the API request will automatically download and deploy the model if it isn’t downloaded yet.

Request

edit

PUT /_inference/<task_type>/<inference_id>

Path parameters

edit
<inference_id>
(Required, string) The unique identifier of the inference endpoint.
<task_type>

(Required, string) The type of the inference task that the model will perform.

Available task types:

  • rerank,
  • sparse_embedding,
  • text_embedding.

Request body

edit
service
(Required, string) The type of service supported for the specified task type. In this case, elasticsearch.
service_settings

(Required, object) Settings used to install the inference model.

These settings are specific to the elasticsearch service.

adaptive_allocations

(Optional, object) Adaptive allocations configuration object. If enabled, the number of allocations of the model is set based on the current load the process gets. When the load is high, a new model allocation is automatically created (respecting the value of max_number_of_allocations if it’s set). When the load is low, a model allocation is automatically removed (respecting the value of min_number_of_allocations if it’s set). The number of model allocations cannot be scaled down to less than 1 this way. If adaptive_allocations is enabled, do not set the number of allocations manually.

enabled
(Optional, Boolean) If true, adaptive_allocations is enabled. Defaults to false.
max_number_of_allocations
(Optional, integer) Specifies the maximum number of allocations to scale to. If set, it must be greater than or equal to min_number_of_allocations.
min_number_of_allocations
(Optional, integer) Specifies the minimum number of allocations to scale to. If set, it must be greater than or equal to 1.
model_id
(Required, string) The name of the model to use for the inference task. It can be the ID of either a built-in model (for example, .multilingual-e5-small for E5) or a text embedding model already uploaded through Eland.
num_allocations
(Required, integer) The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput. If adaptive_allocations is enabled, do not set this value, because it’s automatically set.
num_threads
(Required, integer) Sets the number of threads used by each model allocation during inference. This generally increases the speed per inference request. The inference process is a compute-bound process; threads_per_allocations must not exceed the number of available allocated processors per node. Must be a power of 2. Max allowed value is 32.
task_settings

(Optional, object) Settings to configure the inference task. These settings are specific to the <task_type> you specified.

task_settings for the rerank task type
return_documents
(Optional, Boolean) Returns the document instead of only the index. Defaults to true.

E5 via the elasticsearch service

edit

The following example shows how to create an inference endpoint called my-e5-model to perform a text_embedding task type.

The API request below will automatically download the E5 model if it isn’t already downloaded and then deploy the model.

resp = client.inference.put(
    task_type="text_embedding",
    inference_id="my-e5-model",
    inference_config={
        "service": "elasticsearch",
        "service_settings": {
            "num_allocations": 1,
            "num_threads": 1,
            "model_id": ".multilingual-e5-small"
        }
    },
)
print(resp)
const response = await client.inference.put({
  task_type: "text_embedding",
  inference_id: "my-e5-model",
  inference_config: {
    service: "elasticsearch",
    service_settings: {
      num_allocations: 1,
      num_threads: 1,
      model_id: ".multilingual-e5-small",
    },
  },
});
console.log(response);
PUT _inference/text_embedding/my-e5-model
{
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1,
    "model_id": ".multilingual-e5-small" 
  }
}

The model_id must be the ID of one of the built-in E5 models. Valid values are .multilingual-e5-small and .multilingual-e5-small_linux-x86_64. For further details, refer to the E5 model documentation.

You might see a 502 bad gateway error in the response when using the Kibana Console. This error usually just reflects a timeout, while the model downloads in the background. You can check the download progress in the Machine Learning UI. If using the Python client, you can set the timeout parameter to a higher value.

Models uploaded by Eland via the elasticsearch service

edit

The following example shows how to create an inference endpoint called my-msmarco-minilm-model to perform a text_embedding task type.

resp = client.inference.put(
    task_type="text_embedding",
    inference_id="my-msmarco-minilm-model",
    inference_config={
        "service": "elasticsearch",
        "service_settings": {
            "num_allocations": 1,
            "num_threads": 1,
            "model_id": "msmarco-MiniLM-L12-cos-v5"
        }
    },
)
print(resp)
const response = await client.inference.put({
  task_type: "text_embedding",
  inference_id: "my-msmarco-minilm-model",
  inference_config: {
    service: "elasticsearch",
    service_settings: {
      num_allocations: 1,
      num_threads: 1,
      model_id: "msmarco-MiniLM-L12-cos-v5",
    },
  },
});
console.log(response);
PUT _inference/text_embedding/my-msmarco-minilm-model 
{
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1,
    "model_id": "msmarco-MiniLM-L12-cos-v5" 
  }
}

Provide an unique identifier for the inference endpoint. The inference_id must be unique and must not match the model_id.

The model_id must be the ID of a text embedding model which has already been uploaded through Eland.

Setting adaptive allocation for E5 via the elasticsearch service

edit

The following example shows how to create an inference endpoint called my-e5-model to perform a text_embedding task type and configure adaptive allocations.

The API request below will automatically download the E5 model if it isn’t already downloaded and then deploy the model.

resp = client.inference.put(
    task_type="text_embedding",
    inference_id="my-e5-model",
    inference_config={
        "service": "elasticsearch",
        "service_settings": {
            "adaptive_allocations": {
                "enabled": True,
                "min_number_of_allocations": 3,
                "max_number_of_allocations": 10
            },
            "model_id": ".multilingual-e5-small"
        }
    },
)
print(resp)
const response = await client.inference.put({
  task_type: "text_embedding",
  inference_id: "my-e5-model",
  inference_config: {
    service: "elasticsearch",
    service_settings: {
      adaptive_allocations: {
        enabled: true,
        min_number_of_allocations: 3,
        max_number_of_allocations: 10,
      },
      model_id: ".multilingual-e5-small",
    },
  },
});
console.log(response);
PUT _inference/text_embedding/my-e5-model
{
  "service": "elasticsearch",
  "service_settings": {
    "adaptive_allocations": {
      "enabled": true,
      "min_number_of_allocations": 3,
      "max_number_of_allocations": 10
    },
    "model_id": ".multilingual-e5-small"
  }
}