ELSER inference service
editELSER inference service
editCreates an inference endpoint to perform an inference task with the elser
service.
You can also deploy ELSER by using the Elasticsearch inference service.
The API request will automatically download and deploy the ELSER model if it isn’t already downloaded.
Request
editPUT /_inference/<task_type>/<inference_id>
Path parameters
edit-
<inference_id>
- (Required, string) The unique identifier of the inference endpoint.
-
<task_type>
-
(Required, string) The type of the inference task that the model will perform.
Available task types:
-
sparse_embedding
.
-
Request body
edit-
chunking_settings
-
(Optional, object) Chunking configuration object. Refer to Configuring chunking to learn more about chunking.
-
max_chunking_size
-
(Optional, integer)
Specifies the maximum size of a chunk in words.
Defaults to
250
. This value cannot be higher than300
or lower than20
(forsentence
strategy) or10
(forword
strategy). -
overlap
-
(Optional, integer)
Only for
word
chunking strategy. Specifies the number of overlapping words for chunks. Defaults to100
. This value cannot be higher than the half ofmax_chunking_size
. -
sentence_overlap
-
(Optional, integer)
Only for
sentence
chunking strategy. Specifies the numnber of overlapping sentences for chunks. It can be either1
or0
. Defaults to1
. -
strategy
-
(Optional, string)
Specifies the chunking strategy.
It could be either
sentence
orword
.
-
-
service
-
(Required, string)
The type of service supported for the specified task type. In this case,
elser
. -
service_settings
-
(Required, object) Settings used to install the inference model.
These settings are specific to the
elser
service.-
adaptive_allocations
-
(Optional, object) Adaptive allocations configuration object. If enabled, the number of allocations of the model is set based on the current load the process gets. When the load is high, a new model allocation is automatically created (respecting the value of
max_number_of_allocations
if it’s set). When the load is low, a model allocation is automatically removed (respecting the value ofmin_number_of_allocations
if it’s set). Ifadaptive_allocations
is enabled, do not set the number of allocations manually.-
enabled
-
(Optional, Boolean)
If
true
,adaptive_allocations
is enabled. Defaults tofalse
. -
max_number_of_allocations
-
(Optional, integer)
Specifies the maximum number of allocations to scale to.
If set, it must be greater than or equal to
min_number_of_allocations
. -
min_number_of_allocations
-
(Optional, integer)
Specifies the minimum number of allocations to scale to.
If set, it must be greater than or equal to
1
.
-
-
num_allocations
-
(Required, integer)
The total number of allocations this model is assigned across machine learning nodes.
Increasing this value generally increases the throughput.
If
adaptive_allocations
is enabled, do not set this value, because it’s automatically set. -
num_threads
-
(Required, integer)
Sets the number of threads used by each model allocation during inference. This generally increases the speed per inference request. The inference process is a compute-bound process;
threads_per_allocations
must not exceed the number of available allocated processors per node. Must be a power of 2. Max allowed value is 32.
-
ELSER service example
editThe following example shows how to create an inference endpoint called my-elser-model
to perform a sparse_embedding
task type.
Refer to the ELSER model documentation for more info.
If you want to optimize your ELSER endpoint for ingest, set the number of threads to 1
("num_threads": 1
).
If you want to optimize your ELSER endpoint for search, set the number of threads to greater than 1
.
The request below will automatically download the ELSER model if it isn’t already downloaded and then deploy the model.
resp = client.inference.put( task_type="sparse_embedding", inference_id="my-elser-model", inference_config={ "service": "elser", "service_settings": { "num_allocations": 1, "num_threads": 1 } }, ) print(resp)
const response = await client.inference.put({ task_type: "sparse_embedding", inference_id: "my-elser-model", inference_config: { service: "elser", service_settings: { num_allocations: 1, num_threads: 1, }, }, }); console.log(response);
PUT _inference/sparse_embedding/my-elser-model { "service": "elser", "service_settings": { "num_allocations": 1, "num_threads": 1 } }
Example response:
{ "inference_id": "my-elser-model", "task_type": "sparse_embedding", "service": "elser", "service_settings": { "num_allocations": 1, "num_threads": 1 }, "task_settings": {} }
You might see a 502 bad gateway error in the response when using the Kibana Console.
This error usually just reflects a timeout, while the model downloads in the background.
You can check the download progress in the Machine Learning UI.
If using the Python client, you can set the timeout
parameter to a higher value.
Setting adaptive allocations for the ELSER service
editFor more information on how to optimize your ELSER endpoints, refer to the ELSER recommendations section in the model documentation. To learn more about model autoscaling, refer to the trained model autoscaling page.
The following example shows how to create an inference endpoint called my-elser-model
to perform a sparse_embedding
task type and configure adaptive allocations.
The request below will automatically download the ELSER model if it isn’t already downloaded and then deploy the model.
const response = await client.inference.put({ task_type: "sparse_embedding", inference_id: "my-elser-model", inference_config: { service: "elser", service_settings: { adaptive_allocations: { enabled: true, min_number_of_allocations: 3, max_number_of_allocations: 10, }, num_threads: 1, }, }, }); console.log(response);
PUT _inference/sparse_embedding/my-elser-model { "service": "elser", "service_settings": { "adaptive_allocations": { "enabled": true, "min_number_of_allocations": 3, "max_number_of_allocations": 10 }, "num_threads": 1 } }