Create a Hugging Face inference endpoint
Added in 8.12.0
Create an inference endpoint to perform an inference task with the hugging_face
service.
You must first create an inference endpoint on the Hugging Face endpoint page to get an endpoint URL.
Select the model you want to use on the new endpoint creation page (for example intfloat/e5-small-v2
), then select the sentence embeddings task under the advanced configuration section.
Create the endpoint and copy the URL after the endpoint initialization has been finished.
The following models are recommended for the Hugging Face service:
all-MiniLM-L6-v2
all-MiniLM-L12-v2
all-mpnet-base-v2
e5-base-v2
e5-small-v2
multilingual-e5-base
multilingual-e5-small
When you create an inference endpoint, the associated machine learning model is automatically deployed if it is not already running.
After creating the endpoint, wait for the model deployment to complete before using it.
To verify the deployment status, use the get trained model statistics API.
Look for "state": "fully_allocated"
in the response and ensure that the "allocation_count"
matches the "target_allocation_count"
.
Avoid creating multiple endpoints for the same model unless required, as each endpoint consumes significant resources.
Path parameters
-
task_type
string Required The type of the inference task that the model will perform.
Value is
text_embedding
. -
huggingface_inference_id
string Required The unique identifier of the inference endpoint.
Body
-
chunking_settings
object -
service
string Required Value is
hugging_face
. -
service_settings
object Required
curl \
--request PUT 'http://api.example.com/_inference/{task_type}/{huggingface_inference_id}' \
--header "Authorization: $API_KEY" \
--header "Content-Type: application/json" \
--data '"{\n \"service\": \"hugging_face\",\n \"service_settings\": {\n \"api_key\": \"hugging-face-access-token\", \n \"url\": \"url-endpoint\" \n }\n}"'
{
"service": "hugging_face",
"service_settings": {
"api_key": "hugging-face-access-token",
"url": "url-endpoint"
}
}
{
"chunking_settings": {
"max_chunk_size": 42.0,
"overlap": 42.0,
"sentence_overlap": 42.0,
"strategy": "string"
},
"service": "string",
"service_settings": {},
"task_settings": {},
"inference_id": "string",
"task_type": "sparse_embedding"
}