Azure OpenAI inference service

edit

Azure OpenAI inference service

edit

Creates an inference endpoint to perform an inference task with the azureopenai service.

Request

edit

PUT /_inference/<task_type>/<inference_id>

Path parameters

edit
<inference_id>
(Required, string) The unique identifier of the inference endpoint.
<task_type>

(Required, string) The type of the inference task that the model will perform.

Available task types:

  • completion,
  • text_embedding.

Request body

edit
service
(Required, string) The type of service supported for the specified task type. In this case, azureopenai.
service_settings

(Required, object) Settings used to install the inference model.

These settings are specific to the azureopenai service.

api_key or entra_id

(Required, string) You must provide either an API key or an Entra ID. If you do not provide either, or provide both, you will receive an error when trying to create your model. See the Azure OpenAI Authentication documentation for more details on these authentication types.

You need to provide the API key only once, during the inference model creation. The Get inference API does not retrieve your API key. After creating the inference model, you cannot change the associated API key. If you want to use a different API key, delete the inference model and recreate it with the same name and the updated API key.

resource_name
(Required, string) The name of your Azure OpenAI resource. You can find this from the list of resources in the Azure Portal for your subscription.
deployment_id
(Required, string) The deployment name of your deployed models. Your Azure OpenAI deployments can be found though the Azure OpenAI Studio portal that is linked to your subscription.
api_version
(Required, string) The Azure API version ID to use. We recommend using the latest supported non-preview version.
rate_limit

(Optional, object) The azureopenai service sets a default number of requests allowed per minute depending on the task type. For text_embedding it is set to 1440. For completion it is set to 120. This helps to minimize the number of rate limit errors returned from Azure. To modify this, set the requests_per_minute setting of this object in your service settings:

"rate_limit": {
    "requests_per_minute": <<number_of_requests>>
}

More information about the rate limits for Azure can be found in the Quota limits docs and How to change the quotas.

task_settings

(Optional, object) Settings to configure the inference task. These settings are specific to the <task_type> you specified.

task_settings for the completion task type
user
(optional, string) Specifies the user issuing the request, which can be used for abuse detection.
task_settings for the text_embedding task type
user
(optional, string) Specifies the user issuing the request, which can be used for abuse detection.

Azure OpenAI service example

edit

The following example shows how to create an inference endpoint called azure_openai_embeddings to perform a text_embedding task type. Note that we do not specify a model here, as it is defined already via our Azure OpenAI deployment.

The list of embeddings models that you can choose from in your deployment can be found in the Azure models documentation.

resp = client.inference.put(
    task_type="text_embedding",
    inference_id="azure_openai_embeddings",
    inference_config={
        "service": "azureopenai",
        "service_settings": {
            "api_key": "<api_key>",
            "resource_name": "<resource_name>",
            "deployment_id": "<deployment_id>",
            "api_version": "2024-02-01"
        }
    },
)
print(resp)
const response = await client.inference.put({
  task_type: "text_embedding",
  inference_id: "azure_openai_embeddings",
  inference_config: {
    service: "azureopenai",
    service_settings: {
      api_key: "<api_key>",
      resource_name: "<resource_name>",
      deployment_id: "<deployment_id>",
      api_version: "2024-02-01",
    },
  },
});
console.log(response);
PUT _inference/text_embedding/azure_openai_embeddings
{
    "service": "azureopenai",
    "service_settings": {
        "api_key": "<api_key>",
        "resource_name": "<resource_name>",
        "deployment_id": "<deployment_id>",
        "api_version": "2024-02-01"
    }
}

The next example shows how to create an inference endpoint called azure_openai_completion to perform a completion task type.

resp = client.inference.put(
    task_type="completion",
    inference_id="azure_openai_completion",
    inference_config={
        "service": "azureopenai",
        "service_settings": {
            "api_key": "<api_key>",
            "resource_name": "<resource_name>",
            "deployment_id": "<deployment_id>",
            "api_version": "2024-02-01"
        }
    },
)
print(resp)
const response = await client.inference.put({
  task_type: "completion",
  inference_id: "azure_openai_completion",
  inference_config: {
    service: "azureopenai",
    service_settings: {
      api_key: "<api_key>",
      resource_name: "<resource_name>",
      deployment_id: "<deployment_id>",
      api_version: "2024-02-01",
    },
  },
});
console.log(response);
PUT _inference/completion/azure_openai_completion
{
    "service": "azureopenai",
    "service_settings": {
        "api_key": "<api_key>",
        "resource_name": "<resource_name>",
        "deployment_id": "<deployment_id>",
        "api_version": "2024-02-01"
    }
}

The list of chat completion models that you can choose from in your Azure OpenAI deployment can be found at the following places: