LLM Observability with Elastic: Azure OpenAI Part 2

We recently announced GA of the Azure OpenAI integration. You can find details in our previous blog LLM Observability: Azure OpenAI.

Since then, we have added further capabilities to the Azure OpenAI GA package, which now offer prompt and response monitoring, PTU deployment performance tracking, and billing insights. Read on to learn more!

Advanced Logging and Monitoring

The initial GA release of the integration focused mainly on the native logs, to track the telemetry of the service by using cognitive services logging. This version of the Azure OpenAI integration allows you to process the advanced logs which gives a more holistic view of OpenAI resource usage.

To achieve this, you have to setup API Management services in Azure. The API Management service is a centralized place where you can put all OpenAI services endpoints to manage all of them end-to-end. Enable the API Management services and configure the Azure event hub to stream the logs.

To learn more about setting up the API Management service to access Azure OpenAI, please refer to the Azure documentation.

By using advanced logging, you can collect the following log data:

Request input text
Response output text
Content filter results
Usage Information
- Input prompt tokens
- Output completion tokens
- Total tokens

Azure OpenAI integration now collects the API Management Gateway logs. When a question from the user goes to the API Management, it logs the questions and the responses from the GPT models.

Here’s what a sample log looks like,

Content filtered results

Azure OpenAI’s content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. With Azure OpenAI model deployments, you can use the default content filter or create your own content filter.

Now, The integration collects the content filtered result logs. In this example let's create a custom filter in the Azure OpenAI Studio that generates an error log.

By leveraging the Azure Content Filters, you can create your own custom lists of terms or phrases to block or flag.

And the document ingested in Elastic would look like this: This screenshot provides insights into the content filtered request.

PTU Deployment Monitoring

Provisioned throughput units (PTU) are units of model processing capacity that you can reserve and deploy for processing prompts and generating completions.

The curated dashboard for PTU Deployment gives comprehensive visibility into metrics such as request latency, active token usage, PTU utilization, and fine-tuning activities, offering a quick snapshot of your deployment's health and performance.

Here are the essential PTU metrics captured by default:

Time to Response: Time taken for the first response to appear after a user send a prompt.
Active Tokens: Use this metric to understand your TPS or TPM based utilization for PTUs and compare to the benchmarks for target TPS or TPM scenarios.
Provision-managed Utilization V2: Provides insights into utilization percentages, helping prevent overuse and ensuring efficient resource allocation.
Prompt Token Cache Match Rate: The prompt token cache hit ratio expressed as a percentage.

Using Billing for cost

Using the curated overview dashboard you can now monitor the actual usage cost for the AI applications. You are one step away from processing the billing information.

You need to configure and install the Azure billing metrics integration. Once the installation is complete the usage cost is visualized for the cognitive services in the Azure OpenAI overview dashboard.

Try it out today

Deploy a cluster on our Elasticsearch Service or download the stack, spin up the new Azure OpenAI integration, open the curated dashboards in Kibana and start monitoring your Azure OpenAI service!