How to power Observability AI Assistant with local LLMs for on-prem deployments

Intelligent large-language models have now been around for a little while, and one of the first things some customers did is they took steps to block these models after a number of serious data exfiltration incidents occurred like this one:

Sadly this also means that those businesses are missing out on the amazing things that these LLMs can do, especially in the Observability space. The Elastic Observability AI Assistant powered by LLMs can help with a number of use cases:

Faster and more accurate root cause analysis.
- Not just relying on heuristics to leave you hanging at the ‘CPU is busy’ as the root cause. Really getting you to the root cause.
Democratization of observability.
- LLMs help novice and junior users talk in natural language, do data analytics and help them learn and upskill. Junior BCG consultants saw a 43% boost in work quality using ChatGPT-4.
No swivel-chairing.
- Being able to have a conversation about a problem all from the same pane of glass means little to know context-switching.
Up-to-date expertise always on hand.
- With Search AI based RAG with Elastic you are able to get up to the minute information from your ticketing systems and include these in your troubleshooting workflow.
Freedom from mundane tasks.
- The Elastic AI Assistant can automate a lot of workflows, calling Elastic APIs automatically to generate visualizations and do remedial actions.

With these benefits being left on the table it is clear that businesses need an alternative way to protect their businesses while still being able to use LLMs.

One way that is becoming popular is to use a locally deployable LLM, there are now many of these options available and the most popular of these are currently Mistral and LLAMA3.

In this blog we will show you how to get at a minimum the 8.14 release of the Elastic AI Assistant to connect to these privately deployable LLMs.

Prerequisites and configuration

If you plan on following this blog, here are some of the components and details we used to set up the configuration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
A g5.xlarge instance in AWS (or equivalent from another cloud provider) to run the LLM on. These instances come with NVIDIA A10 GPUs, offering a balance between cost and performance. Make sure you use the default Amazon Linux OS.
I am also using the otel demo in my environment so I have some data to play with, you can also do this by cloning the repository and following the instructions here.

Setting up ollama

Launch a G5 instance with Amazon Linux as shown below, models tend to be fairly large so you may wish to increase the disk space to at least 200GB:

Once the G5 instance is launched you will want to login and firstly run the command

    sudo dnf install kernel-modules-extra.x86_64

This will install additional libraries to get the GPU support working on Amazon Linux.

Next up install ollama :

    curl -fsSL https://ollama.com/install.sh | sh

Run ollama


    ollama run llama3

And then edit the ollama config to allow it to listen on a public address:

    sudo mkdir /etc/systemd/system/ollama.service.d
    sudo vi /etc/systemd/system/ollama.service.d/override.conf

Adding the following lines:

    [Service]
    Environment="OLLAMA_HOST=0.0.0.0"

After doing this, restart the service:


    sudo systemctl daemon-reload
    sudo systemctl restart ollama

For the purposes of configuring Elastic we are using the configuration mentioned in the ollama blog for OpenAI compatibility:

As mentioned there we need to open up port

11434

. To open port 11434 on your AWS host, log in to the AWS Management Console and navigate to the EC2 Dashboard. Locate your EC2 instance by going to "Instances" in the left-hand menu and selecting your instance. In the "Description" tab, find the "Security groups" section and click on the security group ID. In the security group details, go to the "Inbound rules" tab and click "Edit inbound rules." Add a new rule with "Custom TCP Rule" as the type, "TCP" as the protocol, and "11434" as the port range. Set the source to your desired IP range, then click "Save rules" to apply the changes.

And run a test:

    curl http://[YOUR EC2 PUBLIC DNS]:11434/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "llama3",
            "messages": [
                {
                    "role": "system",
                    "content": "You are a helpful assistant."
                },
                {
                    "role": "user",
                    "content": "Hello!"
                }
            ]
        }'

The result should contain: “Hello there! It's nice to meet you”

So now you are all up and running with your own private LLM.

Setting up Elastic

Firstly lets access the AI Assistant from the icon in the top right corner.

Next click the button to “Set up GenAI connector”

Use OpenAI and set up the connector in the following way, NOTE the API key is

ollama

as documented here:

Next the AI Assistant will set up the knowledge base for you:

Once you have set up the AI Assistant, you will need to turn on "Simulated Function Calling" as below. Firstly click on the "AI Assistant Settings" as shown below.

Next tick the box to "Simulated Function Calling"

When that is finished, we can put the AI Assistant to work like so:

In summary

We setup the Elastic Observability AI Assistant to use other models, specifically private or on-prem models and in the examples we use llama3. You can configure olama to use other models such as mistral and test to see which ones give you the best results.

Be aware that your results may vary depending on the models you use and Elastic cannot support all the different models here, our goal is to support the ones that give the highest quality results so that customers have the best experience.

How to power Observability AI Assistant with local LLMs for on- prem deployments