LocalAI for GPU-powered text embeddings in air-gapped systems

Elasticsearch has native integrations with the industry-leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps with the Elastic vector database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

Introduction

Do you want to build a RAG application on top of Elasticsearch vector database? Do you need to use semantic search on a large amount of data? Do you need to run on-premises in an air-gapped environment? This article will show you how.

Elasticsearch offers a number of ways to create embeddings for your data for symmetric search. One of the most popular ways is to use the Elasticsearch open inference API with OpenAI, Cohere, or Hugging Face models. These platforms support a number of large, powerful models for embedding that can run on GPUs. However, third-party embedding services are not available for the air-gapped systems or are off-limits to customers with privacy concerns and regulatory requirements.

Alternatively, you can use ELSER and E5 to compute embeddings locally. These embedding models run on the CPU and are optimized for speed and memory usage. They are also available for air-gapped systems and can be used in the cloud. However, the performance of these models is not as good as the models that run on GPUs.

Wouldn't it be great if you could compute embeddings for your data locally? With LocalAI you can do just that. LocalAI is a free and open-source inference server compatible with the OpenAI API. It supports model inference using multiple backends, including Sentence Transformers for embedding and llama.cpp for text generation. LocalAI also supports GPU acceleration, so you can compute embeddings faster.

This article will show you how to use LocalAI to compute embeddings for your data. We'll walk you through the process of setting up LocalAI, configuring it to compute embeddings for your data, and running it to generate embeddings. You can run it on your laptop, on your air-gapped system, or wherever you need to compute embeddings.

Have I piqued your interest? Let's get started!

How to set up LocalAI to compute embeddings for your data

Step 1: Set up LocalAI with docker-compose

To get started with LocalAI, you need to have Docker and docker-compose installed on your machine. Depending on your operating system, you may also need to install NVIDIA Container Toolkit for GPU support inside the Docker containers.

Older versions do not support NVIDIA runtime directives, so make sure you have the latest version of docker-compose installed:

Check the version of docker-compose:

You need to use the following docker-compose.yaml configuration file

Notes:

We mount the $HOME/models directory to the /models directory inside the container. This is where the models will be stored. You need to adjust the path to the directory where you want to store the models.
We have specified the number of threads to use for inference and the number of GPUs to use. You can adjust these values according to your hardware configuration.

Step 2: Configure LocalAI to use Sentence Transformers models

In this tutorial, we'll use the mixedbread-ai/mxbai-embed-large-v1, which is currently ranked 4th on the MTEB Leaderboard. However, any embedding model that can be loaded by the sentence-transformers library would work in the same way.

Create directory $HOME/models and a configuration file $HOME/models/mxbai-embed-large-v1.yaml with the following content:

Step 3: Start the LocalAI server

Start the Docker container in the detached mode by running

from your $HOME directory.

Verify that the container has started correctly by running docker-compose ps. Checking that the localai container is in the Up state.

You should see the output similar to the following:

If something went wrong, check the logs. You can also use the logs to verify that localai can see the GPU. Running

should be able to see the information like this:

Finally, you can verify that the inference server is working by querying the list of installed models:

should produce output like this:

Step 4: Create Elasticsearch

We have created and configured the LocalAI inference server. Since it is a drop-in replacement for the OpenAI inference server, we can create a new openai inference service in Elasticsearch. Support for this functionality `was implemented in Elasticsearch 8.14.

To create a new inference service, open Dev Tools in Kibana and run the following command:

Notes:

The api_key parameter is required for the openai service and must be set, but the specific value is not important for our LocalAI service.
For large models, the PUT request may initially time out if the model takes a long time to download to the LocalAI server for the first time. Just retry the PUT request after a short while.

Finally, you can verify that the inference service is working correctly:

should produce output like this:

Conclusion

By following the steps in this article, you can set up LocalAI to compute embeddings for your data using GPU acceleration without having to rely on third-party inference services. With LocalAI, users of Elasticsearch in air-gapped environments or with privacy concerns can leverage the world-class vector database for their RAG applications without sacrificing computational performance or the ability to select the best AI model for their needs.

Try building your own RAG application with Elastic Stack today: in the cloud, in the air-gapped environment or on your laptop!

Frequently Asked Questions

What is LocalAI?

LocalAI is a free and open-source inference server compatible with the OpenAI API. It supports model inference using multiple backends, including Sentence Transformers for embedding and llama.cpp for text generation. LocalAI also supports GPU acceleration, so you can compute embeddings faster.

Report an issue