Save space with byte- sized vectors

Elasticsearch is introducing a new type of vector in 8.6! This vector has 8-bit integer dimensions, where each dimension has a range of [-128, 127]. This is 4x smaller than the current vector with 32-bit float dimensions, which can result in substantial space savings.

You can start indexing these smaller, 8-bit vectors right now by adding the element_type parameter with the byte value to your vector mappings, similar to the example below.

{
    "mappings": {
        "properties": {
            "my_vector": {
                "type": "dense_vector",
                "element_type": "byte",
                "dims": 3,
                "index": true,
                "similarity": "dot_product"
            }
        }
    }
}

But what if your existing vectors' dimensions don't fit into this smaller type? Then we can use the process of quantization to make them fit, often with only a small loss of precision!

Let's quantize

Let's start by defining quantization. Quantization is the process of taking a larger set of values and mapping them to a smaller set of values. More specifically, in our case this would be taking the range of a 32-bit float and mapping it to the range of an 8-bit integer for each dimension in a vector. (This should not be confused with dimensional reduction, which is a different topic. This is only reducing the range of the values for the existing dimensions.)

This leads to two further questions. What is the actual range of our 32-bit float vectors? And what function should we use to do the mapping? The answers vary significantly based on use-case.

As an example, one of the simplest forms of quantization is taking the dimensions of normalized 32-bit vectors and linearly mapping them to the full range of the dimensions of 8-bit vectors. Using Python, this would look something like the following:

import numpy as np
import typing as t

def quantize_embeddings(text_and_embeddings: t.List[t.Mapping[str, t.Any]]) -> t.List[t.Mapping[str, t.Any]]:
    quantized_embeddings = np.array([x['embedding'] for x in 
query_and_embeddings])
    quantized_embeddings = (quantized_embeddings * 128)
    quantized_embeddings = quantized_embeddings.clip(-128, 
127).astype(int).tolist()
    return [dict(item, **{'embedding': embedding}) for (item, 
embedding) in zip(text_and_embeddings, quantized_embeddings)]

This is only a single example, though. There are many other useful quantization functions. For your specific use case, it's important to evaluate what method of quantization will give you the best results relative to the trade-off between space reduction, relevance, and recall.

Some real-world numbers

8-bit vectors and quantization are great and all, but do they really reduce space in a real-world use case? The answer is unequivocally YES! And substantially. This is all while they continue to give good results without hurting relevance and recall. Elasticsearch even has all the tools you need to do that evaluation yourself with our rank evaluation API.

Now, let's look at some numbers generated from a real-world example with the following setup:

  1. All data was gathered using Elasticsearch in Cloud with two gcp.data.highcpu.1 64GB nodes
  2. Data was collected from the NQ dataset (Natural Question), built by Google, used in BEIR
  3. The embeddings model was sentence-transformers%2Fall-MiniLM-L6-v2
  4. Quantization to generate 8-bit integer vectors was applied to the 32-bit float vectors collected from the data using the previous example Python snippet

Then we make some magic happen and collect results based on this setup:

categoryMedian kNN Response TimeMedian Exact Response TimeRecall@100NDCG@10Total Index Size (1p, 1r)
byte32ms1072ms0.790.385.8gb
float36ms1530ms0.790.3816.4gb
% Reduction11%30%0%0%64%

And our results look fantastic. Let's break down each one.

  • Median kNN Response Time: This response time is collected using approximate kNN search against our example data set. This type of search uses Lucene's HNSW graph as the backing data structure. We see an 11% increase in response time for byte versus float.
  • Median Exact Response Time: This response time is collected using exact kNN search against our example data set. This type of search uses a script to iterate through every vector in the data set and will return the best possible results. We see a large improvement of 30% reduction in response time!
  • Recall@100: This shows us if the most relevant results are included in the top 100. This is important to show if our quantization function worked well. We can see that the numbers are identical for byte versus float, which means that our relevance even after quantizing is just as good for byte as it is for float.
  • @NDCG@10: This shows us how good the quality of our first 10 results is. This is another important metric to evaluate if our quantization function worked well. Once again, the numbers are identical between byte versus float, so we can rest assured that our results are still just as good even after quantization.
  • Total Index Size (1p, 1r): This is the total index size used for our vectors' index with a single partition and a single replica. For this metric, we disabled source, which we recommend for all vector fields in which the ingested vector data is unmodified so it's not stored twice. And we see a massive 64% reduction in total index size! This doesn't quite reach the 4x difference between a byte and float because of additional overhead for the HNSW data structure including graph connections, but it's still a quite substantial size reduction.

Byte vectors are all ready to go as part of 8.6, and we encourage you to fire up a cluster in Elastic Cloud and give them a try!

Ready to try this out on your own? Start a free trial.

Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our Beyond RAG Basics webinar to build your next GenAI app!

Recommended Articles
Understanding BSI IT Grundschutz: A recipe for GenAI powered search on your (private) PDF treasure
Vector DatabaseGenerative AI

Understanding BSI IT Grundschutz: A recipe for GenAI powered search on your (private) PDF treasure

An easy approach to create embeddings for and apply semantic GenAI powered search (RAG) to documents as part of the BSI IT Grundschutz using Elastic's new semantic_text field type and the Playground in Elastic.

Christine Komander

Unlocking multilingual insights: translating datasets with Python, LangChain, and Vector Database
How ToGenerative AIVector Database

Unlocking multilingual insights: translating datasets with Python, LangChain, and Vector Database

Learn how to translate a dataset from one language to another and use Elastic's vector database capabilities to gain more insights.

Jessica Garson

A tutorial on building local agent using LangGraph, LLaMA3 and Elasticsearch vector store from scratch
How ToGenerative AIVector Database

A tutorial on building local agent using LangGraph, LLaMA3 and Elasticsearch vector store from scratch

This article will provide a detailed tutorial on implementing a local, reliable agent using LangGraph, combining concepts from Adaptive RAG, Corrective RAG, and Self-RAG papers, and integrating Langchain, Elasticsearch Vector Store, Tavily AI for web search, and LLaMA3 via Ollama.

Pratik Rana

Elasticsearch open inference API adds support for Anthropic’s Claude
IntegrationsHow ToGenerative AI

Elasticsearch open inference API adds support for Anthropic’s Claude

Interact with Anthropic's Claude 3.5 Sonnet and other models to generate content and perform question & answering.

Jonathan Buttner

ChatGPT and Elasticsearch revisited: The RAG really tied the app together
Generative AI

ChatGPT and Elasticsearch revisited: The RAG really tied the app together

Learn how to create a chatbot using ChatGPT and Elasticsearch, utilizing all of the newest RAG features.

Jeff Vestal