Evaluating scalar quantization in Elasticsearch

Seamlessly connect with leading AI and machine learning platforms. Start a free cloud trial to explore Elastic’s gen AI capabilities or try it on your machine now.

Understanding scalar quantization in Elasticsearch

In 8.13 we introduced scalar quantization to Elasticsearch. By using this feature an end-user can provide float vectors that are internally indexed as byte vectors while retaining the float vectors in the index for optional re-scoring. This means they can reduce their index memory requirement, which is its dominant cost, by a factor of four. At the moment this is an opt-in feature feature, but we believe it constitutes a better trade off than indexing vectors as floats. In 8.14 we will switch to make this our default. However, before doing this we wanted a systematic evaluation of the quality impact.

Experimentation: Evaluating scalar quantization

The multilingual E5-small is a small high quality multilingual passage embedding model that we offer out-of-the-box in Elasticsearch. It has two versions: one cross-platform version which runs on any hardware and one version which is optimized for CPU inference in the Elastic Stack (see here). E5 represents a challenging case for automatic quantization because the vectors it produces have low angular variation and are relatively low dimension compared to state-of-the-art models. If we can achieve little to no damage enabling int8 quantization for this model we can be confident that it will work reliably.

The purpose of this experimentation is to estimate the effects of scalar-quantized kNN search as described here across a broad range of retrieval tasks using this model. More specifically, our aim is to assess the performance degradation (if any) by switching from a full-precision to a quantized index.

Overview of methodology

For the evaluation we relied upon BEIR and for each dataset that we considered we built a full precision and an int8-quantized index using the default hyperparameters (m: 16, ef_construction: 100). First, we experimented with the quantized (weights only) version of the multilingual E5-small model provided by Elastic here with Table 1 presenting a summary of the nDCG@10 scores (k:10, num_candidates:100):

Dataset	Full precision	Int8 quantization	Absolute difference	Relative difference
Arguana	0.37	0.362	-0.008	-2.16%
FiQA-2018	0.309	0.304	-0.005	-1.62%
NFCorpus	0.302	0.297	-0.005	-1.66%
Quora	0.876	0.875	-0.001	-0.11%
SCIDOCS	0.135	0.132	-0.003	-2.22%
Scifact	0.649	0.644	-0.005	-0.77%
TREC-COVID	0.683	0.672	-0.011	-1.61%
		Average	-0.005	-1.05%

Table 1: nDCG@10 scores for the full precision and int8 quantization indices across a selection of BEIR datasets

Overall, it seems that there is a slight relative decrease of 1.05% on average.

Next, we considered repeating the same evaluation process using the unquantized version of multilingual E5-small (see model card here) and Table 2 shows the respective results.

Dataset	Full precision	Int8 quantization	Absolute difference	Relative difference
Arguana	0.384	0.379	-0.005	-1.3%
Climate-FEVER	0.214	0.222	+0.008	+3.74%
FEVER	0.718	0.715	-0.003	-0.42%
FiQA-2018	0.328	0.324	-0.004	-1.22%
NFCorpus	0.31	0.306	-0.004	-1.29%
NQ	0.548	0.537	-0.011	-2.01%
Quora	0.882	0.881	-0.001	-0.11%
Robust04	0.418	0.415	-0.003	-0.72%
SCIDOCS	0.134	0.132	-0.003	-1.49%
Scifact	0.67	0.666	-0.004	-0.6%
TREC-COVID	0.709	0.693	-0.016	-2.26%
		Average	-0.004	-0.83%

Table 2: nDCG@10 scores of multilingual-E5-small on a selection of BEIR datasets

Again, we observe a slight relative decrease in performance equal to 0.83%. Finally, we repeated the exercise for multilingual E5-base and the performance decrease was even smaller (0.59%)

But this is not the whole story: The increased efficiency of the quantized HNSW indices and the fact that the original float vectors are still retained in the index allows us to recover a significant portion of the lost performance through rescoring. More specifically, we can retrieve a larger pool of candidates through approximate kNN search in the quantized index, which is quite fast, and then compute the similarity function on the original float vectors and re-score accordingly.

As a proof of concept, we consider the NQ dataset which exhibited a large performance decrease (2.01%) with multilingual E5-small. By setting k=15, num_candidates=100 and window_size=10 (as we are interested in nDCG@10) we get an improved score of 0.539 recovering about 20% of the performance. If we further increase the num_candidates parameter to 200 then we get a score that matches the performance of the full precision index but with faster response times. The same setup on Arguana leads to an increase from 0.379 to 0.382 and thus limiting the relative performance drop from 1.3% to only 0.52%

Results

The results of our evaluation suggest that scalar quantization can be used to reduce the memory footprint of vector embeddings in Elasticsearch without significant loss in retrieval performance. The performance decrease is more pronounced for smaller vectors (multilingual E5-small produces vectors of size equal to 384 while E5-base gives 768-dimensional embeddings), but this can be mitigated through rescoring. We are confident that scalar quantization will be beneficial for most users and we plan to make it the default in 8.14.

Frequently Asked Questions

What are the benefits of using scalar quantization in Elasticsearch?

The benefits of using scalar quantization in Elasticsearch include reducing the memory footprint of vector embeddings without significantly affecting retrieval performance.

Report an issue

Related Content

Automating log parsing in Streams with ML

ML Research AI

January 2, 2026

Automating log parsing in Streams with ML

Learn how a hybrid ML approach achieved 94% log parsing and 91% log partitioning accuracy through automation experiments with log format fingerprinting in Streams.

By: Nastia Havriushenko

Elasticsearch Serverless pricing demystified: VCUs and ECUs explained

Basics Inside Elastic

December 19, 2025

Elasticsearch Serverless pricing demystified: VCUs and ECUs explained

Learn how Elasticsearch Serverless pricing works for Elastic’s fully-managed deployment offering. We explain VCUs (Search, Ingest, ML) and ECUs, detailing how consumption is based on actual allocated resources, workload complexity, and Search Power.

SP PG

By: Sander Philipse and Pete Galeotti

How excessive replica counts can degrade performance, and what to do about it

Inside Elastic Basics

December 8, 2025

How excessive replica counts can degrade performance, and what to do about it

Learn about the impact of high replica counts in Elasticsearch, and how to ensure cluster stability by right-sizing your replicas.

By: Alexander Marquardt

How to deploy Elasticsearch on Azure AKS Automatic

Basics

November 14, 2025

How to deploy Elasticsearch on Azure AKS Automatic

Learn how to deploy Elasticsearch with Kibana on Azure using AKS Automatic and ECK for a partially managed Elasticsearch setup configuration.

By: Eduard Martin

Configuring recursive chunking for structured documents in Elasticsearch

Basics Inside Elastic+1

November 11, 2025

Configuring recursive chunking for structured documents in Elasticsearch

Learn how to configure recursive chunking in Elasticsearch with chunk size, separator groups, and custom separator lists for optimal structured document indexing.

By: Daniel Rubinstein