Evaluating scalar quantization in Elasticsearch

In 8.13 we introduced scalar quantization to Elasticsearch. By using this feature an end-user can provide float vectors that are internally indexed as byte vectors while retaining the float vectors in the index for optional re-scoring. This means they can reduce their index memory requirement, which is its dominant cost, by a factor of four. At the moment this is an opt-in feature feature, but we believe it constitutes a better trade off than indexing vectors as floats. In 8.14 we will switch to make this our default. However, before doing this we wanted a systematic evaluation of the quality impact.

The multilingual E5-small is a small high quality multilingual passage embedding model that we offer out-of-the-box in Elasticsearch. It has two versions: one cross-platform version which runs on any hardware and one version which is optimized for CPU inference in the Elastic Stack (see here). E5 represents a challenging case for automatic quantization because the vectors it produces have low angular variation and are relatively low dimension compared to state-of-the-art models. If we can achieve little to no damage enabling int8 quantization for this model we can be confident that it will work reliably.

The purpose of this experimentation is to estimate the effects of scalar-quantized kNN search as described here across a broad range of retrieval tasks using this model. More specifically, our aim is to assess the performance degradation (if any) by switching from a full-precision to a quantized index.

Overview of methodology

For the evaluation we relied upon BEIR and for each dataset that we considered we built a full precision and an int8-quantized index using the default hyperparameters (m: 16, ef_construction: 100). First, we experimented with the quantized (weights only) version of the multilingual E5-small model provided by Elastic here with Table 1 presenting a summary of the nDCG@10 scores (k:10, num_candidates:100):

DatasetFull precisionInt8 quantizationAbsolute differenceRelative difference
Arguana0.370.362-0.008-2.16%
FiQA-20180.3090.304-0.005-1.62%
NFCorpus0.3020.297-0.005-1.66%
Quora0.8760.875-0.001-0.11%
SCIDOCS0.1350.132-0.003-2.22%
Scifact0.6490.644-0.005-0.77%
TREC-COVID0.6830.672-0.011-1.61%
Average-0.005-1.05%

Table 1: nDCG@10 scores for the full precision and int8 quantization indices across a selection of BEIR datasets

Overall, it seems that there is a slight relative decrease of 1.05% on average.

Next, we considered repeating the same evaluation process using the unquantized version of multilingual E5-small (see model card here) and Table 2 shows the respective results.

DatasetFull precisionInt8 quantizationAbsolute differenceRelative difference
Arguana0.3840.379-0.005-1.3%
Climate-FEVER0.2140.222+0.008+3.74%
FEVER0.7180.715-0.003-0.42%
FiQA-20180.3280.324-0.004-1.22%
NFCorpus0.310.306-0.004-1.29%
NQ0.5480.537-0.011-2.01%
Quora0.8820.881-0.001-0.11%
Robust040.4180.415-0.003-0.72%
SCIDOCS0.1340.132-0.003-1.49%
Scifact0.670.666-0.004-0.6%
TREC-COVID0.7090.693-0.016-2.26%
Average-0.004-0.83%

Table 2: nDCG@10 scores of multilingual-E5-small on a selection of BEIR datasets

Again, we observe a slight relative decrease in performance equal to 0.83%. Finally, we repeated the exercise for multilingual E5-base and the performance decrease was even smaller (0.59%)

But this is not the whole story: The increased efficiency of the quantized HNSW indices and the fact that the original float vectors are still retained in the index allows us to recover a significant portion of the lost performance through rescoring. More specifically, we can retrieve a larger pool of candidates through approximate kNN search in the quantized index, which is quite fast, and then compute the similarity function on the original float vectors and re-score accordingly.

As a proof of concept, we consider the NQ dataset which exhibited a large performance decrease (2.01%) with multilingual E5-small. By setting k=15, num_candidates=100 and window_size=10 (as we are interested in nDCG@10) we get an improved score of 0.539 recovering about 20% of the performance. If we further increase the num_candidates parameter to 200 then we get a score that matches the performance of the full precision index but with faster response times. The same setup on Arguana leads to an increase from 0.379 to 0.382 and thus limiting the relative performance drop from 1.3% to only 0.52%

Conclusion

The results of our evaluation suggest that scalar quantization can be used to reduce the memory footprint of vector embeddings in Elasticsearch without significant loss in retrieval performance. The performance decrease is more pronounced for smaller vectors (multilingual E5-small produces vectors of size equal to 384 while E5-base gives 768-dimensional embeddings), but this can be mitigated through rescoring. We are confident that scalar quantization will be beneficial for most users and we plan to make it the default in 8.14.

Ready to build RAG into your apps? Want to try different LLMs with a vector database?
Check out our sample notebooks for LangChain, Cohere and more on Github, and join the Elasticsearch Engineer training starting soon!
Recommended Articles