Better binary quantization 101: How it works & the big benefits

Understand what binary quantization is, how it works and its benefits. This guide also covers the math behind the quantization and examples.

Introduction

As we have discussed previously in Scalar Quantization 101 most embedding models output float32float32 vector values, which is often excessive to represent the vector space. Scalar quantization techniques greatly reduce the space needed to represent these vectors. We've also previously talked about Bit Vectors In Elasticsearch and how binary quantization can often be unacceptably lossy.

With binary quantization techniques such as those presented in the RaBitQ paper we can address the problems associated with naively quantizing into a bit vector and maintain the quality associated with scalar quantization by more thoughtfully subdividing the space and retaining residuals of the transformation. These newer techniques allow for better optimizations and generally better results over other similar techniques like product quantization (PQ) in distance calculations and a 32x level of compression that typically is not possible with scalar quantization.

Here we'll walk through some of the core aspects of binary quantization and leave the mathematical details to the RaBitQ paper.

Building the Bit Vectors

Because we can more efficiently pre-compute some aspects of the distance computation, we treat the indexing and query construction separately.

To start with let's walk through indexing three very simple 2 dimensional vectors, v1v_1, v2v_2, and v3v_3, to see how they are transformed and stored for efficient distance computations at query time.

v1=[0.56,0.82]v_1 = [0.56, 0.82]
v2=[1.23,0.71]v_2 = [1.23, 0.71]
v3=[-3.28,2.13]v_3 = [\text{-}3.28, 2.13]

Our objective is to transform these vectors into much smaller representations that allow:

  • A reasonable proxy for estimating distance rapidly
  • Some guarantees about how vectors are distributed in the space for better control over the total number of data vectors needed to recall the true nearest neighbors

We can achieve this by:

  • Shifting each vector to within a hyper-sphere, in our case a 2d circle, the unit circle
  • Snapping each vector to a single representative point within each region of the circle
  • Retaining corrective factors to better approximate the distance between each vector and the query vector

Let's unpack that step by step.

Find a Representative Centroid

In order to partition each dimension we need to pick a pivot point. For simplicity, we'll select one point to use to transform all of our data vectors.

Let's continue with our vectors, v1v_1, v2v_2, and v3v_3. We pick their centroid as the pivot point.

v1=[0.56,0.82]v2=[1.23,0.71]v3=[-3.28,2.13] c=[(0.56+1.23+-3.28)/3,(0.82+0.71+2.13)/3] c=[-0.49,1.22]v_1 = [0.56, 0.82] \newline v_2 = [1.23, 0.71] \newline v_3 = [\text{-}3.28, 2.13] \newline ~\\ c = [(0.56 + 1.23 + \text{-}3.28) / 3, (0.82 + 0.71 + 2.13) / 3] \newline ~\\ c = [\text{-}0.49, 1.22]

Here's all of those points graphed together:

Each residual vector is then normalized. We'll call these vc1v_{c1}, vc2v_{c2}, and vc3v_{c3}.

vc1=(v1c)/v1cvc2=(v2c)/v2cvc3=(v3c)/v3cv_{c1} = (v_1 - c) / \|v_1 - c\| \newline v_{c2} = (v_2 - c) / \|v_2 - c\| \newline v_{c3} = (v_3 - c) / \|v_3 - c\| \newline

Let's do the math for one of the vectors together:

vc1=(v1c)/v1c v1c=[0.56,0.82][-0.49,1.22]=[1.05,-0.39] v1c=1.13 vc1=(v1c)/v1c=([1.05,-0.39])/[1.05,-0.39]=([1.05,-0.39])/1.13=[0.94,-0.35]v_{c1} = (v_1 - c) / \|v_1 - c\| \newline ~\\ \begin{align*} v_1 - c &= [0.56, 0.82] - [\text{-}0.49, 1.22] \newline &= [1.05, \text{-}0.39] \end{align*} \newline ~\\ \|v_1 - c\| = 1.13 \newline ~\\ \begin{align*} v_{c1} &= (v_1 - c) / \|v_1 - c\| \newline &= ([1.05, \text{-}0.39]) / \|[1.05, \text{-}0.39]\| \newline &= ([1.05, \text{-}0.39]) / 1.13 \newline &= [0.94, \text{-}0.35] \end{align*} \newline

And for each of the remaining vectors:

vc2=[0.96,-0.28]vc3=[-0.95,0.31]v_{c2} = [0.96, \text{-}0.28] \newline v_{c3} = [\text{-}0.95, 0.31]

Let's see that transformation and normalization all together:

As you may be able to see our points now sit on the unit circle around the centroid as if we had placed the centroid at 0,0.

1 Bit, 1 Bit Only Please

With our data vectors centered and normalized, we can apply standard binary quantization encoding each transformed vector component with a 0 if it is negative and 1 if it is positive. In our 2 dimensional example this splits our unit circle into four quadrants and the binary vectors corresponding to v1cv_{1c}, vc2v_{c2}, and vc3v_{c3} become r1=[1,0],r2=[1,0],r3=[0,1]r_1 = [1, 0], r_2 = [1, 0], r_3 = [0, 1], respectively.

We finish quantizing each data vector by snapping it to a representative point within each region specifically picking a point equidistant from each axis on the unit circle: ±1d\pm \frac{1}{\sqrt{d}}

We'll denote each quantized vector as v1\overline{v}_1, v2\overline{v}_2, and v3\overline{v}_3.

So for instance if we snap vc1v_{c1} to its representative point within its region r1r_1 we get:

v1=1d(2r11)=12[1,1]=[12,12]\begin{align*} \overline{v}_1 &= \frac{1}{\sqrt{d}} (2 r_1 - 1) \newline &= \frac{1}{\sqrt{2}} [1, -1] \newline &= [\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}] \end{align*}

And here are the quantized forms for the other data vectors, v2v_2 and v3v_3:

v2=[12,12]v3=[12,12]\overline{v}_2 = [\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}] \newline \overline{v}_3 = [-\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}]

Picking these representative points has some nice mathematical properties as outlined in the RaBitQ paper. And are not unlike the codebooks seen in product quantization (PQ).

At this point we now have a 1 bit approximation; albeit a somewhat fuzzy one that we can use to do distance comparisons. Clearly, v1\overline{v}_1 and v2\overline{v}_2 are now identical in this quantized state, which is not ideal and is a similar problem experienced when we discussed encoding float vectors as Bit Vectors In Elasticsearch.

The elegance of this is that at query time we can use something akin to a dot product to compare each data vector and each query vector rapidly for an approximation of distance. We'll see that in more detail when we discuss handling the query.

The Catch

As we saw above, a lot of information is lost when converting to bit vectors. We'll need some additional information to help compensate for the loss and correct our distance estimations.

In order to recover fidelity we'll store the distance from each vector to the centroid and the projection (dot product) of the vector (e.g vc1v_{c1}) with its quantized form (e.g. v1\overline{v_1}) as two float32float32 values.

The Euclidean distance to the centroid is straight-forward and we already computed it when quantizing each vector:

v1c=1.13v2c=1.79v3c=2.92\|v_1 - c\| = 1.13 \newline \|v_2 - c\| = 1.79 \newline \|v_3 - c\| = 2.92

Precomputing distances from each data vector to the centroid restores the transformation of centering the vectors. Similarly, we'll compute the distance from the query to the centroid. Intuitively the centroid acts as a go-between instead of directly computing the distance between the query and data vector.

The dot product of the vector and the quantized vector is then:

vc1v1=vc112(2r11)=[0.94,-0.35][12,12]=0.90\begin{align*} v_{c1} \cdot \overline{v}_1 &= v_{c1} \cdot \frac{1}{\sqrt{2}} (2 r_1 - 1) \\ &= [0.94, \text{-}0.35] \cdot \left[\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}\right] \\ &= 0.90 \end{align*}

And for our other two vectors:

vc2v2=0.95vc3v3=0.89v_{c2} \cdot \overline{v}_2 = 0.95 \newline v_{c3} \cdot \overline{v}_3 = 0.89

The dot product between the quantized vector and the original vector, being the second corrective factor, captures for how far away the quantized vector is from its original position. In section 3.2 the RaBitQ paper shows there is a bias correcting the dot product between the quantized data and query vectors in a naive fashion. This factor exactly compensates it.

Keep in mind we are doing this transformation to reduce the total size of the data vectors and reduce the cost of vector comparisons. These corrective factors while seemingly large in our 2d example become insignificant as vector dimensionality increases. For example, a 1024 dimensional vector if stored as float32float32 requires 4096 bytes. If stored with this bit compression and corrective factors, only 136 bytes are required.

To better understand why we use these factors refer to the RaBitQ paper. It gives an in-depth treatment of the math involved.

The Query

q=[0.68,-1.72]q = [0.68, \text{-}1.72]

To be able to compare our quantized data vectors to a query vector we must first get the query vector into a quantized form and shift it relative to the unit circle. We'll refer to the query vector as qq, the transformed vector as qcq_c, and the scalar quantized vector as q\overline{q}.

qc=(0.68-0.49),(-1.721.22)=[1.17,2.95] qc=(qc)/qc=[1.17,2.95]/3.17=[0.37,0.92]\begin{align*} q - c &= (0.68 - \text{-}0.49), (\text{-}1.72 - 1.22) \newline &= [1.17, −2.95] \end{align*} \newline ~\\ \begin{align*} q_c &= (q - c) / \|q - c\| \newline &= [1.17, −2.95] / 3.17 \newline &= [0.37, −0.92] \end{align*}

Next we perform Scalar Quantization on the query vector down to 4 bits; we'll call this vector q \overline{q}. It's worth noting that we do not quantize down to a bit representationbut instead maintain an int4int4 scalar quantization, q\overline{q} as an int4 byte array, for estimating the distance. We can take advantage of this asymmetric quantization to retain more information without additional storage.

lower=-0.92upper=0.37 width=(upperlower)/(241);=(0.37-0.92)/15;=0.08 q=(qclower)/width=([0.37,0.92][-0.92,-0.92])/0.08=[15,0]lower = \text{-}0.92 \newline upper = 0.37 \newline ~\\ \begin{align*} width &= (upper - lower) / (2^4 - 1); \newline &= (0.37 - \text{-}0.92) / 15; \newline &= 0.08 \end{align*} \newline ~\\ \newline \begin{align*} \overline{q} &= \lfloor{(q_c - lower) / width}\rceil \newline &= \lfloor{([0.37, −0.92] - [\text{-}0.92, \text{-}0.92]) / 0.08}\rceil \newline &= [15, 0] \end{align*}

As you can see because we have only 2 dimensions our quantized query vector now consists of two values at the ceiling and floor of the int4int4 range. With longer vectors you would see a variety of int4 values with one of them being the ceiling and one of them being the floor.

Now we are ready to perform a distance calculation comparing each indexed data vector with this query vector. We do this by summing up each dimension in our quantized query that's shared with any given data vector. Basically, a plain old dot-product, but with bits and bytes.

qr1=[15,0][1,0]=15\begin{align*} \overline{q} \cdot r_1 &= [15, 0] \cdot [1, 0] \newline &= 15 \end{align*}

We can now apply corrective factors to unroll the quantization and get a more accurate reflection of the estimated distance.

To achieve this we'll collect the upper and lower bound from the quantized query, which we derived when doing the scalar quantization of the query. Additionally we need the distance from the query to the centroid. Since we computed the distance between a vector and a centroid previously we'll just include that distance here for reference:

qc=3.17\|q - c\| = 3.17

Estimated Distance

Alright! We have quantized our vectors and collected corrective factors. Now we are ready to compute the estimated distance between v1v_1 and qq. Let's transform our Euclidean distance into an equation that has much more computationally friendly terms:

dist(v1,q)=v1q=(v1c)(qc)2=v1c2+qc22×v1c×qc×(qcvc1)\begin{align*} dist(v_1, q) &= \|v_1 - q\| \newline &= \sqrt{\|(v_1 - c) - (q - c)\|^2} \newline &= \sqrt{\|v_1 - c\|^2 + \|q - c\|^2 - 2 \times \|v_1 - c\| \times \|q - c\| \times (q_c \cdot v_{c1})} \end{align*}

In this form, most of these factors we derived previously, such as v1c\|v_1-c\|, and notably can be pre-computed prior to query or are not direct comparisons between the query vector and any given data vector such as v1v_1.

We however still need to compute qcvc1q_c \cdot v_{c1}. We can utilize our corrective factors and our quantized binary distance metric qr1\overline{q} \cdot r_1 to estimate this value reasonably and quickly. Let's walk through that.

qcvc1(qcv1)/(vc1v1)q_c \cdot v_{c1} \approx (q_c \cdot \overline{v}_1) / (v_{c1} \cdot \overline{v}_1)

Let's start by estimating qcv1q_c \cdot \overline{v}_1 which requires this equation which essentially unrolls our transformations using the representative points we defined earlier:

qcv1(lower+widthq)(1d(2r11))q_c \cdot \overline{v}_1 \approx (lower + width \cdot \overline{q}) \cdot (\frac{1}{\sqrt{d}}(2r_1 - 1))


Specifically, 1d(2r11)\frac{1}{\sqrt{d}}(2r_1-1) maps the binary values back to our representative point and lower+widthqlower + width \cdot \overline{q} undoes the shift and scale used to compute the scalar quantized query components.

We can rewrite this to make it more computationally friendly like this.

First, though, let's define a couple of helper variables:

  • total number of 1's bits in r1r_1 as vb1\overline{v}_{b1}, 11 in this case
  • total number of all quantized values in qb\overline{q}_b as qb\overline{q}_b, 1515 in this case
qcv12×widthd×(qr1)+2×lowerd×vb1widthd×qbd×lower2×0.082×15+2×-0.922×10.082×152×-0.920.92\begin{align*} q_c \cdot \overline{v}_1 &\approx \frac{2 \times width}{\sqrt{d}} \times (\overline{q} \cdot r_1) + \frac{2 \times lower}{\sqrt{d}} \times \overline{v}_{b1} - \frac{width}{\sqrt{d}} \times \overline{q}_b - \sqrt{d} \times lower \newline &\approx \frac{2 \times 0.08}{\sqrt{2}} \times 15 + \frac{2 \times \text{-}0.92}{\sqrt{2}} \times 1 - \frac{0.08}{\sqrt{2}} \times 15 - \sqrt{2} \times \text{-}0.92 \newline &\approx 0.92 \end{align*} \newline

With this value and vc1v1v_{c1} \cdot \overline{v}_1, which we precomputed when indexing our data vector, we can then plug those values in to compute an approximation of qcvc1q_c \cdot v_{c1}:

qcvc1(qcv1)/(vc1v1)0.92/0.901.01\begin{align*} q_c \cdot v_{c1} &\approx (q_c \cdot \overline{v}_1) / (v_{c1} \cdot \overline{v}_1) \newline &\approx 0.92 / 0.90 \newline &\approx 1.01 \end{align*}

Finally, let's plug this into our larger distance equation noting that we are using an estimate for qcvc1q_c \cdot v_{c1}:

dist(v1,q)=v1c2+qc22×v1c×qc×(qcvc1)est_dist(v1,q)=1.132+3.1722×1.13×3.17×1.01dist(v_1, q) = \sqrt{\|v_1-c\|^2 + \|q-c\|^2 - 2 \times \|v_1-c\| \times \|q-c\| \times (q_c \cdot v_{c1})} \newline est\_dist(v_1, q) = \sqrt{1.13^2 + 3.17^2 − 2 \times 1.13 \times 3.17 \times 1.01}

With all of the corrections applied we are left with a reasonable estimate of the distance between two vectors. For instance in this case our estimated distances between each of our original data vectors, v1v_1, v2v_2, v3v_3, and qq compared to the true distances are:

est_dist(v1,q)=2.02est_dist(v2,q)=1.15est_dist(v3,q)=6.15 est\_dist(v_1, q) = 2.02 \newline est\_dist(v_2, q) = 1.15 \newline est\_dist(v_3, q) = 6.15 \newline ~\\
eucl_dist(v1,q)=2.55eucl_dist(v2,q)=2.50eucl_dist(v3,q)=5.52eucl\_dist(v_1, q) = 2.55 \newline eucl\_dist(v_2, q) = 2.50 \newline eucl\_dist(v_3, q) = 5.52

For details on how the linear algebra is derived or simplified when applied refer to the RaBitQ paper.

Re-ranking

As you can see from the results in the prior section, these estimated distances are indeed estimates. Binary quantization produces vectors whose distance calculations are, even with the extra corrective factors, only an approximation of the distance between vectors. In our experiments we were able to achieve high recall by involving a multi-stage process. This confirms the findings in the RaBitQ paper.

Therefore to achieve high quality results, a reasonable sample of vectors returned from binary quantization then must be re-ranked with a more exact distance computation. In practice this subset of candidates can be small achieving typically high > 95% recall with 100 or less candidates for large datasets (>1m).

With RaBitQ results are re-ranked continually as part of the search operation. In practice to achieve a more scalable binary quantization we decouple the re-ranking step. While RaBitQ is able to maintain a better list of top NN candidates by re-ranking while searching, it is at the cost of constantly paging in full float32float32 vectors, which is untenable at large scale.

Conclusion

Whew! You made it! This blog is indeed a big one. We are extremely excited about this new algorithm as it can alleviate many of the pain points of Product Quantization (e.g. code-book building cost, distance estimation slowness, etc.) and provide excellent recall and speed.

In following articles we'll dive deeper into how we integrated this with HNSW, performance tricks, how we compared it to Product Quantization, and how this performs on real data in Elasticsearch.

Ready to try this out on your own? Start a free trial.

Elasticsearch and Lucene offer strong vector database and search capabilities. Dive into our sample notebooks to learn more.

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself