Store compression in Lucene and Elasticsearch

Back in 2011 if you asked Lucene to index and store some text content, odds are high that your inverted index would take about 30% of the size the original data while the document store (also called “stored fields”) would take a bit more than 100%. Why more than 100%? Because the document store would just store your documents sequentially on disk without any compression and even add some overhead, for instance separators between fields. If you wanted to add compression, your only option was to compress field values yourself before sending them to Lucene.

Unfortunately, compression algorithms don’t like short content much. They prefer longer content where they have more opportunities to identify patterns and take advantage of them for compression. Actually if you compress a short string, it will likely be larger than the original string due to the overhead of the compression container. So if you were indexing structured content, you pretty much had no options for store compression.

Lucene 4 and codecs

But then in 2012 perspectives changed after the release of Lucene 4.0. One of the major highlights of Lucene 4.0 was the new codec API, which gives developers a framework that makes experimentation with file formats and backward compatibility easier. The latter point in particular is important: if you want to change the index format, all you need to do is to build a new codec and make it the new default. Since segments register the codec that has been used to write them, old segments would still work by using the read API of the previous codec while new segments would be written using the new codec. This has been a very important change since it allowed us to perform drastic changes of the index format in minor releases in a backward-compatible way. By the way, out of the eleven 4.x Lucene releases, 6 of them changed the index format!

In particular in Lucene 4.1, the codec changed in order to automatically compress the document store. It works by grouping documents into blocks of 16KB and then compresses them together using LZ4, a lightweight compression algorithm. The benefit of this approach is that it also helps compressing short documents since several documents would be compressed into a single block. However in order to read a single document, you need to decompress the whole block. It generally does not matter as decompressing 16KB for 100 documents with LZ4 is still faster than running a non-trivial query or even just seeking on a spinning disk for these 100 documents.

Better compression with DEFLATE

The good news is that we brought even more improvements to the document store in Lucene 5.0. More and more users are indexing huge amounts of data and in such cases the bottleneck is often I/O, which can be improved by heavier compression. Lucene 5.0 still has the same default codec as Lucene 4.1 but now allows you to use DEFLATE (the compression algorithm behind zip, gzip and png) instead of LZ4, if you would like to have better compression. We know this is something which has been long awaited, especially by our logging users.

Opting in for better compression is just a matter of setting the index.codec setting to best_compression. For instance, the following API call would create an index called my_index that trades stored field performance for better compression:

curl -XPUT ‘localhost:9200/my_index’ -d ‘{
  “settings”: {
    “index.codec”: “best_compression”
  }
}’

Hot and cold nodes

This new option opens new perspectives when it comes to managing hot and cold data: with time-based indices, it is common that indices become queried less often the older they get since new data tend to be more interesting than old data. In order to remain cost-effective in such cases, a good practice is to leverage elasticsearch shard allocation filtering to assign new indices to beefy machines with fast CPUs and disks, and old indices to cheaper machines that have plenty of disk space.

Now you can also enable better compression on the cold nodes by setting index.codec: best_compression in their config/elasticsearch.yml file in order to be able to archive more data with the same amount of disk space. The sequence of actions would be the following:

  • change the include/exclude tags of an index so that it moves from hot nodes to some cold nodes
  • call the _optimize API in order to merge segments: not only will it save memory, disk-space and file handles by having fewer segments, but it will also have the side-effect of rewriting the index using this new best_compression codec

Improved merging of stored fields

Another change that we brought to Lucene 5 is improved stored field merging. When merging segments together in Lucene 4, we would decompress them entirely and then recompress them into a new segment. While this was not too much of an issue with LZ4, it is with DEFLATE as merges can become CPU-bound because of (de)compression. We fixed it by copying compressed data directly at merge time. This is not as easy as it may sound: the last block of a segment is most of time incomplete (less than 16KB). So by copying compressed data directly, incomplete blocks would accumulate in the index and end up hurting the compression ratio. The way it works is that Lucene keeps track of the number of incomplete blocks and only recompresses when this number exceeds a certain threshold.

These Lucene-5 changes I discussed will be available in the upcoming Elasticsearch 2.0 release. I hope you liked this retrospective on the state of store compression in Lucene, see you next time for a new article about Lucene and Elasticsearch internals!