Sparse versus dense document values with Apache Lucene
Recently we've made some big changes to Apache Lucene around how doc values are indexed and searched, including new nightly benchmarks to measure our progress, based on the New York City taxi ride data corpus. These changes fix doc values so you only pay for what you use, just like all other parts of the Lucene index.
These changes will be in Lucene's next major release (7.0) and will likely not be back-ported to any 6.x release, so it will be some time until Elasticsearch exposes this.
What are doc values?
Doc values are Lucene's column-stride field value storage, letting you store numerics (single- or multi-valued), sorted keywords (single or multi-valued) and binary data blobs per document. These values are quite fast to access at search time, since they are stored column-stride such that only the value for that one field needs to be decoded per hit. This is in contrast to Lucene's stored document fields, which store all field values for one document together in a row-stride fashion, and are therefore relatively slow to access.
Doc values can be used to hold scoring signals, e.g. using expressions to combine multiple signals into a score, or for sorting, grouping, aggregations, possibly for block joins in the future , etc. Indeed Lucene's default scoring signal, norms, a single byte per text field/document holding index-time scoring signals (the field's length and boost, quantized) used to implement our BM25 scoring , is also just a doc values field.
Plumbing first
The changes began with this issue , which was "simply" a low-level raw plumbing change switching out how doc values are accessed at search time from the previous random-access API to an iterator API instead. You can see it as annotation H in the new nightly benchmarks . Because an iterator API is a much more restrictive access pattern than an arbitrary random access API, this change gives codecs more freedom to use aggressive compression and other optimizations, like our postings implementations do.
That change was already massive enough that we decided to break out all such codec improvements to future issues. Since this was a plumbing change only, we lost some search-time performance (see annotation BU) at first as the existing Lucene codec had to use temporary silly wrapper classes to translate its random-access API into an iterator API.
Sparse cases, where not all documents have a value for each doc values field, should especially benefit from this change. In the past, Lucene's non-sparse encoding of such fields has been particularly unnerving, especially when you call forceMerge and see the size of your index suddenly grow 10-fold! But even dense cases, where most documents have a value for the field, should see performance gains as well, since the more restrictive API gives codecs more compression freedom.
Codec Improvements
Fortunately, after the initial plumbing change, Adrien Grand worked hard to improve our default codec to take advantage of the more restrictive iterator APIs:
- Create a new 7.0 codec so we are free to make major changes to the index format
- Remove layers of abstraction for the common single-valued numerics and binary cases , norms and when GCD compression is used (e.g. date/time fields)
- Implement sparse cases directly in the 7.0 codec, including norms
- Switch to a sparse encoding in IndexWriter to buffer doc values in memory before writing to disk
- Add a new advanceExact API to get faster skipping (without an implicit nextDoc) to a target document
These changes brought back much of our search performance on the dense use cases tested by Lucene's existing nightly benchmarks, and in some cases improved performance beyond where we were previously .
With all these improvements, and I'm sure many more to come, we finally get Lucene's doc values to a point where only pay for what you use, just like the rest of Lucene's index parts.
Sparse benchmarks
Along with these Lucene improvements we've added a new set of nightly Lucene benchmarks on sparse and dense documents to track our progress and any accidental performance regressions . As usual, the sources to run this benchmark are available here , and pull requests are welcome!
The benchmarks index a 20 M document subset from the New York City taxi ride data corpus . Each document represents a single (anonymous!) taxi ride in New York City, either via a green or yellow cab. Green taxi rides are about 11.5% and yellow taxi rides are around 88.5%, making a good test for mostly sparse and mostly dense fields, vs. 100% dense fields.
We index the same set of documents in three different ways:
- sparse, where green and yellow taxi rides have their own fields (e.g., green_fare_amount and yellow_fare_amount)
- dense, where all rides share a single set of fields and all documents have exactly the same fields (100% fields are set)
- sparse-sorted, with index time sorting by cab color
We also run a few basic searches over those three indices, including TermQuery on cab color, both sorting by a numeric field and sorting by (degenerate) score , a two-clause BooleanQuery (cab color green or yellow) which matches all documents, and a numeric points range filter.
Back testing, where we checkout the master sources at a specific point in time in the past and then run the benchmark, was particularly tricky since it required linearizing the git commit history to match which git commit hash the master branch pointed to at different times in the past, something git (amazingly) seems not to preserve.
Benchmark results
As typically happens when one creates a new benchmark, there are many curious WTFs ("wow that's funny"s) jumps and drops that need explaining, but here are some clear initial observations:
- The indexing metrics improve nicely over time: index size drops; indexing rate increases; docs per MB increases; flush and merge times drop
- Index-time sorting makes indexing ~50% slower, but gives impressive search speedups for queries like BooleanQuery, a worthwhile tradeoff for many users; even CheckIndex time is substantially faster with index sorting
- Index-time sorting causes points to use more search-time heap because it prevents this optimization ; maybe we can fix the optimization to apply to sorted indices as well (pull requests welcome!)
- This change , to improve/simplify how MMapDirectory tries to prevent access to a closed file, made TermQuery and BooleanQuery faster as it reduced the cost of cloning IndexInput which is done many times for one query
- Annotations H and I on the TermQuery, sorted by longitude chart make it clear how much faster MMapDirectory is than NIOFSDirectory
Those last two annotations were important! Here we see an accidental performance regression, caused by a trivial bug fix that otherwise passed Lucene's numerous unit tests , but inadvertently caused Lucene to pick NIOFSDirectory instead of MMapDirectory. This is a stark reminder of why automated benchmarks are so important! We fixed it in annotation I, just in time for Lucene's 6.3.0 release.