Indexing IPv6 addresses in Elasticsearch
Starting with Elasticsearch 5.0, the ip
field will support indexing IPv6 addresses.
Why only now?
The ability to index IPv6 addresses has been requested for a very long time. The reason why we have been pushing back so far is that we had no way to index IPv6 addresses efficiently. Our two options were basically to either index them as sortable strings, or to index them with multiple levels of precision like we did for 32-bits and 64-bits numerics. The issue with indexing as a sortable string is that you get terrible range performance: this typically requires visiting every single matching value, so querying a large IPv6 subnet would have been very slow. Then it would be tempting to index IPv6 addresses with multiple levels of precision, but this raises another issue: the more levels of precision, the fewer terms you need to visit at index time and the faster the range queries. However, these additional terms also have a cost in terms of index size and indexing speed. With only 32 or 64 bits of data, we managed to find trade-offs that provided good search performance with a reasonable indexing slow down and increase of the index size. But the trade-off is more complicated with 128 bits of data as you either get terrible indexing performance or terrible search performance.
What changed?
Lucene 6 introduced a new index structure called multi-dimensional points. In particular, this structure can be used to index numerics up to 128 bits by configuring a number of dimensions equal to 1. Conceptually, it is not that different from how Lucene indexed numerics with multiple level of precision, in the sense that it is putting together data that are likely to match the same ranges. Except that points compute these ranges dynamically based on the data that is being indexed, rather than obeying to a static scheme. This is a huge difference since it means we do not index ranges that will not be useful at search time, which is exactly what was adding bloat to the index and making indexing slow in previous versions.
How does it work?
IPv6 addresses will be supported on all indexes that are created after the upgrade to 5.x, there will be no way to add IPv6 addresses to indexes that were created on Elasticsearch 2.x without reindexing. Internally, all IP addresses are now represented as a 128-bits IPv6 address. If you index an IPv4 address, it will be automatically translated to an IPv4-mapped IPv6 address at index time, and then converted back to an IPv4 address when returning sort values or aggregations. For instance, IPv4 address 1.2.3.4
would internally be indexed as 0:0:0:0:0:ffff:1.2.3.4
.
You might be worried that indexes will be larger in the case that you only need to index IPv4 addresses since they only need 32 bits of data while they are indexed as IPv6 addresses that need 128 bits of data. However IPv4-mapped IPv6 addresses all start with the same 12 bytes, which is something that makes compression easy. And since the new data-structure that we use for indexing numerics is more space-efficient that the one we were using previously, you could actually expect disk usage reduction.