Resizing Elasticsearch shards for fun and profit
Quite some time ago, with the GA release of Elasticsearch 5.0 we released an API that allows to shrink an index into a new index with fewer shards than the original index. The reasoning behind adding this functionality was to provide a tool to keep the number of shards in a cluster at bay. Indices are commonly created with a large number of shards to maximise indexing throughput but later once theses indices are rolled over into a daily or hourly index, the number of shards should be reduced again to maximise resource utilization of the cluster. While this was a step in the right direction, we were still missing the ability to go in reverse: to increase the number of shards in an index if the index has become too small for the number of documents in it.
We had a story for this, we thought.
For the entire history of Elasticsearch the answer to the question of how to split an index was to just add another index and grow your shard count that way. Under the hood Elasticsearch just operates on a set of shards which allows you to add indices after the fact and search across them. This sounds like a great plan for use-cases where documents are largely immutable like metrics and logging, but not as great when you have to update documents, as is the case for use cases like full-text search. For this you had to go and figure out in what index the document was added and update it there or you delete the document everywhere and add it to the latest index. All of these options have downsides and can cause temporary duplicates or missing documents which are not always acceptable.
After having some deep technical conversations with some users on a trip to Munich I was convinced we have to do something about this. But some of the options scared me a fair bit. For instance splitting an index or a shard while the index is actively indexing sounds very appealing in the first place but when you start thinking about the implications and how it would work under the hood, the approach doesn’t sound all that great anymore. Aside from all the internal changes of how we do consistent hashing, how the number of shards are immutable in an index and how we can make this all work in a backwards compatible way, the performance implications of resharding while a cluster is rebalancing or loses a node had so many edge cases that this feature would likely take many engineering iterations to get right.
Luckily we already had implemented the shrink API that was basically doing the same thing in reverse with a lot of interesting properties. Shrink works by creating a copy of the index that recovers from the original index but uses fewer shards. Wait, a copy? That sounds very expensive copying all the data. To prevent this we make use of the properties lucene provides us by never touching a file more than once. Given that we can take point-in-time snapshots of the source index and use hard-links on the filesystem level, the shrink operation is almost instant and doesn’t require any additional space. All properties of Elasticsearch indices (such as the fixed number of shards) are maintained during this time. If any issues occur during the shrink, the shrink can be aborted and the target index deleted.
This sounds great, let’s see if we can do the same thing for splitting an index as well. And in fact it worked like a charm. The split API uses the same mechanics as the shrink API. Once an index is split, Elasticsearch allocates the shards of the target index alongside of the source index and recovers from the point-in-time snapshot of the underlying Lucene index. The only significant difference is that we need to do a single pass through the index on disk to mark all documents as deleted that don’t belong in the target shard. This means we have an additional space requirement of 1 bit per document while all other data files are reused via hard-links. Just like the shrink process this is a reasonably fast operation that typically happens within a couple of seconds and can be monitored with the recovery API.
Unfortunately you can’t rename an index in Elasticsearch today. To make efficient use of this API, the usage of index aliases is highly recommended. If an index needs to be resized, users should split or shrink the index, flip an alias to point to the new index and delete the old index in a single atomic step, using the update-aliases API.
It’s all about trade-offs.
That is likely the story of every software engineers life. In this case we made the trade-off that indexing has to stop for a short amount of time until the index is resized since source indices must be read-only. But as mentioned above these operations are considered fast, and given that this provides transactional behavior with the least amount of surprise to the user we are following our consistent strategy to go with the safe option rather than the fancy one. For details on split factors and how to pick the number of target shards, please see the official documentation for the specific version of Elasticsearch you are using.