Overview
editThe team at Elasticsearch is committed to continuously improving both Elasticsearch and Apache Lucene to protect your data. As with any distributed system, Elasticsearch is complex and has many moving parts, each of which can encounter edge cases that require proper handling. Our resiliency project is an ongoing effort to find and fix these edge cases. If you want to keep up with all this project on GitHub, see our issues list under the tag resiliency.
While GitHub is great for sharing our work, it can be difficult to get an overview of the current state of affairs and the previous work that has been done from an issues list. This page provides an overview of all the resiliency-related issues that we are aware of, improvements that have already been made and current in-progress work. We’ve also listed some historical improvements throughout this page to provide the full context.
If you’re interested in more on how we approach ensuring resiliency in Elasticsearch, you may be interested in Igor Motov’s talk Improving Elasticsearch Resiliency.
You may also be interested in our blog post Resiliency in Elasticsearch, which details our thought processes when addressing resiliency in both Elasticsearch and the work our developers do upstream in Apache Lucene.
Data Store Recommendations
editSome customers use Elasticsearch as a primary datastore, some set-up comprehensive back-up solutions using features such as our Snapshot and Restore, while others use Elasticsearch in conjunction with a data storage system like Hadoop or even flat files. Elasticsearch can be used for so many different use cases which is why we have created this page to make sure you are fully informed when you are architecting your system.
Work in Progress
editKnown Unknowns (STATUS: ONGOING)
editWe consider this topic to be the most important in our quest for resiliency. We put a tremendous amount of effort into testing Elasticsearch to simulate failures and randomize configuration to produce extreme conditions. In addition, our users are an important source of information on unexpected edge cases and your bug reports help us make fixes that ensure that our system continues to be resilient.
If you encounter an issue, please report it!
We are committed to tracking down and fixing all the issues that are posted.
Jepsen Tests
editThe Jepsen platform is specifically designed to test distributed systems. It is not a single test and is regularly adapted to create new scenarios. We have currently ported all published Jepsen scenarios that deal with loss of acknowledged writes to our testing framework. As the Jepsen tests evolve, we will continue porting new scenarios that are not covered yet. We are committed to investigating all new scenarios and will report issues that we find on this page and in our GitHub repository.
Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)
editIf the node holding a primary shard is disconnected for whatever reason, the coordinating node retries the request on the same or a new primary shard. In certain rare conditions, where the node disconnects and immediately reconnects, it is possible that the original request has already been successfully applied but has not been reported, resulting in duplicate requests. This is particularly true when retrying bulk requests, where some actions may have completed and some may not have.
An optimization which disabled the existence check for documents indexed with auto-generated IDs could result in the creation of duplicate documents. This optimization has been removed. #9468 (STATUS: DONE, v1.5.0)
Further issues remain with the retry mechanism:
-
Unversioned index requests could increment the
_version
twice, obscuring acreated
status. - Versioned index requests could return a conflict exception, even though they were applied correctly.
- Update requests could be applied twice.
See #9967. (STATUS: ONGOING)
OOM resiliency (STATUS: ONGOING)
editThe family of circuit breakers has greatly reduced the occurrence of OOM exceptions, but it is still possible to cause a node to run out of heap space. The following issues have been identified:
-
Set a hard limit on
from
/size
parameters #9311. (STATUS: DONE, v2.1.0) - Prevent combinatorial explosion in aggregations from causing OOM #8081. (STATUS: DONE, v5.0.0)
- Add the byte size of each hit to the request circuit breaker #9310. (STATUS: ONGOING)
- Limit the size of individual requests and also add a circuit breaker for the total memory used by in-flight request objects #16011. (STATUS: DONE, v5.0.0)
Other safeguards are tracked in the meta-issue #11511.
Relocating shards omitted by reporting infrastructure (STATUS: ONGOING)
editIndices stats and indices segments requests reach out to all nodes that have shards of that index. Shards that have relocated from a node while the stats request arrives will make that part of the request fail and are just ignored in the overall stats result. #13719
Documentation of guarantees and handling of failures (STATUS: ONGOING)
editThis status page is a start, but we can do a better job of explicitly documenting the processes at work in Elasticsearch and what happens in the case of each type of failure. The plan is to have a test case that validates each behavior under simulated conditions. Every test will document the expected results, the associated test code, and an explicit PASS or FAIL status for each simulated case.
Run Jepsen (STATUS: ONGOING)
editWe have ported the known scenarios in the Jepsen blogs that check loss of acknowledged writes to our testing infrastructure. The new tests are run continuously in our testing farm and are passing. We are also working on running Jepsen independently to verify that no failures are found.
Completed
editDocuments indexed during a network partition cannot be uniquely identified (STATUS: DONE, v7.0.0)
editWhen a primary has been partitioned away from the cluster there is a short
period of time until it detects this. During that time it will continue
indexing writes locally, thereby updating document versions. When it tries
to replicate the operation, however, it will discover that it is partitioned
away. It won’t acknowledge the write and will wait until the partition is
resolved to negotiate with the master on how to proceed. The master will
decide to either fail any replicas which failed to index the operations on
the primary or tell the primary that it has to step down because a new primary
has been chosen in the meantime. Since the old primary has already written
documents, clients may already have read from the old primary before it shuts
itself down. The _version
field of these reads may not uniquely identify the
document’s version if the new primary has already accepted writes for the same
document (see #19269).
The Sequence numbers infrastructure #10708 has introduced more
precise ways for tracking primary changes. This new infrastructure therefore
provides a way for uniquely identifying documents using their primary term
and sequence number fields, even in the presence of network partitions, and
has been used to replace the _version
field in operations that require
uniquely identifying the document, such as optimistic concurrency control.
Replicas can fall out of sync when a primary shard fails (STATUS: DONE, v7.0.0)
editWhen a primary shard fails, a replica shard will be promoted to be the primary shard. If there is more than one replica shard, it is possible for the remaining replicas to be out of sync with the new primary shard. This is caused by operations that were in-flight when the primary shard failed and may not have been processed on all replica shards. These discrepancies are not repaired on primary promotion but instead delayed until replica shards are relocated (e.g., from hot to cold nodes); this means that the length of time in which replicas can be out of sync with the primary shard is unbounded.
Sequence numbers #10708 provide a mechanism for identifying the discrepancies between shard copies at the document level, which allows to efficiently sync up the remaining replicas with the newly-promoted primary shard.
Repeated network partitions can cause cluster state updates to be lost (STATUS: DONE, v7.0.0)
editDuring a networking partition, cluster state updates (like mapping changes or shard assignments) are committed if a majority of the master-eligible nodes received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and receive the previously missed changes. However, if a second partition happens while the cluster is still recovering from the previous one and the old master falls on the minority side, it may be that a new master is elected which has not yet catch up. If that happens, cluster state updates can be lost.
This problem is mostly fixed by #20384 (v5.0.0), which takes committed cluster state updates into account during master election. This considerably reduces the chance of this rare problem occurring but does not fully mitigate it. If the second partition happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be that the in flight update will be lost. If the now-isolated master can still acknowledge the cluster state update to the client this will amount to the loss of an acknowledged change.
Fixing this last scenario was one of the goals of #32006 and its sub-issues. See particularly #32171 and the TLA+ formal model used to verify these changes.
Divergence between primary and replica shard copies when documents deleted (STATUS: DONE, V6.3.0)
editCertain combinations of delays in performing activities related to the deletion of a document could result in the operations on that document being interpreted differently on different shard copies. This could lead to a divergence in the number of documents held in each copy.
Deleting an unacknowledged document that was concurrently being inserted using an auto-generated ID was erroneously sensitive to the order in which those operations were processed on each shard copy. Thanks to the introduction of sequence numbers (#10708) it is now possible to detect these out-of-order operations, and this issue was fixed in #28787.
Re-creating a document a specific interval after it was deleted could result in
that document’s tombstone having being cleaned up on some, but not all, copies
when processing the indexing operation that re-creates it. This resulted in
varying behaviour across the shard copies. The problematic interval was set by
the index.gc_deletes
setting, which is 60 seconds by default. Again, sequence
numbers (#10708) gives us the machinery to detect these conflicting
activities, and this issue was fixed in #28790.
Under certain rare circumstances a replica might erroneously interpret a stale tombstone for a document as fresh, resulting in a concurrent indexing operation for that same document behaving differently on this replica than on the primary. This is fixed in #29619. Triggering this issue required the following activities all to occur in a short time window, in a specific order on the primary and a different specific order on the replica:
- a document is deleted twice
- another document is indexed with the same ID as this first document
- another document is indexed with a completely different, auto-generated, ID
- two refreshes
We found the first two of these issues by empirical testing, and then we built a formal model of the replica’s behaviour using TLA+. Running the TLC model checker on this model found all three issues. We then applied the proposed fixes to the model and validated that the fixed design behaved as expected.
Port Jepsen tests dealing with loss of acknowledged writes to our testing framework (STATUS: DONE, V5.0.0)
editWe have increased our test coverage to include scenarios tested by Jepsen that demonstrate loss of acknowledged writes, as described in
the Elasticsearch related blogs. We make heavy use of randomization to expand on the scenarios that can be tested and to introduce
new error conditions.
You can view these changes on the 5.0
branch of the
DiscoveryWithServiceDisruptionsIT
class,
where the testAckedIndexing
test was specifically added to check that we don’t lose acknowledged writes in various failure scenarios.
Loss of documents during network partition (STATUS: DONE, v5.0.0)
editIf a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults).
If the node hosts a primary shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in split-brain before), some documents that are being indexed into the primary may be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master (#7572). To prevent this situation, the primary needs to wait for the master to acknowledge replica shard failures before acknowledging the write to the client. #14252
Safe primary relocations (STATUS: DONE, v5.0.0)
editWhen primary relocation completes, a cluster state is propagated that deactivates the old primary and marks the new primary as active. As cluster state changes are not applied synchronously on all nodes, there can be a time interval where the relocation target has processed the cluster state and believes to be the active primary and the relocation source has not yet processed the cluster state update and still believes itself to be the active primary. This means that an index request that gets routed to the new primary does not get replicated to the old primary (as it has been deactivated from point of view of the new primary). If a subsequent read request gets routed to the old primary, it cannot see the indexed document. #15900
In the reverse situation where a cluster state update that completes primary relocation is first applied on the relocation source and then on the relocation target, each of the nodes believes the other to be the active primary. This leads to the issue of indexing requests chasing the primary being quickly sent back and forth between the nodes, potentially making them both go OOM. #12573
Do not allow stale shards to automatically be promoted to primary (STATUS: DONE, v5.0.0)
editIn some scenarios, after the loss of all valid copies, a stale replica shard can be automatically assigned as a primary, preferring old data to no data at all (#14671). This can lead to a loss of acknowledged writes if the valid copies are not lost but are rather temporarily unavailable. Allocation IDs (#14739) solve this issue by tracking non-stale shard copies in the cluster and using this tracking information to allocate primary shards. When all shard copies are lost or only stale ones available, Elasticsearch will wait for one of the good shard copies to reappear. In case where all good copies are lost, a manual override command can be used to allocate a stale shard copy.
Make index creation resilient to index closing and full cluster crashes (STATUS: DONE, v5.0.0)
editRecovering an index requires a quorum (with an exception for 2) of shard copies to be available to allocate a primary. This means that a primary cannot be assigned if the cluster dies before enough shards have been allocated (#9126). The same happens if an index is closed before enough shard copies were started, making it impossible to reopen the index (#15281). Allocation IDs (#14739) solve this issue by tracking allocated shard copies in the cluster. This makes it possible to safely recover an index in the presence of a single shard copy. Allocation IDs can also distinguish the situation where an index has been created but none of the shards have been started. If such an index was inadvertently closed before at least one shard could be started, a fresh shard will be allocated upon reopening the index.
Use two phase commit for Cluster State publishing (STATUS: DONE, v5.0.0)
editA master node in Elasticsearch continuously monitors the cluster nodes and removes any node from the cluster that doesn’t respond to its pings in a timely fashion. If the master is left with too few nodes, it will step down and a new master election will start.
When a network partition causes a master node to lose many followers, there is a short window in time until the node loss is detected and the master steps down. During that window, the master may erroneously accept and acknowledge cluster state changes. To avoid this, we introduce a new phase to cluster state publishing where the proposed cluster state is sent to all nodes but is not yet committed. Only once enough nodes actively acknowledge the change, it is committed and commit messages are sent to the nodes. See #13062.
Wait on incoming joins before electing local node as master (STATUS: DONE, v2.0.0)
editDuring master election each node pings in order to discover other nodes and validate the liveness of existing
nodes. Based on this information the node either discovers an existing master or, if enough nodes are found a new master will be elected. Currently, the node that is
elected as master will update the cluster state to indicate the result of the election. Other nodes will submit
a join request to the newly elected master node. Instead of immediately processing the election result, the elected master
node should wait for the incoming joins from other nodes, thus validating that the result of the election is properly applied. As soon as enough
nodes have sent their joins request (based on the minimum_master_nodes
settings) the cluster state is updated.
#12161
Mapping changes should be applied synchronously (STATUS: DONE, v2.0.0)
editWhen introducing new fields using dynamic mapping, it is possible that the same field can be added to different shards with different data types. Each shard will operate with its local data type but, if the shard is relocated, the data type from the cluster state will be applied to the new shard, which can result in a corrupt shard. To prevent this, new fields should not be added to a shard’s mapping until confirmed by the master. #8688 (STATUS: DONE)
Add per-segment and per-commit ID to help replication (STATUS: DONE, v2.0.0)
editLUCENE-5895 adds a unique ID for each segment and each commit point. File-based replication (as performed by snapshot/restore) can use this ID to know whether the segment/commit on the source and destination machines are the same. Fixed in Lucene 5.0.
Write index metadata on data nodes where shards allocated (STATUS: DONE, v2.0.0)
editToday, index metadata is written only on nodes that are master-eligible, not on data-only nodes. This is not a problem when running with multiple master nodes, as recommended, as the loss of all but one master node is still recoverable. However, users running with a single master node are at risk of losing their index metadata if the master fails. Instead, this metadata should also be written on any node where a shard is allocated. #8823, #9952
Better file distribution with multiple data paths (STATUS: DONE, v2.0.0)
editToday, a node configured with multiple data paths distributes writes across all paths by writing one file to each path in turn. This can mean that the failure of a single disk corrupts many shards at once. Instead, by allocating an entire shard to a single data path, the extent of the damage can be limited to just the shards on that disk. #9498
Lucene checksums phase 3 (STATUS: DONE, v2.0.0)
editAlmost all files in Elasticsearch now have checksums which are validated before use. A few changes remain:
- #7586 adds checksums for cluster and index state files. (STATUS: DONE, Fixed in v1.5.0)
- #9183 supports validating the checksums on all files when starting a node. (STATUS: DONE, Fixed in v2.0.0)
- LUCENE-5894 lays the groundwork for extending more efficient checksum validation to all files during optimized bulk merges. (STATUS: DONE, Fixed in v2.0.0)
-
#8403 to add validation of checksums on Lucene
segments_N
files. (STATUS: DONE, v2.0.0)
Report shard-level statuses on write operations (STATUS: DONE, v2.0.0)
editMake write calls return the number of total/successful/missing shards in the same way that we do in search, which ensures transparency in the consistency of write operations. #7994. (STATUS: DONE, v2.0.0)
Take filter cache key size into account (STATUS: DONE, v2.0.0)
editCommonly used filters are cached in Elasticsearch. That cache is limited in size (10% of node’s memory by default) and is being evicted based on a least recently used policy. The amount of memory used by the cache depends on two primary components - the values it stores and the keys associated with them. Calculating the memory footprint of the values is easy enough but the keys accounting is trickier to achieve as they are, by default, raw Lucene objects. This is largely not a problem as the keys are dominated by the values. However, recent optimizations in Lucene have changed the balance causing the filter cache to grow beyond it’s size.
As a temporary solution, we introduced a minimum weight of 1k for each cache entry. This puts an effective limit on the number of entries in the cache. See #8304 (STATUS: DONE, fixed in v1.4.0)
The issue has been completely solved by the move to Lucene’s query cache. See #10897
Ensure shard state ID is incremental (STATUS: DONE, v1.5.1)
editIt is possible in very extreme cases during a complicated full cluster restart, that the current shard state ID can be reset or even go backwards. Elasticsearch now ensures that the state ID always moves forwards, and throws an exception when a legacy ID is higher than the current ID. See #10316 (STATUS: DONE, v1.5.1)
Verification of index UUIDs (STATUS: DONE, v1.5.0)
editWhen deleting and recreating indices rapidly, it is possible that cluster state updates can arrive out of sync and old states can be merged incorrectly. Instead, Elasticsearch now checks the index UUID to ensure that cluster state updates refer to the same index version that is present on the local node. See #9541 and #10200 (STATUS: DONE, Fixed in v1.5.0)
Disable recovery from known buggy versions (STATUS: DONE, v1.5.0)
editCorruptions have been known to occur when doing a rolling restart from older, buggy versions. Now, shards from versions before v1.4.0 are copied over in full and recovery from versions before v1.3.2 are disabled entirely. See #9925 (STATUS: DONE, Fixed in v1.5.0)
Upgrade 3.x segments metadata on engine startup (STATUS: DONE, v1.5.0)
editUpgrading the metadata of old 3.x segments on node upgrade can be error prone and can result in corruption when merges are being run concurrently. Instead, Elasticsearch will now upgrade the metadata of 3.x segments before the engine starts. See #9899 (STATUS; DONE, fixed in v1.5.0)
Prevent setting minimum_master_nodes to more than the current node count (STATUS: DONE, v1.5.0)
editSetting zen.discovery.minimum_master_nodes
to a value higher than the current node count
effectively leaves the cluster without a master and unable to process requests. The only
way to fix this is to add more master-eligible nodes. #8321 adds a mechanism
to validate settings before applying them, and #9051 extends this validation
support to settings applied during a cluster restore. (STATUS: DONE, Fixed in v1.5.0)
Simplify and harden shard recovery and allocation (STATUS: DONE, v1.5.0)
editRandomized testing combined with chaotic failures has revealed corner cases where the recovery and allocation of shards in a concurrent manner can result in shard corruption. There is an ongoing effort to reduce the complexity of these operations in order to make them more deterministic. These include:
- Introduce shard level locks to prevent concurrent shard modifications #8436. (STATUS: DONE, Fixed in v1.5.0)
- Delete shard contents under a lock #9083. (STATUS: DONE, Fixed in v1.5.0)
- Delete shard under a lock #8579. (STATUS: DONE, Fixed in v1.5.0)
- Refactor RecoveryTarget state management #8092. (STATUS: DONE, Fixed in v1.5.0)
- Cancelling a recovery may leave temporary files behind #7893. (STATUS: DONE, Fixed in v1.5.0)
- Quick cluster state processing can result in both shard copies being deleted #9503. (STATUS: DONE, Fixed in v1.5.0)
- Rapid creation and deletion of an index can cause reuse of old index metadata #9489. (STATUS: DONE, Fixed in v1.5.0)
- Flush immediately after the last concurrent recovery finishes to clear out the translog before a new recovery starts #9439. (STATUS: DONE, Fixed in v1.5.0)
Prevent use of known-bad Java versions (STATUS: DONE, v1.5.0)
editCertain versions of the JVM are known to have bugs which can cause index corruption. #7580 prevents Elasticsearch startup if known bad versions are in use.
Make recovery be more resilient to partial network partitions (STATUS: DONE, v1.5.0)
editWhen a node is experience network issues, the master detects it and removes the node from the cluster. That causes all ongoing recoveries from and to that node to be stopped and a new location is found for the relevant shards. However, in the of case partial network partition, where there are connectivity issues between the source and target nodes of a recovery but not between those nodes and the current master things may go wrong. While the nodes successfully restore the connection, the on going recoveries may have encountered issues. In #8720, we added test simulations for these and solved several issues that were flagged by them.
Improving Zen Discovery (STATUS: DONE, v1.4.0.Beta1)
editRecovery from failure is a complicated process, especially in an asynchronous distributed system like Elasticsearch. With several processes happening in parallel, it is important to ensure that recovery proceeds swiftly and safely. While fixing the split-brain issue we have been hunting down corner cases that were not handled optimally, adding tests to demonstrate the issues, and working on fixes:
- Faster & better detection of master & node failures, including not trying to reconnect upon disconnect, fail on disconnect error on ping, verify cluster names in pings. Previously, Elasticsearch had to wait a bit for the node to complete the process required to join the cluster. Recent changes guarantee that a node has fully joined the cluster before we start the fault detection process. Therefore we can do an immediate check causing faster detection of errors and validation of cluster state after a minimum master node breach. #6706, #7399 (STATUS: DONE, v1.4.0.Beta1)
- Broaden Unicast pinging when master fails: When a node loses it’s current master it will start pinging to find a new one. Previously, when using unicast based pinging, the node would ping a set of predefined nodes asking them whether the master had really disappeared or whether there was a network hiccup. Now, we ping all nodes in the cluster to increase coverage. In the case that all unicast hosts are disconnected from the current master during a network failure, this improvement is essential to allow the cluster to reform once the partition is healed. #7336 (STATUS: DONE, v1.4.0.Beta1)
- After joining a cluster, validate that the join was successful and that the master has been set in the local cluster state. #6969. (STATUS: DONE, v1.4.0.Beta1)
- Write additional tests that use the test infrastructure to verify proper behavior during network disconnections and garbage collections. #7082 (STATUS: DONE, v1.4.0.Beta1)
Lucene checksums phase 2 (STATUS:DONE, v1.4.0.Beta1)
editWhen Lucene opens a segment for reading, it validates the checksum on the smaller segment files — those which it reads entirely into memory — but not the large files like term frequencies and positions, as this would be very expensive. During merges, term vectors and stored fields are validated, as long the segments being merged come from the same version of Lucene. Checksumming for term vectors and stored fields is important because merging consists of performing optimized byte copies. Term frequencies, term positions, payloads, doc values, and norms are currently not checked during merges, although Lucene provides the option to do so. These files are less prone to silent corruption as they are actively decoded during merge, and so are more likely to throw exceptions if there is any corruption.
The following changes have been made:
- #7360 validates checksums on all segment files during merges. (STATUS: DONE, fixed in v1.4.0.Beta1)
- LUCENE-5842 validates the structure of the checksum footer of the postings lists, doc values, stored fields and term vectors when opening a new segment, to ensure that these files have not been truncated. (STATUS: DONE, Fixed in Lucene 4.10 and v1.4.0.Beta1)
- #8407 validates Lucene checksums for legacy files. (STATUS: DONE; Fixed in v1.3.6)
Don’t allow unsupported codecs (STATUS: DONE, v1.4.0.Beta1)
editLucene 4 added a number of alternative codecs for experimentation purposes, and Elasticsearch exposed the ability to change codecs. Since then, Lucene has settled on the best choice of codec and provides backwards compatibility only for the default codec. #7566 removes the ability to set alternate codecs.
Use checksums to identify entire segments (STATUS: DONE, v1.4.0.Beta1)
editA hash collision makes it possible for two different files to have the same length and the same checksum. Instead, a segment’s identity should rely on checksums from all of the files in a single segment, which greatly reduces the chance of a collision. This change has been merged (#7351).
Fix ''Split Brain can occur even with minimum_master_nodes'' (STATUS: DONE, v1.4.0.Beta1)
editEven when minimum master nodes is set, split brain can still occur under certain conditions, e.g. disconnection between master eligible nodes, which can lead to data loss. The scenario is described in detail in issue 2488:
- Introduce a new testing infrastructure to simulate different types of node disconnections, including loss of network connection, lost messages, message delays, etc. See MockTransportService support and service disruption for more details. (STATUS: DONE, v1.4.0.Beta1).
- Added tests that simulated the bug described in issue 2488. You can take a look at the original commit of a reproduction on master. (STATUS: DONE, v1.2.0)
- The bug described in issue 2488 is caused by an issue in our zen discovery gossip protocol. This specific issue has been fixed, and work has been done to make the algorithm more resilient. (STATUS: DONE, v1.4.0.Beta1)
Translog Entry Checksum (STATUS: DONE, v1.4.0.Beta1)
editEach translog entry in Elasticsearch should have its own checksum, and potentially additional information, so that we can properly detect corrupted translog entries and act accordingly. You can find more detail in issue #6554.
To start, we will begin by adding checksums to the translog to detect corrupt entries. Once this work has been completed, we will add translog entry markers so that corrupt entries can be skipped in the translog if/when desired.
Request-Level Memory Circuit Breaker (STATUS: DONE, v1.4.0.Beta1)
editWe are in the process of introducing multiple circuit breakers in Elasticsearch, which can “borrow” space from each other in the event that one runs out of memory. This architecture will allow limits for certain parts of memory, but still allow flexibility in the event that another reserve like field data is not being used. This change includes adding a breaker for the BigArrays internal object used for some aggregations. See issue #6739 for more details.
Doc Values (STATUS: DONE, v1.4.0.Beta1)
editFielddata is one of the largest consumers of heap memory, and thus one of the primary reasons for running out of memory and causing node instability. Elasticsearch has had the “doc values” option for a while, which allows you to build these structures at index time so that they live on disk instead of in memory. Up until recently, doc values were significantly slower than in-memory fielddata.
By benchmarking and profiling both Lucene and Elasticsearch, we identified the bottlenecks and have made a series of improvements to improve the performance of doc values. They are now almost as fast as the in-memory option.
See #6967, #6908, #4548, #3829, #4518, #5669, LUCENE-5748, LUCENE-5703, LUCENE-5750, LUCENE-5721, LUCENE-5799.
Index corruption when upgrading Lucene 3.x indices (STATUS: DONE, v1.4.0.Beta1)
editUpgrading indices create with Lucene 3.x (Elasticsearch v0.20 and before) to Lucene 4.7 - 4.9 (Elasticsearch v1.1.0 to v1.3.x), could result in index corruption. LUCENE-5907 fixes this issue in Lucene 4.10.
Improve error handling when deleting files (STATUS: DONE, v1.4.0.Beta1)
editLucene uses reference counting to prevent files that are still in use from being deleted. Lucene testing discovered a bug (LUCENE-5919) when decrementing the ref count on a batch of files. If deleting some of the files resulted in an exception (e.g. due to interference from a virus scanner), the files that had their ref counts decremented successfully could later have their ref counts deleted again, incorrectly, resulting in files being physically deleted before their time. This is fixed in Lucene 4.10.
Using Lucene Checksums to verify shards during snapshot/restore (STATUS:DONE, v1.3.3)
editThe snapshot process should verify checksums for each file that is being snapshotted to make sure that created snapshot doesn’t contain corrupted files. If a corrupted file is detected, the snapshot should fail with an error. In order to implement this feature we need to have correct and verifiable checksums stored with segment files, which is only possible for files that were written by the officially supported append-only codecs. See #7159.
Rare compression corruption during shard recovery (STATUS: DONE, v1.3.2)
editDuring recovery, the primary shard is copied over the network to become a new replica shard. In rare cases, it was possible for a hash collision to trigger a bug in the compression library that is used to produce corruption in the replica shard. This bug was exposed by the change to validate checksums during recovery. We tracked down the bug in the in compression library and submitted a patch, which was accepted and merged by the upstream project. See #7210.
Safer recovery of replica shards (STATUS: DONE, v1.3.0)
editIf a primary shard fails or is closed while a replica is using it for recovery, we need to ensure that the replica is properly failed as well, and allow recovery to start from the new primary. Also check that an active copy of a shard is available on another node before physically removing an inactive shard from disk. #6825, #6645, #6995.
Using Lucene Checksums to verify shards during recovery (STATUS: DONE, v1.3.0)
editElasticsearch can use Lucene checksums to validate files while recovering a replica shard from a primary.
This issue exposed a bug in Elasticsearch’s handling of primary shard failure when having more than 2 replicas, causing the second replica to not be properly unassigned if it is in the middle of recovery. It was fixed with the merge of issue #6808.
In order to verify the checksumming mechanism, we added functionality to our testing infrastructure that can corrupt an arbitrary index file and at any point, such as while it’s traveling over the wire or residing on disk. The tests utilizing this feature expect full or partial recovery from the failure while neither losing data nor spreading the corruption.
Detect File Corruption (STATUS: DONE, v1.3.0)
editWhen a corrupted index can be detected during merging or refresh, Elasticsearch will fail the shard if a checksum failure is detected. You can read the full details in pull request #6776.
Network disconnect events could be lost, causing a zombie node to stay in the cluster state (STATUS: DONE, v1.3.0)
editPreviously, there was a very short window in which we could lose a node disconnect event. To prevent this from occurring, we added extra handling of connection errors to our nodes & master fault detection pinging to make sure the node disconnect event is detected. See issue #6686.
Other fixes to Lucene to address resiliency (STATUS: DONE, v1.3.0)
edit- NativeLock is released if Lock is closed after failing on obtain LUCENE-5738.
- NRT Reader close can wipe an index it doesn’t own. LUCENE-5574
- FSDirectory’s fsync() is lenient, now throws exceptions when errors occur LUCENE-5570
- fsync() directory when committing LUCENE-5588
Backwards Compatibility Testings (STATUS: DONE, v1.3.0)
editSince founding Elasticsearch Inc, we grew our test base from ~1k tests to about 4k in just about over a year. We invested massively into our testing infrastructure, running our tests continuously on different operating systems, bare metal hardware and cloud environments, all while randomizing JVMs and their settings.
Yet, backwards compatibility testing was a very manual thing until we released a pretty insane bug with Elasticsearch 1.2. We tried to fix places where the absolute value of a number was negative (a documented behavior of Math.abs(int) in Java) and missed that the fix for this also changed the result of our routing function. No matter how much randomization we applied to the tests, we didn’t catch this particular failure. We always had backwards compatibility tests on our list of things to do, but didn’t have them in place back then.
We recently tweaked our testing infrastructure to be able to run tests against a hybrid cluster composed of a released version of Elasticsearch and our current stable branch. This test pattern allowed us to mimic typical upgrade scenarios like rolling upgrades, index backwards compatibility and recovering from old to new nodes.
Now, even the simplest test that relies on routing fails against 1.2.0, which is exactly we were aiming for. The test would not have caught the aforementioned routing bug before releasing 1.2.0, but it immediately saved us from another problem in the stable branch.
The work on our testing infrastructure is more than just issue prevention, it allows us to develop and test upgrade paths, introduce new features and evolve indexing over time. It isn’t enough to introduce more resilient implementations, we also have to ensure that users take advantage of them when they upgrade.
You can read more about backwards compatibility tests in issue #6497.
Full Translog Writes on all Platforms (STATUS: DONE, v1.2.2 and v1.3.0)
editWe have recently received bug reports of transaction log corruption that can occur when indexing very large documents (in the area of 300 KB). Although some Linux users reported this behavior, it appears the problem occurs more frequently when running Windows. We traced the source of the problem to the fact that when serializing documents to the transaction log, the Operating System can actually write only part of the document before returning from the write call. We can now detect this situation and make sure that the entire document is properly written. You can read the full details in pull request #6576.
Lucene Checksums (STATUS: DONE, v1.2.0)
editBefore Apache Lucene version 4.8, checksums were not computed on generated index files. The result was that it was difficult to identify when or if a Lucene index got corrupted, whether by hardware failure, JVM bug or for an entirely different reason.
For an idea of the checksum efforts in progress in Apache Lucene, see issues LUCENE-2446, LUCENE-5580 and LUCENE-5602. The gist is that Lucene 4.8+ now computes full checksums on all index files and it verifies them when opening metadata or other smaller files as well as other files during merges.
Detect errors faster by locally failing a shard upon an indexing error (STATUS: DONE, v1.2.0)
editPreviously, Elasticsearch notified the master of the shard failure and waited for the master to close the local copy of the shard, thus assigning it to other nodes. This architecture caused delays in failure detection, potentially causing unneeded failures of other incoming requests. In rare cases, such as concurrency racing conditions or certain network partitions configurations, we could lose these failure notifications. We solved this issue by locally failing shards upon indexing errors. See issue #5847.
Snapshot/Restore API (STATUS: DONE, v1.0.0)
editIn Elasticsearch version 1.0, we significantly improved the backup process by introducing the Snapshot/Restore API. While it was always possible to make backups of Elasticsearch, the Snapshot/Restore API made the backup process much easier.
The backup process is incremental, making it very efficient since only files changed since the last backup are copied. Even with this efficiency introduced, each snapshot contains a full picture of the cluster at the moment when backup started. The restore API allows speedy recovery of a full cluster as well as selected indices.
Since that first release in version 1.0, the API has continued to evolve. In version 1.1.0, we added a new snapshot status API that allows users to monitor the snapshot process. In 1.3.0 we added the ability to restore indices without their aliases and in 1.4 we are planning to add the ability to restore partial snapshots.
The Snapshot/Restore API supports a number of different repository types for storing backups. Currently, it’s possible to make backups to a shared file system, Amazon S3, HDFS, and Azure storage. We are continuing to work on adding other types of storage systems, as well as improving the robustness of the snapshot/restore process.
Circuit Breaker: Fielddata (STATUS: DONE, v1.0.0)
editCurrently, the circuit breaker protects against loading too much field data by estimating how much memory the field data will take to load, then aborting the request if the memory requirements are too high. This feature was added in Elasticsearch version 1.0.0.
Use of Paginated Data Structures to Ease Garbage Collection (STATUS: DONE, v1.0.0 & v1.2.0)
editElasticsearch has moved from an object-based cache to a page-based cache recycler as described in issue #4557. This change makes garbage collection easier by limiting fragmentation, since all pages have the same size and are recycled. It also allows managing the size of the cache not based on the number of objects it contains, but on the memory that it uses.
These pages are used for two main purposes: implementing higher level data structures such as hash tables that are used internally by aggregations to e.g. map terms to counts, as well as reusing memory in the translog/transport layer as detailed in issue #5691.
Dedicated Master Nodes Resiliency (STATUS: DONE, v1.0.0)
editIn order to run a more resilient cluster, we recommend running dedicated master nodes to ensure master nodes are not affected by resources consumed by data nodes. We also have made master nodes more resilient to heavy resource usage, mainly associated with large clusters / cluster states.
These changes include:
- Improve the balancing algorithm to execute faster across large clusters / many indices. (See issue #4458 and #4459)
- Improve cluster state publishing to not create an additional network buffer per node. (More in this commit.)
- Improve master handling of large scale mapping updates from data nodes by batching them into a single cluster event. (See issue #4373.)
- Add an ack mechanism where next phase cluster updates are processed only when nodes acknowledged they received the previous cluster state. (See issues #3736, #3786, #4114, #4169, #4228 and #4421, which also include enhancements to the ack mechanism implementation.)
Multi Data Paths May Falsely Report Corrupt Index (STATUS: DONE, v1.0.0)
editWhen using multiple data paths, an index could be falsely reported as corrupted. This has been fixed with pull request #4674.
Randomized Testing (STATUS: DONE, v1.0.0)
editIn order to best validate for resiliency in Elasticsearch, we rewrote the Elasticsearch test infrastructure to introduce the concept of randomized testing. Randomized testing allows us to easily enhance the Elasticsearch testing infrastructure with predictably irrational conditions, making the resulting code base more resilient.
Each of our integration tests runs against a cluster with a random number of nodes, and indices have a random number of shards and replicas. Merge settings change for every run, indexing is done in serial or async fashion or even wrapped in a bulk operation and thread pool sizes vary to ensure that we don’t produce a deadlock no matter what happens. The list of places we use this randomization infrastructure is long, and growing every day, and has saved us headaches several times before we shipped a particular feature.
At Elasticsearch, we live the philosophy that we can miss a bug once, but never a second time. We make our tests more evil as you go, introducing randomness in all the areas where we discovered bugs. We figure if our tests don’t fail, we are not trying hard enough! If you are interested in how we have evolved our test infrastructure over time check out issues labeled with ``test'' on GitHub.
Lucene Loses Data On File Descriptors Failure (STATUS: DONE, v0.90.0)
editWhen a process runs out of file descriptors, Lucene can causes an index to be completely deleted. This issue was fixed in Lucene (version 4.2.1) and fixed in an early version of Elasticsearch. See issue #2812.