Elasticsearch Internals - Tracking in-sync shard copies

Elasticsearch provides failover capabilities by keeping multiple copies of your data in the cluster. In presence of network disruptions or failing nodes, changes might not make it to all the copies. This blog post showcases one of the internal mechanics of Elasticsearch to identify shard copies that are missing changes, providing an in-depth view on how two central components, the consensus module and the data replication layer, integrate to keep your data safe.

Data replication in Elasticsearch is based on the primary-backup model. This model assumes a single authoritative copy of the data, called the primary. All indexing operations first go to the primary, which is then in charge of replicating changes to active backup copies, called replica shards. Elasticsearch uses replica shards to provide failover capabilities as well as to scale out reads. In cases where the current primary copy becomes either temporarily or permanently unavailable, for example due to a server maintenance window or due to a damaged disk drive, another shard copy is selected as primary. Because the primary is the authoritative copy, it is critical in this model that only shard copies which contain the most recent data are selected as primary. If, for example, an older shard copy was selected as primary after it was on a node that was isolated from the cluster, that old copy would become the definitive copy of the shard, leading to the loss of all changes that were missed by this copy. The following presents how Elasticsearch v5+ keeps track of the shard copies that can safely be selected as primary, also called the in-sync shard copies.

Safely allocating primary shards

Shard allocation is the process of deciding which node should host an active copy of the shard. Allocation decisions are made by the master node and recorded in the cluster state, a data structure that also contains other metadata such as index settings and mappings. Allocation decisions contain information about which shards should be allocated to which nodes, and whether they should play the role of primary or replica shard. The master broadcasts cluster state changes to all nodes that are part of the cluster. Having the cluster state available on each node enables smart routing of requests as each node is aware of where primary and replica shards are allocated.

Each node inspects the cluster state to determine the shards that it should make available. If the shard is assigned as a primary to a node that already holds a copy of the shard data, the node just needs to load up the local shard copy and make it available for search. If a shard is allocated as replica, the node first copies over missing data from the node that has the primary. When there are not enough replica copies of the shard available in the cluster (as determined by the index setting index.number_of_replicas), the master can also allocate replica shards to nodes that don’t have any data for this shard, thereby instructing these nodes to create full copies of the primary shard.

When a new index is created the master has a lot of flexibility to select nodes in the cluster to allocate the primary shards to, taking cluster balancing and other constraints such as allocation awareness and filtering into account. Allocating existing shard copies as primaries happens on more rare occasions. Examples are full cluster restarts, where none of the shards are initially allocated when the cluster is first formed, or multiple failures over a short time span that cause all active copies to become unavailable. In these situations the master first reaches out to all nodes to find out what on-disk shard copies exist in the cluster. Based on the copies that are found, it decides whether to allocate one of them as primary. In order to guarantee safety, the master has to make sure to only select shard copies as primary that contain the most recent data. To this effect, Elasticsearch uses the concept of allocation IDs, which are unique identifiers (UUIDs) to distinguish between shard copies.

Allocation IDs are assigned by the master during shard allocation and are stored on disk by the data nodes, right next to the actual shard data. The master is responsible for tracking the subset of copies that contain the most recent data. This set of copies, known as in-sync allocation IDs, is stored in the cluster state, which is persisted on all master and data nodes. Changes to the cluster state are backed by Elasticsearch’s consensus implementation, called zen discovery. This ensures that there is a shared understanding in the cluster as to which shard copies are considered as in-sync, implicitly marking those shard copies that are not in the in-sync set as stale.

When allocating a primary shard, the master checks if the allocation ID of the on-disk copy that it has found on a node in the cluster matches one of the IDs in the set of in-sync allocations that is present in the cluster state. The shard is then only allocated as primary to the node if its ID matches one of the IDs in the in-sync set. If a primary fails while there are existing active replicas, one of the active replicas, which is also in the in-sync set, is simply promoted to primary, ensuring continued write availability in the cluster.

Marking shards as stale

Elasticsearch uses a different mechanism to replicate data than it uses for replicating cluster state changes. Whereas metadata changes require consensus on the cluster level, data replication uses the simpler primary-backup approach. This two-layered system allows data replication to be simple and fast, only requiring interaction with the cluster consensus layer in exceptional situations. The basic flow for handling document write requests is as follows:

  • Based on the current cluster state, the request gets routed to the node that has the primary shard.
  • The operation is performed locally on the primary shard, for example by indexing, updating or deleting a document.
  • After successful execution of the operation, it is forwarded to the replica shards. If there are multiple replication targets, forwarding the operation to the replica shards is done concurrently.
  • When all replicas have successfully performed the operation and responded to the primary, the primary acknowledges the successful completion of the request to the client.

In the case of network partitions, node failures or just general shard unavailability when the node hosting the shard copy is not up, the forwarded operation might not have been successfully performed on one or more of the replica shard copies. This means that the primary will contain changes that have not been propagated to all shard copies. If these copies were to continue to be considered as in-sync, they could become primary at a later point in time even though they are missing some of the changes, resulting in data loss.

There are two solutions to this: a) Fail the write request and undo the changes on the available copies or b) ensure that the divergent shard copies are not considered as in-sync anymore. Elasticsearch chooses write availability in this case: The primary instructs the active master to remove the IDs of the divergent shard copies from the in-sync set. The primary then only acknowledges the write request to the client after it has received confirmation from the master that the in-sync set has been successfully updated by the consensus layer. This ensures that only shard copies that contain all acknowledged writes can be selected as primary by the master.

Examples

To see this in action, let’s consider a small cluster with one master and two data nodes. To keep the example simple, we’ll use an index with 1 primary and 1 replica shard. Initially, one of the data nodes has the primary and the other the replica shard. We use the cluster state API to inspect the in-sync shard information that is tracked by the cluster state, filtering the resulting output down to the relevant bits using the “filter_path” query parameter:

GET /_cluster/state?filter_path=metadata.indices.my_index.in_sync_allocations.*,routing_table.indices.my_index.*

This yields the following excerpt of the cluster state:

{
  "metadata": {
    "indices": {
      "my_index": {
        "in_sync_allocations": {
          "0": [
            "HNeGpt5aS3W9it3a7tJusg",
            "wP-Z5fuGSM-HbADjMNpSIQ"
          ]
        }
      }
    }
  },
  "routing_table": {
    "indices": {
      "my_index": {
        "shards": {
          "0": [
            {
              "primary": true,
              "state": "STARTED",
              "allocation_id": { "id": "HNeGpt5aS3W9it3a7tJusg" },
              "node": "CX-rFmoPQF21tgt3MYGSQA",
              ...
            },
            {
              "primary": false,
              "state": "STARTED",
              "allocation_id": { "id": "wP-Z5fuGSM-HbADjMNpSIQ" },
              "node": "AzYoyzzSSwG6v_ypdRXYkw",
              ...
            }
          ]
        }
      }
    }
  }
}

The cluster state shows that both the primary and replica shard are started. The primary is allocated to the data node “CX-rFmo” and the replica is allocated to the data node “AzYoyz”. Both have a unique allocation id which is also present in the in_sync_allocations set.

Let’s see what happens when we shut down the node that currently has the primary shard. As this does not actually change any of the shard data, both shard copies should stay in sync. In absence of the primary shard, the replica shard should also be promoted to primary, which is properly reflected in the resulting cluster state:

{
  "metadata": {
    "indices": {
      "my_index": {
        "in_sync_allocations": {
          "0": [
            "HNeGpt5aS3W9it3a7tJusg",
            "wP-Z5fuGSM-HbADjMNpSIQ"
          ]
        }
      }
    }
  },
  "routing_table": {
    "indices": {
      "my_index": {
        "shards": {
          "0": [
            {
              "primary": true,
              "state": "STARTED",
              "allocation_id": { "id": "wP-Z5fuGSM-HbADjMNpSIQ" },
              "node": "AzYoyzzSSwG6v_ypdRXYkw",
              ...
            },
            {
              "primary": false,
              "state": "UNASSIGNED",
              "node": null,
              "unassigned_info": {
                "details": "node_left[CX-rFmoPQF21tgt3MYGSQA]",
                ...
              }
            }
          ]
        }
      }
    }
  }
}

The replica shard stays unassigned as we only have one data node up. If we start the second node again, the replica shard will automatically be allocated to that node again. To make this scenario more interesting, we won’t start up the second node. Instead we are going to index a document into the newly promoted primary shard. As the shard copies are now diverging, the unavailable shard copy becomes stale, and it’s ID is thus removed by the master from the in-sync set:

{
  "metadata": {
    "indices": {
      "my_index": {
        "in_sync_allocations": {
          "0": [
            "wP-Z5fuGSM-HbADjMNpSIQ"
          ]
        }
      }
    }
  },
  "routing_table": {
    ... // same as in previous step
  }
}

With only one in-sync shard copy left, let’s see how the system reacts if that copy becomes unavailable. To that effect, let’s shut down the only data node that’s currently up and then bring up the previous data node with the stale shard copy. After doing so, the cluster health API shows the cluster health as red. The cluster state also shows that the primary shard is not getting allocated:

{
  "metadata": {
    "indices": {
      "my_index": {
        "in_sync_allocations": {
          "0": [
            "wP-Z5fuGSM-HbADjMNpSIQ"
          ]
        }
      }
    }
  },
  "routing_table": {
    "indices": {
      "my_index": {
        "shards": {
          "0": [
            {
              "primary": true,
              "state": "UNASSIGNED",
              "recovery_source": { "type": "EXISTING_STORE" },
              "unassigned_info": {
                "allocation_status": "no_valid_shard_copy",
                "at": "2017-01-26T09:20:24.054Z",
                "details": "node_left[AzYoyzzSSwG6v_ypdRXYkw]"
              },
              ...
            },
            {
              "primary": false,
              "state": "UNASSIGNED",
              "recovery_source": { "type": "PEER" },
              "unassigned_info": {
                "allocation_status": "no_attempt",
                "at": "2017-01-26T09:14:47.689Z",
                "details": "node_left[CX-rFmoPQF21tgt3MYGSQA]"
              },
              ...
            }
          ]
        }
      }
    }
  }
}

Let’s also have a look at the cluster allocation explain API which is a great tool for debugging allocation issues. Running the explain command with no parameters will provide an explanation for the first unassigned shard that the system finds:

GET /_cluster/allocation/explain

The explain API tells us why the primary shard is unassigned and also provides more detailed allocation information on a per-node basis. In this case the master cannot find any in-sync shard copies on the nodes that are currently available in the cluster:

{
  "index" : "my_index",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2017-01-26T09:20:24.054Z",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
  "node_allocation_decisions" : [
    {
      "node_id" : "CX-rFmoPQF21tgt3MYGSQA",
      "node_name" : "CX-rFmo",
      "transport_address" : "127.0.0.1:9301",
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "HNeGpt5aS3W9it3a7tJusg"
      }
    }
  ]
}

The API also shows that the shard copy that’s available on node “CY-rFmo” is stale (store.in_sync = false). Starting up the other node that has the in-sync shard copy brings the cluster back to green again. What would happen though if that node had gone up in flames?

Not all is lost - APIs for the not-so faint-hearted

When a major disaster strikes there may be situations where only stale shard copies are available in the cluster. Elasticsearch will not automatically allocate such shard copies as primary shards, and the cluster will stay red. In the case where all in-sync shard copies are gone for good, however, there is still a possibility for the cluster to revert to using stale copies, but this requires manual intervention from the cluster administrator.

As we’ve seen in the previous example, the first step in understanding allocation issues is to use the cluster allocation explain API. It shows whether nodes have a copy of the shard and whether the respective copy is in the in-sync set or not. In the case of file-system corruptions, it also shows the exception that was thrown when accessing the on-disk information. In the previous example the allocation explain output showed that an existing shard copy was found on node “CX-rFmo” in the cluster but that that copy did not contain the most recent data (in-sync : false).

The reroute API provides a sub-command allocate_stale_primary which is used to allocate a stale shard copy as primary. Using this command means losing the data that is missing in the given shard copy. Using the command in a situation where an in-sync shard copy is just temporarily unavailable effectively means destroying the more recent data that’s available on the in-sync copy. It should be seen as a measure of last resort to get the cluster running with at least some of the data. In the case where all shard copies are gone, there is also the possibility to force Elasticsearch to allocate a primary using an empty shard copy, which effectively means losing all previous data associated with that shard. It goes without saying that the command allocate_empty_primary should only be used in the direst of situations and that its implications should be well-understood.