RED Elasticsearch Cluster? Panic no longer

Beep… beep… beep…  It’s PagerDuty telling you that a node has gone down, or a rack has gone down, or your whole cluster has just rebooted.  Either way, your cluster is RED: some shards are not assigned, which means that your data is not fully available.

Pop quiz hot shot: What do you do?  What do you do?!

In earlier versions of Elasticsearch, figuring out why shards are not being allocated required the analytical skills of a bomb defusion expert.  You’d look through the cluster state API, the cat-shards API, the cat-allocation API, the cat-indices API, the indices-recovery API, the indices-shard-stores API… and wonder what it all means.  

Now, we have added the cluster-allocation-explain API: your one stop shop for understanding shard allocation.  

The cluster allocation explain API (henceforth referred to as the explain API) was introduced as an experimental API in v5.0 and reworked into its current form in v5.2.  The explain API was designed to answer two fundamental questions:

  1. For unassigned shards: “Why are my shards unassigned?”
  2. For assigned shards: “Why are my shards assigned to this particular node?”

It should be noted that shard allocation problems should be a rare event in the cluster and typically are the result of node and/or cluster configuration problems (e.g. incorrectly setting up filter allocation settings), all nodes holding a valid shard copy having disconnected from the cluster, disk problems, and the like.  But when these problems do occur, a cluster administrator needs the right tools to identify the problem and nurture the cluster back to good health, and that is what the explain API is for.

The goal of this blog post is to highlight the power of the explain API by going through some examples of how to use it to diagnose shard allocation problems.  

What is shard allocation?

Shard allocation is the process of assigning a shard to a node in the cluster.  In order to scale to huge document sets and provide high availability in the face of node failure, Elasticsearch splits an index’s documents into shards, each shard residing on a node in the cluster.  If a primary shard cannot be allocated, the index will be missing data and/or no new documents can be written to the index.  If replica shards cannot be allocated, the cluster is at risk of losing data in the case of permanent failure (e.g. corrupt disks) of the node holding the primary shard.  If shards are allocated but to slower-than-desired nodes, then the shards of high traffic indices will suffer from being located on slower machines and the performance of the cluster will degrade.  Therefore, allocating shards and assigning them to the best node possible is of fundamental importance within Elasticsearch.  

The shard allocation process differs for newly created indices and existing indices.  In both cases, Elasticsearch has two main components at work: allocators and deciders.  Allocators try to find the best (defined below) nodes to hold the shard, and deciders make the decision if allocating to a node is allowed.

For newly created indices, the allocator looks for the nodes with the least amount of shards on them and returns a list of nodes sorted by shard weight, with those having the least shard weight appearing first.  Thus, the allocator’s goal for a newly created index is to assign the index’s shards to nodes in a manner that would result in the most balanced cluster.  The deciders then take each node in order and decide if the shard is allowed to be allocated to that node.  For example, if filter allocation rules are set up to prevent node <code>A</code> from holding any shards of index <code>idx</code>, then the <code>filter</code> decider will prevent allocation of <code>idx</code>’s shards to node A, even though it may be ranked first by the allocator as the best node from a cluster balancing perspective.  Note that the allocators only take into account the number of shards per node, not the size of each shard.  It is the job of one of the deciders to prevent allocation to nodes that exceed a certain disk occupancy threshold.  

For existing indices, we must further distinguish between primary and replica shards.  For a primary shard, the allocator will *only* allow allocation to a node that already holds a known good copy of the shard.  If the allocator did not take such a step, then allocating a primary shard to a node that does not already have an up-to-date copy of the shard will result in data loss!  In the case of replica shards, the allocator first looks to see if there are already copies of the shard (even stale copies) on other nodes.  If so, the allocator will prioritize assigning the shard to one of the nodes holding a copy, because the replica needs to get in-sync with the primary once it is allocated, and the fact that a node already has some of the shard data means (hopefully) a lot less data has to be copied over to the replica from the primary.  This can speed up the recovery process for the replica shard significantly.  

Diagnosing Unassigned Primary Shards

Having an unassigned primary shard is about the worst thing that can happen in a cluster.  If the unassigned primary is on a newly created index, no documents can be written to that index.  If the unassigned primary is on an existing index, then not only can the index not be written to, but all data previously indexed is unavailable for searching!  

Let’s start by creating a new index named test_idx, with 1 shard and 0 replicas per shard, in a two-node cluster (with node names A and B), but assigning filter allocation rules at index creation time so that shards for the index cannot be assigned to nodes A and B.  The following curl command accomplishes this:

PUT /test_idx?wait_for_active_shards=0 
{ 
   "settings": 
   { 
      "number_of_shards": 1, 
      "number_of_replicas": 0,
      "index.routing.allocation.exclude._name": "A,B" 
   } 
}

Due to the filter allocation rules, the index will be created but none of its shards can be allocated due to the index’s settings prevent allocation to the only two nodes in the cluster.  While this example is contrived and may sound unrealistic, the possibility does exist that due to misconfiguration of the filter settings (perhaps by misspelling the node names), the shards in the index remain unassigned.  

At this point, our cluster is in the RED state.  We can get an explanation for the first unassigned shard found in the cluster (in this case, there is only one shard in the cluster which is unassigned) by invoking the explain API with an empty request body:

GET /_cluster/allocation/explain

Which produces the following output:

{
  "index" : "test_idx",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED", 
    "at" : "2017-01-16T18:12:39.401Z",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",   
  "node_allocation_decisions" : [ 
    {
      "node_id" : "tn3qdPdnQWuumLxVVjJJYQ",
      "node_name" : "A", 
      "transport_address" : "127.0.0.1:9300",
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "filter",  
          "decision" : "NO", 
          "explanation" : "node matches index setting [index.routing.allocation.exclude.] filters [_name:\"A OR B\"]" 
        }
      ]
    },
    {
      "node_id" : "qNgMCvaCSPi3th0mTcyvKQ",
      "node_name" : "B", 
      "transport_address" : "127.0.0.1:9301",
      "node_decision" : "no",
      "weight_ranking" : 2,
      "deciders" : [
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : "node matches index setting [index.routing.allocation.exclude.] filters [_name:\"A OR B\"]"
        }
      ]
    }
  ]
}

The explain API found the primary shard 0 of test_idx to explain, which is in the unassigned state (see "current_state") due to the index having just been created (see "unassigned_info").  The shard cannot be allocated (see "can_allocate") due to none of the nodes permitting allocation of the shard (see "allocate_explanation").  When drilling down to each node’s decision (see "node_allocation_decisions"), we observe that node A received a decision not to allocate (see "node_decision") due to the filter decider (see "decider") preventing allocation with the reason that the filter allocation settings excluded nodes A and B from holding a copy of the shard (see "explanation" inside the "deciders" section).  The explanation also contains the exact setting to change to allow the shard to be allocated in the cluster.

Updating the filter allocation settings via:

PUT /test_idx/_settings 
{ 
   "index.routing.allocation.exclude._name": null 
}

And re-running the explain API results in an error message saying 

   unable to find any unassigned shards to explain

This is because the only shard for test_idx has now been assigned!  Running the explain API on the primary shard: 

GET /_cluster/allocation/explain 
{ 
   "index": "test_idx", 
   "shard": 0, 
   "primary": true
}

Allows us to see which node the shard was assigned to (output abbreviated):

{
  "index" : "test_idx",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "tn3qdPdnQWuumLxVVjJJYQ",
    "name" : "A",
    "transport_address" : "127.0.0.1:9300",
    "weight_ranking" : 1
  },
  …
}

We can see that the shard is now in the started state and assigned to node A (the node that the shard is assigned to may differ when you run it).

Now, let’s index some data into test_index.  After indexing some data, our primary shard has documents in it that we would never want to lose (at least, not without explicitly deleting them).  Next, we will stop node A so that the primary shard is no longer in the cluster.  Since there are no replica copies at the moment, the shard will remain unassigned and the cluster health will be RED.   Rerunning the above explain API command on primary shard 0 will return:

{
  "index" : "test_idx",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {             
    "reason" : "NODE_LEFT",    
    "at" : "2017-01-16T17:24:21.157Z",
    "details" : "node_left[qU98BvbtQu2crqXF2ATFdA]",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy", 
  "allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster" 
}

The output tells us that the primary shard is currently unassigned (see "current_state") because the node holding the primary left the cluster (see "unassigned_info").  The reason given to us by the explain API for why the shard cannot be allocated is that there is no longer any valid copy for shard 0 of test_idx in the cluster (see "can_allocate") along with the explanation of why the shard cannot be allocated (see "allocate_explanation").  

By the explain API telling us that there is no longer a valid shard copy for our primary shard, we know we have lost all of the nodes that held a valid shard copy.  At this point, the only recourse is to wait for those nodes to come back to life and rejoin the cluster.  In the odd event that all nodes holding copies of this particular shard are all permanently dead, the only recourse is to use the reroute commands to allocate an empty/stale primary shard and accept the fact that data has been lost.

Diagnosing Unassigned Replica Shards

Let’s take our existing test_idx and increase the number of replicas to 1:

PUT /test_idx/_settings 
{ 
   "number_of_replicas": 1 
}

This will give us a total of 2 shards for test_idx - the primary for shard 0 and the replica for shard 0.  Since node A already holds the primary, the replica should be allocated to node B, to form a balanced cluster (plus, node A already has a copy of the shard).  Running the explain API on the replica shard confirms this:

GET /_cluster/allocation/explain 
{ 
   "index": "test_idx", 
   "shard": 0, 
   "primary": false 
}

Results in the response:

{
  "index" : "test_idx",
  "shard" : 0,
  "primary" : false,
  "current_state" : "started",
  "current_node" : {
    "id" : "qNgMCvaCSPi3th0mTcyvKQ",
    "name" : "B",
    "transport_address" : "127.0.0.1:9301",
    "weight_ranking" : 1
  },
  …
}

The output shows that the shard is in the started state and assigned to node B.

Next, we will again set the filter allocation settings on the index, but this time, we will only prevent shard allocation to node B:

PUT /test_idx/_settings
{ 
   "index.routing.allocation.exclude._name": "B" 
}

Now, restart node B and re-run the explain API command for the replica shard:

{
  "index" : "test_idx",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2017-01-16T19:10:34.478Z",
    "details" : "node_left[qNgMCvaCSPi3th0mTcyvKQ]",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no", 
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "qNgMCvaCSPi3th0mTcyvKQ",
      "node_name" : "B",
      "transport_address" : "127.0.0.1:9301",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "filter",  
          "decision" : "NO",
          "explanation" : "node matches index setting [index.routing.allocation.exclude.] filters [_name:\"B\"]" 
        }
      ]
    },
    {
      "node_id" : "tn3qdPdnQWuumLxVVjJJYQ",
      "node_name" : "A",
      "transport_address" : "127.0.0.1:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",  
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[test_idx][0], node[tn3qdPdnQWuumLxVVjJJYQ], [P], s[STARTED], a[id=JNODiTgYTrSp8N2s0Q7MrQ]]" 
        }
      ]
    }
  ]
}

We learn from this output that the replica shard cannot be allocated (see "can_allocate") because for node B, the filter allocation rules prevented allocation to it (see "explanation" for the filter decider).  Since node A already holds the primary shard, another copy of the shard cannot be assigned to it (see "explanation" for the same shard decider) - Elasticsearch prevents this because there is no point in having two copies of the same data live on the same machine.

Diagnosing Assigned Shards

If a shard is assigned, why might you care about its allocation explanation?  One common reason would be that a shard (primary or replica) of an index is allocated to a node, and you just set up filtering rules to move the shard from its current node to another node (perhaps you are trying to make use of the “hot-warm” architecture) but for some reason, the shard remains on its current node.  This is another situation where the explain API can shed light on the shard allocation process.

Let’s again clear the filter allocation rules so that both primary and replica in our test_idx are assigned:

PUT /test_idx/_settings 
{ 
    "index.routing.allocation.exclude._name": null 
}

Now, let’s set the filter allocation rules so that the primary shard cannot remain on its current node (in my case, node A):

PUT /test_idx/_settings 
{ 
    "index.routing.allocation.exclude._name": "A"  
}

One might expect at this point that the filter allocation rules will cause the primary to move away from its current node to another node, but in fact it does not.  We can run the explain API on the primary shard to see why:

GET /_cluster/allocation/explain 
{ 
    "index": "test_idx", 
    "shard": 0, 
    "primary": true 
}

Which results in the response:

{
  "index" : "test_idx",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "tn3qdPdnQWuumLxVVjJJYQ",
    "name" : "A",  
    "transport_address" : "127.0.0.1:9300"
  },
  "can_remain_on_current_node" : "no", 
  "can_remain_decisions" : [   
    {
      "decider" : "filter",
      "decision" : "NO",
      "explanation" : "node matches index setting [index.routing.allocation.exclude.] filters [_name:\"A\"]"   
    }
  ],
  "can_move_to_other_node" : "no", 
  "move_explanation" : "cannot move shard to another node, even though it is not allowed to remain on its current node",
  "node_allocation_decisions" : [
    {
      "node_id" : "qNgMCvaCSPi3th0mTcyvKQ",
      "node_name" : "B",
      "transport_address" : "127.0.0.1:9301",
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard", 
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[test_idx][0], node[qNgMCvaCSPi3th0mTcyvKQ], [R], s[STARTED], a[id=dNgHLTKwRH-Dp-rIX4Hkqg]]" 
        }
      ]
    }
  ]
}

Examining the output, we can see that the primary shard is still assigned to node A (see "current_node").  The cluster correctly acknowledges that the shard can no longer remain on its current node (see "can_remain_on_current_node"), with the reason given that the current node matches the filter allocation exclude rules (see "can_remain_decisions").  Despite not being allowed to remain on its current node, the explain API tells us that the shard cannot be allocated to a different node (see "can_move_to_other_node") because, for the only other node in the cluster (node B), that node already contains an active shard copy, and the same shard copy cannot be allocated to the same node more than once (see the explanation for the same shard decider in "node_allocation_decisions").

Conclusion

In this blog post, we’ve tried to highlight some of the scenarios where the cluster allocation explain API is very useful at helping an administrator understand the shard allocation process in an Elasticsearch cluster.  There are numerous scenarios that the explain API covers, including showing the node weights to explain why an assigned shard remains on its current node instead of rebalancing to another node.  The explain API is an invaluable tool in diagnosing issues with one’s cluster, and we have already seen big benefits and time savings in using it internally during development as well as diagnosing cluster instabilities for our customers.