Cluster reroute API

edit

Changes the allocation of shards in a cluster.

Request

edit

POST /_cluster/reroute

Prerequisites

edit
  • If the Elasticsearch security features are enabled, you must have the manage cluster privilege to use this API.

Description

edit

The reroute command allows for manual changes to the allocation of individual shards in the cluster. For example, a shard can be moved from one node to another explicitly, an allocation can be cancelled, and an unassigned shard can be explicitly allocated to a specific node.

It is important to note that after processing any reroute commands Elasticsearch will perform rebalancing as normal (respecting the values of settings such as cluster.routing.rebalance.enable) in order to remain in a balanced state. For example, if the requested allocation includes moving a shard from node1 to node2 then this may cause a shard to be moved from node2 back to node1 to even things out.

The cluster can be set to disable allocations using the cluster.routing.allocation.enable setting. If allocations are disabled then the only allocations that will be performed are explicit ones given using the reroute command, and consequent allocations due to rebalancing.

It is possible to run reroute commands in "dry run" mode by using the ?dry_run URI query parameter, or by passing "dry_run": true in the request body. This will calculate the result of applying the commands to the current cluster state, and return the resulting cluster state after the commands (and re-balancing) has been applied, but will not actually perform the requested changes.

If the ?explain URI query parameter is included then a detailed explanation of why the commands could or could not be executed is included in the response.

The cluster will attempt to allocate a shard a maximum of index.allocation.max_retries times in a row (defaults to 5), before giving up and leaving the shard unallocated. This scenario can be caused by structural problems such as having an analyzer which refers to a stopwords file which doesn’t exist on all nodes.

Once the problem has been corrected, allocation can be manually retried by calling the reroute API with the ?retry_failed URI query parameter, which will attempt a single retry round for these shards.

Query parameters

edit
dry_run
(Optional, Boolean) If true, then the request simulates the operation only and returns the resulting state.
explain
(Optional, Boolean) If true, then the response contains an explanation of why the commands can or cannot be executed.
metric

(Optional, string) Limits the information returned to the specified metrics. All options except none are deprecated and should be avoided for this parameter. Defaults to all but metadata. The following options are available:

Options for metric
_all
Shows all metrics.
blocks
Shows the blocks part of the response.
master_node
Shows the elected master_node part of the response.
metadata
Shows the metadata part of the response. If you supply a comma separated list of indices, the returned output will only contain metadata for these indices.
nodes
Shows the nodes part of the response.
none
Excludes the entire state field from the response.
routing_table
Shows the routing_table part of the response.
version
Shows the cluster state version.
retry_failed
(Optional, Boolean) If true, then retries allocation of shards that are blocked due to too many subsequent allocation failures.
master_timeout
(Optional, time units) Period to wait for the master node. If the master node is not available before the timeout expires, the request fails and returns an error. Defaults to 30s. Can also be set to -1 to indicate that the request should never timeout.
timeout
(Optional, time units) Period to wait for a response from all relevant nodes in the cluster after updating the cluster metadata. If no response is received before the timeout expires, the cluster metadata update still applies but the response will indicate that it was not completely acknowledged. Defaults to 30s. Can also be set to -1 to indicate that the request should never timeout.

Request body

edit
commands

(Required, array of objects) Defines the commands to perform. Supported commands are:

Properties of commands
move
Move a started shard from one node to another node. Accepts index and shard for index name and shard number, from_node for the node to move the shard from, and to_node for the node to move the shard to.
cancel
Cancel allocation of a shard (or recovery). Accepts index and shard for index name and shard number, and node for the node to cancel the shard allocation on. This can be used to force resynchronization of existing replicas from the primary shard by cancelling them and allowing them to be reinitialized through the standard recovery process. By default only replica shard allocations can be cancelled. If it is necessary to cancel the allocation of a primary shard then the allow_primary flag must also be included in the request.
allocate_replica
Allocate an unassigned replica shard to a node. Accepts index and shard for index name and shard number, and node to allocate the shard to. Takes allocation deciders into account.

Two more commands are available that allow the allocation of a primary shard to a node. These commands should however be used with extreme care, as primary shard allocation is usually fully automatically handled by Elasticsearch. Reasons why a primary shard cannot be automatically allocated include the following:

  • A new index was created but there is no node which satisfies the allocation deciders.
  • An up-to-date shard copy of the data cannot be found on the current data nodes in the cluster. To prevent data loss, the system does not automatically promote a stale shard copy to primary.

The following two commands are dangerous and may result in data loss. They are meant to be used in cases where the original data can not be recovered and the cluster administrator accepts the loss. If you have suffered a temporary issue that can be fixed, please see the retry_failed flag described above. To emphasise: if these commands are performed and then a node joins the cluster that holds a copy of the affected shard then the copy on the newly-joined node will be deleted or overwritten.

allocate_stale_primary
Allocate a primary shard to a node that holds a stale copy. Accepts the index and shard for index name and shard number, and node to allocate the shard to. Using this command may lead to data loss for the provided shard id. If a node which has the good copy of the data rejoins the cluster later on, that data will be deleted or overwritten with the data of the stale copy that was forcefully allocated with this command. To ensure that these implications are well-understood, this command requires the flag accept_data_loss to be explicitly set to true.
allocate_empty_primary
Allocate an empty primary shard to a node. Accepts the index and shard for index name and shard number, and node to allocate the shard to. Using this command leads to a complete loss of all data that was indexed into this shard, if it was previously started. If a node which has a copy of the data rejoins the cluster later on, that data will be deleted. To ensure that these implications are well-understood, this command requires the flag accept_data_loss to be explicitly set to true.

Examples

edit

This is a short example of a simple reroute API call:

resp = client.cluster.reroute(
    commands=[
        {
            "move": {
                "index": "test",
                "shard": 0,
                "from_node": "node1",
                "to_node": "node2"
            }
        },
        {
            "allocate_replica": {
                "index": "test",
                "shard": 1,
                "node": "node3"
            }
        }
    ],
)
print(resp)
const response = await client.cluster.reroute({
  commands: [
    {
      move: {
        index: "test",
        shard: 0,
        from_node: "node1",
        to_node: "node2",
      },
    },
    {
      allocate_replica: {
        index: "test",
        shard: 1,
        node: "node3",
      },
    },
  ],
});
console.log(response);
POST /_cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": "test", "shard": 0,
        "from_node": "node1", "to_node": "node2"
      }
    },
    {
      "allocate_replica": {
        "index": "test", "shard": 1,
        "node": "node3"
      }
    }
  ]
}