Cluster-level shard allocation and routing settings
editCluster-level shard allocation and routing settings
editShard allocation is the process of allocating shards to nodes. This can happen during initial recovery, replica allocation, rebalancing, or when nodes are added or removed.
One of the main roles of the master is to decide which shards to allocate to which nodes, and when to move shards between nodes in order to rebalance the cluster.
There are a number of settings available to control the shard allocation process:
- Cluster-level shard allocation settings control allocation and rebalancing operations.
- Disk-based shard allocation settings explains how Elasticsearch takes available disk space into account, and the related settings.
- Shard allocation awareness and Forced awareness control how shards can be distributed across different racks or availability zones.
- Cluster-level shard allocation filtering allows certain nodes or groups of nodes excluded from allocation so that they can be decommissioned.
Besides these, there are a few other miscellaneous cluster-level settings.
Cluster-level shard allocation settings
editYou can use the following settings to control shard allocation and recovery:
-
cluster.routing.allocation.enable
-
(Dynamic) Enable or disable allocation for specific kinds of shards:
-
all
- (default) Allows shard allocation for all kinds of shards. -
primaries
- Allows shard allocation only for primary shards. -
new_primaries
- Allows shard allocation only for primary shards for new indices. -
none
- No shard allocations of any kind are allowed for any indices.
This setting does not affect the recovery of local primary shards when restarting a node. A restarted node that has a copy of an unassigned primary shard will recover that primary immediately, assuming that its allocation id matches one of the active allocation ids in the cluster state.
-
-
cluster.routing.allocation.node_concurrent_incoming_recoveries
-
(Dynamic)
How many concurrent incoming shard recoveries are allowed to happen on a node. Incoming recoveries are the recoveries
where the target shard (most likely the replica unless a shard is relocating) is allocated on the node. Defaults to
2
. -
cluster.routing.allocation.node_concurrent_outgoing_recoveries
-
(Dynamic)
How many concurrent outgoing shard recoveries are allowed to happen on a node. Outgoing recoveries are the recoveries
where the source shard (most likely the primary unless a shard is relocating) is allocated on the node. Defaults to
2
. -
cluster.routing.allocation.node_concurrent_recoveries
-
(Dynamic)
A shortcut to set both
cluster.routing.allocation.node_concurrent_incoming_recoveries
andcluster.routing.allocation.node_concurrent_outgoing_recoveries
. Defaults to 2. -
cluster.routing.allocation.node_initial_primaries_recoveries
-
(Dynamic)
While the recovery of replicas happens over the network, the recovery of
an unassigned primary after node restart uses data from the local disk.
These should be fast so more initial primary recoveries can happen in
parallel on the same node. Defaults to
4
.
-
cluster.routing.allocation.same_shard.host
-
(Dynamic)
If
true
, forbids multiple copies of a shard from being allocated to distinct nodes on the same host, i.e. which have the same network address. Defaults tofalse
, meaning that copies of a shard may sometimes be allocated to nodes on the same host. This setting is only relevant if you run multiple nodes on each host.
Shard rebalancing settings
editA cluster is balanced when it has an equal number of shards on each node, with all nodes needing equal resources, without having a concentration of shards from any index on any node. Elasticsearch runs an automatic process called rebalancing which moves shards between the nodes in your cluster to improve its balance. Rebalancing obeys all other shard allocation rules such as allocation filtering and forced awareness which may prevent it from completely balancing the cluster. In that case, rebalancing strives to achieve the most balanced cluster possible within the rules you have configured. If you are using data tiers then Elasticsearch automatically applies allocation filtering rules to place each shard within the appropriate tier. These rules mean that the balancer works independently within each tier.
You can use the following settings to control the rebalancing of shards across the cluster:
-
cluster.routing.rebalance.enable
-
(Dynamic) Enable or disable rebalancing for specific kinds of shards:
-
all
- (default) Allows shard balancing for all kinds of shards. -
primaries
- Allows shard balancing only for primary shards. -
replicas
- Allows shard balancing only for replica shards. -
none
- No shard balancing of any kind are allowed for any indices.
-
-
cluster.routing.allocation.allow_rebalance
-
(Dynamic) Specify when shard rebalancing is allowed:
-
always
- Always allow rebalancing. -
indices_primaries_active
- Only when all primaries in the cluster are allocated. -
indices_all_active
- (default) Only when all shards (primaries and replicas) in the cluster are allocated.
-
-
cluster.routing.allocation.cluster_concurrent_rebalance
-
(Dynamic)
Defines the number of concurrent shard rebalances are allowed across the whole
cluster. Defaults to
2
. Note that this setting only controls the number of concurrent shard relocations due to imbalances in the cluster. This setting does not limit shard relocations due to allocation filtering or forced awareness. -
cluster.routing.allocation.type
-
Selects the algorithm used for computing the cluster balance. Defaults to
desired_balance
which selects the desired balance allocator. This allocator runs a background task which computes the desired balance of shards in the cluster. Once this background task completes, Elasticsearch moves shards to their desired locations.May also be set to
balanced
to select the legacy balanced allocator. This allocator was the default allocator in versions of Elasticsearch before 8.6.0. It runs in the foreground, preventing the master from doing other work in parallel. It works by selecting a small number of shard movements which immediately improve the balance of the cluster, and when those shard movements complete it runs again and selects another few shards to move. Since this allocator makes its decisions based only on the current state of the cluster, it will sometimes move a shard several times while balancing the cluster.
Shard balancing heuristics settings
editRebalancing works by computing a weight for each node based on its allocation of shards, and then moving shards between nodes to reduce the weight of the heavier nodes and increase the weight of the lighter ones. The cluster is balanced when there is no possible shard movement that can bring the weight of any node closer to the weight of any other node by more than a configurable threshold.
The weight of a node depends on the number of shards it holds and on the total estimated resource usage of those shards expressed in terms of the size of the shard on disk and the number of threads needed to support write traffic to the shard. Elasticsearch estimates the resource usage of shards belonging to data streams when they are created by a rollover. The estimated disk size of the new shard is the mean size of the other shards in the data stream. The estimated write load of the new shard is a weighted average of the actual write loads of recent shards in the data stream. Shards that do not belong to the write index of a data stream have an estimated write load of zero.
The following settings control how Elasticsearch combines these values into an overall measure of each node’s weight.
-
cluster.routing.allocation.balance.shard
-
(float, Dynamic)
Defines the weight factor for the total number of shards allocated to each node.
Defaults to
0.45f
. Raising this value increases the tendency of Elasticsearch to equalize the total number of shards across nodes ahead of the other balancing variables. -
cluster.routing.allocation.balance.index
-
(float, Dynamic)
Defines the weight factor for the number of shards per index allocated to each
node. Defaults to
0.55f
. Raising this value increases the tendency of Elasticsearch to equalize the number of shards of each index across nodes ahead of the other balancing variables. -
cluster.routing.allocation.balance.disk_usage
-
(float, Dynamic)
Defines the weight factor for balancing shards according to their predicted disk
size in bytes. Defaults to
2e-11f
. Raising this value increases the tendency of Elasticsearch to equalize the total disk usage across nodes ahead of the other balancing variables. -
cluster.routing.allocation.balance.write_load
-
(float, Dynamic)
Defines the weight factor for the write load of each shard, in terms of the
estimated number of indexing threads needed by the shard. Defaults to
10.0f
. Raising this value increases the tendency of Elasticsearch to equalize the total write load across nodes ahead of the other balancing variables. -
cluster.routing.allocation.balance.threshold
-
(float, Dynamic)
The minimum improvement in weight which triggers a rebalancing shard movement.
Defaults to
1.0f
. Raising this value will cause Elasticsearch to stop rebalancing shards sooner, leaving the cluster in a more unbalanced state.
Regardless of the result of the balancing algorithm, rebalancing might not be allowed due to allocation rules such as forced awareness and allocation filtering.
Disk-based shard allocation settings
editThe disk-based shard allocator ensures that all nodes have enough disk space without performing more shard movements than necessary. It allocates shards based on a pair of thresholds known as the low watermark and the high watermark. Its primary goal is to ensure that no node exceeds the high watermark, or at least that any such overage is only temporary. If a node exceeds the high watermark then Elasticsearch will solve this by moving some of its shards onto other nodes in the cluster.
It is normal for nodes to temporarily exceed the high watermark from time to time.
The allocator also tries to keep nodes clear of the high watermark by forbidding the allocation of more shards to a node that exceeds the low watermark. Importantly, if all of your nodes have exceeded the low watermark then no new shards can be allocated and Elasticsearch will not be able to move any shards between nodes in order to keep the disk usage below the high watermark. You must ensure that your cluster has enough disk space in total and that there are always some nodes below the low watermark.
Shard movements triggered by the disk-based shard allocator must also satisfy all other shard allocation rules such as allocation filtering and forced awareness. If these rules are too strict then they can also prevent the shard movements needed to keep the nodes' disk usage under control. If you are using data tiers then Elasticsearch automatically configures allocation filtering rules to place shards within the appropriate tier, which means that the disk-based shard allocator works independently within each tier.
If a node is filling up its disk faster than Elasticsearch can move shards elsewhere then there is a risk that the disk will completely fill up. To prevent this, as a last resort, once the disk usage reaches the flood-stage watermark Elasticsearch will block writes to indices with a shard on the affected node. It will also continue to move shards onto the other nodes in the cluster. When disk usage on the affected node drops below the high watermark, Elasticsearch automatically removes the write block. Refer to Fix watermark errors to resolve persistent watermark errors.
It is normal for the nodes in your cluster to be using very different amounts of disk space. The balance of the cluster depends only on the number of shards on each node and the indices to which those shards belong. It considers neither the sizes of these shards nor the available disk space on each node, for the following reasons:
- Disk usage changes over time. Balancing the disk usage of individual nodes would require a lot more shard movements, perhaps even wastefully undoing earlier movements. Moving a shard consumes resources such as I/O and network bandwidth and may evict data from the filesystem cache. These resources are better spent handling your searches and indexing where possible.
- A cluster with equal disk usage on every node typically performs no better than one that has unequal disk usage, as long as no disk is too full.
You can use the following settings to control disk-based allocation:
-
cluster.routing.allocation.disk.threshold_enabled
-
(Dynamic)
Defaults to
true
. Set tofalse
to disable the disk allocation decider. Upon disabling, it will also remove any existingindex.blocks.read_only_allow_delete
index blocks.
-
cluster.routing.allocation.disk.watermark.low
-
(Dynamic)
Controls the low watermark for disk usage. It defaults to
85%
, meaning that Elasticsearch will not allocate shards to nodes that have more than 85% disk used. It can alternatively be set to a ratio value, e.g.,0.85
. It can also be set to an absolute byte value (like500mb
) to prevent Elasticsearch from allocating shards if less than the specified amount of space is available. This setting has no effect on the primary shards of newly-created indices but will prevent their replicas from being allocated. -
cluster.routing.allocation.disk.watermark.low.max_headroom
-
(Dynamic) Controls the max headroom for the low watermark (in case of a percentage/ratio value).
Defaults to 200GB when
cluster.routing.allocation.disk.watermark.low
is not explicitly set. This caps the amount of free space required.
-
cluster.routing.allocation.disk.watermark.high
-
(Dynamic)
Controls the high watermark. It defaults to
90%
, meaning that Elasticsearch will attempt to relocate shards away from a node whose disk usage is above 90%. It can alternatively be set to a ratio value, e.g.,0.9
. It can also be set to an absolute byte value (similarly to the low watermark) to relocate shards away from a node if it has less than the specified amount of free space. This setting affects the allocation of all shards, whether previously allocated or not. -
cluster.routing.allocation.disk.watermark.high.max_headroom
-
(Dynamic) Controls the max headroom for the high watermark (in case of a percentage/ratio value).
Defaults to 150GB when
cluster.routing.allocation.disk.watermark.high
is not explicitly set. This caps the amount of free space required. -
cluster.routing.allocation.disk.watermark.enable_for_single_data_node
-
(Static)
In earlier releases, the default behaviour was to disregard disk watermarks for a single
data node cluster when making an allocation decision. This is deprecated behavior
since 7.14 and has been removed in 8.0. The only valid value for this setting
is now
true
. The setting will be removed in a future release.
-
cluster.routing.allocation.disk.watermark.flood_stage
-
(Dynamic) Controls the flood stage watermark, which defaults to 95%. Elasticsearch enforces a read-only index block (
index.blocks.read_only_allow_delete
) on every index that has one or more shards allocated on the node, and that has at least one disk exceeding the flood stage. This setting is a last resort to prevent nodes from running out of disk space. The index block is automatically released when the disk utilization falls below the high watermark. Similarly to the low and high watermark values, it can alternatively be set to a ratio value, e.g.,0.95
, or an absolute byte value.An example of resetting the read-only index block on the
my-index-000001
index:response = client.indices.put_settings( index: 'my-index-000001', body: { "index.blocks.read_only_allow_delete": nil } ) puts response
PUT /my-index-000001/_settings { "index.blocks.read_only_allow_delete": null }
-
cluster.routing.allocation.disk.watermark.flood_stage.max_headroom
-
(Dynamic) Controls the max headroom for the flood stage watermark (in case of a percentage/ratio value).
Defaults to 100GB when
cluster.routing.allocation.disk.watermark.flood_stage
is not explicitly set. This caps the amount of free space required.
You cannot mix the usage of percentage/ratio values and byte values across
the cluster.routing.allocation.disk.watermark.low
, cluster.routing.allocation.disk.watermark.high
,
and cluster.routing.allocation.disk.watermark.flood_stage
settings. Either all values
are set to percentage/ratio values, or all are set to byte values. This enforcement is
so that Elasticsearch can validate that the settings are internally consistent, ensuring that the
low disk threshold is less than the high disk threshold, and the high disk threshold is
less than the flood stage threshold. A similar comparison check is done for the max
headroom values.
-
cluster.routing.allocation.disk.watermark.flood_stage.frozen
- (Dynamic) Controls the flood stage watermark for dedicated frozen nodes, which defaults to 95%.
-
cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom
-
(Dynamic)
Controls the max headroom for the flood stage watermark (in case of a
percentage/ratio value) for dedicated frozen nodes. Defaults to 20GB when
cluster.routing.allocation.disk.watermark.flood_stage.frozen
is not explicitly set. This caps the amount of free space required on dedicated frozen nodes. -
cluster.info.update.interval
-
(Dynamic)
How often Elasticsearch should check on disk usage for each node in the
cluster. Defaults to
30s
.
Percentage values refer to used disk space, while byte values refer to free disk space. This can be confusing, since it flips the meaning of high and low. For example, it makes sense to set the low watermark to 10gb and the high watermark to 5gb, but not the other way around.
An example of updating the low watermark to at least 100 gigabytes free, a high watermark of at least 50 gigabytes free, and a flood stage watermark of 10 gigabytes free, and updating the information about the cluster every minute:
response = client.cluster.put_settings( body: { persistent: { "cluster.routing.allocation.disk.watermark.low": '100gb', "cluster.routing.allocation.disk.watermark.high": '50gb', "cluster.routing.allocation.disk.watermark.flood_stage": '10gb', "cluster.info.update.interval": '1m' } } ) puts response
PUT _cluster/settings { "persistent": { "cluster.routing.allocation.disk.watermark.low": "100gb", "cluster.routing.allocation.disk.watermark.high": "50gb", "cluster.routing.allocation.disk.watermark.flood_stage": "10gb", "cluster.info.update.interval": "1m" } }
Concerning the max headroom settings for the watermarks, please note that these apply only in the case that the watermark settings are percentages/ratios. The aim of a max headroom value is to cap the required free disk space before hitting the respective watermark. This is especially useful for servers with larger disks, where a percentage/ratio watermark could translate to a big free disk space requirement, and the max headroom can be used to cap the required free disk space amount. As an example, let us take the default settings for the flood watermark. It has a 95% default value, and the flood max headroom setting has a default value of 100GB. This means that:
- For a smaller disk, e.g., of 100GB, the flood watermark will hit at 95%, meaning at 5GB of free space, since 5GB is smaller than the 100GB max headroom value.
- For a larger disk, e.g., of 100TB, the flood watermark will hit at 100GB of free space. That is because the 95% flood watermark alone would require 5TB of free disk space, but that is capped by the max headroom setting to 100GB.
Finally, the max headroom settings have their default values only if their respective watermark settings are not explicitly set (thus, they have their default percentage values). If watermarks are explicitly set, then the max headroom settings do not have their default values, and would need to be explicitly set if they are desired.
Shard allocation awareness
editYou can use custom node attributes as awareness attributes to enable Elasticsearch to take your physical hardware configuration into account when allocating shards. If Elasticsearch knows which nodes are on the same physical server, in the same rack, or in the same zone, it can distribute the primary shard and its replica shards to minimise the risk of losing all shard copies in the event of a failure.
When shard allocation awareness is enabled with the
dynamic
cluster.routing.allocation.awareness.attributes
setting, shards are only
allocated to nodes that have values set for the specified awareness attributes.
If you use multiple awareness attributes, Elasticsearch considers each attribute
separately when allocating shards.
The number of attribute values determines how many shard copies are allocated in each location. If the number of nodes in each location is unbalanced and there are a lot of replicas, replica shards might be left unassigned.
Enabling shard allocation awareness
editTo enable shard allocation awareness:
-
Specify the location of each node with a custom node attribute. For example, if you want Elasticsearch to distribute shards across different racks, you might set an awareness attribute called
rack_id
in each node’selasticsearch.yml
config file.node.attr.rack_id: rack_one
You can also set custom attributes when you start a node:
./bin/elasticsearch -Enode.attr.rack_id=rack_one
-
Tell Elasticsearch to take one or more awareness attributes into account when allocating shards by setting
cluster.routing.allocation.awareness.attributes
in every master-eligible node’selasticsearch.yml
config file.You can also use the cluster-update-settings API to set or update a cluster’s awareness attributes.
With this example configuration, if you start two nodes with
node.attr.rack_id
set to rack_one
and create an index with 5 primary
shards and 1 replica of each primary, all primaries and replicas are
allocated across the two nodes.
If you add two nodes with node.attr.rack_id
set to rack_two
,
Elasticsearch moves shards to the new nodes, ensuring (if possible)
that no two copies of the same shard are in the same rack.
If rack_two
fails and takes down both its nodes, by default Elasticsearch
allocates the lost shard copies to nodes in rack_one
. To prevent multiple
copies of a particular shard from being allocated in the same location, you can
enable forced awareness.
Forced awareness
editBy default, if one location fails, Elasticsearch assigns all of the missing replica shards to the remaining locations. While you might have sufficient resources across all locations to host your primary and replica shards, a single location might be unable to host ALL of the shards.
To prevent a single location from being overloaded in the event of a failure,
you can set cluster.routing.allocation.awareness.force
so no replicas are
allocated until nodes are available in another location.
For example, if you have an awareness attribute called zone
and configure nodes
in zone1
and zone2
, you can use forced awareness to prevent Elasticsearch
from allocating replicas if only one zone is available:
cluster.routing.allocation.awareness.attributes: zone cluster.routing.allocation.awareness.force.zone.values: zone1,zone2
With this example configuration, if you start two nodes with node.attr.zone
set
to zone1
and create an index with 5 shards and 1 replica, Elasticsearch creates
the index and allocates the 5 primary shards but no replicas. Replicas are
only allocated once nodes with node.attr.zone
set to zone2
are available.
Cluster-level shard allocation filtering
editYou can use cluster-level shard allocation filters to control where Elasticsearch allocates shards from any index. These cluster wide filters are applied in conjunction with per-index allocation filtering and allocation awareness.
Shard allocation filters can be based on custom node attributes or the built-in
_name
, _host_ip
, _publish_ip
, _ip
, _host
, _id
and _tier
attributes.
The cluster.routing.allocation
settings are dynamic, enabling live indices to
be moved from one set of nodes to another. Shards are only relocated if it is
possible to do so without breaking another routing constraint, such as never
allocating a primary and replica shard on the same node.
The most common use case for cluster-level shard allocation filtering is when you want to decommission a node. To move shards off of a node prior to shutting it down, you could create a filter that excludes the node by its IP address:
response = client.cluster.put_settings( body: { persistent: { "cluster.routing.allocation.exclude._ip": '10.0.0.1' } } ) puts response
PUT _cluster/settings { "persistent" : { "cluster.routing.allocation.exclude._ip" : "10.0.0.1" } }
Cluster routing settings
edit-
cluster.routing.allocation.include.{attribute}
-
(Dynamic)
Allocate shards to a node whose
{attribute}
has at least one of the comma-separated values. -
cluster.routing.allocation.require.{attribute}
-
(Dynamic)
Only allocate shards to a node whose
{attribute}
has all of the comma-separated values. -
cluster.routing.allocation.exclude.{attribute}
-
(Dynamic)
Do not allocate shards to a node whose
{attribute}
has any of the comma-separated values.
The cluster allocation settings support the following built-in attributes:
|
Match nodes by node name |
|
Match nodes by host IP address (IP associated with hostname) |
|
Match nodes by publish IP address |
|
Match either |
|
Match nodes by hostname |
|
Match nodes by node id |
|
Match nodes by the node’s data tier role |
You can use wildcards when specifying attribute values, for example:
response = client.cluster.put_settings( body: { persistent: { "cluster.routing.allocation.exclude._ip": '192.168.2.*' } } ) puts response
PUT _cluster/settings { "persistent": { "cluster.routing.allocation.exclude._ip": "192.168.2.*" } }