Data tiers
editData tiers
editA data tier is a collection of nodes with the same data role that typically share the same hardware profile:
- Content tier nodes handle the indexing and query load for content such as a product catalog.
- Hot tier nodes handle the indexing load for time series data such as logs or metrics and hold your most recent, most-frequently-accessed data.
- Warm tier nodes hold time series data that is accessed less-frequently and rarely needs to be updated.
- Cold tier nodes hold time series data that is accessed infrequently and not normally updated.
- Frozen tier nodes hold time series data that is accessed rarely and never updated, kept in searchable snapshots.
When you index documents directly to a specific index, they remain on content tier nodes indefinitely.
When you index documents to a data stream, they initially reside on hot tier nodes. You can configure index lifecycle management (ILM) policies to automatically transition your time series data through the hot, warm, and cold tiers according to your performance, resiliency and data retention requirements.
A node’s data role is configured in elasticsearch.yml
.
For example, the highest-performance nodes in a cluster might be assigned to both the hot and content tiers:
node.roles: ["data_hot", "data_content"]
Content tier
editData stored in the content tier is generally a collection of items such as a product catalog or article archive. Unlike time series data, the value of the content remains relatively constant over time, so it doesn’t make sense to move it to a tier with different performance characteristics as it ages. Content data typically has long data retention requirements, and you want to be able to retrieve items quickly regardless of how old they are.
Content tier nodes are usually optimized for query performance—they prioritize processing power over IO throughput so they can process complex searches and aggregations and return results quickly. While they are also responsible for indexing, content data is generally not ingested at as high a rate as time series data such as logs and metrics. From a resiliency perspective the indices in this tier should be configured to use one or more replicas.
New indices are automatically allocated to the Content tier unless they are part of a data stream.
Hot tier
editThe hot tier is the Elasticsearch entry point for time series data and holds your most-recent, most-frequently-searched time series data. Nodes in the hot tier need to be fast for both reads and writes, which requires more hardware resources and faster storage (SSDs). For resiliency, indices in the hot tier should be configured to use one or more replicas.
New indices that are part of a data stream are automatically allocated to the hot tier.
Warm tier
editTime series data can move to the warm tier once it is being queried less frequently than the recently-indexed data in the hot tier. The warm tier typically holds data from recent weeks. Updates are still allowed, but likely infrequent. Nodes in the warm tier generally don’t need to be as fast as those in the hot tier. For resiliency, indices in the warm tier should be configured to use one or more replicas.
Cold tier
editOnce data is no longer being updated, it can move from the warm tier to the cold tier where it stays while being queried infrequently. The cold tier is still a responsive query tier, but data in the cold tier is not normally updated. As data transitions into the cold tier it can be compressed and shrunken. For resiliency, indices in the cold tier can rely on searchable snapshots, eliminating the need for replicas.
Frozen tier
editThis functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
Once data is no longer being queried, or being queried rarely, it may move from the cold tier to the frozen tier where it stays for the rest of its life.
The frozen tier uses searchable snapshots to store and load data from a snapshot repository. Instead of using a full local copy of your data, these searchable snapshots use smaller local caches containing only recently searched data. If a search requires data that is not in a cache, Elasticsearch fetches the data as needed from the snapshot repository. This decouples compute and storage, letting you run searches over very large data sets with minimal compute resources, which significantly reduces your storage and operating costs.
The frozen phase automatically converts data transitioning into the frozen tier into a shared-cache searchable snapshot.
Search is typically slower on the frozen tier than the cold tier, because Elasticsearch must sometimes fetch data from the snapshot repository.
Data tier index allocation
editWhen you create an index, by default Elasticsearch sets
index.routing.allocation.include._tier_preference
to data_content
to automatically allocate the index shards to the content tier.
When Elasticsearch creates an index as part of a data stream,
by default Elasticsearch sets
index.routing.allocation.include._tier_preference
to data_hot
to automatically allocate the index shards to the hot tier.
You can override the automatic tier-based allocation by specifying shard allocation filtering settings in the create index request or index template that matches the new index.
You can also explicitly set index.routing.allocation.include._tier_preference
to opt out of the default tier-based allocation.
If you set the tier preference to null
, Elasticsearch ignores the data tier roles during allocation.
Automatic data tier migration
editILM automatically transitions managed indices through the available data tiers using the migrate action. By default, this action is automatically injected in every phase. You can explicitly specify the migrate action to override the default behavior, or use the allocate action to manually specify allocation rules.