WARNING: This documentation covers Elasticsearch 2.x. The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Time-Based Data
editTime-Based Data
editOne of the most common use cases for Elasticsearch is for logging, so common in fact that Elasticsearch provides an integrated logging platform called the ELK stack—Elasticsearch, Logstash, and Kibana—to make the process easy.
Logstash collects, parses, and enriches logs before indexing them into Elasticsearch. Elasticsearch acts as a centralized logging server, and Kibana is a graphic frontend that makes it easy to query and visualize what is happening across your network in near real-time.
Most traditional use cases for search engines involve a relatively static collection of documents that grows slowly. Searches look for the most relevant documents, regardless of when they were created.
Logging—and other time-based data streams such as social-network activity—are very different in nature. The number of documents in the index grows rapidly, often accelerating with time. Documents are almost never updated, and searches mostly target the most recent documents. As documents age, they lose value.
We need to adapt our index design to function with the flow of time-based data.
Index per Time Frame
editIf we were to have one big index for documents of this type, we would soon run
out of space. Logging events just keep on coming, without pause or
interruption. We could delete the old events with a scroll
query and bulk delete, but this approach is very inefficient. When you delete a
document, it is only marked as deleted (see Deletes and Updates). It won’t
be physically deleted until the segment containing it is merged away.
Instead, use an index per time frame. You could start out with an index per
year (logs_2014
) or per month (logs_2014-10
). Perhaps, when your
website gets really busy, you need to switch to an index per day
(logs_2014-10-24
). Purging old data is easy: just delete old indices.
This approach has the advantage of allowing you to scale as and when you need to. You don’t have to make any difficult decisions up front. Every day is a new opportunity to change your indexing time frames to suit the current demand. Apply the same logic to how big you make each index. Perhaps all you need is one primary shard per week initially. Later, maybe you need five primary shards per day. It doesn’t matter—you can adjust to new circumstances at any time.
Aliases can help make switching indices more transparent. For indexing,
you can point logs_current
to the index currently accepting new log events,
and for searching, update last_3_months
to point to all indices for the
previous three months: