Elasticsearch as a Time Series Data Store
As the project manager of stagemonitor, an open source performance monitoring tool, I've recently been looking for a database to replace the cool-but-aging Graphite Time Series DataBase (TSDB) as the backend. TSDBs are specialised packages for storing (performance) metric data, like the response time of your app or the CPU utilisation of a server. Ultimately, we were looking for a datastore that is easy to install, scalable, supports a variety of functions, and has great support for visualizing the metrics.
We've previously worked with Elasticsearch, so we know it is easy to install, scaleable, offers many aggregations, and has a great visualisation tool in Kibana. But we didn't know if Elasticsearch was suited for time series data. We were not the only one asking the question. In fact, CERN (you know, those folks who shoot protons in circles) did a performance comparison between Elasticsearch, InfluxDB and OpenTSDB and declared Elasticsearch the winner.
The Decision Process
Elasticsearch is a fantastic tool for storing, searching, and analyzing structured and unstructured data — including free text, system logs, database records, and more. With the right tweaking, you also get a great platform to store your time series metrics from tools like collectd or statsd.
It also scales very easily as you add more metrics. Elasticsearch has built in redundancy thanks to shard replicas and allows simple backups with Snapshot & Restore, which makes management of your cluster and your data a lot easier.
Elasticsearch is also highly API driven, and with integration tools like Logstash, it's simple to build data processing pipelines that can handle very large amounts of data with great efficiency. Once you add Kibana to the mix you have a platform that allows you to ingest and analyse multiple datasets and draw correlations from metric and other data side-by-side.
Another benefit that isn't immediately obvious is instead of storing metrics that have been transformed via a calculation to provide an endpoint value that gets graphed, you are storing the raw value and then running the suite of powerful aggregations built into Elasticsearch over these values. This means that if you change your mind after a few months of monitoring a metric and you want to calculate or display the metric differently, it's as simple as changing the aggregation over the dataset, for both historic and current data. Or to put it another way: You have the ability to ask and answer questions that you didn't think about when the data was stored!
With all that in mind, the obvious question we wanted to answer is: What's the best method for setting up Elasticsearch as a time series database?
First Stop: Mappings
The most important part to start is your mapping. Defining your mapping ahead of time means that the analysis and storage of data in Elasticsearch is as optimal as possible.
Here's an example of how we do the mappings at stagemonitor. You can find the original over in our Github repo:
{ "template": "stagemonitor-metrics-*", "settings": { "index": { "refresh_interval": "5s" } }, "mappings": { "_default_": { "dynamic_templates": [ { "strings": { "match": "*", "match_mapping_type": "string", "mapping": { "type": "string", "doc_values": true, "index": "not_analyzed" } } } ], "_all": { "enabled": false }, "_source": { "enabled": false }, "properties": { "@timestamp": { "type": "date", "doc_values": true }, "count": { "type": "integer", "doc_values": true, "index": "no" }, "m1_rate": { "type": "float", "doc_values": true, "index": "no" }, "m5_rate": { "type": "float", "doc_values": true, "index": "no" }, "m15_rate": { "type": "float", "doc_values": true, "index": "no" }, "max": { "type": "integer", "doc_values": true, "index": "no" }, "mean": { "type": "integer", "doc_values": true, "index": "no" }, "mean_rate": { "type": "float", "doc_values": true, "index": "no" }, "median": { "type": "float", "doc_values": true, "index": "no" }, "min": { "type": "float", "doc_values": true, "index": "no" }, "p25": { "type": "float", "doc_values": true, "index": "no" }, "p75": { "type": "float", "doc_values": true, "index": "no" }, "p95": { "type": "float", "doc_values": true, "index": "no" }, "p98": { "type": "float", "doc_values": true, "index": "no" }, "p99": { "type": "float", "doc_values": true, "index": "no" }, "p999": { "type": "float", "doc_values": true, "index": "no" }, "std": { "type": "float", "doc_values": true, "index": "no" }, "value": { "type": "float", "doc_values": true, "index": "no" }, "value_boolean": { "type": "boolean", "doc_values": true, "index": "no" }, "value_string": { "type": "string", "doc_values": true, "index": "no" } } } } }
You can see here we have disabled _source and _all as we are only ever going to be building aggregations, so we save on disk space as the document stored will be smaller. The downside is that we won't be able to see the actual JSON documents or to reindex to a new mapping or index structure (see the documentation for disabling source for more information), but for our metrics based use case this isn't a major worry for us.
Just to reiterate: For most use cases you do not want to disable source!
We are also not analyzing string values, as we won't perform full text searches on the metric documents. In this case, we only want to filter by exact names or perform term aggregations on fields like metricName, host or application so that we can filter our metrics by certain hosts or to get a list of all hosts. It's also better to use doc_values as much as possible to reduce heap use.
There are two more quite aggressive optimisations which may not be suitable for all use cases. The first one is to use "index": "no"
for all metric values. This reduces the index size but also means we can't search the values — which is fine if we want to show all values in a graph and not only a subset, like values between 2.7182 and 3.1415. By using the smallest numeric type (for us it was float) we can optimize the index further. if your case request values are out of the range of a float you could use doubles.
Next Up: Optimising for long term storage
The next important step in optimising this data for long term storage is to force merge (previously known as optimize) the indices after all the data has been indexed into them. This involves merging the existing shards into just a few and removing any deleted documents in the same step. “Optimize”, as it was known, is a bit of a misleading term in Elasticsearch — the process does improve resource use, but may require a lot of CPU and disk resources as the system purges any deleted documents and then merges all the underlying Lucene segments. This is why we recommend force merging during off peak periods or running it on nodes with more CPU and disk resources.
The merging process does happen automatically in the background, but only while data is being written to the index. We want to explicitly call it once we are sure all the events have been sent to Elasticsearch and the index is no longer being modified by additions, updates, or deletions.
An optimize is usually best left until 24-48 hours after the newest index has been created (be it hourly, daily, weekly, etc) to allow any late events to reach Elasticsearch. After that period we can easily use Curator to handle the optimize call:
$ curator optimize --delay 2 --max_num_segments 1 indices --older-than 1 --time-unit days --timestring %Y.%m.%d --prefix stagemonitor-metrics-
Another great benefit of running this optimise after all data has been written is that we automatically apply synced flush that assists in cluster recovery speed and restarts of nodes.
If you are using stagemonitor the optimize process is triggered automatically every night, so you don't even need to use curator in that case.
The Outcome
To test this, we sent a randomised set of just over 23 million data points from our platform to Elasticsearch, equal to roughly a week's worth. This is a sample of what the data looks like:
{ "@timestamp": 1442165810, "name": "timer_1", "application": "Metrics Store Benchmark", "host": "my_hostname", "instance": "Local", "count": 21, "mean": 714.86, "min": 248.00, "max": 979.00, "stddev": 216.63, "p50": 741.00, "p75": 925.00, "p95": 977.00, "p98": 979.00, "p99": 979.00, "p999": 979.00, "mean_rate": 2.03, "m1_rate": 2.18, "m5_rate": 2.20, "m15_rate": 2.20 }
After running a few indexing and optimising cycles we saw the following figures:
Initial Size | Post Optimize | |
Sample Run 1 | 2.2G | 508.6M |
Sample Run 2 | 514.1M | |
Sample Run 3 | 510.9M | |
Sample Run 4 | 510.9M | |
Sample Run 5 | 510.9M |
You can see how important the optimize process was. Even with Elasticsearch doing this work in the background, it is well worth running for long term storage alone.
Now with the all this data in Elasticsearch, what can we discover? Well, here's a few samples of what we have built with the system:
If you'd like to replicate the testing that we did here you can find the code in the stagemonitor Github repo.
The Future
With Elasticsearch 2.0 there are a lot of features that make it even more flexible and suitable for time series data users.
Pipeline aggregations open a whole new level for analyzing and transforming of data points. For example, it is possible to smooth graphs with moving averages, use a Holt Winters forecast to check if the data matches historic patterns, or even calculate derivatives.
And finally, in the mapping described above we had to manually enable doc_values to improve heap efficiency. In 2.0, doc_values are enabled by default for any not_analyzed field which means less work for you!
About the Author — Felix Barnsteiner
Felix Barnsteiner is the developer of the open source performance monitoring project stagemonitor. During the day he works on eCommerce solutions at iSYS Software GmbH in Munich, Germany.