Aggregating data for faster performance
editAggregating data for faster performance
editWhen you aggregate data, Elasticsearch automatically distributes the calculations across your cluster. Then you can feed this aggregated data into the machine learning features instead of raw results. It reduces the volume of data that must be analyzed.
Requirements
editThere are a number of requirements for using aggregations in datafeeds.
Aggregations
edit-
Your aggregation must include a
date_histogram
aggregation or a top levelcomposite
aggregation, which in turn must contain amax
aggregation on the time field. It ensures that the aggregated data is a time series and the timestamp of each bucket is the time of the last record in the bucket. -
The
time_zone
parameter in the date histogram aggregation must be set toUTC
, which is the default value. -
The name of the aggregation and the name of the field that it operates on need
to match. For example, if you use a
max
aggregation on a time field calledresponsetime
, the name of the aggregation must also beresponsetime
. -
For
composite
aggregation support, there must be exactly onedate_histogram
value source. That value source must not be sorted in descending order. Additionalcomposite
aggregation value sources are allowed, such asterms
. -
The
size
parameter of the non-composite aggregations must match the cardinality of your data. A greater value of thesize
parameter increases the memory requirement of the aggregation. -
If you set the
summary_count_field_name
property to a non-null value, the anomaly detection job expects to receive aggregated input. The property must be set to the name of the field that contains the count of raw data points that have been aggregated. It applies to all detectors in the job. - The influencers or the partition fields must be included in the aggregation of your datafeed, otherwise they are not included in the job analysis. For more information on influencers, refer to Influencers.
Intervals
edit-
The bucket span of your anomaly detection job must be divisible by the value of the
calendar_interval
orfixed_interval
in your aggregation (with no remainder). -
If you specify a
frequency
for your datafeed, it must be divisible by thecalendar_interval
or thefixed_interval
. -
Anomaly detection jobs cannot use
date_histogram
orcomposite
aggregations with an interval measured in months because the length of the month is not fixed; they can use weeks or smaller units.
Limitations
edit-
If your datafeed uses aggregations with nested
terms
aggs and model plot is not enabled for the anomaly detection job, neither the Single Metric Viewer nor the Anomaly Explorer can plot and display an anomaly chart. In these cases, an explanatory message is shown instead of the chart. - Your datafeed can contain multiple aggregations, but only the ones with names that match values in the job configuration are fed to the job.
Recommendations
edit-
When your detectors use metric or
sum analytical functions, it’s recommended to set the
date_histogram
orcomposite
aggregation interval to a tenth of the bucket span. This creates finer, more granular time buckets, which are ideal for this type of analysis. - When your detectors use count or rare functions, set the interval to the same value as the bucket span.
-
If you have multiple influencers or partition fields or if your field cardinality is more than 1000, use composite aggregations.
To determine the cardinality of your data, you can run searches such as:
GET .../_search { "aggs": { "service_cardinality": { "cardinality": { "field": "service" } } } }
Including aggregations in anomaly detection jobs
editWhen you create or update an anomaly detection job, you can include aggregated fields in the analysis configuration. In the datafeed configuration object, you can define the aggregations.
PUT _ml/anomaly_detectors/kibana-sample-data-flights { "analysis_config": { "bucket_span": "60m", "detectors": [{ "function": "mean", "field_name": "responsetime", "by_field_name": "airline" }], "summary_count_field_name": "doc_count" }, "data_description": { "time_field":"time" }, "datafeed_config":{ "indices": ["kibana-sample-data-flights"], "aggregations": { "buckets": { "date_histogram": { "field": "time", "fixed_interval": "360s", "time_zone": "UTC" }, "aggregations": { "time": { "max": {"field": "time"} }, "airline": { "terms": { "field": "airline", "size": 100 }, "aggregations": { "responsetime": { "avg": { "field": "responsetime" } } } } } } } } }
The |
|
The |
|
The aggregations have names that match the fields that they operate on. The
|
|
The |
|
The |
Use the following format to define a date_histogram
aggregation to bucket by
time in your datafeed:
"aggregations": { ["bucketing_aggregation": { "bucket_agg": { ... }, "aggregations": { "data_histogram_aggregation": { "date_histogram": { "field": "time", }, "aggregations": { "timestamp": { "max": { "field": "time" } }, [,"<first_term>": { "terms":{... } [,"aggregations" : { [<sub_aggregation>]+ } ] }] } } } } }
Composite aggregations
editComposite aggregations are optimized for queries that are either match_all
or
range
filters. Use composite aggregations in your datafeeds for these cases.
Other types of queries may cause the composite
aggregation to be inefficient.
The following is an example of a job with a datafeed that uses a composite
aggregation to bucket the metrics based on time and terms:
PUT _ml/anomaly_detectors/kibana-sample-data-flights-composite { "analysis_config": { "bucket_span": "60m", "detectors": [{ "function": "mean", "field_name": "responsetime", "by_field_name": "airline" }], "summary_count_field_name": "doc_count" }, "data_description": { "time_field":"time" }, "datafeed_config":{ "indices": ["kibana-sample-data-flights"], "aggregations": { "buckets": { "composite": { "size": 1000, "sources": [ { "time_bucket": { "date_histogram": { "field": "time", "fixed_interval": "360s", "time_zone": "UTC" } } }, { "airline": { "terms": { "field": "airline" } } } ] }, "aggregations": { "time": { "max": { "field": "time" } }, "responsetime": { "avg": { "field": "responsetime" } } } } } } }
The number of resources to use when aggregating the data. A larger |
|
The required |
|
Instead of using a regular |
|
The required |
|
The |
Use the following format to define a composite aggregation in your datafeed:
"aggregations": { "composite_agg": { "sources": [ { "date_histogram_agg": { "field": "time", ...settings... } }, ...other valid sources... ], ...composite agg settings..., "aggregations": { "timestamp": { "max": { "field": "time" } }, ...other aggregations... [ [,"aggregations" : { [<sub_aggregation>]+ } ] }] } } }
Nested aggregations
editYou can also use complex nested aggregations in datafeeds.
The next example uses the
derivative
pipeline aggregation
to find the first order derivative of the counter system.network.out.bytes
for
each value of the field beat.name
.
derivative
or other pipeline aggregations may not work within
composite
aggregations. See
composite aggregations and pipeline aggregations.
"aggregations": { "beat.name": { "terms": { "field": "beat.name" }, "aggregations": { "buckets": { "date_histogram": { "field": "@timestamp", "fixed_interval": "5m" }, "aggregations": { "@timestamp": { "max": { "field": "@timestamp" } }, "bytes_out_average": { "avg": { "field": "system.network.out.bytes" } }, "bytes_out_derivative": { "derivative": { "buckets_path": "bytes_out_average" } } } } } } }
Single bucket aggregations
editYou can also use single bucket aggregations in datafeeds. The following example
shows two filter
aggregations, each gathering the number of unique entries for
the error
field.
{ "job_id":"servers-unique-errors", "indices": ["logs-*"], "aggregations": { "buckets": { "date_histogram": { "field": "time", "interval": "360s", "time_zone": "UTC" }, "aggregations": { "time": { "max": {"field": "time"} } "server1": { "filter": {"term": {"source": "server-name-1"}}, "aggregations": { "server1_error_count": { "value_count": { "field": "error" } } } }, "server2": { "filter": {"term": {"source": "server-name-2"}}, "aggregations": { "server2_error_count": { "value_count": { "field": "error" } } } } } } } }
Using aggregate_metric_double
field type in datafeeds
editIt is not currently possible to use aggregate_metric_double
type fields
in datafeeds without aggregations.
You can use fields with the
aggregate_metric_double
field type in a
datafeed with aggregations. It is required to retrieve the value_count
of the
aggregate_metric_double
filed in an aggregation and then use it as the
summary_count_field_name
to provide the correct count that represents the
aggregation value.
In the following example, presum
is an aggregate_metric_double
type field
that has all the possible metrics: [ min, max, sum, value_count ]
. To use an
avg
aggregation on this field, you need to perform a value_count
aggregation
on presum
and then set the field that contains the aggregated values
my_count
as the summary_count_field_name
:
{ "analysis_config": { "bucket_span": "1h", "detectors": [ { "function": "avg", "field_name": "my_avg" } ], "summary_count_field_name": "my_count" }, "data_description": { "time_field": "timestamp" }, "datafeed_config": { "indices": [ "my_index" ], "datafeed_id": "datafeed-id", "aggregations": { "buckets": { "date_histogram": { "field": "time", "fixed_interval": "360s", "time_zone": "UTC" }, "aggregations": { "timestamp": { "max": {"field": "timestamp"} }, "my_avg": { "avg": { "field": "presum" } }, "my_count": { "value_count": { "field": "presum" } } } } } } }
The field |
|
The |
|
The |