Implementing a Statistical Anomaly Detector in Elasticsearch - Part 3

Welcome to the third and final installment of this series on building a statistical anomaly detector in Elasticsearch.  As a quick recap, let’s look at what we’ve built so far:


Today, we’ll take what we built in Part 1 and 2 and automate it completely using Watcher, Elastic’s real-time alerting and notification plugin for Elasticsearch.


With Watcher’s ability to use mustache templating and groovy scripting, it is a remarkably powerful alerting engine.  We can encode the entire Atlas system in just two watches.  The first watch will generate all of the surprise data (just like Part 1) while the second watch will create the threshold and check for anomalies (like Timelion in Part 2).


Let’s get started!

Data Collection Watch

The first watch’s job is to collect the top 90th surprise values for each metric on an hourly basis, emulating the data collection process we built in Part 1.  This means we can leverage most of the hard work from that section (e.g. the pipeline aggregation).  


First, here is the entire watch (then we’ll break it down piece-by-piece):


PUT _watcher/watch/atlas
{
   "trigger":{
      "schedule":{
         "hourly" : { "minute" : 0 }
      }
   },
   "input":{
      "search":{
         "request":{
            "indices":"data",
            "types": "data",
            "body":{
               "query":{
                  "filtered":{
                     "filter":{
                        "range":{
                           "hour":{
                              "gte":"now-24h"
                           }
                        }
                     }
                  }
               },
               "size":0,
               "aggs":{
                  "metrics":{
                     "terms":{
                        "field":"metric"
                     },
                     "aggs":{
                        "queries":{
                           "terms":{
                              "field":"query"
                           },
                           "aggs":{
                              "series":{
                                 "date_histogram":{
                                    "field":"hour",
                                    "interval":"hour"
                                 },
                                 "aggs":{
                                    "avg":{
                                       "avg":{
                                          "field":"value"
                                       }
                                    },
                                    "movavg":{
                                       "moving_avg":{
                                          "buckets_path":"avg",
                                          "window":24,
                                          "model":"simple"
                                       }
                                    },
                                    "surprise":{
                                       "bucket_script":{
                                          "buckets_path":{
                                             "avg":"avg",
                                             "movavg":"movavg"
                                          },
                                          "script":"(avg - movavg).abs()"
                                       }
                                    }
                                 }
                              },
                              "largest_surprise":{
                                 "max_bucket":{
                                    "buckets_path":"series.surprise"
                                 }
                              }
                           }
                        },
                        "ninetieth_surprise":{
                           "percentiles_bucket":{
                              "buckets_path":"queries>largest_surprise",
                              "percents":[
                                 90.0
                              ]
                           }
                        }
                     }
                  }
               }
            }
         },
         "extract":[
           "aggregations.metrics.buckets.ninetieth_surprise",
           "aggregations.metrics.buckets.key"
        ]
      }
   },
   "actions":{
      "index_payload":{
         "transform":{
            "script": {
                "file": "hourly"
            }
         },
         "index" : {
              "index" : "atlas",
              "doc_type" : "data"
            }
      }
   }
}


It’s long, but don’t panic!  A lot of it is repeated code from Part 1.  Let’s start looking at the individual components:

PUT _watcher/watch/atlas
{
   "trigger":{
      "schedule":{
         "hourly" : { "minute" : 0 }
      }
   },


The first thing in our request is the HTTP command.  Watches are stored inside your cluster, so we execute a PUT command to the _watcher endpoint and add a new watch called “atlas”.  Next, we schedule the watch to run with a “trigger”.  Triggers allow watches to run on schedules, much like a cronjob.  We are going to use an hourly trigger, which fires every hour on the hour.


After our trigger, we define the "input" to the watch:


   "input":{
      "search":{
         "request":{
            "indices":"data",
            "types": "data",
            "body":{...},
         "extract":[
           "aggregations.metrics.buckets.ninetieth_surprise",
           "aggregations.metrics.buckets.key"
        ]
      }
   },


Inputs provide the data that a watch uses to make decisions.  There are a variety of inputs available,  but we’ll use a search input.  This input executes an arbitrary Elasticsearch query and allows a watch to use the response for later processing.  The “request” parameter defines the details about the request: the indices/types to query and the request body (which is the pipeline aggregation we built in Part 1).  Combined with the trigger, our watch will execute the large pipeline agg against the raw data every hour.


The “extract” parameter lets us extract details that we are interested in, to simplify further processing in the watch.  It is conceptually very similar to filter_path, merely a filtering mechanism to reduce response verbosity.  Here we are using it to extract the five top-90th percentile surprises and their keys.


Finally we define an "action":


   "actions":{
      "index_payload":{
         "transform":{
            "script": {
                "file": "hourly"
            }
         },
         "index" : {
              "index" : "atlas",
              "doc_type" : "data"
            }
      }
   }
}

The action is executed after the query has run, and defines the "output" of a watch.  Actions can send emails, send messages to Slack, post to custom webhooks, etc.  For our purposes, we actually want to put data back inside Elasticsearch.  We need to index the results of the pipeline aggregation so we can alert on it.  To do that, we setup an index_payload action which will index documents back into Elasticsearch for us.


But before we can index anything, we need to convert the JSON aggregation response into a set of indexable documents.  That is done via the transform script hourly.groovy which resides on our node (in the config/scripts/ directory).  It looks like this:


def docs = [];
for(item in ctx.payload.aggregations.metrics.buckets) {
  def doc = [
    metric : item.key,
    value : item.ninetieth_surprise.values["90.0"],
    execution_time: ctx.execution_time
  ];
  docs << doc;
}
return [ _doc : docs ];

Its function is very simple: iterate over the 90th percentile buckets and create an array holding the key, the value and the execution time.  Then append that to a bulk array and return it when done iterating over the buckets.


The returned array is in the Bulk API syntax, which Watcher will insert into the “atlas” index under the “data” type.  Once this watch is added to the cluster, Elasticsearch will begin collecting hourly surprise metrics just like we did in the simulator.  Perfect!  Let’s write the watch that finds anomalies now.


Anomaly Detection Watch

The goal of this watch is to replicate what we did in Part 2 with Timelion.  Namely, it needs to construct a threshold that is three standard deviations above the moving average of the 90th surprise...per metric.  Then it needs to raise some kind of alert if that threshold is broken.


This watch follows a similar layout as the last one, but has a bit more custom logic.  The whole watch looks like this:


PUT _watcher/watch/atlas_analytics
{
   "trigger": {
      "schedule": {
         "hourly" : { "minute" : 5 }
      }
   },
   "input": {
      "search": {
         "request": {
            "indices": "atlas",
            "types": "data",
            "body": {
               "query": {
                  "filtered": {
                     "filter": {
                        "range": {
                           "execution_time": {
                              "gte": "now-6h"
                           }
                        }
                     }
                  }
               },
               "size": 0,
               "aggs": {
                  "metrics": {
                     "terms": {
                        "field": "metric"
                     },
                     "aggs": {
                        "series": {
                           "date_histogram": {
                              "field": "execution_time",
                              "interval": "hour"
                           },
                           "aggs": {
                              "avg": {
                                 "avg": {
                                    "field": "value"
                                 }
                              }
                           }
                        },
                        "series_stats": {
                           "extended_stats": {
                              "field": "value",
                              "sigma": 3
                           }
                        }
                     }
                  }
               }
            }
         },
         "extract": [
            "aggregations.metrics.buckets"
         ]
      }
   },
   "condition": {
      "script": {
         "file": "analytics_condition"
      }
   },
   "transform": {
      "script": {
         "file": "analytics_transform"
      }
   },
   "actions": {
      "index_payload": {
         "logging": {
            "text": "{{ctx.alerts}}"
         }
      },
      "email_alert" : {
        "email": {
          "to": "'John Doe <john.doe@example.com>'",
          "subject": "Atlas Alerts Triggered!",
          "body": "Metrics that appear anomalous: {{ctx.alerts}}"
        }
      }
   }
}

We'll walk through it step-by-step again.  Similar to the first watch, we PUT the watch into the cluster with a specific name (“atlas_analytics”) and setup an hourly schedule for it to run. However, the schedule is offset by five minutes this time to allow the first watch time to complete.

We also use a search input again:

   "input": {
      "search": {
         "request": {
            "indices": "atlas",
            "types": "data",
            "body": {
               "query": {
                  "filtered": {
                     "filter": {
                        "range": {
                           "execution_time": {
                              "gte": "now-6h"
                           }
                        }
                     }
                  }
               },
               "size": 0,
               "aggs": {
                  "metrics": {
                     "terms": {
                        "field": "metric"
                     },
                     "aggs": {
                        "series": {
                           "date_histogram": {
                              "field": "execution_time",
                              "interval": "hour"
                           },
                           "aggs": {
                              "avg": {
                                 "avg": {
                                    "field": "value"
                                 }
                              }
                           }
                        },
                        "series_stats": {
                           "extended_stats": {
                              "field": "value",
                              "sigma": 3
                           }
                        }
                     }
                  }
               }
            }
         },
         "extract": [
            "aggregations.metrics.buckets"
         ]
      }
   },

This search is a little different.  First, it is querying /atlas/data instead of /data/data; this watch is aggregating the results of the previous watch instead of the raw data.  The query is also filtering to just the last six hours which allows us to scope the time frame to a specific window. 


An aggregation is used to build a date_histogram per metric (e.g. a time-series per metric).  Inside each series we calculate the average and standard deviation (making sure to ask the stats agg for three standard deviations via the sigma parameter).  Finally, we extract out just the buckets because we don’t care about the rest of the response.


You’ll notice that in Part 2 we used a moving average and standard deviation to calculate this data, while here it is a plain average / stddev.  Why is that?  Because this watch executes every hour, the window of time will naturally slide across the data. Unlike the Timelion implementation -- which had to display all points of time in one graph -- we are only concerned with generating the data points for this hour, so a simple avg works fine.


So at this point, our watch has all the required information to flag an anomaly...but we need to run some custom logic to tie it together.  That’s what happens next, in the condition clause:


   "condition": {
      "script": {
         "file": "analytics_condition"
      }
   },


A condition is a gatekeeper to the action: if the condition evaluates true, the action is run.  Our condition uses another groovy script, analytics_condition.groovy:


def docs = [];
def status = false;
for(item in ctx.payload.aggregations.metrics.buckets) {
  def std_upper = Double.valueOf(item.series_stats.std_deviation_bounds.upper);
  def avg = Double.valueOf(item.series.buckets.last().avg.value);
  if (std_upper == Double.NaN || avg == Double.NaN) {
    continue;
  }
  if (avg > std_upper) {
    status = true;
    break;
  }
}
return status;


The script is really very simple: extract the standard deviation upper bound (which is provided by the aggregation natively) and the average, then see if the average is greater than the upper bound.  If the average is indeed greater, set a flag and return true


At this point, if the condition returned false is returned empty, the watch ends: nothing is anomalous.  But if it returns true, we continue onwards to the transform clause:


   "transform": {
      "script": {
         "file": "analytics_transform"
      }
   },


Transformations can be used to modify, enrich and manipulate data.  We’ll use the transformation to tidy up the data so that a list of alerts can be embedded in an email easily.  Again, we use a groovy script to do the transformation, this one called analytics_transform.groovy:


def alerts = [];
for(item in ctx.payload.aggregations.metrics.buckets) {
  def std_upper = Double.valueOf(item.series_stats.std_deviation_bounds.upper);
  def avg = Double.valueOf(item.series.buckets.last().avg.value);
  if (Double.isNaN(std_upper) || Double.isNaN(avg)) {
    continue;
  }
  if (avg > std_upper) {
    alerts << item.id;
  }
}
return [alerts: alerts];


Look familiar?  This is basically the same as the analytics_condition.groovy script used in the condition clause.  The only difference is that any anomalous metrics are appended to an array, instead of changing a flag.  The array is then returned, which we can use in our final email action:


 "actions": {
      "index_payload": {
         "logging": {
            "text": "{{ctx.alerts}}"
         }
      },
      "email_alert" : {
        "email": {
          "to": "'John Doe <john.doe@example.com>'",
          "subject": "Atlas Alerts Triggered!",
          "body": "Metrics that appear anomalous: {{ctx.alerts}}"
        }
      }
   }
}


In the last part of the watch, we perform two actions.  First, we log the anomalies (for debugging purposes).  We also define an email_alert, which will fire off an email.  The body of the email can use mustache for templating, which is how we can embed the list of alerts (via {{ctx.alerts}}, the array we built in the transformation step)

Conclusion

And that’s it!  The watches are long, but relatively straightforward when you work through them step-by-step.  All the difficult work was done in Part 1 and 2...moving the logic into Watcher is mostly trivial.


Once these watches are enabled, the cluster will automatically start monitoring and alerting on an hourly basis.  It is very tunable because watches can be modified at any time via API calls. You could make the interval shorter or longer, extend the amount of data in each aggregation pass, modify any of the aggregation settings, change the types of moving averages in the pipeline agg, introduce entirely new metrics, etc.  It is a very easy system to tweak even once it is live and in production.


I hope you’ve enjoyed this three-part series.  It was a very fun project to work on, and really helped me understand the power that pipeline aggregations, Timelion and Watcher bring to the table (especially when combined).  Until next time!