A case for self-monitoring systems

Theory

Network infrastructure has evolved in recent years.  Systems reside in a combination of the corporate office, datacentre co-locations, and the cloud.  Trends in automation and configuration management have simplified massive scaling.

A single security domain interconnected by private lines is no longer a luxury that engineers can expect.   Ad-hoc measures must be implemented such as virtual private networks and least privilege firewall permissions, on a case-by-case basis, requiring extensive architecture and maintenance.  

System logging and monitoring are essential components in stable and intelligent business operations - but too often, security is complicated - or worse, compromised.   A smart engineer will usually choose the simplest approach, as it is often the most stable and secure.

In most of today’s networks, logging and monitoring are mutually exclusive, one pushing back to a central location, and the other polling from a central location.   The polling piece of this dual-architecture has several problems:

  • Polling requires either individual firewall rules, or unrestricted access from an operations network.  These rules must be manually maintained in each target datacentre, or cloud domain.
  • Each polled system and each monitored service must be configured and maintained, apart from existing automation or configuration management systems.

Theoretically, it is now possible to simplify the traditional architecture, streamlining logging, monitoring, performance statistics and business reporting into a single platform - Elasticsearch.

Proof of Concept

Beats

In order for a solution to be viable, it must ultimately address any need that arises, on any platform.  The Beats platform is a lightweight, open source data shipper, and runs on most common operating systems.  The community has already begun contributing Beats to monitor their favourite services.

A wealth of Nagios Checks already exist (currently over 5000 in the official community directory), and can be integrated with Beats via Nagioscheckbeat.   Here is an example of how we would configure nagioscheckbeat to check the redis service every 10 seconds:

period: "10s"
name: "redis"
cmd: "/usr/lib/nagios/plugins/check_redis.pl"
args: "-H 127.0.0.1 -R -m -T -f -A"

Elasticsearch

Beats must have an output, such as Logstash or an Elasticsearch cluster.   Self-hosted, or as a cloud service, the organization’s cluster must be secure and accessible by all end systems.  

Kibana

Elasticsearch and Kibana serve nicely for reporting performance metrics:

Screen Shot 2016-01-13 at 4.53.17 PM.png

Apache Workers fit perfectly into a percentage visualization:

Screen Shot 2016-01-13 at 4.25.54 PM.png

Kibana’s new Timelion plugin allows for more detailed time series analysis.  For example, we could overlay arbitrary metrics in a single visualization, for custom correlations during an RCA.

Screen Shot 2016-01-13 at 4.21.00 PM.png

Alerting on Crossed Thresholds or Unresponsive Services

Nagioscheckbeat does two things.  It publishes performance metrics — for making pretty graphs — but also reports the status of each check against warning & critical thresholds defined in the configuration.  Those checks are published separately from the metrics, so we can watch the results directly.   Elastic’s Watcher plugin can be configured to alert with little effort, and even supports PagerDuty notifications.

This Watcher configuration will send an email when any nagioscheck has reported a CRITICAL status in the last 30 minutes:

PUT _watcher/watch/critical_watch
{
  "trigger" : {
      "schedule" : {
        "interval" : "1m"
    }
  },
  "input" : {
    "search" : {
      "request" : {
        "indices" : [ "nagioscheckbeat*" ],
        "body" : {
           "query":{
              "filtered":{
                 "query" : {
                    "bool" : {
                      "must" : [
                            { "term" : {"_type": "nagioscheck"} },
                            { "range" : {"@timestamp" : {"gte" : "now-30m"}} },
                            { "term" : {"status" : "CRITICAL" } }
                      ]
                    }
                 }
              }
           }
        }
      }
    }
  },
  "condition" : {
    "compare" : { "ctx.payload.hits.total" : { "gt" : 0 }}
  },
  "actions" : {
    "send_email" : {
      "throttle_period": "30m",
      "email" : {
        "to" : "me@elastic.co",
        "subject" : "Alert from Watcher - Service(s) Critical",
        "body" : "One or more services reported a CRITICAL state within the last 30 minutes.  See the attached file for details.  You will not receive another notification for 30 minutes.",
        "attach_data" : true
      }
    }
  }
}

Notice the throttle_period directive, which ensures that alerts are only sent periodically until the condition is resolved.   

Alerting on Lost Heartbeats

No monitoring system would be complete without knowing if a host is down.  Consider this experimental approach:  On each host, a special “heartbeat” is published every several seconds.

name: "heartbeat"
cmd: "/usr/lib64/nagios/plugins/check_dummy"
args: "0 Hello"
period: "10s"

The watch works like this: Any host that sent a heartbeat in the last 1 day, and whom we have not heard from in the last 5 minutes is considered “down”.  The watch uses an Elasticsearch Top Hits aggregation to show us the most recent heartbeat from each host. 

...
  "condition" : {
    "script" : {
      "file" : "monitor-hosts",
      "params" : {
        "threshold" : 0
      }
    }
  }
...

(see the entire watch here)

Notice that our condition in the watch is a script, which looks like this:

def minutes = 5
def now = DateTime.now().getMillis()
ctx.vars.hosts = [ up: [], down: [] ]
ctx.payload.aggregations.hosts.buckets.each {
 def last_heartbeat = it.latest_heartbeat.hits.hits[0].sort[0];
 def ms_ago = now - last_heartbeat
 if (ms_ago > (minutes * 1000) ){
   ctx.vars.hosts.down.add( [ hostname: it.key, last_heartbeat: new Date(last_heartbeat) ])
 }
}
return ctx.vars.hosts.down.size() > 0


A Complete Open Source solution

As an alternative to Watcher, Integrating Elasticsearch with any existing monitoring system would be trivial - one could simply set up a check that alerts and escalates under similar conditions.  A monitoring system communicating with a single Elasticsearch cluster is more secure and scalable than reaching out to every service on each end system.

Also, in Kibana, we can easily report on services that are running in a CRITICAL threshold:

Screen Shot 2016-01-15 at 1.17.38 PM.png

Conclusion

When various Beats are used in combination, we have a unified solution for exporting system logs, application data, and performance metrics.   

In the changing landscape of infrastructure operations, networks are relying on cloud services such as Found and Software as a Service such as PagerDuty.  Leveraging these services is trending because it simplifies architecture and operations.

Modern automation technologies have simplified the host-configuration process.  Cohesively, when a host’s monitoring configuration is applied via automation, that configuration is spun up and torn down, autoscaling with each host.

Please see the entire Proof of Concept on Github, and feel free to clone and run it yourself.