Structured logging with Filebeat
The idea behind structured logging is simple: instead of having applications write logs that need to be parsed via regular expressions into JSON objects that you index into Elasticsearch, you make the application write JSON objects directly.
To exemplify, let’s say you are writing a Python web application, and you are using the standard library for logging. When a user signs in, you might use a logging statement that looks like this:
logging.debug("User '{}' (id: {}) successfully logged in. Session id: {}" .format(user["name"], user["id"], session_id))
And then when the user gets verified:
logging.debug("User '{}' (id: {}) changed state to verified." .format(user["name"], user["id"]))
This results in log lines like this:
DEBUG:root:User 'arthur' (id: 42) successfully logged in. Session id: 91e5b9d DEBUG:root:User 'arthur' (id: 42) changed state to verified.
This way of logging is popular ever since the good old printf. Back then, you would typically have a single server, and you would tail and grep the log files and everything was great.
Times have changed, however, and today it is more likely that you have tens, hundreds, or thousands of servers / virtual machines / containers creating massive amounts of logs, so you centralize them in Elasticsearch and use its magical (that’s just how it feels) search and aggregation features to navigate through them.
This works best when the logs are pre-parsed in a structured object, so you can search and aggregate on individual fields. So before indexing, you would typically use Logstash’s amazing Grok filter to parse the application logs into a JSON object, but this means that you have to write and maintain Grok patterns and spend CPU cycles to do the parsing.
Now let’s try the same example with structured logging. While usually not in the standard library, all major programming languages have libraries that make structured logging easy. James Turnbull created a list in this blog post, which also goes into detail about how to do this for a Rails application. In Python, there’s the structlog library, which we will use here:
log = log.bind(user='arthur', id=42, verified=False) log.msg('logged_in') log.msg('changed_state', verified=True)
Which results in log lines like this:
verified=False user='arthur' session_id='91e5b9d' id=42 event='logged_in' verified=True user='arthur' session_id='91e5b9d' id=42 event='changed_state'
One thing to notice is that the code is less repetitive and it encourages the developer to include all the data rather than only what seems important while writing the code. Also note that in this format, the log lines are still reasonably easy to follow for a human during development. When moving to production, however, it makes more sense to use the JSON renderer:
log = wrap_logger(PrintLogger(), processors=[JSONRenderer()])
This results in log lines like this:
{"verified": false, "user": "arthur", "session_id": "91e5b9d", "id": 42, "event": "logged_in"} {"verified": true, "user": "arthur", "session_id": "91e5b9d", "id": 42, "event": "changed_state"}
This is less readable to human eyes, but has the advantage that the data is already structured in the format that Elasticsearch likes.
Filebeat is an open source log shipper, written in Go, that can send log lines to Logstash and Elasticsearch. It offers “at-least-once” guarantees, so you never lose a log line, and it uses a back-pressure sensitive protocol, so it won’t overload your pipeline. Basic filtering and multi-line correlation are also included.
Starting with version 5.0 (currently in alpha, but you can give it a try), Filebeat is able to also natively decode JSON objects if they are stored one per line like in the above example.
Here is a sample configuration file that configures Filebeat to pick up the files and send the JSON objects to Elasticsearch:
filebeat.prospectors: - input_type: log paths: ["test/*"] json.message_key: event json.keys_under_root: true fields: planet: Magrathea service: ${SERVICE_NAME} tags: [ "i", "heart", "json" ] output.elasticsearch: hosts: ["192.168.99.100:9200"] template.name: filebeat template.path: filebeat.template.json
This is one of the JSON objects that gets indexed this way:
{ "@timestamp": "2016-04-22T22:09:09.631Z", "beat": { "hostname": "DeepThought", "name": "DeepThought" }, "event": "logged_in", "fields": { "planet": "Magrathea", "service": "Answerer" }, "id": 42, "input_type": "log", "offset": 0, "session_id": "91e5b9d", "source": "test/structured.log", "tags": [ "i", "heart", "json" ], "type": "log", "user": "arthur", "verified": false }
As you can see, Filebeat automatically adds a timestamp. Note that this is the time when the log line was read, which can be different from when the application wrote the log line. You can set up the structlog library to generate timestamps if you need better accuracy.
Filebeat also automatically adds some metadata like the host name, and it makes it easy to add custom fields and tags via the configuration file. This means that the application doesn’t have to worry about adding metadata from the environment.
That’s about all you need. Simple things should be simple :-)