Explore your data with runtime fields
editExplore your data with runtime fields
editConsider a large set of log data that you want to extract fields from. Indexing the data is time consuming and uses a lot of disk space, and you just want to explore the data structure without committing to a schema up front.
You know that your log data contains specific fields that you want to extract.
In this case, we want to focus on the @timestamp
and message
fields. By
using runtime fields, you can define scripts to calculate values at search
time for these fields.
Define indexed fields as a starting point
editYou can start with a simple example by adding the @timestamp
and message
fields to the my-index-000001
mapping as indexed fields. To remain flexible, use
wildcard
as the field type for message
:
response = client.indices.create( index: 'my-index-000001', body: { mappings: { properties: { "@timestamp": { format: 'strict_date_optional_time||epoch_second', type: 'date' }, message: { type: 'wildcard' } } } } ) puts response
PUT /my-index-000001/ { "mappings": { "properties": { "@timestamp": { "format": "strict_date_optional_time||epoch_second", "type": "date" }, "message": { "type": "wildcard" } } } }
Ingest some data
editAfter mapping the fields you want to retrieve, index a few records from
your log data into Elasticsearch. The following request uses the bulk API
to index raw log data into my-index-000001
. Instead of indexing all of your log
data, you can use a small sample to experiment with runtime fields.
The final document is not a valid Apache log format, but we can account for that scenario in our script.
response = client.bulk( index: 'my-index-000001', refresh: true, body: [ { index: {} }, { timestamp: '2020-04-30T14:30:17-05:00', message: '40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] "GET /images/hm_bg.jpg HTTP/1.0" 200 24736' }, { index: {} }, { timestamp: '2020-04-30T14:30:53-05:00', message: '232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] "GET /images/hm_bg.jpg HTTP/1.0" 200 24736' }, { index: {} }, { timestamp: '2020-04-30T14:31:12-05:00', message: '26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] "GET /images/hm_bg.jpg HTTP/1.0" 200 24736' }, { index: {} }, { timestamp: '2020-04-30T14:31:19-05:00', message: '247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] "GET /french/splash_inet.html HTTP/1.0" 200 3781' }, { index: {} }, { timestamp: '2020-04-30T14:31:22-05:00', message: '247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] "GET /images/hm_nbg.jpg HTTP/1.0" 304 0' }, { index: {} }, { timestamp: '2020-04-30T14:31:27-05:00', message: '252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] "GET /images/hm_bg.jpg HTTP/1.0" 200 24736' }, { index: {} }, { timestamp: '2020-04-30T14:31:28-05:00', message: 'not a valid apache log' } ] ) puts response
POST /my-index-000001/_bulk?refresh {"index":{}} {"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"} {"index":{}} {"timestamp":"2020-04-30T14:30:53-05:00","message":"232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"} {"index":{}} {"timestamp":"2020-04-30T14:31:12-05:00","message":"26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"} {"index":{}} {"timestamp":"2020-04-30T14:31:19-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"} {"index":{}} {"timestamp":"2020-04-30T14:31:22-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"} {"index":{}} {"timestamp":"2020-04-30T14:31:27-05:00","message":"252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"} {"index":{}} {"timestamp":"2020-04-30T14:31:28-05:00","message":"not a valid apache log"}
At this point, you can view how Elasticsearch stores your raw data.
response = client.indices.get( index: 'my-index-000001' ) puts response
GET /my-index-000001
The mapping contains two fields: @timestamp
and message
.
{ "my-index-000001" : { "aliases" : { }, "mappings" : { "properties" : { "@timestamp" : { "type" : "date", "format" : "strict_date_optional_time||epoch_second" }, "message" : { "type" : "wildcard" }, "timestamp" : { "type" : "date" } } }, ... } }
Define a runtime field with a grok pattern
editIf you want to retrieve results that include clientip
, you can add that
field as a runtime field in the mapping. The following runtime script defines a
grok pattern that extracts structured fields out of a single text
field within a document. A grok pattern is like a regular expression that
supports aliased expressions that you can reuse.
The script matches on the %{COMMONAPACHELOG}
log pattern, which understands
the structure of Apache logs. If the pattern matches (clientip != null
),
the script emits the value of the matching IP address. If the pattern doesn’t
match, the script just returns the field value without crashing.
PUT my-index-000001/_mappings { "runtime": { "http.client_ip": { "type": "ip", "script": """ String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip; if (clientip != null) emit(clientip); """ } } }
This condition ensures that the script doesn’t crash even if the pattern of the message doesn’t match. |
Alternatively, you can define the same runtime field but in the context of a
search request. The runtime definition and the script are exactly the same as
the one defined previously in the index mapping. Just copy that definition into
the search request under the runtime_mappings
section and include a query
that matches on the runtime field. This query returns the same results as if
you defined a search query for the http.clientip
runtime field in your index
mappings, but only in the context of this specific search:
GET my-index-000001/_search { "runtime_mappings": { "http.clientip": { "type": "ip", "script": """ String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip; if (clientip != null) emit(clientip); """ } }, "query": { "match": { "http.clientip": "40.135.0.0" } }, "fields" : ["http.clientip"] }
Define a composite runtime field
editYou can also define a composite runtime field to emit multiple fields from a single script. You can define a set of typed subfields and emit a map of values. At search time, each subfield retrieves the value associated with their name in the map. This means that you only need to specify your grok pattern one time and can return multiple values:
response = client.indices.put_mapping( index: 'my-index-000001', body: { runtime: { http: { type: 'composite', script: 'emit(grok("%<COMMONAPACHELOG>s").extract(doc["message"].value))', fields: { clientip: { type: 'ip' }, verb: { type: 'keyword' }, response: { type: 'long' } } } } } ) puts response
PUT my-index-000001/_mappings { "runtime": { "http": { "type": "composite", "script": "emit(grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message\"].value))", "fields": { "clientip": { "type": "ip" }, "verb": { "type": "keyword" }, "response": { "type": "long" } } } } }
Search for a specific IP address
editUsing the http.clientip
runtime field, you can define a simple query to run a
search for a specific IP address and return all related fields.
GET my-index-000001/_search { "query": { "match": { "http.clientip": "40.135.0.0" } }, "fields" : ["*"] }
The API returns the following result. Because http
is a composite
runtime
field, the response includes each of the sub-fields under fields
, including
any associated values that match the query. Without building your data structure
in advance, you can search and explore your data in meaningful ways to
experiment and determine which fields to index.
{ ... "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 1.0, "hits" : [ { "_index" : "my-index-000001", "_id" : "sRVHBnwBB-qjgFni7h_O", "_score" : 1.0, "_source" : { "timestamp" : "2020-04-30T14:30:17-05:00", "message" : "40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736" }, "fields" : { "http.verb" : [ "GET" ], "http.clientip" : [ "40.135.0.0" ], "http.response" : [ 200 ], "message" : [ "40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736" ], "http.client_ip" : [ "40.135.0.0" ], "timestamp" : [ "2020-04-30T19:30:17.000Z" ] } } ] } }
Also, remember that if
statement in the script?
if (clientip != null) emit(clientip);
If the script didn’t include this condition, the query would fail on any shard that doesn’t match the pattern. By including this condition, the query skips data that doesn’t match the grok pattern.
Search for documents in a specific range
editYou can also run a range query that operates on the
timestamp
field. The following query returns any documents where the
timestamp
is greater than or equal to 2020-04-30T14:31:27-05:00
:
GET my-index-000001/_search { "query": { "range": { "timestamp": { "gte": "2020-04-30T14:31:27-05:00" } } } }
The response includes the document where the log format doesn’t match, but the timestamp falls within the defined range.
{ ... "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 1.0, "hits" : [ { "_index" : "my-index-000001", "_id" : "hdEhyncBRSB6iD-PoBqe", "_score" : 1.0, "_source" : { "timestamp" : "2020-04-30T14:31:27-05:00", "message" : "252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736" } }, { "_index" : "my-index-000001", "_id" : "htEhyncBRSB6iD-PoBqe", "_score" : 1.0, "_source" : { "timestamp" : "2020-04-30T14:31:28-05:00", "message" : "not a valid apache log" } } ] } }
Define a runtime field with a dissect pattern
editIf you don’t need the power of regular expressions, you can use dissect patterns instead of grok patterns. Dissect patterns match on fixed delimiters but are typically faster than grok.
You can use dissect to achieve the same results as parsing the Apache logs with a grok pattern. Instead of matching on a log pattern, you include the parts of the string that you want to discard. Paying special attention to the parts of the string you want to discard will help build successful dissect patterns.
PUT my-index-000001/_mappings { "runtime": { "http.client.ip": { "type": "ip", "script": """ String clientip=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.clientip; if (clientip != null) emit(clientip); """ } } }
Similarly, you can define a dissect pattern to extract the HTTP response code:
PUT my-index-000001/_mappings { "runtime": { "http.responses": { "type": "long", "script": """ String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response; if (response != null) emit(Integer.parseInt(response)); """ } } }
You can then run a query to retrieve a specific HTTP response using the
http.responses
runtime field. Use the fields
parameter of the _search
request to indicate which fields you want to retrieve:
response = client.search( index: 'my-index-000001', body: { query: { match: { "http.responses": '304' } }, fields: [ 'http.client_ip', 'timestamp', 'http.verb' ] } ) puts response
GET my-index-000001/_search { "query": { "match": { "http.responses": "304" } }, "fields" : ["http.client_ip","timestamp","http.verb"] }
The response includes a single document where the HTTP response is 304
:
{ ... "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 1.0, "hits" : [ { "_index" : "my-index-000001", "_id" : "A2qDy3cBWRMvVAuI7F8M", "_score" : 1.0, "_source" : { "timestamp" : "2020-04-30T14:31:22-05:00", "message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0" }, "fields" : { "http.verb" : [ "GET" ], "http.client_ip" : [ "247.37.0.0" ], "timestamp" : [ "2020-04-30T19:31:22.000Z" ] } } ] } }