Elasticsearch Command Line Debugging With The _cat API
One of the most useful utilities for investigating Elasticsearch from the command line is the _cat API. Whereas the usual Elasticsearch API endpoints are ideal for consuming JSON from within a fully-fledged programming language, the cat API (as its name would imply) is especially suited for command-line tools.
In previous posts we've explored some of the different endpoints for the API. We can build upon that and the existing plethora of command-line utilities to build even more useful patterns that we can combine for simple (and effective) monitoring and debugging use cases.
A Primer for your Cat
A basic familiarity with the cat API is helpful before reading on. In particular:
- The
h
query string parameter allows us to ask the API to only return certain fields that we're interested in (i.e., the header) - Note that the
-s
argument is used in curl - this is to enable silent output, otherwise extraneous HTTP transfer data may get into the pipeline when chaining commands together.
Diving Into The Heap
An oft-asked question is how to debug OutOfMerroryError
messages. You've taken the right measures to configure your heap, but after some time of normal use, heap usage grows again and instability ensues. How can you dig further?
There are lots of good resources to track this sort of resource utilization: Marvel offers commercial monitoring, and there's many of open source options as well. However, when you're debugging a red cluster at the last minute, you need immediate options. What tools can you reach for easily?
The cat API offers many endpoints, and piping a curl
command which retrieves heap metrics into sort
can quickly answer the question, "Which node is experiencing the most memory pressure right now?"
$ curl -s 'localhost:9200/_cat/nodes?h=host,heap.percent' | sort -r -n -k2 es02 71 es00 60 es01 59
We can see that node es02
is using 71% of the JVM heap. Following the pipeline:
- We ask for the
nodes
endpoint, querying the hostname and heap percentage in use, then - Pipe to
sort
, returning the second column (heap percentage in use) by-r
(reverse) order to get the highest in use first.
Coupled with other utilities like head
and tail
, we can find both over- and under-utilitized nodes in very large clusters very quickly.
This is useful, but we can do more. It would be nice if we could query heap usage at a more granular level in order to determine what, exactly, is using space on our nodes.
It turns out we can:
$ curl -s 'localhost:9200/_cat/nodes?h=name,fm,fcm,sm,qcm,im&v' name fm fcm sm es01 781.4mb 675.6mb 734.5mb es02 1.6gb 681.3mb 892.2mb es00 1.4gb 620.1mb 899.4mb
Note: These are abbreviated column names for fielddata.memory_size
(fm), filter_cache.memory_size
(fcm), and segments.memory
(sm). Other fields exist as well, consult curl -s 'localhost:9200/_cat/nodes?help' | grep -i memory
for additional information.
From this we can see that fielddata.memory_size
is consuming a fairly large part of our node's memory. Armed with this knowledge, mitigations such as increased use of doc_values can aid in shrinking that aspect of heap usage.
Gazing Into The Thread Pool
Many Elasticsearch operations take place in thread pools, which are useful when inspecting what your cluster is busy doing. During peak times, the thread pool can be a useful reflection of what operations (searching, indexing, etc.) are keeping machines busy.
The cat API is useful here, too. By default it returns common thread pools' active, queued, and rejected pools, which can often help pinpoint requests that are backing up into queued pools under heavy load. Consider this generic output:
$ curl -s 'localhost:9200/_cat/thread_pool' es03 10.xxx.xx.xxx 0 0 0 0 0 0 1 0 0 elk00 10.xx.xxx.xxx 0 0 0 0 0 0 1 0 0 es00 10.xx.xx.xxx 0 0 0 0 0 0 0 0 0
Note: the table headers are omitted here, but the numbers following node IPs are the active
, queue
, and rejected
pools for the bulk
, index
, and search
thread pools, respectively.
This cluster is serving a single search request, which isn't terribly exciting. However, what if your cluster is having problems and you need to closely watch operations? A watch
command can help here:
$ watch 'curl -s localhost:9200/_cat/thread_pool | sort -n -k1'
watch
executes the command every 2 seconds by default. We sort on the first column to keep ordering consistent, and watch
usefully highlights field values as they change so we can keep a close eye on thread pools as they change, so it's easy to spot problems such as a deluge of search requests getting queued if users are hitting the cluster hard.
Diffing Indices for Fun and Profit
Another common use case is migrating data from one cluster to another: there are several ways to do this including snapshots and utilities like logstash. With large datasets, this can take quite some time, so how can we gain visibility into the process?
The cat API offers simple endpoints for index metrics through _cat/indices
which includes information such as index disk usage and document count per index.
Given a scenario in which we're streaming documents from one index to another on a different cluster, we can perform some command-line gymnastics to watch a diff between index document counts. Consider the following command:
$ join <(curl -s localhost:9200/_cat/indices | awk '$3 ~ /foo-index/ { print $3 " " $6; }') <(curl -s otherhost:9200/_cat/indices | awk '$3 ~ /foo-index/ { print $3 " " $6; }') | awk '{ print $1 " " $2 - $3; }' foo-index -231700
This command makes use of bash process substitution in order to create temporary file descriptors from the output of commands between the <()
parenthesis. We use awk
to find the index of interest, foo-index
, and print the index name and document count for the local index and the remote one we're streaming to. Using the join
command on the host field merges them into one line, and we pipe the results through another awk
in order to calculate a difference between the two indices.
The results of the example command indicate that there's a disparity in document count that should converge as the stream nears completion and finishes replicating the data.
By placing this command into either a watch
command or bash for
loop, we can quickly whip up a small utility to watch the progress of our index import.
Going Further
These are just a few examples of how everyday command line utilities can be paired with the cat API to make lives easier for administering Elasticsearch. There are plenty of other potential applications to consider - for example:
- Use the
_cat/recovery
API to watch the recovery progress of a cluster in a shell loop - Consume metrics like thread pools or node statistics in tools like ganglia or nagios for simple alerting on Elasticsearch health
- Use
_cat/shards
when in a pinch to find exactly which shard is causing your cluster to go into a red or yellow state
Good luck, and may the cat be with you!