Terms Filter Lookup
NOTE: This article now contains outdated information. Please reference our docs, peruse our latest blogs, and visit our forums for the latest and greatest. Thank you. |
There is a new feature in the 0.90 branch that is pretty awesome: the Terms Filter now supports document lookups.
In a normal Terms Filter, you provide a list of Terms that you want to filter against. This is fine for small lists, but what if you have 100 terms? A thousand terms? That is a lot of data to pass over the wire. If that list of terms is stored in your index somewhere, you also have to retrieve it first…just so you can pass it back to Elasticsearch.
The new lookup feature tells Elasticsearch to use another document as your terms array. Instead of passing 1000 terms, you simply tell the Terms Filter “Hey, all the terms I want are in this document”. Elasticsearch will fetch that document internally, extract the terms and perform your query.
Let’s work our way through a concrete example of how it all works.
A concrete example
The Terms Filter page has a good Twitter example. I encourage you to read over it. But I come from a biology background and you can only read so many Twitter examples…so let’s try something a bit different and bioinformatics-y.
Imagine you run a bioinformatics search engine and database. You have two indices holding two different types of data. First, you have an index that stores academic research articles.
This document represents a single scientific paper. It has two fields, the title of the paper and a list of all proteins relevant to the topic:
curl -XPUT localhost:9200/papers/paper/paper789 -d '{ "title" : "Ahi1, whose human ortholog is mutated in Joubert syndrome, is required for Rab8a localization, ciliogenesis and vesicle trafficking." "proteins" : [ "Ahi1", "Rab8a" ] }'
Next, you have an index which holds data from a microarray experiment. Microarrays are a method to determine whether a gene’s activity is more (“up-regulated”) or less (“down-regulated”) than its normal rate. Microarrays are about the size of a postage stamp and test thousands of genes at once.
The resulting “up-regulated” field could potentially be thousands long:
curl -XPUT localhost:9200/microarrays/experiment/experiment1234 -d '{ "upregulated_proteins" : [ "Ahi1", "Wnt", "SHH", "CDC42", "GSK-3B", [.......] ] }'
Given this data, a very common question might be: “Show me all papers about the proteins up-regulated in this experiment”.
Filtering the papers
Before lookups, the way to accomplish this is a GET plus a Filtered Search. First you have to GET the microarray experiment and extract the array of terms:
curl -XGET localhost:9200/microarrays/experiment/experiment1234 # ...Extract array in your app... #
Then perform a search query using the filter you need:
curl -XGET localhost:9200/papers/paper/_search -d '{ "query":{ "filtered":{ "filter":{ "terms" : { "proteins":[ "Ahi1", "Wnt", "SHH", "CDC42", "GSK-3B", [.......] ] } } } } }
This works, but you can see why it isn’t ideal. Not only do we need to perform two requests – a GET and a Search – but we have to shuffle a potentially large term array across the wire twice. The lookup feature allows you to bypass this inefficiency.
Filtering, this time with Lookups
Lookups use documents themselves as the list of Terms, which means you can avoid unnecessary requests. Let’s try again, but this time with lookups.
The data is organized the same as before, but when we search we skip the extraneous GET phase and go straight to the Terms Filter with the new Lookup syntax:
curl -XGET localhost:9200/papers/paper/_search -d '{ "query" : { "filtered" : { "filter" : { "terms" : { "proteins" : { "index" : "microarrays", "type" : "experiment", "id" : "experiment1234", "path" : "upregulated_proteins" }, "_cache_key" : "experiment_1234" } } } } }'
Neat! How does it work? Let’s just look at the new Terms Filter syntax line by line:
{ "terms":{ "proteins":{Filter terms in the"proteins"
field…."index":"microarrays", "type":"experiment", "id":"experiment1234", "path":"upregulated_proteins"…matching the array of terms located in the"upregulated_proteins"
field, inside the document located at/microarrays/experiment/experiment1234
…}, "_cache_key":"experiment_1234" } }…and save (cache) the result of this filter under the name “experiment_1234″ so we can use it again later without loading the document.
More than just convenience
The new Lookup feature is certainly useful. But it offers more than just convenience: there are tangible performance benefits.
- Eliminates the need for extraneous round-trips
- Removes network latency
- Caches the result
- Removes the need to even load the term lookup document on subsequent requests
- Custom cache key name
- Reduces cache memory usage. The Terms Filters usually construct cache-key names by concatenating together the list of Terms. Obviously, a 1000 term cache-key is going to have a very long string for a name, using a lot of memory.
Performance can be boosted even more if the Lookup index (/microarray/
in this example) is fully replicated to each node. The Lookup will prefer shards that are local, removing the need to query other shards to get the lookup document. While inter-node latency is usually pretty low, zero latency is always faster.
Conclusion
This is just one example of using the new Lookup feature. Lookups are predominantly used to boost performance when filtering large Term lists. Check out the Terms Filter documentation for more details about settings (adjusting cache memory, etc) as well as the standard Twitter example.