Terms Filter

edit

Filters documents that have fields that match any of the provided terms (not analyzed). For example:

{
    "constant_score" : {
        "filter" : {
            "terms" : { "user" : ["kimchy", "elasticsearch"]}
        }
    }
}

The terms filter is also aliased with in as the filter name for simpler usage.

Execution Mode

edit

The way terms filter executes is by iterating over the terms provided and finding matches docs (loading into a bitset) and caching it. Sometimes, we want a different execution model that can still be achieved by building more complex queries in the DSL, but we can support them in the more compact model that terms filter provides.

The execution option now has the following options :

plain

The default. Works as today. Iterates over all the terms, building a bit set matching it, and filtering. The total filter is cached.

fielddata

Generates a terms filters that uses the fielddata cache to compare terms. This execution mode is great to use when filtering on a field that is already loaded into the fielddata cache from faceting, sorting, or index warmers. When filtering on a large number of terms, this execution can be considerably faster than the other modes. The total filter is not cached unless explicitly configured to do so.

bool

Generates a term filter (which is cached) for each term, and wraps those in a bool filter. The bool filter itself is not cached as it can operate very quickly on the cached term filters.

and

Generates a term filter (which is cached) for each term, and wraps those in an and filter. The and filter itself is not cached.

or

Generates a term filter (which is cached) for each term, and wraps those in an or filter. The or filter itself is not cached. Generally, the bool execution mode should be preferred.

If you don’t want the generated individual term queries to be cached, you can use: bool_nocache, and_nocache or or_nocache instead, but be aware that this will affect performance.

The "total" terms filter caching can still be explicitly controlled using the _cache option. Note the default value for it depends on the execution value.

For example:

{
    "constant_score" : {
        "filter" : {
            "terms" : {
                "user" : ["kimchy", "elasticsearch"],
                "execution" : "bool",
                "_cache": true
            }
        }
    }
}

Caching

edit

The result of the filter is automatically cached by default. The _cache can be set to false to turn it off.

Terms lookup mechanism

edit

When it’s needed to specify a terms filter with a lot of terms it can be beneficial to fetch those term values from a document in an index. A concrete example would be to filter tweets tweeted by your followers. Potentially the amount of user ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.

The terms lookup mechanism supports the following options:

index

The index to fetch the term values from. Defaults to the current index.

type

The type to fetch the term values from.

id

The id of the document to fetch the term values from.

path

The field specified as path to fetch the actual values for the terms filter.

routing

A custom routing value to be used when retrieving the external terms doc.

cache

Whether to cache the filter built from the retrieved document (true - default) or whether to fetch and rebuild the filter on every request (false). See "Terms lookup caching" below

The values for the terms filter will be fetched from a field in a document with the specified id in the specified type and index. Internally a get request is executed to fetch the values from the specified path. At the moment for this feature to work the _source needs to be stored.

Also, consider using an index with a single shard and fully replicated across all nodes if the "reference" terms data is not large. The lookup terms filter will prefer to execute the get request on a local node if possible, reducing the need for networking.

Terms lookup caching

edit

There is an additional cache involved, which caches the lookup of the lookup document to the actual terms. This lookup cache is a LRU cache. This cache has the following options:

indices.cache.filter.terms.size
The size of the lookup cache. The default is 10mb.
indices.cache.filter.terms.expire_after_access
The time after the last read an entry should expire. Disabled by default.
indices.cache.filter.terms.expire_after_write
The time after the last write an entry should expire. Disabled by default.

All options for the lookup of the documents cache can only be configured via the elasticsearch.yml file.

When using the terms lookup the execution option isn’t taken into account and behaves as if the execution mode was set to plain.

Terms lookup twitter example

edit
# index the information for user with id 2, specifically, its followers
curl -XPUT localhost:9200/users/user/2 -d '{
   "followers" : ["1", "3"]
}'

# index a tweet, from user with id 2
curl -XPUT localhost:9200/tweets/tweet/1 -d '{
   "user" : "2"
}'

# search on all the tweets that match the followers of user 2
curl -XGET localhost:9200/tweets/_search -d '{
  "query" : {
    "filtered" : {
      "filter" : {
        "terms" : {
          "user" : {
            "index" : "users",
            "type" : "user",
            "id" : "2",
            "path" : "followers"
          },
          "_cache_key" : "user_2_friends"
        }
      }
    }
  }
}'

The above is highly optimized, both in a sense that the list of followers will not be fetched if the filter is already cached in the filter cache, and with internal LRU cache for fetching external values for the terms filter. Also, the entry in the filter cache will not hold all the terms reducing the memory required for it.

_cache_key is recommended to be set, so its simple to clear the cache associated with it using the clear cache API. For example:

curl -XPOST 'localhost:9200/tweets/_cache/clear?filter_keys=user_2_friends'

The structure of the external terms document can also include array of inner objects, for example:

curl -XPUT localhost:9200/users/user/2 -d '{
 "followers" : [
   {
     "id" : "1"
   },
   {
     "id" : "2"
   }
 ]
}'

In which case, the lookup path will be followers.id.