IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Hunspell Token Filter Normalization Token Filter »

› › ›

Common Grams Token Filter

edit

Common Grams Token Filter

edit

Token filter that generates bigrams for frequently occurring terms. Single terms are still indexed. It can be used as an alternative to the Stop Token Filter when we don’t want to completely ignore common terms.

For example, the text "the quick brown is a fox" will be tokenized as "the", "the_quick", "quick", "brown", "brown_is", "is", "is_a", "a", "a_fox", "fox". Assuming "the", "is" and "a" are common words.

When query_mode is enabled, the token filter removes common words and single terms followed by a common word. This parameter should be enabled in the search analyzer.

For example, the query "the quick brown is a fox" will be tokenized as "the_quick", "quick", "brown_is", "is_a", "a_fox", "fox".

The following are settings that can be set:

Setting	Description
`common_words`	A list of common words to use.
`common_words_path`	A path (either relative to `config` location, or absolute) to a list of common words. Each word should be in its own "line" (separated by a line break). The file must be UTF-8 encoded.
`ignore_case`	If true, common words matching will be case insensitive (defaults to `false`).
`query_mode`	Generates bigrams then removes common words and single terms followed by a common word (defaults to `false`).

Note, common_words or common_words_path field is required.

Here is an example:

PUT /common_grams_example
{
    "settings": {
        "analysis": {
            "analyzer": {
                "index_grams": {
                    "tokenizer": "whitespace",
                    "filter": ["common_grams"]
                },
                "search_grams": {
                    "tokenizer": "whitespace",
                    "filter": ["common_grams_query"]
                }
            },
            "filter": {
                "common_grams": {
                    "type": "common_grams",
                    "common_words": ["the", "is", "a"]
                },
                "common_grams_query": {
                    "type": "common_grams",
                    "query_mode": true,
                    "common_words": ["the", "is", "a"]
                }
            }
        }
    }
}

You can see the output by using e.g. the _analyze endpoint:

POST /common_grams_example/_analyze
{
  "analyzer" : "index_grams",
  "text" : "the quick brown is a fox"
}

And the response will be:

{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "the_quick",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "gram",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "quick",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "brown_is",
      "start_offset" : 10,
      "end_offset" : 18,
      "type" : "gram",
      "position" : 2,
      "positionLength" : 2
    },
    {
      "token" : "is",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "is_a",
      "start_offset" : 16,
      "end_offset" : 20,
      "type" : "gram",
      "position" : 3,
      "positionLength" : 2
    },
    {
      "token" : "a",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "a_fox",
      "start_offset" : 19,
      "end_offset" : 24,
      "type" : "gram",
      "position" : 4,
      "positionLength" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 21,
      "end_offset" : 24,
      "type" : "word",
      "position" : 5
    }
  ]
}

« Hunspell Token Filter Normalization Token Filter »