IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Classic token filter Conditional token filter »

› › ›

Common grams token filter

edit

Common grams token filter

edit

Generates bigrams for a specified set of common words.

For example, you can specify is and the as common words. This filter then converts the tokens [the, quick, fox, is, brown] to [the, the_quick, quick, fox, fox_is, is, is_brown, brown].

You can use the common_grams filter in place of the stop token filter when you don’t want to completely ignore common words.

This filter uses Lucene’s CommonGramsFilter.

Example

edit

The following analyze API request creates bigrams for is and the:

response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      {
        type: 'common_grams',
        common_words: [
          'is',
          'the'
        ]
      }
    ],
    text: 'the quick fox is brown'
  }
)
puts response

GET /_analyze
{
  "tokenizer" : "whitespace",
  "filter" : [
    {
      "type": "common_grams",
      "common_words": ["is", "the"]
    }
  ],
  "text" : "the quick fox is brown"
}

Copy as curl Try in Elastic

The filter produces the following tokens:

[ the, the_quick, quick, fox, fox_is, is, is_brown, brown ]

Add to an analyzer

edit

The following create index API request uses the common_grams filter to configure a new custom analyzer:

response = client.indices.create(
  index: 'common_grams_example',
  body: {
    settings: {
      analysis: {
        analyzer: {
          index_grams: {
            tokenizer: 'whitespace',
            filter: [
              'common_grams'
            ]
          }
        },
        filter: {
          common_grams: {
            type: 'common_grams',
            common_words: [
              'a',
              'is',
              'the'
            ]
          }
        }
      }
    }
  }
)
puts response

PUT /common_grams_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "index_grams": {
          "tokenizer": "whitespace",
          "filter": [ "common_grams" ]
        }
      },
      "filter": {
        "common_grams": {
          "type": "common_grams",
          "common_words": [ "a", "is", "the" ]
        }
      }
    }
  }
}

Copy as curl Try in Elastic

Configurable parameters

edit

common_words

(Required*, array of strings) A list of tokens. The filter generates bigrams for these tokens.

Either this or the common_words_path parameter is required.

common_words_path

(Required*, string) Path to a file containing a list of tokens. The filter generates bigrams for these tokens.

This path must be absolute or relative to the config location. The file must be UTF-8 encoded. Each token in the file must be separated by a line break.

Either this or the common_words parameter is required.

ignore_case

(Optional, Boolean) If true, matches for common words matching are case-insensitive. Defaults to false.

query_mode

(Optional, Boolean) If true, the filter excludes the following tokens from the output:

Unigrams for common words
Unigrams for terms followed by common words

Defaults to false. We recommend enabling this parameter for search analyzers.

For example, you can enable this parameter and specify is and the as common words. This filter converts the tokens [the, quick, fox, is, brown] to [the_quick, quick, fox_is, is_brown,].

Customize

edit

To customize the common_grams filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following request creates a custom common_grams filter with ignore_case and query_mode set to true:

PUT /common_grams_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "index_grams": {
          "tokenizer": "whitespace",
          "filter": [ "common_grams_query" ]
        }
      },
      "filter": {
        "common_grams_query": {
          "type": "common_grams",
          "common_words": [ "a", "is", "the" ],
          "ignore_case": true,
          "query_mode": true
        }
      }
    }
  }
}

Copy as curl Try in Elastic

« Classic token filter Conditional token filter »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

Common grams token filter

Common grams token filter

Example

Add to an analyzer

Configurable parameters

Customize

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards