New

The executive guide to generative AI

Read more
Loading

Keyword marker token filter

Marks specified tokens as keywords, which are not stemmed.

The keyword_marker filter assigns specified tokens a keyword attribute of true. Stemmer token filters, such as stemmer or porter_stem, skip tokens with a keyword attribute of true.

Important

To work properly, the keyword_marker filter must be listed before any stemmer token filters in the analyzer configuration.

The keyword_marker filter uses Lucene’s KeywordMarkerFilter.

To see how the keyword_marker filter works, you first need to produce a token stream containing stemmed tokens.

The following analyze API request uses the stemmer filter to create stemmed tokens for fox running and jumping.

 GET /_analyze {
  "tokenizer": "whitespace",
  "filter": [ "stemmer" ],
  "text": "fox running and jumping"
}

The request produces the following tokens. Note that running was stemmed to run and jumping was stemmed to jump.

[ fox, run, and, jump ]

To prevent jumping from being stemmed, add the keyword_marker filter before the stemmer filter in the previous analyze API request. Specify jumping in the keywords parameter of the keyword_marker filter.

 GET /_analyze {
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "keyword_marker",
      "keywords": [ "jumping" ]
    },
    "stemmer"
  ],
  "text": "fox running and jumping"
}

The request produces the following tokens. running is still stemmed to run, but jumping is not stemmed.

[ fox, run, and, jumping ]

To see the keyword attribute for these tokens, add the following arguments to the analyze API request:

  • explain: true
  • attributes: keyword
 GET /_analyze {
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "keyword_marker",
      "keywords": [ "jumping" ]
    },
    "stemmer"
  ],
  "text": "fox running and jumping",
  "explain": true,
  "attributes": "keyword"
}

The API returns the following response. Note the jumping token has a keyword attribute of true.

{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "whitespace",
      "tokens": [
        {
          "token": "fox",
          "start_offset": 0,
          "end_offset": 3,
          "type": "word",
          "position": 0
        },
        {
          "token": "running",
          "start_offset": 4,
          "end_offset": 11,
          "type": "word",
          "position": 1
        },
        {
          "token": "and",
          "start_offset": 12,
          "end_offset": 15,
          "type": "word",
          "position": 2
        },
        {
          "token": "jumping",
          "start_offset": 16,
          "end_offset": 23,
          "type": "word",
          "position": 3
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "__anonymous__keyword_marker",
        "tokens": [
          {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0,
            "keyword": false
          },
          {
            "token": "running",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": false
          },
          {
            "token": "and",
            "start_offset": 12,
            "end_offset": 15,
            "type": "word",
            "position": 2,
            "keyword": false
          },
          {
            "token": "jumping",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": true
          }
        ]
      },
      {
        "name": "stemmer",
        "tokens": [
          {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0,
            "keyword": false
          },
          {
            "token": "run",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": false
          },
          {
            "token": "and",
            "start_offset": 12,
            "end_offset": 15,
            "type": "word",
            "position": 2,
            "keyword": false
          },
          {
            "token": "jumping",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": true
          }
        ]
      }
    ]
  }
}
ignore_case
(Optional, Boolean) If true, matching for the keywords and keywords_path parameters ignores letter case. Defaults to false.
keywords

(Required*, array of strings) Array of keywords. Tokens that match these keywords are not stemmed.

This parameter, keywords_path, or keywords_pattern must be specified. You cannot specify this parameter and keywords_pattern.

keywords_path
(Required*, string) Path to a file that contains a list of keywords. Tokens that match these keywords are not stemmed.

This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each word in the file must be separated by a line break.

This parameter, keywords, or keywords_pattern must be specified. You cannot specify this parameter and keywords_pattern.

keywords_pattern
(Required*, string) Java regular expression used to match tokens. Tokens that match this expression are marked as keywords and not stemmed.

This parameter, keywords, or keywords_path must be specified. You cannot specify this parameter and keywords or keywords_pattern.

Warning

Poorly written regular expressions can cause Elasticsearch to run slowly or result in stack overflow errors, causing the running node to suddenly exit.

To customize the keyword_marker filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following create index API request uses a custom keyword_marker filter and the porter_stem filter to configure a new custom analyzer.

The custom keyword_marker filter marks tokens specified in the analysis/example_word_list.txt file as keywords. The porter_stem filter does not stem these tokens.

 PUT /my-index-000001 {
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "my_custom_keyword_marker_filter",
            "porter_stem"
          ]
        }
      },
      "filter": {
        "my_custom_keyword_marker_filter": {
          "type": "keyword_marker",
          "keywords_path": "analysis/example_word_list.txt"
        }
      }
    }
  }
}