Keyword repeat token filter

edit

Outputs a keyword version of each token in a stream. These keyword tokens are not stemmed.

The keyword_repeat filter assigns keyword tokens a keyword attribute of true. Stemmer token filters, such as stemmer or porter_stem, skip tokens with a keyword attribute of true.

You can use the keyword_repeat filter with a stemmer token filter to output a stemmed and unstemmed version of each token in a stream.

To work properly, the keyword_repeat filter must be listed before any stemmer token filters in the analyzer configuration.

Stemming does not affect all tokens. This means streams could contain duplicate tokens in the same position, even after stemming.

To remove these duplicate tokens, add the remove_duplicates filter after the stemmer filter in the analyzer configuration.

The keyword_repeat filter uses Lucene’s KeywordRepeatFilter.

Example

edit

The following analyze API request uses the keyword_repeat filter to output a keyword and non-keyword version of each token in fox running and jumping.

To return the keyword attribute for these tokens, the analyze API request also includes the following arguments:

  • explain: true
  • attributes: keyword
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat"
  ],
  "text": "fox running and jumping",
  "explain": true,
  "attributes": "keyword"
}

The API returns the following response. Note that one version of each token has a keyword attribute of true.

Response
{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": ...,
    "tokenfilters": [
      {
        "name": "keyword_repeat",
        "tokens": [
          {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0,
            "keyword": true
          },
          {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0,
            "keyword": false
          },
          {
            "token": "running",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": true
          },
          {
            "token": "running",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": false
          },
          {
            "token": "and",
            "start_offset": 12,
            "end_offset": 15,
            "type": "word",
            "position": 2,
            "keyword": true
          },
          {
            "token": "and",
            "start_offset": 12,
            "end_offset": 15,
            "type": "word",
            "position": 2,
            "keyword": false
          },
          {
            "token": "jumping",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": true
          },
          {
            "token": "jumping",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": false
          }
        ]
      }
    ]
  }
}

To stem the non-keyword tokens, add the stemmer filter after the keyword_repeat filter in the previous analyze API request.

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat",
    "stemmer"
  ],
  "text": "fox running and jumping",
  "explain": true,
  "attributes": "keyword"
}

The API returns the following response. Note the following changes:

  • The non-keyword version of running was stemmed to run.
  • The non-keyword version of jumping was stemmed to jump.
Response
{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": ...,
    "tokenfilters": [
      {
        "name": "keyword_repeat",
        "tokens": ...
      },
      {
        "name": "stemmer",
        "tokens": [
          {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0,
            "keyword": true
          },
          {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0,
            "keyword": false
          },
          {
            "token": "running",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": true
          },
          {
            "token": "run",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": false
          },
          {
            "token": "and",
            "start_offset": 12,
            "end_offset": 15,
            "type": "word",
            "position": 2,
            "keyword": true
          },
          {
            "token": "and",
            "start_offset": 12,
            "end_offset": 15,
            "type": "word",
            "position": 2,
            "keyword": false
          },
          {
            "token": "jumping",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": true
          },
          {
            "token": "jump",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": false
          }
        ]
      }
    ]
  }
}

However, the keyword and non-keyword versions of fox and and are identical and in the same respective positions.

To remove these duplicate tokens, add the remove_duplicates filter after stemmer in the analyze API request.

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat",
    "stemmer",
    "remove_duplicates"
  ],
  "text": "fox running and jumping",
  "explain": true,
  "attributes": "keyword"
}

The API returns the following response. Note that the duplicate tokens for fox and and have been removed.

Response
{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": ...,
    "tokenfilters": [
      {
        "name": "keyword_repeat",
        "tokens": ...
      },
      {
        "name": "stemmer",
        "tokens": ...
      },
      {
        "name": "remove_duplicates",
        "tokens": [
          {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0,
            "keyword": true
          },
          {
            "token": "running",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": true
          },
          {
            "token": "run",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": false
          },
          {
            "token": "and",
            "start_offset": 12,
            "end_offset": 15,
            "type": "word",
            "position": 2,
            "keyword": true
          },
          {
            "token": "jumping",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": true
          },
          {
            "token": "jump",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": false
          }
        ]
      }
    ]
  }
}

Add to an analyzer

edit

The following create index API request uses the keyword_repeat filter to configure a new custom analyzer.

This custom analyzer uses the keyword_repeat and porter_stem filters to create a stemmed and unstemmed version of each token in a stream. The remove_duplicates filter then removes any duplicate tokens from the stream.

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "keyword_repeat",
            "porter_stem",
            "remove_duplicates"
          ]
        }
      }
    }
  }
}