Pattern replace token filter

edit

Uses a regular expression to match and replace token substrings.

The pattern_replace filter uses Java’s regular expression syntax. By default, the filter replaces matching substrings with an empty substring ("").

Replacement substrings can use Java’s $g syntax to reference capture groups from the original token text.

A poorly-written regular expression may run slowly or return a StackOverflowError, causing the node running the expression to exit suddenly.

Read more about pathological regular expressions and how to avoid them.

This filter uses Lucene’s PatternReplaceFilter.

Example

edit

The following analyze API request uses the pattern_replace filter to prepend watch to the substring dog in foxes jump lazy dogs.

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "pattern_replace",
      "pattern": "(dog)",
      "replacement": "watch$1"
    }
  ],
  "text": "foxes jump lazy dogs"
}

The filter produces the following tokens.

[ foxes, jump, lazy, watchdogs ]

Configurable parameters

edit
all
(Optional, boolean) If true, all substrings matching the pattern parameter’s regular expression are replaced. If false, the filter replaces only the first matching substring in each token. Defaults to true.
pattern
(Required, string) Regular expression, written in Java’s regular expression syntax. The filter replaces token substrings matching this pattern with the substring in the replacement parameter.
replacement
(Optional, string) Replacement substring. Defaults to an empty substring ("").

Customize and add to an analyzer

edit

To customize the pattern_replace filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

The following create index API request configures a new custom analyzer using a custom pattern_replace filter, my_pattern_replace_filter.

The my_pattern_replace_filter filter uses the regular expression [£|€] to match and remove the currency symbols £ and . The filter’s all parameter is false, meaning only the first matching symbol in each token is removed.

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "filter": [
            "my_pattern_replace_filter"
          ]
        }
      },
      "filter": {
        "my_pattern_replace_filter": {
          "type": "pattern_replace",
          "pattern": "[£|€]",
          "replacement": "",
          "all": false
        }
      }
    }
  }
}