Pattern replace token filter
editPattern replace token filter
editUses a regular expression to match and replace token substrings.
The pattern_replace
filter uses
Java’s
regular expression syntax. By default, the filter replaces matching substrings
with an empty substring (""
). Replacement substrings can use Java’s
$g
syntax to reference capture groups from the original token text.
A poorly-written regular expression may run slowly or return a StackOverflowError, causing the node running the expression to exit suddenly.
Read more about pathological regular expressions and how to avoid them.
This filter uses Lucene’s PatternReplaceFilter.
Example
editThe following analyze API request uses the pattern_replace
filter to prepend watch
to the substring dog
in foxes jump lazy dogs
.
GET /_analyze { "tokenizer": "whitespace", "filter": [ { "type": "pattern_replace", "pattern": "(dog)", "replacement": "watch$1" } ], "text": "foxes jump lazy dogs" }
The filter produces the following tokens.
[ foxes, jump, lazy, watchdogs ]
Configurable parameters
edit-
all
-
(Optional, Boolean)
If
true
, all substrings matching thepattern
parameter’s regular expression are replaced. Iffalse
, the filter replaces only the first matching substring in each token. Defaults totrue
. -
pattern
-
(Required, string)
Regular expression, written in
Java’s
regular expression syntax. The filter replaces token substrings matching this
pattern with the substring in the
replacement
parameter. -
replacement
-
(Optional, string)
Replacement substring. Defaults to an empty substring (
""
).
Customize and add to an analyzer
editTo customize the pattern_replace
filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
The following create index API request
configures a new custom analyzer using a custom
pattern_replace
filter, my_pattern_replace_filter
.
The my_pattern_replace_filter
filter uses the regular expression [£|€]
to
match and remove the currency symbols £
and €
. The filter’s all
parameter is false
, meaning only the first matching symbol in each token is
removed.
PUT /my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "filter": [ "my_pattern_replace_filter" ] } }, "filter": { "my_pattern_replace_filter": { "type": "pattern_replace", "pattern": "[£|€]", "replacement": "", "all": false } } } } }