WARNING: Version 1.6 of Elasticsearch has passed its EOL date.

This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.

« Whitespace Tokenizer UAX Email URL Tokenizer »

› › ›

Pattern Tokenizer

edit

Pattern Tokenizer

edit

A tokenizer of type pattern that can flexibly separate text into terms via a regular expression. Accepts the following settings:

Setting	Description
`pattern`	The regular expression pattern, defaults to `\W+`.
`flags`	The regular expression flags.
`group`	Which group to extract into tokens. Defaults to `-1` (split).

IMPORTANT: The regular expression should match the token separators, not the tokens themselves.

group set to -1 (the default) is equivalent to "split". Using group >= 0 selects the matching group as the token. For example, if you have:

pattern = '([^']+)'
group   = 0
input   = aaa 'bbb' 'ccc'

the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks).

« Whitespace Tokenizer UAX Email URL Tokenizer »