WARNING: Version 2.3 of Elasticsearch has passed its EOL date.

This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.

« Whitespace Tokenizer UAX Email URL Tokenizer »

› › ›

Pattern Tokenizer

edit

Pattern Tokenizer

edit

A tokenizer of type pattern that can flexibly separate text into terms via a regular expression. Accepts the following settings:

Setting	Description
`pattern`	The regular expression pattern, defaults to `\W+`.
`flags`	The regular expression flags.
`group`	Which group to extract into tokens. Defaults to `-1` (split).

IMPORTANT: The regular expression should match the token separators, not the tokens themselves.

group set to -1 (the default) is equivalent to "split". Using group >= 0 selects the matching group as the token. For example, if you have:

pattern = '([^']+)'
group   = 0
input   = aaa 'bbb' 'ccc'

the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks).

« Whitespace Tokenizer UAX Email URL Tokenizer »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

Pattern Tokenizer

Pattern Tokenizer

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards