IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Path hierarchy tokenizer Simple pattern tokenizer »

› › ›

Pattern tokenizer

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Pattern tokenizer

edit

The pattern tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms.

The default pattern is \W+, which splits text whenever it encounters non-word characters.

Beware of Pathological Regular Expressions

The pattern tokenizer uses Java Regular Expressions.

A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.

Example output

edit

POST _analyze
{
  "tokenizer": "pattern",
  "text": "The foo_bar_size's default is 5."
}

Copy as curl Try in Elastic

The above sentence would produce the following terms:

[ The, foo_bar_size, s, default, is, 5 ]

Configuration

edit

The pattern tokenizer accepts the following parameters:

`pattern`	A Java regular expression, defaults to `\W+`.
`flags`	Java regular expression flags. Flags should be pipe-separated, eg `"CASE_INSENSITIVE\|COMMENTS"`.
`group`	Which capture group to extract as tokens. Defaults to `-1` (split).

Example configuration

edit

In this example, we configure the pattern tokenizer to break text into tokens when it encounters commas:

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "comma,separated,values"
}

Copy as curl Try in Elastic

The above example produces the following terms:

[ comma, separated, values ]

In the next example, we configure the pattern tokenizer to capture values enclosed in double quotes (ignoring embedded escaped quotes \"). The regex itself looks like this:

"((?:\\"|[^"]|\\")*)"

And reads as follows:

A literal "
Start capturing:
- A literal \" OR any character except "
- Repeat until no more characters match
A literal closing "

When the pattern is specified in JSON, the " and \ characters need to be escaped, so the pattern ends up looking like:

\"((?:\\\\\"|[^\"]|\\\\\")+)\"

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "\"((?:\\\\\"|[^\"]|\\\\\")+)\"",
          "group": 1
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "\"value\", \"value with embedded \\\" quote\""
}

Copy as curl Try in Elastic

The above example produces the following two terms:

[ value, value with embedded \" quote ]

« Path hierarchy tokenizer Simple pattern tokenizer »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

Pattern tokenizer

Pattern tokenizer

Example output

Configuration

Example configuration

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards