Pattern tokenizer
editPattern tokenizer
editThe pattern
tokenizer uses a regular expression to either split text into
terms whenever it matches a word separator, or to capture matching text as
terms.
The default pattern is \W+
, which splits text whenever it encounters
non-word characters.
Beware of Pathological Regular Expressions
The pattern tokenizer uses Java Regular Expressions.
A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.
Read more about pathological regular expressions and how to avoid them.
Example output
editresponse = client.indices.analyze( body: { tokenizer: 'pattern', text: "The foo_bar_size's default is 5." } ) puts response
POST _analyze { "tokenizer": "pattern", "text": "The foo_bar_size's default is 5." }
The above sentence would produce the following terms:
[ The, foo_bar_size, s, default, is, 5 ]
Configuration
editThe pattern
tokenizer accepts the following parameters:
|
A Java regular expression, defaults to |
|
Java regular expression flags.
Flags should be pipe-separated, eg |
|
Which capture group to extract as tokens. Defaults to |
Example configuration
editIn this example, we configure the pattern
tokenizer to break text into
tokens when it encounters commas:
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_analyzer: { tokenizer: 'my_tokenizer' } }, tokenizer: { my_tokenizer: { type: 'pattern', pattern: ',' } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_analyzer', text: 'comma,separated,values' } ) puts response
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "pattern", "pattern": "," } } } } } POST my-index-000001/_analyze { "analyzer": "my_analyzer", "text": "comma,separated,values" }
The above example produces the following terms:
[ comma, separated, values ]
In the next example, we configure the pattern
tokenizer to capture values
enclosed in double quotes (ignoring embedded escaped quotes \"
). The regex
itself looks like this:
"((?:\\"|[^"]|\\")*)"
And reads as follows:
-
A literal
"
-
Start capturing:
-
A literal
\"
OR any character except"
- Repeat until no more characters match
-
A literal
-
A literal closing
"
When the pattern is specified in JSON, the "
and \
characters need to be
escaped, so the pattern ends up looking like:
\"((?:\\\\\"|[^\"]|\\\\\")+)\"
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_analyzer: { tokenizer: 'my_tokenizer' } }, tokenizer: { my_tokenizer: { type: 'pattern', pattern: '"((?:\\\"|[^"]|\\\")+)"', group: 1 } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_analyzer', text: '"value", "value with embedded \" quote"' } ) puts response
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "pattern", "pattern": "\"((?:\\\\\"|[^\"]|\\\\\")+)\"", "group": 1 } } } } } POST my-index-000001/_analyze { "analyzer": "my_analyzer", "text": "\"value\", \"value with embedded \\\" quote\"" }
The above example produces the following two terms:
[ value, value with embedded \" quote ]