WARNING: Version 1.3 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Pattern Analyzer
editPattern Analyzer
editAn analyzer of type pattern
that can flexibly separate text into terms
via a regular expression. Accepts the following settings:
The following are settings that can be set for a pattern
analyzer
type:
Setting | Description |
---|---|
|
Should terms be lowercased or not. Defaults to |
|
The regular expression pattern, defaults to |
|
The regular expression flags. |
|
A list of stopwords to initialize the stop filter with. Defaults to an empty stopword list. [1.0.0.RC1] Added in 1.0.0.RC1. Previously defaulted to the English stopwords list . Check Stop Analyzer for more details. |
IMPORTANT: The regular expression should match the token separators, not the tokens themselves.
Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS"
. Check
Java
Pattern API for more details about flags
options.
Pattern Analyzer Examples
editIn order to try out these examples, you should delete the test
index
before running each example:
curl -XDELETE localhost:9200/test
Whitespace tokenizer
editcurl -XPUT 'localhost:9200/test' -d ' { "settings":{ "analysis": { "analyzer": { "whitespace":{ "type": "pattern", "pattern":"\\\\s+" } } } } }' curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz' # "foo,bar", "baz"
Non-word character tokenizer
editcurl -XPUT 'localhost:9200/test' -d ' { "settings":{ "analysis": { "analyzer": { "nonword":{ "type": "pattern", "pattern":"[^\\\\w]+" } } } } }' curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz' # "foo,bar baz" becomes "foo", "bar", "baz" curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4' # "type_1","type_4"
CamelCase tokenizer
editcurl -XPUT 'localhost:9200/test?pretty=1' -d ' { "settings":{ "analysis": { "analyzer": { "camel":{ "type": "pattern", "pattern":"([^\\\\p{L}\\\\d]+)|(?<=\\\\D)(?=\\\\d)|(?<=\\\\d)(?=\\\\D)|(?<=[\\\\p{L}&&[^\\\\p{Lu}]])(?=\\\\p{Lu})|(?<=\\\\p{Lu})(?=\\\\p{Lu}[\\\\p{L}&&[^\\\\p{Lu}]])" } } } } }' curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d ' MooseX::FTPClass2_beta ' # "moose","x","ftp","class","2","beta"
The regex above is easier to understand as:
([^\\p{L}\\d]+) # swallow non letters and numbers, | (?<=\\D)(?=\\d) # or non-number followed by number, | (?<=\\d)(?=\\D) # or number followed by non-number, | (?<=[ \\p{L} && [^\\p{Lu}]]) # or lower case (?=\\p{Lu}) # followed by upper case, | (?<=\\p{Lu}) # or upper case (?=\\p{Lu} # followed by upper case [\\p{L}&&[^\\p{Lu}]] # then lower case )