WARNING: Version 1.5 of Elasticsearch has passed its EOL date.

This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.

› › ›

Pattern Analyzer

edit

Pattern Analyzer

edit

An analyzer of type pattern that can flexibly separate text into terms via a regular expression. Accepts the following settings:

The following are settings that can be set for a pattern analyzer type:

`lowercase`	Should terms be lowercased or not. Defaults to `true`.
`pattern`	The regular expression pattern, defaults to `\W+`.
`flags`	The regular expression flags.
`stopwords`	A list of stopwords to initialize the stop filter with. Defaults to an empty stopword list Check Stop Analyzer for more details.

IMPORTANT: The regular expression should match the token separators, not the tokens themselves.

Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS". Check Java Pattern API for more details about flags options.

Pattern Analyzer Examples

edit

In order to try out these examples, you should delete the test index before running each example.

Whitespace tokenizer

edit

DELETE test

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace": {
          "type": "pattern",
          "pattern": "\\s+"
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=whitespace&text=foo,bar baz

# "foo,bar", "baz"

Non-word character tokenizer

edit

DELETE test

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "nonword": {
          "type": "pattern",
          "pattern": "[^\\w]+" 
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=nonword&text=foo,bar baz
# "foo,bar baz" becomes "foo", "bar", "baz"

GET /test/_analyze?analyzer=nonword&text=type_1-type_4
# "type_1","type_4"

CamelCase tokenizer

edit

DELETE test

PUT /test?pretty=1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
# "moose","x","ftp","class","2","beta"

The regex above is easier to understand as:

  ([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?<=\D)(?=\d)                 # or non-number followed by number,
| (?<=\d)(?=\D)                 # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case
  (?=\p{Lu})                    #   followed by upper case,
| (?<=\p{Lu})                   # or upper case
  (?=\p{Lu}                     #   followed by upper case
    [\p{L}&&[^\p{Lu}]]          #   then lower case
  )

« Keyword Analyzer Language Analyzers »