WARNING: This documentation covers Elasticsearch 2.x. The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.

This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.

« Configuring Analyzers Types and Mappings »

› › ›

Custom Analyzers

edit

Custom Analyzers

edit

While Elasticsearch comes with a number of analyzers available out of the box, the real power comes from the ability to create your own custom analyzers by combining character filters, tokenizers, and token filters in a configuration that suits your particular data.

In Analysis and Analyzers, we said that an analyzer is a wrapper that combines three functions into a single package, which are executed in sequence:

Character filters

Character filters are used to “tidy up” a string before it is tokenized. For instance, if our text is in HTML format, it will contain HTML tags like <p> or <div> that we don’t want to be indexed. We can use the html_strip character filter to remove all HTML tags and to convert HTML entities like Á into the corresponding Unicode character Á.

An analyzer may have zero or more character filters.

Tokenizers

An analyzer must have a single tokenizer. The tokenizer breaks up the string into individual terms or tokens. The standard tokenizer, which is used in the standard analyzer, breaks up a string into individual terms on word boundaries, and removes most punctuation, but other tokenizers exist that have different behavior.

For instance, the keyword tokenizer outputs exactly the same string as it received, without any tokenization. The whitespace tokenizer splits text on whitespace only. The pattern tokenizer can be used to split text on a matching regular expression.

Token filters

After tokenization, the resulting token stream is passed through any specified token filters, in the order in which they are specified.

Token filters may change, add, or remove tokens. We have already mentioned the lowercase and stop token filters, but there are many more available in Elasticsearch. Stemming token filters “stem” words to their root form. The ascii_folding filter removes diacritics, converting a term like "très" into "tres". The ngram and edge_ngram token filters can produce tokens suitable for partial matching or autocomplete.

In Search in Depth, we discuss examples of where and how to use these tokenizers and filters. But first, we need to explain how to create a custom analyzer.

Creating a Custom Analyzer

edit

In the same way as we configured the es_std analyzer previously, we can configure character filters, tokenizers, and token filters in their respective sections under analysis:

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },
            "tokenizer":   { ...    custom tokenizers     ... },
            "filter":      { ...   custom token filters   ... },
            "analyzer":    { ...    custom analyzers      ... }
        }
    }
}

As an example, let’s set up a custom analyzer that will do the following:

Strip out HTML by using the html_strip character filter.

Replace & characters with " and ", using a custom mapping character filter:

"char_filter": {
    "&_to_and": {
        "type":       "mapping",
        "mappings": [ "&=> and "]
    }
}

Tokenize words, using the standard tokenizer.
Lowercase terms, using the lowercase token filter.

Remove a custom list of stopwords, using a custom stop token filter:

"filter": {
    "my_stopwords": {
        "type":        "stop",
        "stopwords": [ "the", "a" ]
    }
}

Our analyzer definition combines the predefined tokenizer and filters with the custom filters that we have configured previously:

"analyzer": {
    "my_analyzer": {
        "type":           "custom",
        "char_filter":  [ "html_strip", "&_to_and" ],
        "tokenizer":      "standard",
        "filter":       [ "lowercase", "my_stopwords" ]
    }
}

To put it all together, the whole create-index request looks like this:

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
}}}

Copy as curl View in Sense

After creating the index, use the analyze API to test the new analyzer:

GET /my_index/_analyze
{
    "text": "The quick & brown fox",
    "analyzer": "my_analyzer"
}

Copy as curl View in Sense

The following abbreviated results show that our analyzer is working correctly:

{
  "tokens" : [
      { "token" :   "quick",    "position" : 2 },
      { "token" :   "and",      "position" : 3 },
      { "token" :   "brown",    "position" : 4 },
      { "token" :   "fox",      "position" : 5 }
    ]
}

The analyzer is not much use unless we tell Elasticsearch where to use it. We can apply it to a string field with a mapping such as the following:

PUT /my_index/_mapping/my_type
{
    "properties": {
        "title": {
            "type":      "string",
            "analyzer":  "my_analyzer"
        }
    }
}

Copy as curl View in Sense

« Configuring Analyzers Types and Mappings »

On this page

Creating a Custom Analyzer

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

Custom Analyzers

Custom Analyzers

Creating a Custom Analyzer

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards