New

The executive guide to generative AI

Read more

Normalizers

edit

Normalizers are similar to analyzers except that they may only emit a single token. As a consequence, they do not have a tokenizer and only accept a subset of the available char filters and token filters. Only the filters that work on a per-character basis are allowed. For instance a lowercasing filter would be allowed, but not a stemming filter, which needs to look at the keyword as a whole. The current list of filters that can be used in a normalizer definition are: arabic_normalization, asciifolding, bengali_normalization, cjk_width, decimal_digit, elision, german_normalization, hindi_normalization, indic_normalization, lowercase, pattern_replace, persian_normalization, scandinavian_folding, serbian_normalization, sorani_normalization, trim, uppercase.

Elasticsearch ships with a lowercase built-in normalizer. For other forms of normalization, a custom configuration is required.

Custom normalizers

edit

Custom normalizers take a list of character filters and a list of token filters.

resp = client.indices.create(
    index="index",
    settings={
        "analysis": {
            "char_filter": {
                "quote": {
                    "type": "mapping",
                    "mappings": [
                        "« => \"",
                        "» => \""
                    ]
                }
            },
            "normalizer": {
                "my_normalizer": {
                    "type": "custom",
                    "char_filter": [
                        "quote"
                    ],
                    "filter": [
                        "lowercase",
                        "asciifolding"
                    ]
                }
            }
        }
    },
    mappings={
        "properties": {
            "foo": {
                "type": "keyword",
                "normalizer": "my_normalizer"
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'index',
  body: {
    settings: {
      analysis: {
        char_filter: {
          quote: {
            type: 'mapping',
            mappings: [
              '« => "',
              '» => "'
            ]
          }
        },
        normalizer: {
          my_normalizer: {
            type: 'custom',
            char_filter: [
              'quote'
            ],
            filter: [
              'lowercase',
              'asciifolding'
            ]
          }
        }
      }
    },
    mappings: {
      properties: {
        foo: {
          type: 'keyword',
          normalizer: 'my_normalizer'
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "index",
  settings: {
    analysis: {
      char_filter: {
        quote: {
          type: "mapping",
          mappings: ['« => "', '» => "'],
        },
      },
      normalizer: {
        my_normalizer: {
          type: "custom",
          char_filter: ["quote"],
          filter: ["lowercase", "asciifolding"],
        },
      },
    },
  },
  mappings: {
    properties: {
      foo: {
        type: "keyword",
        normalizer: "my_normalizer",
      },
    },
  },
});
console.log(response);
PUT index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "quote": {
          "type": "mapping",
          "mappings": [
            "« => \"",
            "» => \""
          ]
        }
      },
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": ["quote"],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "foo": {
        "type": "keyword",
        "normalizer": "my_normalizer"
      }
    }
  }
}

On this page

Was this helpful?
Feedback