IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
ICU Normalization Token Filter
editICU Normalization Token Filter
editNormalizes characters as explained
here. It registers
itself as the icu_normalizer
token filter, which is available to all indices
without any further configuration. The type of normalization can be specified
with the name
parameter, which accepts nfc
, nfkc
, and nfkc_cf
(default).
Which letters are normalized can be controlled by specifying the
unicode_set_filter
parameter, which accepts a
UnicodeSet.
You should probably prefer the Normalization character filter.
Here are two examples, the default usage and a customised token filter:
PUT icu_sample { "settings": { "index": { "analysis": { "analyzer": { "nfkc_cf_normalized": { "tokenizer": "icu_tokenizer", "filter": [ "icu_normalizer" ] }, "nfc_normalized": { "tokenizer": "icu_tokenizer", "filter": [ "nfc_normalizer" ] } }, "filter": { "nfc_normalizer": { "type": "icu_normalizer", "name": "nfc" } } } } } }