IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
ICU Normalization Character Filter
editICU Normalization Character Filter
editNormalizes characters as explained
here.
It registers itself as the icu_normalizer
character filter, which is
available to all indices without any further configuration. The type of
normalization can be specified with the name
parameter, which accepts nfc
,
nfkc
, and nfkc_cf
(default). Set the mode
parameter to decompose
to
convert nfc
to nfd
or nfkc
to nfkd
respectively:
Which letters are normalized can be controlled by specifying the
unicodeSetFilter
parameter, which accepts a
UnicodeSet.
Here are two examples, the default usage and a customised character filter:
PUT icu_sample { "settings": { "index": { "analysis": { "analyzer": { "nfkc_cf_normalized": { "tokenizer": "icu_tokenizer", "char_filter": [ "icu_normalizer" ] }, "nfd_normalized": { "tokenizer": "icu_tokenizer", "char_filter": [ "nfd_normalizer" ] } }, "char_filter": { "nfd_normalizer": { "type": "icu_normalizer", "name": "nfc", "mode": "decompose" } } } } } }