WARNING: The 1.x versions of Elasticsearch have passed their EOL dates. If you are running a 1.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Living in a Unicode World
editLiving in a Unicode World
editWhen Elasticsearch compares one token with another, it does so at the byte level. In other words, for two tokens to be considered the same, they need to consist of exactly the same bytes. Unicode, however, allows you to write the same letter in different ways.
For instance, what’s the difference between é and é? It
depends on who you ask. According to Elasticsearch, the first one consists of
the two bytes 0xC3 0xA9
, and the second one consists of three bytes, 0x65
0xCC 0x81
.
According to Unicode, the differences in how they are represented as bytes is
irrelevant, and they are the same letter. The first one is the single letter
é
, while the second is a plain e
combined with an acute accent ´
.
If you get your data from more than one source, it may happen that you have
the same letters encoded in different ways, which may result in one form of
déjà
not matching another!
Fortunately, a solution is at hand. There are four Unicode normalization
forms, all of which convert Unicode characters into a standard format, making
all characters comparable at a byte level: nfc
, nfd
, nfkc
, nfkd
.
It doesn’t really matter which normalization form you choose, as long as all
your text is in the same form. That way, the same tokens consist of the
same bytes. That said, the compatibility forms allow you to compare
ligatures like ffi
with their simpler representation, ffi
.
You can use the icu_normalizer
token filter to ensure that all of your
tokens are in the same form:
PUT /my_index { "settings": { "analysis": { "filter": { "nfkc_normalizer": { "type": "icu_normalizer", "name": "nfkc" } }, "analyzer": { "my_normalizer": { "tokenizer": "icu_tokenizer", "filter": [ "nfkc_normalizer" ] } } } } }
Besides the icu_normalizer
token filter mentioned previously, there is also an
icu_normalizer
character filter, which does the same job as the token
filter, but does so before the text reaches the tokenizer. When using the
standard
tokenizer or icu_tokenizer
, this doesn’t really matter. These
tokenizers know how to deal with all forms of Unicode correctly.
However, if you plan on using a different tokenizer, such as the ngram
,
edge_ngram
, or pattern
tokenizers, it would make sense to use the
icu_normalizer
character filter in preference to the token filter.
Usually, though, you will want to not only normalize the byte order of tokens,
but also lowercase them. This can be done with icu_normalizer
, using
the custom normalization form nfkc_cf
, which we discuss in the next section.