IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
ICU Collation Token Filter
editICU Collation Token Filter
editCollations are used for sorting documents in a language-specific word order.
The icu_collation
token filter is available to all indices and defaults to
using the
DUCET collation,
which is a best-effort attempt at language-neutral sorting.
Below is an example of how to set up a field for sorting German names in “phonebook” order:
PUT /my_index { "settings": { "analysis": { "filter": { "german_phonebook": { "type": "icu_collation", "language": "de", "country": "DE", "variant": "@collation=phonebook" } }, "analyzer": { "german_phonebook": { "tokenizer": "keyword", "filter": [ "german_phonebook" ] } } } }, "mappings": { "user": { "properties": { "name": { "type": "string", "fields": { "sort": { "type": "string", "analyzer": "german_phonebook" } } } } } } } GET _search { "query": { "match": { "name": "Fritz" } }, "sort": "name.sort" }
The |
|
The |
|
An example query which searches the |
Collation options
edit-
strength
-
The strength property determines the minimum level of difference considered
significant during comparison. Possible values are :
primary
,secondary
,tertiary
,quaternary
oridentical
. See the ICU Collation documentation for a more detailed explanation for each value. Defaults totertiary
unless otherwise specified in the collation. -
decomposition
-
Possible values:
no
(default, but collation-dependent) orcanonical
. Setting this decomposition property tocanonical
allows the Collator to handle unnormalized text properly, producing the same results as if the text were normalized. Ifno
is set, it is the user’s responsibility to insure that all text is already in the appropriate form before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between faster and more complete collation behavior. Since a great many of the world’s languages do not require text normalization, most locales setno
as the default decomposition mode.
The following options are expert only:
-
alternate
-
Possible values:
shifted
ornon-ignorable
. Sets the alternate handling for strengthquaternary
to be either shifted or non-ignorable. Which boils down to ignoring punctuation and whitespace. -
caseLevel
-
Possible values:
true
orfalse
(default). Whether case level sorting is required. When strength is set toprimary
this will ignore accent differences. -
caseFirst
-
Possible values:
lower
orupper
. Useful to control which case is sorted first when case is not ignored for strengthtertiary
. The default depends on the collation. -
numeric
-
Possible values:
true
orfalse
(default) . Whether digits are sorted according to their numeric representation. For example the valueegg-9
is sorted before the valueegg-21
. -
variableTop
-
Single character or contraction. Controls what is variable for
alternate
. -
hiraganaQuaternaryMode
-
Possible values:
true
orfalse
. Distinguishing between Katakana and Hiragana characters inquaternary
strength.