WARNING: Version 1.4 of Elasticsearch has passed its EOL date.

This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.

› ›

ICU Analysis Plugin

edit

ICU Analysis Plugin

edit

The ICU analysis plugin allows for unicode normalization, collation and folding. The plugin is called elasticsearch-analysis-icu.

The plugin includes the following analysis components:

ICU Normalization

edit

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "normalization" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_normalizer"]
                }
            }
        }
    }
}

ICU Folding

edit

Folding of unicode characters based on UTR#30. It registers itself under icu_folding and icuFolding names. The filter also does lowercasing, which means the lowercase filter can normally be left out. Sample setting:

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "folding" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_folding"]
                }
            }
        }
    }
}

Filtering

edit

The folding can be filtered by a set of unicode characters with the parameter unicodeSetFilter. This is useful for a non-internationalized search engine where retaining a set of national characters which are primary letters in a specific language is wanted. See syntax for the UnicodeSet here.

The Following example exempts Swedish characters from the folding. Note that the filtered characters are NOT lowercased which is why we add that filter below.

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "folding" : {
                    "tokenizer" : "standard",
                    "filter" : ["my_icu_folding", "lowercase"]
                }
            }
            "filter" : {
                "my_icu_folding" : {
                    "type" : "icu_folding"
                    "unicodeSetFilter" : "[^åäöÅÄÖ]"
                }
            }
        }
    }
}

ICU Collation

edit

Uses collation token filter. Allows to either specify the rules for collation (defined here) using the rules parameter (can point to a location or expressed in the settings, location can be relative to config location), or using the language parameter (further specialized by country and variant). By default registers under icu_collation or icuCollation and uses the default locale.

Here is a sample settings:

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_collation"]
                }
            }
        }
    }
}

And here is a sample of custom collation:

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "filter" : ["myCollator"]
                }
            },
            "filter" : {
                "myCollator" : {
                    "type" : "icu_collation",
                    "language" : "en"
                }
            }
        }
    }
}

Options

edit

strength

The strength property determines the minimum level of difference considered significant during comparison.

The default strength for the Collator is tertiary, unless specified otherwise by the locale used to create the Collator. Possible values: primary, secondary, tertiary, quaternary or identical.

See ICU Collation documentation for a more detailed explanation for the specific values.

decomposition

Possible values: no or canonical. Defaults to no. Setting this decomposition property with canonical allows the Collator to handle un-normalized text properly, producing the same results as if the text were normalized. If no is set, it is the user’s responsibility to insure that all text is already in the appropriate form before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between faster and more complete collation behavior. Since a great many of the world’s languages do not require text normalization, most locales set no as the default decomposition mode.

Expert options:

edit

`alternate`	Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for strength `quaternary` to be either shifted or non-ignorable. What boils down to ignoring punctuation and whitespace.
`caseLevel`	Possible values: `true` or `false`. Default is `false`. Whether case level sorting is required. When strength is set to `primary` this will ignore accent differences.
`caseFirst`	Possible values: `lower` or `upper`. Useful to control which case is sorted first when case is not ignored for strength `tertiary`.
`numeric`	Possible values: `true` or `false`. Whether digits are sorted according to numeric representation. For example the value `egg-9` is sorted before the value `egg-21`. Defaults to `false`.
`variableTop`	Single character or contraction. Controls what is variable for `alternate`.
`hiraganaQuaternaryMode`	Possible values: `true` or `false`. Defaults to `false`. Distinguishing between Katakana and Hiragana characters in `quaternary` strength .

ICU Tokenizer

edit

Breaks text into words according to UAX #29: Unicode Text Segmentation http://www.unicode.org/reports/tr29/.

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "icu_tokenizer",
                }
            }
        }
    }
}

ICU Normalization CharFilter

edit

Normalizes characters as explained here. It registers itself by default under icu_normalizer or icuNormalizer using the default settings. Allows for the name parameter to be provided which can include the following values: nfc, nfkc, and nfkc_cf. Allows for the mode parameter to be provided which can include the following values: compose and decompose. Use decompose with nfc or nfkc, to get nfd or nfkd, respectively. Here is a sample settings:

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "char_filter" : ["icu_normalizer"]
                }
            }
        }
    }
}

« Pattern Replace Char Filter Modules »