ICU Tokenizer
editICU Tokenizer
editTokenizes text into words on word boundaries, as defined in
UAX #29: Unicode Text Segmentation.
It behaves much like the standard
tokenizer,
but adds better support for some Asian languages by using a dictionary-based
approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
using custom rules to break Myanmar and Khmer text into syllables.
PUT icu_sample { "settings": { "index": { "analysis": { "analyzer": { "my_icu_analyzer": { "tokenizer": "icu_tokenizer" } } } } } }
Rules customization
editThis functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
You can customize the icu-tokenizer
behavior by specifying per-script rule files, see the
RBBI rules syntax reference
for a more detailed explanation.
To add icu tokenizer rules, set the rule_files
settings, which should contain a comma-separated list of
code:rulefile
pairs in the following format:
four-letter ISO 15924 script code,
followed by a colon, then a rule file name. Rule files are placed ES_HOME/config
directory.
As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi
:
.+ {200};
Then create an analyzer to use this rule file as follows:
PUT icu_sample { "settings": { "index":{ "analysis":{ "tokenizer" : { "icu_user_file" : { "type" : "icu_tokenizer", "rule_files" : "Latn:KeywordTokenizer.rbbi" } }, "analyzer" : { "my_analyzer" : { "type" : "custom", "tokenizer" : "icu_user_file" } } } } } } POST icu_sample/_analyze?analyzer=my_analyzer&text=Elasticsearch. Wow!
The above analyze
request returns the following:
{ "tokens": [ { "token": "Elasticsearch. Wow!", "start_offset": 0, "end_offset": 19, "type": "<ALPHANUM>", "position": 0 } ] }