NOTE: You are looking at documentation for an older release. For the latest information, see the current release documentation.
ICU Tokenizer
editICU Tokenizer
editTokenizes text into words on word boundaries, as defined in
UAX #29: Unicode Text Segmentation.
It behaves much like the standard
tokenizer,
but adds better support for some Asian languages by using a dictionary-based
approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
using custom rules to break Myanmar and Khmer text into syllables.
PUT icu_sample { "settings": { "index": { "analysis": { "analyzer": { "my_icu_analyzer": { "tokenizer": "icu_tokenizer" } } } } } }
Rules customization
editThis functionality is marked as experimental in Lucene
You can customize the icu-tokenizer
behavior by specifying per-script rule files, see the
RBBI rules syntax reference
for a more detailed explanation.
To add icu tokenizer rules, set the rule_files
settings, which should contain a comma-separated list of
code:rulefile
pairs in the following format:
four-letter ISO 15924 script code,
followed by a colon, then a rule file name. Rule files are placed ES_HOME/config
directory.
As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi
:
.+ {200};
Then create an analyzer to use this rule file as follows:
PUT icu_sample { "settings": { "index":{ "analysis":{ "tokenizer" : { "icu_user_file" : { "type" : "icu_tokenizer", "rule_files" : "Latn:KeywordTokenizer.rbbi" } }, "analyzer" : { "my_analyzer" : { "type" : "custom", "tokenizer" : "icu_user_file" } } } } } } GET icu_sample/_analyze { "analyzer": "my_analyzer", "text": "Elasticsearch. Wow!" }
The above analyze
request returns the following:
{ "tokens": [ { "token": "Elasticsearch. Wow!", "start_offset": 0, "end_offset": 19, "type": "<ALPHANUM>", "position": 0 } ] }