ICU Tokenizer
editICU Tokenizer
editTokenizes text into words on word boundaries, as defined in
UAX #29: Unicode Text Segmentation.
It behaves much like the standard
tokenizer,
but adds better support for some Asian languages by using a dictionary-based
approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
using custom rules to break Myanmar and Khmer text into syllables.
PUT icu_sample { "settings": { "index": { "analysis": { "analyzer": { "my_icu_analyzer": { "tokenizer": "icu_tokenizer" } } } } } }
Rules customization
editThis functionality is marked as experimental in Lucene
You can customize the icu-tokenizer
behavior by specifying per-script rule files, see the
RBBI rules syntax reference
for a more detailed explanation.
To add icu tokenizer rules, set the rule_files
settings, which should contain a comma-separated list of
code:rulefile
pairs in the following format:
four-letter ISO 15924 script code,
followed by a colon, then a rule file name. Rule files are placed ES_HOME/config
directory.
As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi
:
.+ {200};
Then create an analyzer to use this rule file as follows:
PUT icu_sample { "settings": { "index": { "analysis": { "tokenizer": { "icu_user_file": { "type": "icu_tokenizer", "rule_files": "Latn:KeywordTokenizer.rbbi" } }, "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "icu_user_file" } } } } } } GET icu_sample/_analyze { "analyzer": "my_analyzer", "text": "Elasticsearch. Wow!" }
The above analyze
request returns the following:
{ "tokens": [ { "token": "Elasticsearch. Wow!", "start_offset": 0, "end_offset": 19, "type": "<ALPHANUM>", "position": 0 } ] }