IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« ICU Normalization Character Filter ICU Normalization Token Filter »

› › ›

ICU Tokenizer

edit

ICU Tokenizer

edit

Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}

Copy as curl Try in Elastic

Rules customization

edit

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

You can customize the icu-tokenizer behavior by specifying per-script rule files, see the RBBI rules syntax reference for a more detailed explanation.

To add icu tokenizer rules, set the rule_files settings, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a rule file name. Rule files are placed ES_HOME/config directory.

As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi:

.+ {200};

Then create an analyzer to use this rule file as follows:

PUT icu_sample
{
    "settings": {
        "index":{
            "analysis":{
                "tokenizer" : {
                    "icu_user_file" : {
                       "type" : "icu_tokenizer",
                       "rule_files" : "Latn:KeywordTokenizer.rbbi"
                    }
                },
                "analyzer" : {
                    "my_analyzer" : {
                        "type" : "custom",
                        "tokenizer" : "icu_user_file"
                    }
                }
            }
        }
    }
}

POST icu_sample/_analyze?analyzer=my_analyzer&text=Elasticsearch. Wow!

Copy as curl Try in Elastic

The above analyze request returns the following:

{
   "tokens": [
      {
         "token": "Elasticsearch. Wow!",
         "start_offset": 0,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 0
      }
   ]
}

« ICU Normalization Character Filter ICU Normalization Token Filter »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

ICU Tokenizer

ICU Tokenizer

Rules customization

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards