IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« ICU Normalization Character Filter ICU Normalization Token Filter »

› › ›

ICU Tokenizer

edit

ICU Tokenizer

edit

Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}

Copy as curl Try in Elastic

Rules customization

edit

This functionality is marked as experimental in Lucene

You can customize the icu-tokenizer behavior by specifying per-script rule files, see the RBBI rules syntax reference for a more detailed explanation.

To add icu tokenizer rules, set the rule_files settings, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a rule file name. Rule files are placed ES_HOME/config directory.

As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi:

.+ {200};

Then create an analyzer to use this rule file as follows:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "icu_user_file": {
            "type": "icu_tokenizer",
            "rule_files": "Latn:KeywordTokenizer.rbbi"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "icu_user_file"
          }
        }
      }
    }
  }
}

GET icu_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Elasticsearch. Wow!"
}

Copy as curl Try in Elastic

The above analyze request returns the following:

{
   "tokens": [
      {
         "token": "Elasticsearch. Wow!",
         "start_offset": 0,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 0
      }
   ]
}

« ICU Normalization Character Filter ICU Normalization Token Filter »

On this page

Rules customization

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

ICU Tokenizer

ICU Tokenizer

Rules customization

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards