Plugins and Integrations: other versions:
Introduction to plugins
Plugin Management
API Extension Plugins
- Delete By Query Plugin
  - Using Delete-by-Query
  - Why Delete-By-Query is a plugin
Alerting Plugins
Analysis Plugins
Discovery Plugins
Management and Site Plugins
Mapper Plugins
- Mapper Size Plugin
  - Using the _size field
- Mapper Murmur3 Plugin
  - Using the murmur3 field
Scripting Plugins
- JavaScript Language Plugin
  - Using JavaScript in Elasticsearch
- Python Language Plugin
  - Using Python in Elasticsearch
Security Plugins
Snapshot/Restore Repository Plugins
Transport Plugins
Integrations
Help for plugin authors

IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« ICU Normalization Character Filter ICU Normalization Token Filter »

› › ›

ICU Tokenizer

edit

ICU Tokenizer

edit

Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}

Copy as curl View in Sense

« ICU Normalization Character Filter ICU Normalization Token Filter »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

ICU Tokenizer

ICU Tokenizer

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards