New

The executive guide to generative AI

Read more
IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

CJK width token filter

edit

Normalizes width differences in CJK (Chinese, Japanese, and Korean) characters as follows:

  • Folds full-width ASCII character variants into the equivalent basic Latin characters
  • Folds half-width Katakana character variants into the equivalent Kana characters

This filter is included in Elasticsearch’s built-in CJK language analyzer. It uses Lucene’s CJKWidthFilter.

This token filter can be viewed as a subset of NFKC/NFKD Unicode normalization. See the analysis-icu plugin for full normalization support.

Example

edit
GET /_analyze
{
  "tokenizer" : "standard",
  "filter" : ["cjk_width"],
  "text" : "シーサイドライナー"
}

The filter produces the following token:

シーサイドライナー

Add to an analyzer

edit

The following create index API request uses the CJK width token filter to configure a new custom analyzer.

PUT /cjk_width_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_cjk_width": {
          "tokenizer": "standard",
          "filter": [ "cjk_width" ]
        }
      }
    }
  }
}
Was this helpful?
Feedback