This documentation contains work-in-progress information for future Elastic Stack and Cloud releases. Use the version selector to view supported release docs. It also contains some Elastic Cloud serverless information. Check out our serverless docs for more details.
CJK width token filter
editCJK width token filter
editNormalizes width differences in CJK (Chinese, Japanese, and Korean) characters as follows:
- Folds full-width ASCII character variants into the equivalent basic Latin characters
- Folds half-width Katakana character variants into the equivalent Kana characters
This filter is included in Elasticsearch’s built-in CJK language analyzer. It uses Lucene’s CJKWidthFilter.
This token filter can be viewed as a subset of NFKC/NFKD Unicode
normalization. See the
analysis-icu
plugin for
full normalization support.
Example
editresp = client.indices.analyze( tokenizer="standard", filter=[ "cjk_width" ], text="シーサイドライナー", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'standard', filter: [ 'cjk_width' ], text: 'シーサイドライナー' } ) puts response
const response = await client.indices.analyze({ tokenizer: "standard", filter: ["cjk_width"], text: "シーサイドライナー", }); console.log(response);
GET /_analyze { "tokenizer" : "standard", "filter" : ["cjk_width"], "text" : "シーサイドライナー" }
The filter produces the following token:
シーサイドライナー
Add to an analyzer
editThe following create index API request uses the CJK width token filter to configure a new custom analyzer.
resp = client.indices.create( index="cjk_width_example", settings={ "analysis": { "analyzer": { "standard_cjk_width": { "tokenizer": "standard", "filter": [ "cjk_width" ] } } } }, ) print(resp)
response = client.indices.create( index: 'cjk_width_example', body: { settings: { analysis: { analyzer: { standard_cjk_width: { tokenizer: 'standard', filter: [ 'cjk_width' ] } } } } } ) puts response
const response = await client.indices.create({ index: "cjk_width_example", settings: { analysis: { analyzer: { standard_cjk_width: { tokenizer: "standard", filter: ["cjk_width"], }, }, }, }, }); console.log(response);
PUT /cjk_width_example { "settings": { "analysis": { "analyzer": { "standard_cjk_width": { "tokenizer": "standard", "filter": [ "cjk_width" ] } } } } }