CJK bigram token filter
editCJK bigram token filter
editForms bigrams out of CJK (Chinese, Japanese, and Korean) tokens.
This filter is included in Elasticsearch’s built-in CJK language analyzer. It uses Lucene’s CJKBigramFilter.
Example
editThe following analyze API request demonstrates how the CJK bigram token filter works.
GET /_analyze { "tokenizer" : "standard", "filter" : ["cjk_bigram"], "text" : "東京都は、日本の首都であり" }
The filter produces the following tokens:
[ 東京, 京都, 都は, 日本, 本の, の首, 首都, 都で, であ, あり ]
Add to an analyzer
editThe following create index API request uses the CJK bigram token filter to configure a new custom analyzer.
PUT /cjk_bigram_example { "settings" : { "analysis" : { "analyzer" : { "standard_cjk_bigram" : { "tokenizer" : "standard", "filter" : ["cjk_bigram"] } } } } }
Configurable parameters
edit-
ignored_scripts
-
(Optional, array of character scripts) Array of character scripts for which to disable bigrams. Possible values:
-
han
-
hangul
-
hiragana
-
katakana
All non-CJK input is passed through unmodified.
-
output_unigrams
(Optional, boolean)
If true
, emit tokens in both bigram and
unigram form. If false
, a CJK character
is output in unigram form when it has no adjacent characters. Defaults to
false
.
Customize
editTo customize the CJK bigram token filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
PUT /cjk_bigram_example { "settings" : { "analysis" : { "analyzer" : { "han_bigrams" : { "tokenizer" : "standard", "filter" : ["han_bigrams_filter"] } }, "filter" : { "han_bigrams_filter" : { "type" : "cjk_bigram", "ignored_scripts": [ "hangul", "hiragana", "katakana" ], "output_unigrams" : true } } } } }