NOTE: You are looking at documentation for an older release. For the latest information, see the current release documentation.
CJK Bigram Token Filter
editCJK Bigram Token Filter
editThe cjk_bigram
token filter forms bigrams out of the CJK
terms that are generated by the standard
tokenizer
or the icu_tokenizer
(see analysis-icu
plugin).
By default, when a CJK character has no adjacent characters to form a bigram,
it is output in unigram form. If you always want to output both unigrams and
bigrams, set the output_unigrams
flag to true
. This can be used for a
combined unigram+bigram approach.
Bigrams are generated for characters in han
, hiragana
, katakana
and
hangul
, but bigrams can be disabled for particular scripts with the
ignored_scripts
parameter. All non-CJK input is passed through unmodified.
PUT /cjk_bigram_example { "settings" : { "analysis" : { "analyzer" : { "han_bigrams" : { "tokenizer" : "standard", "filter" : ["han_bigrams_filter"] } }, "filter" : { "han_bigrams_filter" : { "type" : "cjk_bigram", "ignored_scripts": [ "hiragana", "katakana", "hangul" ], "output_unigrams" : true } } } } }