WARNING: Version 2.0 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
CJK Bigram Token Filter
editCJK Bigram Token Filter
editThe cjk_bigram
token filter forms bigrams out of the CJK
terms that are generated by the standard
tokenizer
or the icu_tokenizer
(see ICU Analysis Plugin).
By default, when a CJK character has no adjacent characters to form a bigram,
it is output in unigram form. If you always want to output both unigrams and
bigrams, set the output_unigrams
flag to true
. This can be used for a
combined unigram+bigram approach.
Bigrams are generated for characters in han
, hiragana
, katakana
and
hangul
, but bigrams can be disabled for particular scripts with the
ignored_scripts
parameter. All non-CJK input is passed through unmodified.
{ "index" : { "analysis" : { "analyzer" : { "han_bigrams" : { "tokenizer" : "standard", "filter" : ["han_bigrams_filter"] } }, "filter" : { "han_bigrams_filter" : { "type" : "cjk_bigram", "ignored_scripts": [ "hiragana", "katakana", "hangul" ], "output_unigrams" : true } } } } }