kuromoji analyzer
editkuromoji
analyzer
editThe kuromoji
analyzer uses the following analysis chain:
-
CJKWidthCharFilter
from Lucene -
kuromoji_tokenizer
-
kuromoji_baseform
token filter -
kuromoji_part_of_speech
token filter -
ja_stop
token filter -
kuromoji_stemmer
token filter -
lowercase
token filter
It supports the mode
and user_dictionary
settings from
kuromoji_tokenizer
.
Normalize full-width characters
editThe kuromoji_tokenizer
tokenizer uses characters from the MeCab-IPADIC
dictionary to split text into tokens. The dictionary includes some full-width
characters, such as o
and f
. If a text contains full-width characters,
the tokenizer can produce unexpected tokens.
For example, the kuromoji_tokenizer
tokenizer converts the text
Culture of Japan
to the tokens [ culture, o, f, japan ]
instead of [ culture, of, japan ]
.
To avoid this, add the icu_normalizer
character filter to a custom analyzer based on the kuromoji
analyzer. The
icu_normalizer
character filter converts full-width characters to their normal
equivalents.
First, duplicate the kuromoji
analyzer to create the basis for a custom
analyzer. Then add the icu_normalizer
character filter to the custom analyzer.
For example: