IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« kuromoji_iteration_mark character filter kuromoji_baseform token filter »

› › ›

kuromoji_tokenizer

edit

`kuromoji_tokenizer`

edit

The kuromoji_tokenizer accepts the following settings:

mode

The tokenization mode determines how the tokenizer handles compound and unknown words. It can be set to:

normal

Normal segmentation, no decomposition for compounds. Example output:

関西国際空港
アブラカダブラ

search

Segmentation geared towards search. This includes a decompounding process for long nouns, also including the full compound token as a synonym. Example output:

関西, 関西国際空港, 国際, 空港
アブラカダブラ

extended

Extended mode outputs unigrams for unknown words. Example output:

関西, 国際, 空港
ア, ブ, ラ, カ, ダ, ブ, ラ

discard_punctuation

Whether punctuation should be discarded from the output. Defaults to true.

user_dictionary

The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A user_dictionary may be appended to the default dictionary. The dictionary should have the following CSV format:

<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>

As a demonstration of how the user dictionary can be used, save the following dictionary to $ES_HOME/config/userdict_ja.txt:

東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞

Then create an analyzer as follows:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ja.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=東京スカイツリー

Copy as curl View in Sense

The above analyze request returns the following:

# Result
{
  "tokens" : [ {
    "token" : "東京",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "スカイツリー",
    "start_offset" : 2,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  } ]
}

« kuromoji_iteration_mark character filter kuromoji_baseform token filter »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

kuromoji_tokenizer

`kuromoji_tokenizer`

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards