Compound Word Token Filter

edit

Compound Word Token Filter

edit

Token filters that allow to decompose compound words. There are two types available: dictionary_decompounder and hyphenation_decompounder.

The following are settings that can be set for a compound word token filter type:

Setting Description

word_list

A list of words to use.

word_list_path

A path (either relative to config location, or absolute) to a list of words.

min_word_size

Minimum word size(Integer). Defaults to 5.

min_subword_size

Minimum subword size(Integer). Defaults to 2.

max_subword_size

Maximum subword size(Integer). Defaults to 15.

only_longest_match

Only matching the longest(Boolean). Defaults to false

Here is an example:

index :
    analysis :
        analyzer :
            myAnalyzer2 :
                type : custom
                tokenizer : standard
                filter : [myTokenFilter1, myTokenFilter2]
        filter :
            myTokenFilter1 :
                type : dictionary_decompounder
                word_list: [one, two, three]
            myTokenFilter2 :
                type : hyphenation_decompounder
                word_list_path: path/to/words.txt
                max_subword_size : 22