New

The executive guide to generative AI

Read more

Dictionary decompounder token filter

edit

In most cases, we recommend using the faster hyphenation_decompounder token filter in place of this filter. However, you can use the dictionary_decompounder filter to check the quality of a word list before implementing it in the hyphenation_decompounder filter.

Uses a specified list of words and a brute force approach to find subwords in compound words. If found, these subwords are included in the token output.

This filter uses Lucene’s DictionaryCompoundWordTokenFilter, which was built for Germanic languages.

Example

edit

The following analyze API request uses the dictionary_decompounder filter to find subwords in Donaudampfschiff. The filter then checks these subwords against the specified list of words: Donau, dampf, meer, and schiff.

resp = client.indices.analyze(
    tokenizer="standard",
    filter=[
        {
            "type": "dictionary_decompounder",
            "word_list": [
                "Donau",
                "dampf",
                "meer",
                "schiff"
            ]
        }
    ],
    text="Donaudampfschiff",
)
print(resp)
response = client.indices.analyze(
  body: {
    tokenizer: 'standard',
    filter: [
      {
        type: 'dictionary_decompounder',
        word_list: [
          'Donau',
          'dampf',
          'meer',
          'schiff'
        ]
      }
    ],
    text: 'Donaudampfschiff'
  }
)
puts response
const response = await client.indices.analyze({
  tokenizer: "standard",
  filter: [
    {
      type: "dictionary_decompounder",
      word_list: ["Donau", "dampf", "meer", "schiff"],
    },
  ],
  text: "Donaudampfschiff",
});
console.log(response);
GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "dictionary_decompounder",
      "word_list": ["Donau", "dampf", "meer", "schiff"]
    }
  ],
  "text": "Donaudampfschiff"
}

The filter produces the following tokens:

[ Donaudampfschiff, Donau, dampf, schiff ]

Configurable parameters

edit
word_list

(Required*, array of strings) A list of subwords to look for in the token stream. If found, the subword is included in the token output.

Either this parameter or word_list_path must be specified.

word_list_path

(Required*, string) Path to a file that contains a list of subwords to find in the token stream. If found, the subword is included in the token output.

This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each token in the file must be separated by a line break.

Either this parameter or word_list must be specified.

max_subword_size
(Optional, integer) Maximum subword character length. Longer subword tokens are excluded from the output. Defaults to 15.
min_subword_size
(Optional, integer) Minimum subword character length. Shorter subword tokens are excluded from the output. Defaults to 2.
min_word_size
(Optional, integer) Minimum word character length. Shorter word tokens are excluded from the output. Defaults to 5.
only_longest_match
(Optional, Boolean) If true, only include the longest matching subword. Defaults to false.

Customize and add to an analyzer

edit

To customize the dictionary_decompounder filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following create index API request uses a custom dictionary_decompounder filter to configure a new custom analyzer.

The custom dictionary_decompounder filter find subwords in the analysis/example_word_list.txt file. Subwords longer than 22 characters are excluded from the token output.

resp = client.indices.create(
    index="dictionary_decompound_example",
    settings={
        "analysis": {
            "analyzer": {
                "standard_dictionary_decompound": {
                    "tokenizer": "standard",
                    "filter": [
                        "22_char_dictionary_decompound"
                    ]
                }
            },
            "filter": {
                "22_char_dictionary_decompound": {
                    "type": "dictionary_decompounder",
                    "word_list_path": "analysis/example_word_list.txt",
                    "max_subword_size": 22
                }
            }
        }
    },
)
print(resp)
const response = await client.indices.create({
  index: "dictionary_decompound_example",
  settings: {
    analysis: {
      analyzer: {
        standard_dictionary_decompound: {
          tokenizer: "standard",
          filter: ["22_char_dictionary_decompound"],
        },
      },
      filter: {
        "22_char_dictionary_decompound": {
          type: "dictionary_decompounder",
          word_list_path: "analysis/example_word_list.txt",
          max_subword_size: 22,
        },
      },
    },
  },
});
console.log(response);
PUT dictionary_decompound_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_dictionary_decompound": {
          "tokenizer": "standard",
          "filter": [ "22_char_dictionary_decompound" ]
        }
      },
      "filter": {
        "22_char_dictionary_decompound": {
          "type": "dictionary_decompounder",
          "word_list_path": "analysis/example_word_list.txt",
          "max_subword_size": 22
        }
      }
    }
  }
}
Was this helpful?
Feedback