Word Delimiter Graph Token Filter

edit

Word Delimiter Graph Token Filter

edit

This functionality is marked as experimental in Lucene

Named word_delimiter_graph, it splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

  • split on intra-word delimiters (by default, all non alpha-numeric characters).
  • "Wi-Fi" → "Wi", "Fi"
  • split on case transitions: "PowerShot" → "Power", "Shot"
  • split on letter-number transitions: "SD500" → "SD", "500"
  • leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, dude" → "hello", "there", "dude"
  • trailing "'s" are removed for each subword: "O’Neil’s" → "O", "Neil"

Unlike the word_delimiter, this token filter correctly handles positions for multi terms expansion at search-time when any of the following options are set to true:

  • preserve_original
  • catenate_numbers
  • catenate_words
  • catenate_all

Parameters include:

generate_word_parts
If true causes parts of words to be generated: "PowerShot" ⇒ "Power" "Shot". Defaults to true.
generate_number_parts
If true causes number subwords to be generated: "500-42" ⇒ "500" "42". Defaults to true.
catenate_words
If true causes maximum runs of word parts to be catenated: "wi-fi" ⇒ "wifi". Defaults to false.
catenate_numbers
If true causes maximum runs of number parts to be catenated: "500-42" ⇒ "50042". Defaults to false.
catenate_all
If true causes all subword parts to be catenated: "wi-fi-4000" ⇒ "wifi4000". Defaults to false.
split_on_case_change
If true causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regards). Defaults to true.
preserve_original
If true includes original words in subwords: "500-42" ⇒ "500-42" "500" "42". Defaults to false.
split_on_numerics
If true causes "j2se" to be three tokens; "j" "2" "se". Defaults to true.
stem_english_possessive
If true causes trailing "'s" to be removed for each subword: "O’Neil’s" ⇒ "O", "Neil". Defaults to true.

Advance settings include:

protected_words
A list of protected words from being delimiter. Either an array, or also can set protected_words_path which resolved to a file configured with protected words (one on each line). Automatically resolves to config/ based location if exists.
adjust_offsets
By default, the filter tries to output subtokens with adjusted offsets to reflect their actual position in the token stream. However, when used in combination with other filters that alter the length or starting position of tokens without changing their offsets (e.g. trim) this can cause tokens with illegal offsets to be emitted. Setting adjust_offsets to false will stop word_delimiter_graph from adjusting these internal offsets.
type_table
A custom type mapping table, for example (when configured using type_table_path):
    # Map the $, %, '.', and ',' characters to DIGIT
    # This might be useful for financial data.
    $ => DIGIT
    % => DIGIT
    . => DIGIT
    \\u002C => DIGIT

    # in some cases you might not want to split on ZWJ
    # this also tests the case where we need a bigger byte[]
    # see http://en.wikipedia.org/wiki/Zero-width_joiner
    \\u200D => ALPHANUM

Using a tokenizer like the standard tokenizer may interfere with the catenate_* and preserve_original parameters, as the original string may already have lost punctuation during tokenization. Instead, you may want to use the whitespace tokenizer.