Word delimiter token filter
editWord delimiter token filter
editWe recommend using the
word_delimiter_graph
instead of
the word_delimiter
filter.
The word_delimiter
filter can produce invalid token graphs. See
Differences between word_delimiter_graph
and word_delimiter
.
The word_delimiter
filter also uses Lucene’s
WordDelimiterFilter,
which is marked as deprecated.
Splits tokens at non-alphanumeric characters. The word_delimiter
filter
also performs optional token normalization based on a set of rules. By default,
the filter uses the following rules:
-
Split tokens at non-alphanumeric characters.
The filter uses these characters as delimiters.
For example:
Super-Duper
→Super
,Duper
-
Remove leading or trailing delimiters from each token.
For example:
XL---42+'Autocoder'
→XL
,42
,Autocoder
-
Split tokens at letter case transitions.
For example:
PowerShot
→Power
,Shot
-
Split tokens at letter-number transitions.
For example:
XL500
→XL
,500
-
Remove the English possessive (
's
) from the end of each token. For example:Neil's
→Neil
The word_delimiter
filter was designed to remove punctuation from complex
identifiers, such as product IDs or part numbers. For these use cases, we
recommend using the word_delimiter
filter with the
keyword
tokenizer.
Avoid using the word_delimiter
filter to split hyphenated words, such as
wi-fi
. Because users often search for these words both with and without
hyphens, we recommend using the
synonym_graph
filter instead.
Example
editThe following analyze API request uses the
word_delimiter
filter to split Neil's-Super-Duper-XL500--42+AutoCoder
into normalized tokens using the filter’s default rules:
response = client.indices.analyze( body: { tokenizer: 'keyword', filter: [ 'word_delimiter' ], text: "Neil's-Super-Duper-XL500--42+AutoCoder" } ) puts response
GET /_analyze { "tokenizer": "keyword", "filter": [ "word_delimiter" ], "text": "Neil's-Super-Duper-XL500--42+AutoCoder" }
The filter produces the following tokens:
[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
Add to an analyzer
editThe following create index API request uses the
word_delimiter
filter to configure a new
custom analyzer.
PUT /my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "filter": [ "word_delimiter" ] } } } } }
Avoid using the word_delimiter
filter with tokenizers that remove punctuation,
such as the standard
tokenizer. This could
prevent the word_delimiter
filter from splitting tokens correctly. It can also
interfere with the filter’s configurable parameters, such as catenate_all
or
preserve_original
. We recommend using the
keyword
or
whitespace
tokenizer instead.
Configurable parameters
edit-
catenate_all
-
(Optional, Boolean) If
true
, the filter produces catenated tokens for chains of alphanumeric characters separated by non-alphabetic delimiters. For example:super-duper-xl-500
→ [super
,superduperxl500
,duper
,xl
,500
]. Defaults tofalse
.When used for search analysis, catenated tokens can cause problems for the
match_phrase
query and other queries that rely on token position for matching. Avoid setting this parameter totrue
if you plan to use these queries. -
catenate_numbers
-
(Optional, Boolean) If
true
, the filter produces catenated tokens for chains of numeric characters separated by non-alphabetic delimiters. For example:01-02-03
→ [01
,010203
,02
,03
]. Defaults tofalse
.When used for search analysis, catenated tokens can cause problems for the
match_phrase
query and other queries that rely on token position for matching. Avoid setting this parameter totrue
if you plan to use these queries. -
catenate_words
-
(Optional, Boolean) If
true
, the filter produces catenated tokens for chains of alphabetical characters separated by non-alphabetic delimiters. For example:super-duper-xl
→ [super
,superduperxl
,duper
,xl
]. Defaults tofalse
.When used for search analysis, catenated tokens can cause problems for the
match_phrase
query and other queries that rely on token position for matching. Avoid setting this parameter totrue
if you plan to use these queries. -
generate_number_parts
-
(Optional, Boolean)
If
true
, the filter includes tokens consisting of only numeric characters in the output. Iffalse
, the filter excludes these tokens from the output. Defaults totrue
. -
generate_word_parts
-
(Optional, Boolean)
If
true
, the filter includes tokens consisting of only alphabetical characters in the output. Iffalse
, the filter excludes these tokens from the output. Defaults totrue
. -
preserve_original
-
(Optional, Boolean)
If
true
, the filter includes the original version of any split tokens in the output. This original version includes non-alphanumeric delimiters. For example:super-duper-xl-500
→ [super-duper-xl-500
,super
,duper
,xl
,500
]. Defaults tofalse
. -
protected_words
- (Optional, array of strings) Array of tokens the filter won’t split.
-
protected_words_path
-
(Optional, string) Path to a file that contains a list of tokens the filter won’t split.
This path must be absolute or relative to the
config
location, and the file must be UTF-8 encoded. Each token in the file must be separated by a line break. -
split_on_case_change
-
(Optional, Boolean)
If
true
, the filter splits tokens at letter case transitions. For example:camelCase
→ [camel
,Case
]. Defaults totrue
. -
split_on_numerics
-
(Optional, Boolean)
If
true
, the filter splits tokens at letter-number transitions. For example:j2se
→ [j
,2
,se
]. Defaults totrue
. -
stem_english_possessive
-
(Optional, Boolean)
If
true
, the filter removes the English possessive ('s
) from the end of each token. For example:O'Neil's
→ [O
,Neil
]. Defaults totrue
. -
type_table
-
(Optional, array of strings) Array of custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.
For example, the following array maps the plus (
+
) and hyphen (-
) characters as alphanumeric, which means they won’t be treated as delimiters:[ "+ => ALPHA", "- => ALPHA" ]
Supported types include:
-
ALPHA
(Alphabetical) -
ALPHANUM
(Alphanumeric) -
DIGIT
(Numeric) -
LOWER
(Lowercase alphabetical) -
SUBWORD_DELIM
(Non-alphanumeric delimiter) -
UPPER
(Uppercase alphabetical)
-
-
type_table_path
-
(Optional, string) Path to a file that contains custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.
For example, the contents of this file may contain the following:
# Map the $, %, '.', and ',' characters to DIGIT # This might be useful for financial data. $ => DIGIT % => DIGIT . => DIGIT \\u002C => DIGIT # in some cases you might not want to split on ZWJ # this also tests the case where we need a bigger byte[] # see https://en.wikipedia.org/wiki/Zero-width_joiner \\u200D => ALPHANUM
Supported types include:
-
ALPHA
(Alphabetical) -
ALPHANUM
(Alphanumeric) -
DIGIT
(Numeric) -
LOWER
(Lowercase alphabetical) -
SUBWORD_DELIM
(Non-alphanumeric delimiter) -
UPPER
(Uppercase alphabetical)
This file path must be absolute or relative to the
config
location, and the file must be UTF-8 encoded. Each mapping in the file must be separated by a line break. -
Customize
editTo customize the word_delimiter
filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following request creates a word_delimiter
filter that uses the following rules:
-
Split tokens at non-alphanumeric characters, except the hyphen (
-
) character. - Remove leading or trailing delimiters from each token.
- Do not split tokens at letter case transitions.
- Do not split tokens at letter-number transitions.
-
Remove the English possessive (
's
) from the end of each token.
PUT /my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "filter": [ "my_custom_word_delimiter_filter" ] } }, "filter": { "my_custom_word_delimiter_filter": { "type": "word_delimiter", "type_table": [ "- => ALPHA" ], "split_on_case_change": false, "split_on_numerics": false, "stem_english_possessive": true } } } } }