Edge n-gram tokenizer
editEdge n-gram tokenizer
editThe edge_ngram
tokenizer first breaks text down into words whenever it
encounters one of a list of specified characters, then it emits
N-grams of each word where the start of
the N-gram is anchored to the beginning of the word.
Edge N-Grams are useful for search-as-you-type queries.
When you need search-as-you-type for text which has a widely known order, such as movie or song titles, the completion suggester is a much more efficient choice than edge N-grams. Edge N-grams have the advantage when trying to autocomplete words that can appear in any order.
Example output
editWith the default settings, the edge_ngram
tokenizer treats the initial text as a
single token and produces N-grams with minimum length 1
and maximum length
2
:
POST _analyze { "tokenizer": "edge_ngram", "text": "Quick Fox" }
The above sentence would produce the following terms:
[ Q, Qu ]
These default gram lengths are almost entirely useless. You need to
configure the edge_ngram
before using it.
Configuration
editThe edge_ngram
tokenizer accepts the following parameters:
-
min_gram
-
Minimum length of characters in a gram. Defaults to
1
. -
max_gram
-
Maximum length of characters in a gram. Defaults to
2
. -
token_chars
-
Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to
[]
(keep all characters).Character classes may be any of the following:
-
letter
— for examplea
,b
,ï
or京
-
digit
— for example3
or7
-
whitespace
— for example" "
or"\n"
-
punctuation
— for example!
or"
-
symbol
— for example$
or√
-