Edge n-gram token filter
editEdge n-gram token filter
editForms an n-gram of a specified length from the beginning of a token.
For example, you can use the edge_ngram
token filter to change quick
to
qu
.
When not customized, the filter creates 1-character edge n-grams by default.
This filter uses Lucene’s EdgeNGramTokenFilter.
The edge_ngram
filter is similar to the ngram
token filter. However, the edge_ngram
only outputs n-grams that start at the
beginning of a token. These edge n-grams are useful for
search-as-you-type queries.
Example
editThe following analyze API request uses the edge_ngram
filter to convert the quick brown fox jumps
to 1-character and 2-character
edge n-grams:
resp = client.indices.analyze( tokenizer="standard", filter=[ { "type": "edge_ngram", "min_gram": 1, "max_gram": 2 } ], text="the quick brown fox jumps", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'standard', filter: [ { type: 'edge_ngram', min_gram: 1, max_gram: 2 } ], text: 'the quick brown fox jumps' } ) puts response
const response = await client.indices.analyze({ tokenizer: "standard", filter: [ { type: "edge_ngram", min_gram: 1, max_gram: 2, }, ], text: "the quick brown fox jumps", }); console.log(response);
GET _analyze { "tokenizer": "standard", "filter": [ { "type": "edge_ngram", "min_gram": 1, "max_gram": 2 } ], "text": "the quick brown fox jumps" }
The filter produces the following tokens:
[ t, th, q, qu, b, br, f, fo, j, ju ]
Add to an analyzer
editThe following create index API request uses the
edge_ngram
filter to configure a new
custom analyzer.
resp = client.indices.create( index="edge_ngram_example", settings={ "analysis": { "analyzer": { "standard_edge_ngram": { "tokenizer": "standard", "filter": [ "edge_ngram" ] } } } }, ) print(resp)
response = client.indices.create( index: 'edge_ngram_example', body: { settings: { analysis: { analyzer: { standard_edge_ngram: { tokenizer: 'standard', filter: [ 'edge_ngram' ] } } } } } ) puts response
const response = await client.indices.create({ index: "edge_ngram_example", settings: { analysis: { analyzer: { standard_edge_ngram: { tokenizer: "standard", filter: ["edge_ngram"], }, }, }, }, }); console.log(response);
PUT edge_ngram_example { "settings": { "analysis": { "analyzer": { "standard_edge_ngram": { "tokenizer": "standard", "filter": [ "edge_ngram" ] } } } } }
Configurable parameters
edit-
max_gram
-
(Optional, integer) Maximum character length of a gram. For custom token filters, defaults to
2
. For the built-inedge_ngram
filter, defaults to1
. -
min_gram
-
(Optional, integer)
Minimum character length of a gram. Defaults to
1
. -
preserve_original
-
(Optional, Boolean)
Emits original token when set to
true
. Defaults tofalse
. -
side
-
(Optional, string) [8.16.0] Deprecated in 8.16.0. use <<analysis-reverse-tokenfilter . Indicates whether to truncate tokens from the
front
orback
. Defaults tofront
.
Customize
editTo customize the edge_ngram
filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following request creates a custom edge_ngram
filter that forms n-grams between 3-5 characters.
resp = client.indices.create( index="edge_ngram_custom_example", settings={ "analysis": { "analyzer": { "default": { "tokenizer": "whitespace", "filter": [ "3_5_edgegrams" ] } }, "filter": { "3_5_edgegrams": { "type": "edge_ngram", "min_gram": 3, "max_gram": 5 } } } }, ) print(resp)
response = client.indices.create( index: 'edge_ngram_custom_example', body: { settings: { analysis: { analyzer: { default: { tokenizer: 'whitespace', filter: [ '3_5_edgegrams' ] } }, filter: { "3_5_edgegrams": { type: 'edge_ngram', min_gram: 3, max_gram: 5 } } } } } ) puts response
const response = await client.indices.create({ index: "edge_ngram_custom_example", settings: { analysis: { analyzer: { default: { tokenizer: "whitespace", filter: ["3_5_edgegrams"], }, }, filter: { "3_5_edgegrams": { type: "edge_ngram", min_gram: 3, max_gram: 5, }, }, }, }, }); console.log(response);
PUT edge_ngram_custom_example { "settings": { "analysis": { "analyzer": { "default": { "tokenizer": "whitespace", "filter": [ "3_5_edgegrams" ] } }, "filter": { "3_5_edgegrams": { "type": "edge_ngram", "min_gram": 3, "max_gram": 5 } } } } }
Limitations of the max_gram
parameter
editThe edge_ngram
filter’s max_gram
value limits the character length of
tokens. When the edge_ngram
filter is used with an index analyzer, this
means search terms longer than the max_gram
length may not match any indexed
terms.
For example, if the max_gram
is 3
, searches for apple
won’t match the
indexed term app
.
To account for this, you can use the
truncate
filter with a search analyzer
to shorten search terms to the max_gram
character length. However, this could
return irrelevant results.
For example, if the max_gram
is 3
and search terms are truncated to three
characters, the search term apple
is shortened to app
. This means searches
for apple
return any indexed terms matching app
, such as apply
, snapped
,
and apple
.
We recommend testing both approaches to see which best fits your use case and desired search experience.