Remove duplicates token filter
editRemove duplicates token filter
editRemoves duplicate tokens in the same position.
The remove_duplicates
filter uses Lucene’s
RemoveDuplicatesTokenFilter.
Example
editTo see how the remove_duplicates
filter works, you first need to produce a
token stream containing duplicate tokens in the same position.
The following analyze API request uses the
keyword_repeat
and
stemmer
filters to create stemmed and
unstemmed tokens for jumping dog
.
GET _analyze { "tokenizer": "whitespace", "filter": [ "keyword_repeat", "stemmer" ], "text": "jumping dog" }
The API returns the following response. Note that the dog
token in position
1
is duplicated.
{ "tokens": [ { "token": "jumping", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 }, { "token": "jump", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 }, { "token": "dog", "start_offset": 8, "end_offset": 11, "type": "word", "position": 1 }, { "token": "dog", "start_offset": 8, "end_offset": 11, "type": "word", "position": 1 } ] }
To remove one of the duplicate dog
tokens, add the remove_duplicates
filter
to the previous analyze API request.
GET _analyze { "tokenizer": "whitespace", "filter": [ "keyword_repeat", "stemmer", "remove_duplicates" ], "text": "jumping dog" }
The API returns the following response. There is now only one dog
token in
position 1
.
{ "tokens": [ { "token": "jumping", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 }, { "token": "jump", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 }, { "token": "dog", "start_offset": 8, "end_offset": 11, "type": "word", "position": 1 } ] }
Add to an analyzer
editThe following create index API request uses the
remove_duplicates
filter to configure a new custom
analyzer.
This custom analyzer uses the keyword_repeat
and stemmer
filters to create a
stemmed and unstemmed version of each token in a stream. The remove_duplicates
filter then removes any duplicate tokens in the same position.
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_custom_analyzer": { "tokenizer": "standard", "filter": [ "keyword_repeat", "stemmer", "remove_duplicates" ] } } } } }