Trim token filter
editTrim token filter
editRemoves leading and trailing whitespace from each token in a stream. While this
can change the length of a token, the trim
filter does not change a token’s
offsets.
The trim
filter uses Lucene’s
TrimFilter.
Many commonly used tokenizers, such as the
standard
or
whitespace
tokenizer, remove whitespace by
default. When using these tokenizers, you don’t need to add a separate trim
filter.
Example
editTo see how the trim
filter works, you first need to produce a token
containing whitespace.
The following analyze API request uses the
keyword
tokenizer to produce a token for
" fox "
.
resp = client.indices.analyze( tokenizer="keyword", text=" fox ", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'keyword', text: ' fox ' } ) puts response
const response = await client.indices.analyze({ tokenizer: "keyword", text: " fox ", }); console.log(response);
GET _analyze { "tokenizer" : "keyword", "text" : " fox " }
The API returns the following response. Note the " fox "
token contains the
original text’s whitespace. Note that despite changing the token’s length, the
start_offset
and end_offset
remain the same.
{ "tokens": [ { "token": " fox ", "start_offset": 0, "end_offset": 5, "type": "word", "position": 0 } ] }
To remove the whitespace, add the trim
filter to the previous analyze API
request.
resp = client.indices.analyze( tokenizer="keyword", filter=[ "trim" ], text=" fox ", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'keyword', filter: [ 'trim' ], text: ' fox ' } ) puts response
const response = await client.indices.analyze({ tokenizer: "keyword", filter: ["trim"], text: " fox ", }); console.log(response);
GET _analyze { "tokenizer" : "keyword", "filter" : ["trim"], "text" : " fox " }
The API returns the following response. The returned fox
token does not
include any leading or trailing whitespace.
{ "tokens": [ { "token": "fox", "start_offset": 0, "end_offset": 5, "type": "word", "position": 0 } ] }
Add to an analyzer
editThe following create index API request uses the trim
filter to configure a new custom analyzer.
resp = client.indices.create( index="trim_example", settings={ "analysis": { "analyzer": { "keyword_trim": { "tokenizer": "keyword", "filter": [ "trim" ] } } } }, ) print(resp)
response = client.indices.create( index: 'trim_example', body: { settings: { analysis: { analyzer: { keyword_trim: { tokenizer: 'keyword', filter: [ 'trim' ] } } } } } ) puts response
const response = await client.indices.create({ index: "trim_example", settings: { analysis: { analyzer: { keyword_trim: { tokenizer: "keyword", filter: ["trim"], }, }, }, }, }); console.log(response);
PUT trim_example { "settings": { "analysis": { "analyzer": { "keyword_trim": { "tokenizer": "keyword", "filter": [ "trim" ] } } } } }