IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Synonym graph token filter Truncate token filter »

› › ›

Trim token filter

edit

Trim token filter

edit

Removes leading and trailing whitespace from each token in a stream. While this can change the length of a token, the trim filter does not change a token’s offsets.

The trim filter uses Lucene’s TrimFilter.

Many commonly used tokenizers, such as the standard or whitespace tokenizer, remove whitespace by default. When using these tokenizers, you don’t need to add a separate trim filter.

Example

edit

To see how the trim filter works, you first need to produce a token containing whitespace.

The following analyze API request uses the keyword tokenizer to produce a token for " fox ".

resp = client.indices.analyze(
    tokenizer="keyword",
    text=" fox ",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'keyword',
    text: ' fox '
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "keyword",
  text: " fox ",
});
console.log(response);

GET _analyze
{
  "tokenizer" : "keyword",
  "text" : " fox "
}

The API returns the following response. Note the " fox " token contains the original text’s whitespace. Note that despite changing the token’s length, the start_offset and end_offset remain the same.

{
  "tokens": [
    {
      "token": " fox ",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    }
  ]
}

To remove the whitespace, add the trim filter to the previous analyze API request.

resp = client.indices.analyze(
    tokenizer="keyword",
    filter=[
        "trim"
    ],
    text=" fox ",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'keyword',
    filter: [
      'trim'
    ],
    text: ' fox '
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "keyword",
  filter: ["trim"],
  text: " fox ",
});
console.log(response);

GET _analyze
{
  "tokenizer" : "keyword",
  "filter" : ["trim"],
  "text" : " fox "
}

The API returns the following response. The returned fox token does not include any leading or trailing whitespace.

{
  "tokens": [
    {
      "token": "fox",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    }
  ]
}

Add to an analyzer

edit

The following create index API request uses the trim filter to configure a new custom analyzer.

resp = client.indices.create(
    index="trim_example",
    settings={
        "analysis": {
            "analyzer": {
                "keyword_trim": {
                    "tokenizer": "keyword",
                    "filter": [
                        "trim"
                    ]
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'trim_example',
  body: {
    settings: {
      analysis: {
        analyzer: {
          keyword_trim: {
            tokenizer: 'keyword',
            filter: [
              'trim'
            ]
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "trim_example",
  settings: {
    analysis: {
      analyzer: {
        keyword_trim: {
          tokenizer: "keyword",
          filter: ["trim"],
        },
      },
    },
  },
});
console.log(response);

PUT trim_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "keyword_trim": {
          "tokenizer": "keyword",
          "filter": [ "trim" ]
        }
      }
    }
  }
}

« Synonym graph token filter Truncate token filter »