This documentation contains work-in-progress information for future Elastic Stack and Cloud releases. Use the version selector to view supported release docs. It also contains some Elastic Cloud serverless information. Check out our serverless docs for more details.

« Edge n-gram tokenizer Letter tokenizer »

› › ›

Keyword tokenizer

edit

Keyword tokenizer

edit

The keyword tokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters to normalise output, e.g. lower-casing email addresses.

Example output

edit

resp = client.indices.analyze(
    tokenizer="keyword",
    text="New York",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'keyword',
    text: 'New York'
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "keyword",
  text: "New York",
});
console.log(response);

POST _analyze
{
  "tokenizer": "keyword",
  "text": "New York"
}

The above sentence would produce the following term:

[ New York ]

Combine with token filters

edit

You can combine the keyword tokenizer with token filters to normalise structured data, such as product IDs or email addresses.

For example, the following analyze API request uses the keyword tokenizer and lowercase filter to convert an email address to lowercase.

resp = client.indices.analyze(
    tokenizer="keyword",
    filter=[
        "lowercase"
    ],
    text="john.SMITH@example.COM",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'keyword',
    filter: [
      'lowercase'
    ],
    text: 'john.SMITH@example.COM'
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "keyword",
  filter: ["lowercase"],
  text: "john.SMITH@example.COM",
});
console.log(response);

POST _analyze
{
  "tokenizer": "keyword",
  "filter": [ "lowercase" ],
  "text": "john.SMITH@example.COM"
}

The request produces the following token:

[ john.smith@example.com ]

Configuration

edit

The keyword tokenizer accepts the following parameters:

buffer_size

The number of characters read into the term buffer in a single pass. Defaults to 256. The term buffer will grow by this size until all the text has been consumed. It is advisable not to change this setting.

« Edge n-gram tokenizer Letter tokenizer »