IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Keyword tokenizer Lowercase tokenizer »

› › ›

Letter tokenizer

edit

Letter tokenizer

edit

The letter tokenizer breaks text into terms whenever it encounters a character which is not a letter. It does a reasonable job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

Example output

edit

resp = client.indices.analyze(
    tokenizer="letter",
    text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'letter',
    text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "letter",
  text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
});
console.log(response);

POST _analyze
{
  "tokenizer": "letter",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Copy as curl Try in Elastic

The above sentence would produce the following terms:

[ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]

Configuration

edit

The letter tokenizer is not configurable.

« Keyword tokenizer Lowercase tokenizer »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

Letter tokenizer

Letter tokenizer

Example output

Configuration

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards