This documentation contains work-in-progress information for future Elastic Stack and Cloud releases. Use the version selector to view supported release docs. It also contains some Elastic Cloud serverless information. Check out our serverless docs for more details.

« Simple pattern split tokenizer Thai tokenizer »

› › ›

Standard tokenizer

edit

Standard tokenizer

edit

The standard tokenizer provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.

Example output

edit

resp = client.indices.analyze(
    tokenizer="standard",
    text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'standard',
    text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "standard",
  text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
});
console.log(response);

POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following terms:

[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

Configuration

edit

The standard tokenizer accepts the following parameters:

max_token_length

The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.

Example configuration

edit

In this example, we configure the standard tokenizer to have a max_token_length of 5 (for demonstration purposes):

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "type": "standard",
                    "max_token_length": 5
                }
            }
        }
    },
)
print(resp)

resp1 = client.indices.analyze(
    index="my-index-000001",
    analyzer="my_analyzer",
    text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
print(resp1)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'my_tokenizer'
          }
        },
        tokenizer: {
          my_tokenizer: {
            type: 'standard',
            max_token_length: 5
          }
        }
      }
    }
  }
)
puts response

response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
    analyzer: 'my_analyzer',
    text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  }
)
puts response

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        my_analyzer: {
          tokenizer: "my_tokenizer",
        },
      },
      tokenizer: {
        my_tokenizer: {
          type: "standard",
          max_token_length: 5,
        },
      },
    },
  },
});
console.log(response);

const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "my_analyzer",
  text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
});
console.log(response1);

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above example produces the following terms:

[ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ]

« Simple pattern split tokenizer Thai tokenizer »