Analyze API

edit

Performs analysis on a text string and returns the resulting tokens.

response = client.indices.analyze(
  body: {
    analyzer: 'standard',
    text: 'Quick Brown Foxes!'
  }
)
puts response
GET /_analyze
{
  "analyzer" : "standard",
  "text" : "Quick Brown Foxes!"
}

Request

edit

GET /_analyze

POST /_analyze

GET /<index>/_analyze

POST /<index>/_analyze

Prerequisites

edit
  • If the Elasticsearch security features are enabled, you must have the manage index privilege for the specified index.

Path parameters

edit
<index>

(Optional, string) Index used to derive the analyzer.

If specified, the analyzer or <field> parameter overrides this value.

If no analyzer or field are specified, the analyze API uses the default analyzer for the index.

If no index is specified or the index does not have a default analyzer, the analyze API uses the standard analyzer.

Query parameters

edit
analyzer

(Optional, string) The name of the analyzer that should be applied to the provided text. This could be a built-in analyzer, or an analyzer that’s been configured in the index.

If this parameter is not specified, the analyze API uses the analyzer defined in the field’s mapping.

If no field is specified, the analyze API uses the default analyzer for the index.

If no index is specified, or the index does not have a default analyzer, the analyze API uses the standard analyzer.

attributes
(Optional, array of strings) Array of token attributes used to filter the output of the explain parameter.
char_filter
(Optional, array of strings) Array of character filters used to preprocess characters before the tokenizer. See Character filters reference for a list of character filters.
explain
(Optional, Boolean) If true, the response includes token attributes and additional details. Defaults to false. [preview] The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.
field

(Optional, string) Field used to derive the analyzer. To use this parameter, you must specify an index.

If specified, the analyzer parameter overrides this value.

If no field is specified, the analyze API uses the default analyzer for the index.

If no index is specified or the index does not have a default analyzer, the analyze API uses the standard analyzer.

filter
(Optional, Array of strings) Array of token filters used to apply after the tokenizer. See Token filter reference for a list of token filters.
normalizer
(Optional, string) Normalizer to use to convert text into a single token. See Normalizers for a list of normalizers.
text
(Required, string or array of strings) Text to analyze. If an array of strings is provided, it is analyzed as a multi-value field.
tokenizer
(Optional, string) Tokenizer to use to convert text into tokens. See Tokenizer reference for a list of tokenizers.

Examples

edit

No index specified

edit

You can apply any of the built-in analyzers to the text string without specifying an index.

response = client.indices.analyze(
  body: {
    analyzer: 'standard',
    text: 'this is a test'
  }
)
puts response
GET /_analyze
{
  "analyzer" : "standard",
  "text" : "this is a test"
}

Array of text strings

edit

If the text parameter is provided as array of strings, it is analyzed as a multi-value field.

response = client.indices.analyze(
  body: {
    analyzer: 'standard',
    text: [
      'this is a test',
      'the second text'
    ]
  }
)
puts response
GET /_analyze
{
  "analyzer" : "standard",
  "text" : ["this is a test", "the second text"]
}

Custom analyzer

edit

You can use the analyze API to test a custom transient analyzer built from tokenizers, token filters, and char filters. Token filters use the filter parameter:

response = client.indices.analyze(
  body: {
    tokenizer: 'keyword',
    filter: [
      'lowercase'
    ],
    text: 'this is a test'
  }
)
puts response
GET /_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "text" : "this is a test"
}
response = client.indices.analyze(
  body: {
    tokenizer: 'keyword',
    filter: [
      'lowercase'
    ],
    char_filter: [
      'html_strip'
    ],
    text: 'this is a test</b>'
  }
)
puts response
GET /_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["html_strip"],
  "text" : "this is a <b>test</b>"
}

Custom tokenizers, token filters, and character filters can be specified in the request body as follows:

response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      'lowercase',
      {
        type: 'stop',
        stopwords: [
          'a',
          'is',
          'this'
        ]
      }
    ],
    text: 'this is a test'
  }
)
puts response
GET /_analyze
{
  "tokenizer" : "whitespace",
  "filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}],
  "text" : "this is a test"
}

Specific index

edit

You can also run the analyze API against a specific index:

response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    text: 'this is a test'
  }
)
puts response
GET /analyze_sample/_analyze
{
  "text" : "this is a test"
}

The above will run an analysis on the "this is a test" text, using the default index analyzer associated with the analyze_sample index. An analyzer can also be provided to use a different analyzer:

response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    analyzer: 'whitespace',
    text: 'this is a test'
  }
)
puts response
GET /analyze_sample/_analyze
{
  "analyzer" : "whitespace",
  "text" : "this is a test"
}

Derive analyzer from a field mapping

edit

The analyzer can be derived based on a field mapping, for example:

response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    field: 'obj1.field1',
    text: 'this is a test'
  }
)
puts response
GET /analyze_sample/_analyze
{
  "field" : "obj1.field1",
  "text" : "this is a test"
}

Will cause the analysis to happen based on the analyzer configured in the mapping for obj1.field1 (and if not, the default index analyzer).

Normalizer

edit

A normalizer can be provided for keyword field with normalizer associated with the analyze_sample index.

response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    normalizer: 'my_normalizer',
    text: 'BaR'
  }
)
puts response
GET /analyze_sample/_analyze
{
  "normalizer" : "my_normalizer",
  "text" : "BaR"
}

Or by building a custom transient normalizer out of token filters and char filters.

response = client.indices.analyze(
  body: {
    filter: [
      'lowercase'
    ],
    text: 'BaR'
  }
)
puts response
GET /_analyze
{
  "filter" : ["lowercase"],
  "text" : "BaR"
}

Explain analyze

edit

If you want to get more advanced details, set explain to true (defaults to false). It will output all token attributes for each token. You can filter token attributes you want to output by setting attributes option.

The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.

response = client.indices.analyze(
  body: {
    tokenizer: 'standard',
    filter: [
      'snowball'
    ],
    text: 'detailed output',
    explain: true,
    attributes: [
      'keyword'
    ]
  }
)
puts response
GET /_analyze
{
  "tokenizer" : "standard",
  "filter" : ["snowball"],
  "text" : "detailed output",
  "explain" : true,
  "attributes" : ["keyword"] 
}

Set "keyword" to output "keyword" attribute only

The request returns the following result:

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "standard",
      "tokens" : [ {
        "token" : "detailed",
        "start_offset" : 0,
        "end_offset" : 8,
        "type" : "<ALPHANUM>",
        "position" : 0
      }, {
        "token" : "output",
        "start_offset" : 9,
        "end_offset" : 15,
        "type" : "<ALPHANUM>",
        "position" : 1
      } ]
    },
    "tokenfilters" : [ {
      "name" : "snowball",
      "tokens" : [ {
        "token" : "detail",
        "start_offset" : 0,
        "end_offset" : 8,
        "type" : "<ALPHANUM>",
        "position" : 0,
        "keyword" : false 
      }, {
        "token" : "output",
        "start_offset" : 9,
        "end_offset" : 15,
        "type" : "<ALPHANUM>",
        "position" : 1,
        "keyword" : false 
      } ]
    } ]
  }
}

Output only "keyword" attribute, since specify "attributes" in the request.

Setting a token limit

edit

Generating excessive amount of tokens may cause a node to run out of memory. The following setting allows to limit the number of tokens that can be produced:

index.analyze.max_token_count
The maximum number of tokens that can be produced using _analyze API. The default value is 10000. If more than this limit of tokens gets generated, an error will be thrown. The _analyze endpoint without a specified index will always use 10000 value as a limit. This setting allows you to control the limit for a specific index:
response = client.indices.create(
  index: 'analyze_sample',
  body: {
    settings: {
      "index.analyze.max_token_count": 20_000
    }
  }
)
puts response
PUT /analyze_sample
{
  "settings" : {
    "index.analyze.max_token_count" : 20000
  }
}
response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    text: 'this is a test'
  }
)
puts response
GET /analyze_sample/_analyze
{
  "text" : "this is a test"
}