Standard analyzer
editStandard analyzer
editThe standard
analyzer is the default analyzer which is used if none is
specified. It provides grammar based tokenization (based on the Unicode Text
Segmentation algorithm, as specified in
Unicode Standard Annex #29) and works well
for most languages.
Example output
editresp = client.indices.analyze( analyzer="standard", text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", ) print(resp)
response = client.indices.analyze( body: { analyzer: 'standard', text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } ) puts response
const response = await client.indices.analyze({ analyzer: "standard", text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", }); console.log(response);
POST _analyze { "analyzer": "standard", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
The above sentence would produce the following terms:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
Configuration
editThe standard
analyzer accepts the following parameters:
|
The maximum token length. If a token is seen that exceeds this length then
it is split at |
|
A pre-defined stop words list like |
|
The path to a file containing stop words. |
See the Stop Token Filter for more information about stop word configuration.
Example configuration
editIn this example, we configure the standard
analyzer to have a
max_token_length
of 5 (for demonstration purposes), and to use the
pre-defined list of English stop words:
resp = client.indices.create( index="my-index-000001", settings={ "analysis": { "analyzer": { "my_english_analyzer": { "type": "standard", "max_token_length": 5, "stopwords": "_english_" } } } }, ) print(resp) resp1 = client.indices.analyze( index="my-index-000001", analyzer="my_english_analyzer", text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", ) print(resp1)
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_english_analyzer: { type: 'standard', max_token_length: 5, stopwords: '_english_' } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_english_analyzer', text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } ) puts response
const response = await client.indices.create({ index: "my-index-000001", settings: { analysis: { analyzer: { my_english_analyzer: { type: "standard", max_token_length: 5, stopwords: "_english_", }, }, }, }, }); console.log(response); const response1 = await client.indices.analyze({ index: "my-index-000001", analyzer: "my_english_analyzer", text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", }); console.log(response1);
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_english_analyzer": { "type": "standard", "max_token_length": 5, "stopwords": "_english_" } } } } } POST my-index-000001/_analyze { "analyzer": "my_english_analyzer", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
The above example produces the following terms:
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
Definition
editThe standard
analyzer consists of:
- Tokenizer
- Token Filters
-
- Lower Case Token Filter
- Stop Token Filter (disabled by default)
If you need to customize the standard
analyzer beyond the configuration
parameters then you need to recreate it as a custom
analyzer and modify
it, usually by adding token filters. This would recreate the built-in
standard
analyzer and you can use it as a starting point:
resp = client.indices.create( index="standard_example", settings={ "analysis": { "analyzer": { "rebuilt_standard": { "tokenizer": "standard", "filter": [ "lowercase" ] } } } }, ) print(resp)
response = client.indices.create( index: 'standard_example', body: { settings: { analysis: { analyzer: { rebuilt_standard: { tokenizer: 'standard', filter: [ 'lowercase' ] } } } } } ) puts response
const response = await client.indices.create({ index: "standard_example", settings: { analysis: { analyzer: { rebuilt_standard: { tokenizer: "standard", filter: ["lowercase"], }, }, }, }, }); console.log(response);