WARNING: Version 2.3 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Language Analyzers
editLanguage Analyzers
editA set of analyzers aimed at analyzing specific language text. The
following types are supported:
arabic
,
armenian
,
basque
,
brazilian
,
bulgarian
,
catalan
,
cjk
,
czech
,
danish
,
dutch
,
english
,
finnish
,
french
,
galician
,
german
,
greek
,
hindi
,
hungarian
,
indonesian
,
irish
,
italian
,
latvian
,
lithuanian
,
norwegian
,
persian
,
portuguese
,
romanian
,
russian
,
sorani
,
spanish
,
swedish
,
turkish
,
thai
.
Configuring language analyzers
editStopwords
editAll analyzers support setting custom stopwords
either internally in
the config, or by using an external stopwords file by setting
stopwords_path
. Check Stop Analyzer for
more details.
Excluding words from stemming
editThe stem_exclusion
parameter allows you to specify an array
of lowercase words that should not be stemmed. Internally, this
functionality is implemented by adding the
keyword_marker
token filter
with the keywords
set to the value of the stem_exclusion
parameter.
The following analyzers support setting custom stem_exclusion
list:
arabic
, armenian
, basque
, catalan
, bulgarian
, catalan
,
czech
, finnish
, dutch
, english
, finnish
, french
, galician
,
german
, irish
, hindi
, hungarian
, indonesian
, italian
, latvian
,
lithuanian
, norwegian
, portuguese
, romanian
, russian
, sorani
,
spanish
, swedish
, turkish
.
Reimplementing language analyzers
editThe built-in language analyzers can be reimplemented as custom
analyzers
(as described below) in order to customize their behaviour.
If you do not intend to exclude words from being stemmed (the
equivalent of the stem_exclusion
parameter above), then you should remove
the keyword_marker
token filter from the custom analyzer configuration.
arabic
analyzer
editThe arabic
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "arabic_stop": { "type": "stop", "stopwords": "_arabic_" }, "arabic_keywords": { "type": "keyword_marker", "keywords": [] }, "arabic_stemmer": { "type": "stemmer", "language": "arabic" } }, "analyzer": { "arabic": { "tokenizer": "standard", "filter": [ "lowercase", "arabic_stop", "arabic_normalization", "arabic_keywords", "arabic_stemmer" ] } } } } }
armenian
analyzer
editThe armenian
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "armenian_stop": { "type": "stop", "stopwords": "_armenian_" }, "armenian_keywords": { "type": "keyword_marker", "keywords": [] }, "armenian_stemmer": { "type": "stemmer", "language": "armenian" } }, "analyzer": { "armenian": { "tokenizer": "standard", "filter": [ "lowercase", "armenian_stop", "armenian_keywords", "armenian_stemmer" ] } } } } }
basque
analyzer
editThe basque
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "basque_stop": { "type": "stop", "stopwords": "_basque_" }, "basque_keywords": { "type": "keyword_marker", "keywords": [] }, "basque_stemmer": { "type": "stemmer", "language": "basque" } }, "analyzer": { "basque": { "tokenizer": "standard", "filter": [ "lowercase", "basque_stop", "basque_keywords", "basque_stemmer" ] } } } } }
brazilian
analyzer
editThe brazilian
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "brazilian_stop": { "type": "stop", "stopwords": "_brazilian_" }, "brazilian_keywords": { "type": "keyword_marker", "keywords": [] }, "brazilian_stemmer": { "type": "stemmer", "language": "brazilian" } }, "analyzer": { "brazilian": { "tokenizer": "standard", "filter": [ "lowercase", "brazilian_stop", "brazilian_keywords", "brazilian_stemmer" ] } } } } }
bulgarian
analyzer
editThe bulgarian
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "bulgarian_stop": { "type": "stop", "stopwords": "_bulgarian_" }, "bulgarian_keywords": { "type": "keyword_marker", "keywords": [] }, "bulgarian_stemmer": { "type": "stemmer", "language": "bulgarian" } }, "analyzer": { "bulgarian": { "tokenizer": "standard", "filter": [ "lowercase", "bulgarian_stop", "bulgarian_keywords", "bulgarian_stemmer" ] } } } } }
catalan
analyzer
editThe catalan
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "catalan_elision": { "type": "elision", "articles": [ "d", "l", "m", "n", "s", "t"] }, "catalan_stop": { "type": "stop", "stopwords": "_catalan_" }, "catalan_keywords": { "type": "keyword_marker", "keywords": [] }, "catalan_stemmer": { "type": "stemmer", "language": "catalan" } }, "analyzer": { "catalan": { "tokenizer": "standard", "filter": [ "catalan_elision", "lowercase", "catalan_stop", "catalan_keywords", "catalan_stemmer" ] } } } } }
cjk
analyzer
editThe cjk
analyzer could be reimplemented as a custom
analyzer as follows:
czech
analyzer
editThe czech
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "czech_stop": { "type": "stop", "stopwords": "_czech_" }, "czech_keywords": { "type": "keyword_marker", "keywords": [] }, "czech_stemmer": { "type": "stemmer", "language": "czech" } }, "analyzer": { "czech": { "tokenizer": "standard", "filter": [ "lowercase", "czech_stop", "czech_keywords", "czech_stemmer" ] } } } } }
danish
analyzer
editThe danish
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "danish_stop": { "type": "stop", "stopwords": "_danish_" }, "danish_keywords": { "type": "keyword_marker", "keywords": [] }, "danish_stemmer": { "type": "stemmer", "language": "danish" } }, "analyzer": { "danish": { "tokenizer": "standard", "filter": [ "lowercase", "danish_stop", "danish_keywords", "danish_stemmer" ] } } } } }
dutch
analyzer
editThe dutch
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "dutch_stop": { "type": "stop", "stopwords": "_dutch_" }, "dutch_keywords": { "type": "keyword_marker", "keywords": [] }, "dutch_stemmer": { "type": "stemmer", "language": "dutch" }, "dutch_override": { "type": "stemmer_override", "rules": [ "fiets=>fiets", "bromfiets=>bromfiets", "ei=>eier", "kind=>kinder" ] } }, "analyzer": { "dutch": { "tokenizer": "standard", "filter": [ "lowercase", "dutch_stop", "dutch_keywords", "dutch_override", "dutch_stemmer" ] } } } } }
english
analyzer
editThe english
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "english_stop": { "type": "stop", "stopwords": "_english_" }, "english_keywords": { "type": "keyword_marker", "keywords": [] }, "english_stemmer": { "type": "stemmer", "language": "english" }, "english_possessive_stemmer": { "type": "stemmer", "language": "possessive_english" } }, "analyzer": { "english": { "tokenizer": "standard", "filter": [ "english_possessive_stemmer", "lowercase", "english_stop", "english_keywords", "english_stemmer" ] } } } } }
finnish
analyzer
editThe finnish
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "finnish_stop": { "type": "stop", "stopwords": "_finnish_" }, "finnish_keywords": { "type": "keyword_marker", "keywords": [] }, "finnish_stemmer": { "type": "stemmer", "language": "finnish" } }, "analyzer": { "finnish": { "tokenizer": "standard", "filter": [ "lowercase", "finnish_stop", "finnish_keywords", "finnish_stemmer" ] } } } } }
french
analyzer
editThe french
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "french_elision": { "type": "elision", "articles_case": true, "articles": [ "l", "m", "t", "qu", "n", "s", "j", "d", "c", "jusqu", "quoiqu", "lorsqu", "puisqu" ] }, "french_stop": { "type": "stop", "stopwords": "_french_" }, "french_keywords": { "type": "keyword_marker", "keywords": [] }, "french_stemmer": { "type": "stemmer", "language": "light_french" } }, "analyzer": { "french": { "tokenizer": "standard", "filter": [ "french_elision", "lowercase", "french_stop", "french_keywords", "french_stemmer" ] } } } } }
galician
analyzer
editThe galician
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "galician_stop": { "type": "stop", "stopwords": "_galician_" }, "galician_keywords": { "type": "keyword_marker", "keywords": [] }, "galician_stemmer": { "type": "stemmer", "language": "galician" } }, "analyzer": { "galician": { "tokenizer": "standard", "filter": [ "lowercase", "galician_stop", "galician_keywords", "galician_stemmer" ] } } } } }
german
analyzer
editThe german
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "german_stop": { "type": "stop", "stopwords": "_german_" }, "german_keywords": { "type": "keyword_marker", "keywords": [] }, "german_stemmer": { "type": "stemmer", "language": "light_german" } }, "analyzer": { "german": { "tokenizer": "standard", "filter": [ "lowercase", "german_stop", "german_keywords", "german_normalization", "german_stemmer" ] } } } } }
greek
analyzer
editThe greek
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "greek_stop": { "type": "stop", "stopwords": "_greek_" }, "greek_lowercase": { "type": "lowercase", "language": "greek" }, "greek_keywords": { "type": "keyword_marker", "keywords": [] }, "greek_stemmer": { "type": "stemmer", "language": "greek" } }, "analyzer": { "greek": { "tokenizer": "standard", "filter": [ "greek_lowercase", "greek_stop", "greek_keywords", "greek_stemmer" ] } } } } }
hindi
analyzer
editThe hindi
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "hindi_stop": { "type": "stop", "stopwords": "_hindi_" }, "hindi_keywords": { "type": "keyword_marker", "keywords": [] }, "hindi_stemmer": { "type": "stemmer", "language": "hindi" } }, "analyzer": { "hindi": { "tokenizer": "standard", "filter": [ "lowercase", "indic_normalization", "hindi_normalization", "hindi_stop", "hindi_keywords", "hindi_stemmer" ] } } } } }
hungarian
analyzer
editThe hungarian
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "hungarian_stop": { "type": "stop", "stopwords": "_hungarian_" }, "hungarian_keywords": { "type": "keyword_marker", "keywords": [] }, "hungarian_stemmer": { "type": "stemmer", "language": "hungarian" } }, "analyzer": { "hungarian": { "tokenizer": "standard", "filter": [ "lowercase", "hungarian_stop", "hungarian_keywords", "hungarian_stemmer" ] } } } } }
indonesian
analyzer
editThe indonesian
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "indonesian_stop": { "type": "stop", "stopwords": "_indonesian_" }, "indonesian_keywords": { "type": "keyword_marker", "keywords": [] }, "indonesian_stemmer": { "type": "stemmer", "language": "indonesian" } }, "analyzer": { "indonesian": { "tokenizer": "standard", "filter": [ "lowercase", "indonesian_stop", "indonesian_keywords", "indonesian_stemmer" ] } } } } }
irish
analyzer
editThe irish
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "irish_elision": { "type": "elision", "articles": [ "h", "n", "t" ] }, "irish_stop": { "type": "stop", "stopwords": "_irish_" }, "irish_lowercase": { "type": "lowercase", "language": "irish" }, "irish_keywords": { "type": "keyword_marker", "keywords": [] }, "irish_stemmer": { "type": "stemmer", "language": "irish" } }, "analyzer": { "irish": { "tokenizer": "standard", "filter": [ "irish_stop", "irish_elision", "irish_lowercase", "irish_keywords", "irish_stemmer" ] } } } } }
italian
analyzer
editThe italian
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "italian_elision": { "type": "elision", "articles": [ "c", "l", "all", "dall", "dell", "nell", "sull", "coll", "pell", "gl", "agl", "dagl", "degl", "negl", "sugl", "un", "m", "t", "s", "v", "d" ] }, "italian_stop": { "type": "stop", "stopwords": "_italian_" }, "italian_keywords": { "type": "keyword_marker", "keywords": [] }, "italian_stemmer": { "type": "stemmer", "language": "light_italian" } }, "analyzer": { "italian": { "tokenizer": "standard", "filter": [ "italian_elision", "lowercase", "italian_stop", "italian_keywords", "italian_stemmer" ] } } } } }
latvian
analyzer
editThe latvian
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "latvian_stop": { "type": "stop", "stopwords": "_latvian_" }, "latvian_keywords": { "type": "keyword_marker", "keywords": [] }, "latvian_stemmer": { "type": "stemmer", "language": "latvian" } }, "analyzer": { "latvian": { "tokenizer": "standard", "filter": [ "lowercase", "latvian_stop", "latvian_keywords", "latvian_stemmer" ] } } } } }
lithuanian
analyzer
editThe lithuanian
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "lithuanian_stop": { "type": "stop", "stopwords": "_lithuanian_" }, "lithuanian_keywords": { "type": "keyword_marker", "keywords": [] }, "lithuanian_stemmer": { "type": "stemmer", "language": "lithuanian" } }, "analyzer": { "lithuanian": { "tokenizer": "standard", "filter": [ "lowercase", "lithuanian_stop", "lithuanian_keywords", "lithuanian_stemmer" ] } } } } }
norwegian
analyzer
editThe norwegian
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "norwegian_stop": { "type": "stop", "stopwords": "_norwegian_" }, "norwegian_keywords": { "type": "keyword_marker", "keywords": [] }, "norwegian_stemmer": { "type": "stemmer", "language": "norwegian" } }, "analyzer": { "norwegian": { "tokenizer": "standard", "filter": [ "lowercase", "norwegian_stop", "norwegian_keywords", "norwegian_stemmer" ] } } } } }
persian
analyzer
editThe persian
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "char_filter": { "zero_width_spaces": { "type": "mapping", "mappings": [ "\\u200C=> "] } }, "filter": { "persian_stop": { "type": "stop", "stopwords": "_persian_" } }, "analyzer": { "persian": { "tokenizer": "standard", "char_filter": [ "zero_width_spaces" ], "filter": [ "lowercase", "arabic_normalization", "persian_normalization", "persian_stop" ] } } } } }
portuguese
analyzer
editThe portuguese
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "portuguese_stop": { "type": "stop", "stopwords": "_portuguese_" }, "portuguese_keywords": { "type": "keyword_marker", "keywords": [] }, "portuguese_stemmer": { "type": "stemmer", "language": "light_portuguese" } }, "analyzer": { "portuguese": { "tokenizer": "standard", "filter": [ "lowercase", "portuguese_stop", "portuguese_keywords", "portuguese_stemmer" ] } } } } }
romanian
analyzer
editThe romanian
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "romanian_stop": { "type": "stop", "stopwords": "_romanian_" }, "romanian_keywords": { "type": "keyword_marker", "keywords": [] }, "romanian_stemmer": { "type": "stemmer", "language": "romanian" } }, "analyzer": { "romanian": { "tokenizer": "standard", "filter": [ "lowercase", "romanian_stop", "romanian_keywords", "romanian_stemmer" ] } } } } }
russian
analyzer
editThe russian
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "russian_stop": { "type": "stop", "stopwords": "_russian_" }, "russian_keywords": { "type": "keyword_marker", "keywords": [] }, "russian_stemmer": { "type": "stemmer", "language": "russian" } }, "analyzer": { "russian": { "tokenizer": "standard", "filter": [ "lowercase", "russian_stop", "russian_keywords", "russian_stemmer" ] } } } } }
sorani
analyzer
editThe sorani
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "sorani_stop": { "type": "stop", "stopwords": "_sorani_" }, "sorani_keywords": { "type": "keyword_marker", "keywords": [] }, "sorani_stemmer": { "type": "stemmer", "language": "sorani" } }, "analyzer": { "sorani": { "tokenizer": "standard", "filter": [ "sorani_normalization", "lowercase", "sorani_stop", "sorani_keywords", "sorani_stemmer" ] } } } } }
spanish
analyzer
editThe spanish
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "spanish_stop": { "type": "stop", "stopwords": "_spanish_" }, "spanish_keywords": { "type": "keyword_marker", "keywords": [] }, "spanish_stemmer": { "type": "stemmer", "language": "light_spanish" } }, "analyzer": { "spanish": { "tokenizer": "standard", "filter": [ "lowercase", "spanish_stop", "spanish_keywords", "spanish_stemmer" ] } } } } }
swedish
analyzer
editThe swedish
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "swedish_stop": { "type": "stop", "stopwords": "_swedish_" }, "swedish_keywords": { "type": "keyword_marker", "keywords": [] }, "swedish_stemmer": { "type": "stemmer", "language": "swedish" } }, "analyzer": { "swedish": { "tokenizer": "standard", "filter": [ "lowercase", "swedish_stop", "swedish_keywords", "swedish_stemmer" ] } } } } }
turkish
analyzer
editThe turkish
analyzer could be reimplemented as a custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "turkish_stop": { "type": "stop", "stopwords": "_turkish_" }, "turkish_lowercase": { "type": "lowercase", "language": "turkish" }, "turkish_keywords": { "type": "keyword_marker", "keywords": [] }, "turkish_stemmer": { "type": "stemmer", "language": "turkish" } }, "analyzer": { "turkish": { "tokenizer": "standard", "filter": [ "apostrophe", "turkish_lowercase", "turkish_stop", "turkish_keywords", "turkish_stemmer" ] } } } } }
thai
analyzer
editThe thai
analyzer could be reimplemented as a custom
analyzer as follows: