Hunspell token filter
editHunspell token filter
editBasic support for hunspell stemming. Hunspell dictionaries will be
picked up from a dedicated hunspell directory on the filesystem
(<path.conf>/hunspell
). Each dictionary is expected to
have its own directory named after its associated locale (language).
This dictionary directory is expected to hold a single *.aff
and
one or more *.dic
files (all of which will automatically be picked up).
For example, assuming the default hunspell location is used, the
following directory layout will define the en_US
dictionary:
- conf |-- hunspell | |-- en_US | | |-- en_US.dic | | |-- en_US.aff
Each dictionary can be configured with one setting:
-
ignore_case
-
If true, dictionary matching will be case insensitive
(defaults to
false
)
This setting can be configured globally in elasticsearch.yml
using
-
indices.analysis.hunspell.dictionary.ignore_case
or for specific dictionaries:
-
indices.analysis.hunspell.dictionary.en_US.ignore_case
.
It is also possible to add settings.yml
file under the dictionary
directory which holds these settings (this will override any other
settings defined in the elasticsearch.yml
).
One can use the hunspell stem filter by configuring it the analysis settings:
PUT /hunspell_example { "settings": { "analysis" : { "analyzer" : { "en" : { "tokenizer" : "standard", "filter" : [ "lowercase", "en_US" ] } }, "filter" : { "en_US" : { "type" : "hunspell", "locale" : "en_US", "dedup" : true } } } } }
The hunspell token filter accepts four options:
-
locale
-
A locale for this filter. If this is unset, the
lang
orlanguage
are used instead - so one of these has to be set. -
dictionary
-
The name of a dictionary. The path to your hunspell
dictionaries should be configured via
indices.analysis.hunspell.dictionary.location
before. -
dedup
-
If only unique terms should be returned, this needs to be
set to
true
. Defaults totrue
. -
longest_only
-
If only the longest term should be returned, set this to
true
. Defaults tofalse
: all possible stems are returned.
As opposed to the snowball stemmers (which are algorithm based) this is a dictionary lookup based stemmer and therefore the quality of the stemming is determined by the quality of the dictionary.
Dictionary loading
editBy default, the default Hunspell directory (config/hunspell/
) is checked
for dictionaries when the node starts up, and any dictionaries are
automatically loaded.
Dictionary loading can be deferred until they are actually used by setting
indices.analysis.hunspell.dictionary.lazy
to true
in the config file.
References
editHunspell is a spell checker and morphological analyzer designed for languages with rich morphology and complex word compounding and character encoding.
- Wikipedia, http://en.wikipedia.org/wiki/Hunspell
- Source code, http://hunspell.sourceforge.net/
- Open Office Hunspell dictionaries, http://wiki.openoffice.org/wiki/Dictionaries
- Mozilla Hunspell dictionaries, https://addons.mozilla.org/en-US/firefox/language-tools/
- Chromium Hunspell dictionaries, http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/