Synonym token filter
editSynonym token filter
editThe synonym
token filter allows to easily handle synonyms during the
analysis process. Synonyms are configured using a configuration file.
Here is an example:
response = client.indices.create( index: 'test_index', body: { settings: { index: { analysis: { analyzer: { synonym: { tokenizer: 'whitespace', filter: [ 'synonym' ] } }, filter: { synonym: { type: 'synonym', synonyms_path: 'analysis/synonym.txt' } } } } } } ) puts response
PUT /test_index { "settings": { "index": { "analysis": { "analyzer": { "synonym": { "tokenizer": "whitespace", "filter": [ "synonym" ] } }, "filter": { "synonym": { "type": "synonym", "synonyms_path": "analysis/synonym.txt" } } } } } }
The above configures a synonym
filter, with a path of
analysis/synonym.txt
(relative to the config
location). The
synonym
analyzer is then configured with the filter.
This filter tokenizes synonyms with whatever tokenizer and token filters appear before it in the chain.
Additional settings are:
-
updateable
(defaults tofalse
). Iftrue
allows reloading search analyzers to pick up changes to synonym files. Only to be used for search analyzers. -
expand
(defaults totrue
). -
lenient
(defaults tofalse
). Iftrue
ignores exceptions while parsing the synonym configuration. It is important to note that only those synonym rules which cannot get parsed are ignored. For instance consider the following request:
response = client.indices.create( index: 'test_index', body: { settings: { index: { analysis: { analyzer: { synonym: { tokenizer: 'standard', filter: [ 'my_stop', 'synonym' ] } }, filter: { my_stop: { type: 'stop', stopwords: [ 'bar' ] }, synonym: { type: 'synonym', lenient: true, synonyms: [ 'foo, bar => baz' ] } } } } } } ) puts response
PUT /test_index { "settings": { "index": { "analysis": { "analyzer": { "synonym": { "tokenizer": "standard", "filter": [ "my_stop", "synonym" ] } }, "filter": { "my_stop": { "type": "stop", "stopwords": [ "bar" ] }, "synonym": { "type": "synonym", "lenient": true, "synonyms": [ "foo, bar => baz" ] } } } } } }
With the above request the word bar
gets skipped but a mapping foo => baz
is still added. However, if the mapping
being added was foo, baz => bar
nothing would get added to the synonym list. This is because the target word for the
mapping is itself eliminated because it was a stop word. Similarly, if the mapping was "bar, foo, baz" and expand
was
set to false
no mapping would get added as when expand=false
the target mapping is the first word. However, if
expand=true
then the mappings added would be equivalent to foo, baz => foo, baz
i.e, all mappings other than the
stop word.
tokenizer
and ignore_case
are deprecated
editThe tokenizer
parameter controls the tokenizers that will be used to
tokenize the synonym, this parameter is for backwards compatibility for indices that created before 6.0.
The ignore_case
parameter works with tokenizer
parameter only.
Two synonym formats are supported: Solr, WordNet.
Solr synonyms
editThe following is a sample format of the file:
# Blank lines and lines starting with pound are comments. # Explicit mappings match any token sequence on the LHS of "=>" # and replace with all alternatives on the RHS. These types of mappings # ignore the expand parameter in the schema. # Examples: i-pod, i pod => ipod sea biscuit, sea biscit => seabiscuit # Equivalent synonyms may be separated with commas and give # no explicit mapping. In this case the mapping behavior will # be taken from the expand parameter in the schema. This allows # the same synonym file to be used in different synonym handling strategies. # Examples: ipod, i-pod, i pod foozball , foosball universe , cosmos lol, laughing out loud # If expand==true, "ipod, i-pod, i pod" is equivalent # to the explicit mapping: ipod, i-pod, i pod => ipod, i-pod, i pod # If expand==false, "ipod, i-pod, i pod" is equivalent # to the explicit mapping: ipod, i-pod, i pod => ipod # Multiple synonym mapping entries are merged. foo => foo bar foo => baz # is equivalent to foo => foo bar, baz
You can also define synonyms for the filter directly in the
configuration file (note use of synonyms
instead of synonyms_path
):
response = client.indices.create( index: 'test_index', body: { settings: { index: { analysis: { filter: { synonym: { type: 'synonym', synonyms: [ 'i-pod, i pod => ipod', 'universe, cosmos' ] } } } } } } ) puts response
PUT /test_index { "settings": { "index": { "analysis": { "filter": { "synonym": { "type": "synonym", "synonyms": [ "i-pod, i pod => ipod", "universe, cosmos" ] } } } } } }
However, it is recommended to define large synonyms set in a file using
synonyms_path
, because specifying them inline increases cluster size unnecessarily.
WordNet synonyms
editSynonyms based on WordNet format can be
declared using format
:
response = client.indices.create( index: 'test_index', body: { settings: { index: { analysis: { filter: { synonym: { type: 'synonym', format: 'wordnet', synonyms: [ "s(100000001,1,'abstain',v,1,0).", "s(100000001,2,'refrain',v,1,0).", "s(100000001,3,'desist',v,1,0)." ] } } } } } } ) puts response
PUT /test_index { "settings": { "index": { "analysis": { "filter": { "synonym": { "type": "synonym", "format": "wordnet", "synonyms": [ "s(100000001,1,'abstain',v,1,0).", "s(100000001,2,'refrain',v,1,0).", "s(100000001,3,'desist',v,1,0)." ] } } } } } }
Using synonyms_path
to define WordNet synonyms in a file is supported
as well.
Parsing synonym files
editElasticsearch will use the token filters preceding the synonym filter
in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
synonym filter is placed after a stemmer, then the stemmer will also be applied
to the synonym entries. Because entries in the synonym map cannot have stacked
positions, some token filters may cause issues here. Token filters that produce
multiple versions of a token may choose which version of the token to emit when
parsing synonyms, e.g. asciifolding
will only produce the folded version of the
token. Others, e.g. multiplexer
, word_delimiter_graph
or ngram
will throw an
error.
If you need to build analyzers that include both multi-token filters and synonym filters, consider using the multiplexer filter, with the multi-token filters in one branch and the synonym filter in the other.