WARNING: Version 1.4 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Phrase Suggester
editPhrase Suggester
editIn order to understand the format of suggestions, please read the Suggesters page first.
The term
suggester provides a very convenient API to access word
alternatives on a per token basis within a certain string distance. The API
allows accessing each token in the stream individually while
suggest-selection is left to the API consumer. Yet, often pre-selected
suggestions are required in order to present to the end-user. The
phrase
suggester adds additional logic on top of the term
suggester
to select entire corrected phrases instead of individual tokens weighted
based on ngram-language
models. In practice this suggester will be
able to make better decisions about which tokens to pick based on
co-occurence and frequencies.
API Example
editThe phrase
request is defined along side the query part in the json
request:
curl -XPOST 'localhost:9200/_search' -d { "suggest" : { "text" : "Xor the Got-Jewel", "simple_phrase" : { "phrase" : { "analyzer" : "body", "field" : "bigram", "size" : 1, "real_word_error_likelihood" : 0.95, "max_errors" : 0.5, "gram_size" : 2, "direct_generator" : [ { "field" : "body", "suggest_mode" : "always", "min_word_length" : 1 } ], "highlight": { "pre_tag": "<em>", "post_tag": "</em>" } } } } }
The response contains suggestions scored by the most likely spell
correction first. In this case we received the expected correction
xorr the god jewel
first while the second correction is less
conservative where only one of the errors is corrected. Note, the
request is executed with max_errors
set to 0.5
so 50% of the terms
can contain misspellings (See parameter descriptions below).
{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2938, "max_score" : 0.0, "hits" : [ ] }, "suggest" : { "simple_phrase" : [ { "text" : "Xor the Got-Jewel", "offset" : 0, "length" : 17, "options" : [ { "text" : "xorr the god jewel", "highlighted": "<em>xorr</em> the <em>god</em> jewel", "score" : 0.17877324 }, { "text" : "xor the god jewel", "highlighted": "xor the <em>god</em> jewel", "score" : 0.14231323 } ] } ] } }
Basic Phrase suggest API parameters
edit
|
the name of the field used to do n-gram lookups for the language model, the suggester will use this field to gain statistics to score corrections. This field is mandatory. |
|
sets max size of the n-grams (shingles) in the |
|
the likelihood of a term being a
misspelled even if the term exists in the dictionary. The default is
|
|
The confidence level defines a factor applied to the
input phrases score which is used as a threshold for other suggest
candidates. Only candidates that score higher than the threshold will be
included in the result. For instance a confidence level of |
|
the maximum percentage of the terms that at most
considered to be misspellings in order to form a correction. This method
accepts a float value in the range |
|
the separator that is used to separate terms in the bigram field. If not set the whitespace character is used as a separator. |
|
the number of candidates that are generated for each
individual query term Low numbers like |
|
Sets the analyzer to analyse to suggest text with.
Defaults to the search analyzer of the suggest field passed via |
|
Sets the maximum number of suggested term to be
retrieved from each individual shard. During the reduce phase, only the
top N suggestions are returned based on the |
|
Sets the text / query to provide suggestions for. |
|
Sets up suggestion highlighting. If not provided then
no |
|
Checks each suggestion against the specified |
curl -XPOST 'localhost:9200/_search' -d { "suggest" : { "text" : "Xor the Got-Jewel", "simple_phrase" : { "phrase" : { "field" : "bigram", "size" : 1, "direct_generator" : [ { "field" : "body", "suggest_mode" : "always", "min_word_length" : 1 } ], "collate": { "query": { "match": { "{{field_name}}" : "{{suggestion}}" } }, "params": {"field_name" : "title"}, "preference": "_primary", "prune": true } } } } }
This query will be run once for every suggestion. |
|
The |
|
An additional |
|
The default |
|
All suggestions will be returned with an extra |
Smoothing Models
editThe phrase
suggester supports multiple smoothing models to balance
weight between infrequent grams (grams (shingles) are not existing in
the index) and frequent grams (appear at least once in the index).
|
a simple backoff model that backs off to lower
order n-gram models if the higher order count is |
|
a smoothing model that uses an additive smoothing where a
constant (typically |
|
a smoothing model that takes the weighted
mean of the unigrams, bigrams and trigrams based on user supplied
weights (lambdas). Linear Interpolation doesn’t have any default values.
All parameters ( |
Candidate Generators
editThe phrase
suggester uses candidate generators to produce a list of
possible terms per term in the given text. A single candidate generator
is similar to a term
suggester called for each individual term in the
text. The output of the generators is subsequently scored in combination
with the candidates from the other terms to for suggestion candidates.
Currently only one type of candidate generator is supported, the
direct_generator
. The Phrase suggest API accepts a list of generators
under the key direct_generator
each of the generators in the list are
called per term in the original text.
Direct Generators
editThe direct generators support the following parameters:
|
The field to fetch the candidate suggestions from. This is a required option that either needs to be set globally or per suggestion. |
|
The maximum corrections to be returned per suggest text token. |
|
The suggest mode controls what suggestions are included or controls for what suggest text terms, suggestions should be suggested. Three possible values can be specified:
|
|
The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 and 2. Any other value result in an bad request error being thrown. Defaults to 2. |
|
The number of minimal prefix characters that must match in order be a candidate suggestions. Defaults to 1. Increasing this number improves spellcheck performance. Usually misspellings don’t occur in the beginning of terms. (Old name "prefix_len" is deprecated) |
|
The minimum length a suggest text term must have in order to be included. Defaults to 4. (Old name "min_word_len" is deprecated) |
|
A factor that is used to multiply with the
|
|
The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified then the number cannot be fractional. The shard level document frequencies are used for this option. |
|
The maximum threshold in number of documents a suggest text token can exist in order to be included. Can be a relative percentage number (e.g 0.4) or an absolute number to represent document frequencies. If an value higher than 1 is specified then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms from being spellchecked. High frequency terms are usually spelled correctly on top of this also improves the spellcheck performance. The shard level document frequencies are used for this option. |
|
a filter (analyzer) that is applied to each of the tokens passed to this candidate generator. This filter is applied to the original token before candidates are generated. |
|
a filter (analyzer) that is applied to each of the generated tokens before they are passed to the actual phrase scorer. |
The following example shows a phrase
suggest call with two generators,
the first one is using a field containing ordinary indexed terms and the
second one uses a field that uses terms indexed with a reverse
filter
(tokens are index in reverse order). This is used to overcome the limitation
of the direct generators to require a constant prefix to provide
high-performance suggestions. The pre_filter
and post_filter
options
accept ordinary analyzer names.
curl -s -XPOST 'localhost:9200/_search' -d { "suggest" : { "text" : "Xor the Got-Jewel", "simple_phrase" : { "phrase" : { "analyzer" : "body", "field" : "bigram", "size" : 4, "real_word_error_likelihood" : 0.95, "confidence" : 2.0, "gram_size" : 2, "direct_generator" : [ { "field" : "body", "suggest_mode" : "always", "min_word_length" : 1 }, { "field" : "reverse", "suggest_mode" : "always", "min_word_length" : 1, "pre_filter" : "reverse", "post_filter" : "reverse" } ] } } } }
pre_filter
and post_filter
can also be used to inject synonyms after
candidates are generated. For instance for the query captain usq
we
might generate a candidate usa
for term usq
which is a synonym for
america
which allows to present captain america
to the user if this
phrase scores high enough.