Similarity module
editSimilarity module
editA similarity (scoring / ranking model) defines how matching documents are scored. Similarity is per field, meaning that via the mapping one can define a different similarity per field.
Configuring a custom similarity is considered an expert feature and the
builtin similarities are most likely sufficient as is described in
similarity
.
Configuring a similarity
editMost existing or custom Similarities have configuration options which can be configured via the index settings as shown below. The index options can be provided when creating an index or updating index settings.
PUT /index { "settings" : { "index" : { "similarity" : { "my_similarity" : { "type" : "DFR", "basic_model" : "g", "after_effect" : "l", "normalization" : "h2", "normalization.h2.c" : "3.0" } } } } }
Here we configure the DFRSimilarity so it can be referenced as
my_similarity
in mappings as is illustrate in the below example:
PUT /index/_mapping/_doc { "properties" : { "title" : { "type" : "text", "similarity" : "my_similarity" } } }
Available similarities
editBM25 similarity (default)
editTF/IDF based similarity that has built-in tf normalization and is supposed to work better for short fields (like names). See Okapi_BM25 for more details. This similarity has the following options:
|
Controls non-linear term frequency normalization
(saturation). The default value is |
|
Controls to what degree document length normalizes tf values.
The default value is |
|
Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms. |
Type name: BM25
Classic similarity
editThe classic similarity that is based on the TF/IDF model. This similarity has the following option:
-
discount_overlaps
- Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.
Type name: classic
DFR similarity
editSimilarity that implements the divergence from randomness framework. This similarity has the following options:
|
Possible values: |
|
Possible values: |
|
Possible values: |
All options but the first option need a normalization value.
Type name: DFR
DFI similarity
editSimilarity that implements the divergence from independence model. This similarity has the following options:
|
Possible values |
Type name: DFI
IB similarity.
editInformation based model . The algorithm is based on the concept that the information content in any symbolic distribution sequence is primarily determined by the repetitive usage of its basic elements. For written texts this challenge would correspond to comparing the writing styles of different authors. This similarity has the following options:
|
Possible values: |
|
Possible values: |
|
Same as in |
Type name: IB
LM Dirichlet similarity.
editLM Dirichlet similarity . This similarity has the following options:
|
Default to |
Type name: LMDirichlet
LM Jelinek Mercer similarity.
editLM Jelinek Mercer similarity . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
|
The optimal value depends on both the collection and the query. The optimal value is around |
Type name: LMJelinekMercer
Scripted similarity
editA similarity that allows you to use a script in order to specify how scores should be computed. For instance, the below example shows how to reimplement TF-IDF:
PUT /index { "settings": { "number_of_shards": 1, "similarity": { "scripted_tfidf": { "type": "scripted", "script": { "source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;" } } } }, "mappings": { "_doc": { "properties": { "field": { "type": "text", "similarity": "scripted_tfidf" } } } } } PUT /index/_doc/1 { "field": "foo bar foo" } PUT /index/_doc/2 { "field": "bar baz" } POST /index/_refresh GET /index/_search?explain=true { "query": { "query_string": { "query": "foo^1.7", "default_field": "field" } } }
Which yields:
{ "took": 12, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 1.9508477, "hits": [ { "_shard": "[index][0]", "_node": "OzrdjxNtQGaqs4DmioFw9A", "_index": "index", "_type": "_doc", "_id": "1", "_score": 1.9508477, "_source": { "field": "foo bar foo" }, "_explanation": { "value": 1.9508477, "description": "weight(field:foo in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 1.9508477, "description": "score from ScriptedSimilarity(weightScript=[null], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;', options={}, params={}}]) computed from:", "details": [ { "value": 1.0, "description": "weight", "details": [] }, { "value": 1.7, "description": "query.boost", "details": [] }, { "value": 2.0, "description": "field.docCount", "details": [] }, { "value": 4.0, "description": "field.sumDocFreq", "details": [] }, { "value": 5.0, "description": "field.sumTotalTermFreq", "details": [] }, { "value": 1.0, "description": "term.docFreq", "details": [] }, { "value": 2.0, "description": "term.totalTermFreq", "details": [] }, { "value": 2.0, "description": "doc.freq", "details": [] }, { "value": 3.0, "description": "doc.length", "details": [] } ] } ] } } ] } }
You might have noticed that a significant part of the script depends on
statistics that are the same for every document. It is possible to make the
above slightly more efficient by providing an weight_script
which will
compute the document-independent part of the score and will be available
under the weight
variable. When no weight_script
is provided, weight
is equal to 1
. The weight_script
has access to the same variables as
the script
except doc
since it is supposed to compute a
document-independent contribution to the score.
The below configuration will give the same tf-idf scores but is slightly more efficient:
PUT /index { "settings": { "number_of_shards": 1, "similarity": { "scripted_tfidf": { "type": "scripted", "weight_script": { "source": "double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;" }, "script": { "source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;" } } } }, "mappings": { "_doc": { "properties": { "field": { "type": "text", "similarity": "scripted_tfidf" } } } } }
Type name: scripted
Default Similarity
editBy default, Elasticsearch will use whatever similarity is configured as
default
.
You can change the default similarity for all fields in an index when it is created:
PUT /index { "settings": { "index": { "similarity": { "default": { "type": "boolean" } } } } }
If you want to change the default similarity after creating the index you must close your index, send the following request and open it again afterwards:
POST /index/_close PUT /index/_settings { "index": { "similarity": { "default": { "type": "boolean" } } } } POST /index/_open