Elasticsearch Guide: other versions:
Getting Started
- Basic Concepts
- Installation
- Exploring Your Cluster
- Modifying Your Data
- Exploring Your Data
- Conclusion
Set up Elasticsearch
- Installing Elasticsearch
- Configuring Elasticsearch
  - Secure Settings
  - Logging configuration
- Important Elasticsearch configuration
- Important System Configuration
- Bootstrap Checks
- Stopping Elasticsearch
Upgrade Elasticsearch
- Rolling upgrades
- Full cluster restart upgrade
- Reindex before upgrading
  - Reindex in place
  - Reindex from a remote cluster
Set up X-Pack
- Installing X-Pack
- Configuring Monitoring
- Configuring Security
- Configuring X-Pack Java Clients
- X-Pack Settings
- Bootstrap Checks
Breaking changes
- Breaking changes in 6.0
- Breaking changes in 6.1
X-Pack Breaking Changes
- X-Pack Breaking changes in 6.1
- X-Pack Breaking changes in 6.0
API Conventions
- Multiple Indices
- Date math support in index names
- Common options
- URL-based access control
Document APIs
- Reading and Writing documents
- Index API
- Get API
- Delete API
- Delete By Query API
- Update API
- Update By Query API
- Multi Get API
- Bulk API
- Reindex API
- Term Vectors
- Multi termvectors API
- ?refresh
Search APIs
- Search
- URI Search
- Request Body Search
  - Query
  - From / Size
  - Sort
  - Source filtering
  - Fields
  - Script Fields
  - Doc value Fields
  - Post filter
  - Highlighting
  - Rescoring
  - Search Type
  - Scroll
  - Preference
  - Explain
  - Version
  - Index Boost
  - min_score
  - Named Queries
  - Inner hits
  - Field Collapsing
  - Search After
- Search Template
- Multi Search Template
- Search Shards API
- Suggesters
- Multi Search API
- Count API
- Validate API
- Explain API
- Profile API
- Field Capabilities API
Aggregations
- Metrics Aggregations
- Bucket Aggregations
- Pipeline Aggregations
- Matrix Aggregations
  - Matrix Stats
- Caching heavy aggregations
- Returning only aggregation results
- Aggregation Metadata
- Returning the type of the aggregation
Indices APIs
- Create Index
- Delete Index
- Get Index
- Indices Exists
- Open / Close Index API
- Shrink Index
- Split Index
- Rollover Index
- Put Mapping
- Get Mapping
- Get Field Mapping
- Types Exists
- Index Aliases
- Update Indices Settings
- Get Settings
- Analyze
  - Explain Analyze
- Index Templates
- Indices Stats
- Indices Segments
- Indices Recovery
- Indices Shard Stores
- Clear Cache
- Flush
  - Synced Flush
- Refresh
- Force Merge
cat APIs
- cat aliases
- cat allocation
- cat count
- cat fielddata
- cat health
- cat indices
- cat master
- cat nodeattrs
- cat nodes
- cat pending tasks
- cat plugins
- cat recovery
- cat repositories
- cat thread pool
- cat shards
- cat segments
- cat snapshots
- cat templates
Cluster APIs
- Cluster Health
- Cluster State
- Cluster Stats
- Pending cluster tasks
- Cluster Reroute
- Cluster Update Settings
- Nodes Stats
- Nodes Info
- Nodes Feature Usage
- Remote Cluster Info
- Task Management API
- Nodes hot_threads
- Cluster Allocation Explain API
Query DSL
- Query and filter context
- Match All Query
- Full text queries
- Term level queries
- Compound queries
- Joining queries
- Geo queries
- Specialized queries
- Span queries
- Minimum Should Match
- Multi Term Query Rewrite
Mapping
- Removal of mapping types
- Field datatypes
- Meta-Fields
- Mapping parameters
- Dynamic Mapping
Analysis
- Anatomy of an analyzer
- Testing analyzers
- Analyzers
- Normalizers
- Tokenizers
- Token Filters
- Character Filters
Modules
- Cluster
- Discovery
- Local Gateway
- HTTP
- Indices
- Network Settings
- Node
- Plugins
- Scripting
- Snapshot And Restore
- Thread Pool
- Transport
- Tribe node
- Cross-cluster search
Index Modules
- Analysis
- Index Shard Allocation
- Mapper
- Merge
- Similarity module
- Slow Log
- Store
  - Pre-loading data into the file system cache
- Translog
- Index Sorting
  - Use index sorting to speed up conjunctions
Ingest Node
- Pipeline Definition
- Ingest APIs
- Accessing Data in Pipelines
- Handling Failures in Pipelines
- Processors
Monitoring Elasticsearch
- Overview
  - Collectors
- Collecting Data from Particular Indices
- HTTP Exporter
X-Pack APIs
- Info API
- Explore API
- Machine Learning APIs
- Security APIs
- Watcher APIs
- Migration APIs
  - Migration Assistance API
  - Migration Upgrade API
- Deprecation Info APIs
- Definitions
X-Pack Commands
- certgen
- certutil
- migrate
- setup-passwords
- syskeygen
- users
How To
- General recommendations
- Recipes
  - Mixing exact search with stemming
  - Getting consistent scoring
- Tune for indexing speed
- Tune for search speed
  - Tune your queries with the Profile API
- Tune for disk usage
Testing
- Java Testing Framework
Glossary of terms
Release Notes
- 6.1.4 Release Notes
- 6.1.3 Release Notes
- 6.1.2 Release Notes
- 6.1.1 Release Notes
- 6.1.0 Release Notes
- 6.0.1 Release Notes
- 6.0.0 Release Notes
- 6.0.0-rc2 Release Notes
- 6.0.0-rc1 Release Notes
- 6.0.0-beta2 Release Notes
- 6.0.0-beta1 Release Notes
- 6.0.0-alpha2 Release Notes
- 6.0.0-alpha1 Release Notes
- 6.0.0-alpha1 Release Notes (Changes previously released in 5.x)
X-Pack Release Notes
- 6.1.4 Release Notes
- 6.1.3 Release Notes
- 6.1.2 Release Notes
- 6.1.1 Release Notes
- 6.1.0 Release Notes
- 6.0.1 Release Notes
- 6.0.0 Release Notes
- 6.0.0-rc2 Release Notes
- 6.0.0-rc1 Release Notes
- 6.0.0-beta2 Release Notes
- 6.0.0-beta1 Release Notes
- 6.0.0-alpha2 Release Notes
- 6.0.0-alpha1 Release Notes

WARNING: Version 6.1 of Elasticsearch has passed its EOL date.

This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.

« Merge Slow Log »

› ›

Similarity module

edit

Similarity module

edit

A similarity (scoring / ranking model) defines how matching documents are scored. Similarity is per field, meaning that via the mapping one can define a different similarity per field.

Configuring a custom similarity is considered an expert feature and the builtin similarities are most likely sufficient as is described in similarity.

Configuring a similarity

edit

Most existing or custom Similarities have configuration options which can be configured via the index settings as shown below. The index options can be provided when creating an index or updating index settings.

PUT /index
{
    "settings" : {
        "index" : {
            "similarity" : {
              "my_similarity" : {
                "type" : "DFR",
                "basic_model" : "g",
                "after_effect" : "l",
                "normalization" : "h2",
                "normalization.h2.c" : "3.0"
              }
            }
        }
    }
}

Copy as curl Try in Elastic

Here we configure the DFRSimilarity so it can be referenced as my_similarity in mappings as is illustrate in the below example:

PUT /index/_mapping/book
{
  "properties" : {
    "title" : { "type" : "text", "similarity" : "my_similarity" }
  }
}

Copy as curl Try in Elastic

Available similarities

edit

BM25 similarity (default)

edit

TF/IDF based similarity that has built-in tf normalization and is supposed to work better for short fields (like names). See Okapi_BM25 for more details. This similarity has the following options:

`k1`	Controls non-linear term frequency normalization (saturation). The default value is `1.2`.
`b`	Controls to what degree document length normalizes tf values. The default value is `0.75`.
`discount_overlaps`	Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.

Type name: BM25

Classic similarity

edit

The classic similarity that is based on the TF/IDF model. This similarity has the following option:

discount_overlaps: Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.

Type name: classic

DFR similarity

edit

Similarity that implements the divergence from randomness framework. This similarity has the following options:

`basic_model`	Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.
`after_effect`	Possible values: `no`, `b` and `l`.
`normalization`	Possible values: `no`, `h1`, `h2`, `h3` and `z`.

All options but the first option need a normalization value.

Type name: DFR

DFI similarity

edit

Similarity that implements the divergence from independence model. This similarity has the following options:

independence_measure

Possible values standardized, saturated, chisquared.

Type name: DFI

IB similarity.

edit

Information based model . The algorithm is based on the concept that the information content in any symbolic distribution sequence is primarily determined by the repetitive usage of its basic elements. For written texts this challenge would correspond to comparing the writing styles of different authors. This similarity has the following options:

`distribution`	Possible values: `ll` and `spl`.
`lambda`	Possible values: `df` and `ttf`.
`normalization`	Same as in `DFR` similarity.

Type name: IB

LM Dirichlet similarity.

edit

LM Dirichlet similarity . This similarity has the following options:

`mu`	Default to `2000`.

Type name: LMDirichlet

LM Jelinek Mercer similarity.

edit

LM Jelinek Mercer similarity . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:

lambda

The optimal value depends on both the collection and the query. The optimal value is around 0.1 for title queries and 0.7 for long queries. Default to 0.1. When value approaches 0, documents that match more query terms will be ranked higher than those that match fewer terms.

Type name: LMJelinekMercer

Scripted similarity

edit

A similarity that allows you to use a script in order to specify how scores should be computed. For instance, the below example shows how to reimplement TF-IDF:

PUT /index
{
  "settings": {
    "number_of_shards": 1,
    "similarity": {
      "scripted_tfidf": {
        "type": "scripted",
        "script": {
          "source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "field": {
          "type": "text",
          "similarity": "scripted_tfidf"
        }
      }
    }
  }
}

PUT /index/doc/1
{
  "field": "foo bar foo"
}

PUT /index/doc/2
{
  "field": "bar baz"
}

POST /index/_refresh

GET /index/_search?explain=true
{
  "query": {
    "query_string": {
      "query": "foo^1.7",
      "default_field": "field"
    }
  }
}

Copy as curl Try in Elastic

Which yields:

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.9508477,
    "hits": [
      {
        "_shard": "[index][0]",
        "_node": "OzrdjxNtQGaqs4DmioFw9A",
        "_index": "index",
        "_type": "doc",
        "_id": "1",
        "_score": 1.9508477,
        "_source": {
          "field": "foo bar foo"
        },
        "_explanation": {
          "value": 1.9508477,
          "description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 1.9508477,
              "description": "score from ScriptedSimilarity(weightScript=[null], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;', options={}, params={}}]) computed from:",
              "details": [
                {
                  "value": 1.0,
                  "description": "weight",
                  "details": []
                },
                {
                  "value": 1.7,
                  "description": "query.boost",
                  "details": []
                },
                {
                  "value": 2.0,
                  "description": "field.docCount",
                  "details": []
                },
                {
                  "value": 4.0,
                  "description": "field.sumDocFreq",
                  "details": []
                },
                {
                  "value": 5.0,
                  "description": "field.sumTotalTermFreq",
                  "details": []
                },
                {
                  "value": 1.0,
                  "description": "term.docFreq",
                  "details": []
                },
                {
                  "value": 2.0,
                  "description": "term.totalTermFreq",
                  "details": []
                },
                {
                  "value": 2.0,
                  "description": "doc.freq",
                  "details": []
                },
                {
                  "value": 3.0,
                  "description": "doc.length",
                  "details": []
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

You might have noticed that a significant part of the script depends on statistics that are the same for every document. It is possible to make the above slightly more efficient by providing an weight_script which will compute the document-independent part of the score and will be available under the weight variable. When no weight_script is provided, weight is equal to 1. The weight_script has access to the same variables as the script except doc since it is supposed to compute a document-independent contribution to the score.

The below configuration will give the same tf-idf scores but is slightly more efficient:

PUT /index
{
  "settings": {
    "number_of_shards": 1,
    "similarity": {
      "scripted_tfidf": {
        "type": "scripted",
        "weight_script": {
          "source": "double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;"
        },
        "script": {
          "source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;"
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "field": {
          "type": "text",
          "similarity": "scripted_tfidf"
        }
      }
    }
  }
}

Copy as curl Try in Elastic

Type name: scripted

Default Similarity

edit

By default, Elasticsearch will use whatever similarity is configured as default.

You can change the default similarity for all fields in an index when it is created:

PUT /index
{
  "settings": {
    "index": {
      "similarity": {
        "default": {
          "type": "classic"
        }
      }
    }
  }
}

Copy as curl Try in Elastic

If you want to change the default similarity after creating the index you must close your index, send the following request and open it again afterwards:

POST /index/_close

PUT /index/_settings
{
  "index": {
    "similarity": {
      "default": {
        "type": "classic"
      }
    }
  }
}

POST /index/_open

Copy as curl Try in Elastic

« Merge Slow Log »

On this page

Configuring a similarity
Available similarities
BM25 similarity (default)
Classic similarity
DFR similarity
DFI similarity
IB similarity.
LM Dirichlet similarity.
LM Jelinek Mercer similarity.
Scripted similarity
Default Similarity

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

Similarity module

Similarity module

Configuring a similarity

Available similarities

BM25 similarity (default)

Classic similarity

DFR similarity

DFI similarity

IB similarity.

LM Dirichlet similarity.

LM Jelinek Mercer similarity.

Scripted similarity

Default Similarity

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards