- Elasticsearch Guide: other versions:
- Getting Started
- Setup
- Breaking changes
- Breaking changes in 2.2
- Breaking changes in 2.1
- Breaking changes in 2.0
- Removed features
- Network changes
- Multiple
path.data
striping - Mapping changes
- CRUD and routing changes
- Query DSL changes
- Search changes
- Aggregation changes
- Parent/Child changes
- Scripting changes
- Index API changes
- Snapshot and Restore changes
- Plugin and packaging changes
- Setting changes
- Stats, info, and
cat
changes - Java API changes
- API Conventions
- Document APIs
- Search APIs
- Aggregations
- Metrics Aggregations
- Avg Aggregation
- Cardinality Aggregation
- Extended Stats Aggregation
- Geo Bounds Aggregation
- Geo Centroid Aggregation
- Max Aggregation
- Min Aggregation
- Percentiles Aggregation
- Percentile Ranks Aggregation
- Scripted Metric Aggregation
- Stats Aggregation
- Sum Aggregation
- Top hits Aggregation
- Value Count Aggregation
- Bucket Aggregations
- Children Aggregation
- Date Histogram Aggregation
- Date Range Aggregation
- Filter Aggregation
- Filters Aggregation
- Geo Distance Aggregation
- GeoHash grid Aggregation
- Global Aggregation
- Histogram Aggregation
- IPv4 Range Aggregation
- Missing Aggregation
- Nested Aggregation
- Range Aggregation
- Reverse nested Aggregation
- Sampler Aggregation
- Significant Terms Aggregation
- Terms Aggregation
- Pipeline Aggregations
- Avg Bucket Aggregation
- Derivative Aggregation
- Max Bucket Aggregation
- Min Bucket Aggregation
- Sum Bucket Aggregation
- Stats Bucket Aggregation
- Extended Stats Bucket Aggregation
- Percentiles Bucket Aggregation
- Moving Average Aggregation
- Cumulative Sum Aggregation
- Bucket Script Aggregation
- Bucket Selector Aggregation
- Serial Differencing Aggregation
- Caching heavy aggregations
- Returning only aggregation results
- Aggregation Metadata
- Metrics Aggregations
- Indices APIs
- Create Index
- Delete Index
- Get Index
- Indices Exists
- Open / Close Index API
- Put Mapping
- Get Mapping
- Get Field Mapping
- Types Exists
- Index Aliases
- Update Indices Settings
- Get Settings
- Analyze
- Index Templates
- Warmers
- Shadow replica indices
- Indices Stats
- Indices Segments
- Indices Recovery
- Indices Shard Stores
- Clear Cache
- Flush
- Refresh
- Force Merge
- Optimize
- Upgrade
- cat APIs
- Cluster APIs
- Query DSL
- Mapping
- Field datatypes
- Meta-Fields
- Mapping parameters
analyzer
boost
coerce
copy_to
doc_values
dynamic
enabled
fielddata
format
geohash
geohash_precision
geohash_prefix
ignore_above
ignore_malformed
include_in_all
index
index_options
lat_lon
fields
norms
null_value
position_increment_gap
precision_step
properties
search_analyzer
similarity
store
term_vector
- Dynamic Mapping
- Transform
- Analysis
- Analyzers
- Tokenizers
- Token Filters
- Standard Token Filter
- ASCII Folding Token Filter
- Length Token Filter
- Lowercase Token Filter
- Uppercase Token Filter
- NGram Token Filter
- Edge NGram Token Filter
- Porter Stem Token Filter
- Shingle Token Filter
- Stop Token Filter
- Word Delimiter Token Filter
- Stemmer Token Filter
- Stemmer Override Token Filter
- Keyword Marker Token Filter
- Keyword Repeat Token Filter
- KStem Token Filter
- Snowball Token Filter
- Phonetic Token Filter
- Synonym Token Filter
- Compound Word Token Filter
- Reverse Token Filter
- Elision Token Filter
- Truncate Token Filter
- Unique Token Filter
- Pattern Capture Token Filter
- Pattern Replace Token Filter
- Trim Token Filter
- Limit Token Count Token Filter
- Hunspell Token Filter
- Common Grams Token Filter
- Normalization Token Filter
- CJK Width Token Filter
- CJK Bigram Token Filter
- Delimited Payload Token Filter
- Keep Words Token Filter
- Keep Types Token Filter
- Classic Token Filter
- Apostrophe Token Filter
- Decimal Digit Token Filter
- Character Filters
- Modules
- Index Modules
- Testing
- Glossary of terms
- Release Notes
WARNING: Version 2.2 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
More Like This Query
editMore Like This Query
editThe More Like This Query (MLT Query) finds documents that are "like" a given
set of documents. In order to do so, MLT selects a set of representative terms
of these input documents, forms a query using these terms, executes the query
and returns the results. The user controls the input documents, how the terms
should be selected and how the query is formed. more_like_this
can be
shortened to mlt
.
The simplest use case consists of asking for documents that are similar to a provided piece of text. Here, we are asking for all movies that have some text similar to "Once upon a time" in their "title" and in their "description" fields, limiting the number of selected terms to 12.
{ "more_like_this" : { "fields" : ["title", "description"], "like" : "Once upon a time", "min_term_freq" : 1, "max_query_terms" : 12 } }
A more complicated use case consists of mixing texts with documents already existing in the index. In this case, the syntax to specify a document is similar to the one used in the Multi GET API.
{ "more_like_this" : { "fields" : ["title", "description"], "like" : [ { "_index" : "imdb", "_type" : "movies", "_id" : "1" }, { "_index" : "imdb", "_type" : "movies", "_id" : "2" }, "and potentially some more text here as well" ], "min_term_freq" : 1, "max_query_terms" : 12 } }
Finally, users can mix some texts, a chosen set of documents but also provide documents not necessarily present in the index. To provide documents not present in the index, the syntax is similar to artificial documents.
{ "more_like_this" : { "fields" : ["name.first", "name.last"], "like" : [ { "_index" : "marvel", "_type" : "quotes", "doc" : { "name": { "first": "Ben", "last": "Grimm" }, "tweet": "You got no idea what I'd... what I'd give to be invisible." } } }, { "_index" : "marvel", "_type" : "quotes", "_id" : "2" } ], "min_term_freq" : 1, "max_query_terms" : 12 } }
How it Works
editSuppose we wanted to find all documents similar to a given input document.
Obviously, the input document itself should be its best match for that type of
query. And the reason would be mostly, according to
Lucene scoring formula,
due to the terms with the highest tf-idf. Therefore, the terms of the input
document that have the highest tf-idf are good representatives of that
document, and could be used within a disjunctive query (or OR
) to retrieve similar
documents. The MLT query simply extracts the text from the input document,
analyzes it, usually using the same analyzer at the field, then selects the
top K terms with highest tf-idf to form a disjunctive query of these terms.
The fields on which to perform MLT must be indexed and of type
string
. Additionally, when using like
with documents, either _source
must be enabled or the fields must be stored
or store term_vector
. In
order to speed up analysis, it could help to store term vectors at index time.
For example, if we wish to perform MLT on the "title" and "tags.raw" fields,
we can explicitly store their term_vector
at index time. We can still
perform MLT on the "description" and "tags" fields, as _source
is enabled by
default, but there will be no speed up on analysis for these fields.
curl -s -XPUT 'http://localhost:9200/imdb/' -d '{ "mappings": { "movies": { "properties": { "title": { "type": "string", "term_vector": "yes" }, "description": { "type": "string" }, "tags": { "type": "string", "fields" : { "raw": { "type" : "string", "index" : "not_analyzed", "term_vector" : "yes" } } } } } } }
Parameters
editThe only required parameter is like
, all other parameters have sensible
defaults. There are three types of parameters: one to specify the document
input, the other one for term selection and for query formation.
Document Input Parameters
edit
|
The only required parameter of the MLT query is |
|
The |
|
A list of fields to fetch and analyze the text from. Defaults to the |
|
[2.0.0-beta1]
Deprecated in 2.0.0-beta1. Replaced by |
|
[2.0.0-beta1]
Deprecated in 2.0.0-beta1. Replaced by |
Term Selection Parameters
edit
|
The maximum number of query terms that will be selected. Increasing this value
gives greater accuracy at the expense of query execution speed. Defaults to
|
|
The minimum term frequency below which the terms will be ignored from the
input document. Defaults to |
|
The minimum document frequency below which the terms will be ignored from the
input document. Defaults to |
|
The maximum document frequency above which the terms will be ignored from the
input document. This could be useful in order to ignore highly frequent words
such as stop words. Defaults to unbounded ( |
|
The minimum word length below which the terms will be ignored. The old name
|
|
The maximum word length above which the terms will be ignored. The old name
|
|
An array of stop words. Any word in this set is considered "uninteresting" and ignored. If the analyzer allows for stop words, you might want to tell MLT to explicitly ignore them, as for the purposes of document similarity it seems reasonable to assume that "a stop word is never interesting". |
|
The analyzer that is used to analyze the free form text. Defaults to the
analyzer associated with the first field in |
Query Formation Parameters
edit
|
After the disjunctive query has been formed, this parameter controls the
number of terms that must match.
The syntax is the same as the minimum should match.
(Defaults to |
|
Each term in the formed query could be further boosted by their tf-idf score.
This sets the boost factor to use when using this feature. Defaults to
deactivated ( |
|
Specifies whether the input documents should also be included in the search
results returned. Defaults to |
|
Sets the boost value of the whole query. Defaults to |
On this page