More Like This Query

edit

More like this query find documents that are "like" provided text by running it against one or more fields.

{
    "more_like_this" : {
        "fields" : ["name.first", "name.last"],
        "like_text" : "text like this one",
        "min_term_freq" : 1,
        "max_query_terms" : 12
    }
}

Additionally, More Like This can find documents that are "like" a set of chosen documents. The syntax to specify one or more documents is similar to the Multi GET API, and supports the ids or docs array. If only one document is specified, the query behaves the same as the More Like This API.

{
    "more_like_this" : {
        "fields" : ["name.first", "name.last"],
        "docs" : [
        {
            "_index" : "test",
            "_type" : "type",
            "_id" : "1"
        },
        {
            "_index" : "test",
            "_type" : "type",
            "_id" : "2"
        }
        ],
        "ids" : ["3", "4"],
        "min_term_freq" : 1,
        "max_query_terms" : 12
    }
}

more_like_this can be shortened to mlt.

Under the hood, more_like_this simply creates multiple should clauses in a bool query of interesting terms extracted from some provided text. The interesting terms are selected with respect to their tf-idf scores. These are controlled by min_term_freq, min_doc_freq, and max_doc_freq. The number of interesting terms is controlled by max_query_terms. While the minimum number of clauses that must be satisfied is controlled by percent_terms_to_match. The terms are extracted from like_text which is analyzed by the analyzer associated with the field, unless specified by analyzer. There are other parameters, such as min_word_length, max_word_length or stop_words, to control what terms should be considered as interesting. In order to give more weight to more interesting terms, each boolean clause associated with a term could be boosted by the term tf-idf score times some boosting factor boost_terms. When a search for multiple docs is issued, More Like This generates a more_like_this query per document field in fields. These fields are specified as a top level parameter or within each doc.

The fields must be indexed and of type string. Additionally, when using ids or docs, the fields must be either stored, store term_vector or _source must be enabled.

The more_like_this top level parameters include:

Parameter Description

fields

A list of the fields to run the more like this query against. Defaults to the _all field for like_text and to all possible fields for ids or docs.

like_text

The text to find documents like it, required if ids or docs are not specified.

ids or docs

A list of documents following the same syntax as the Multi GET API. The text is fetched from fields unless specified otherwise in each doc.

include

When using ids or docs, specifies whether the documents should be included from the search. Defaults to false.

exclude

When using ids or docs, specifies whether the documents should be excluded from the search. Defaults to true.

percent_terms_to_match

From the generated query, the percentage of terms that must match (float value between 0 and 1). Defaults to 0.3 (30 percent).

min_term_freq

The frequency below which terms will be ignored in the source doc. The default frequency is 2.

max_query_terms

The maximum number of query terms that will be included in any generated query. Defaults to 25.

stop_words

An array of stop words. Any word in this set is considered "uninteresting" and ignored. Even if your Analyzer allows stopwords, you might want to tell the MoreLikeThis code to ignore them, as for the purposes of document similarity it seems reasonable to assume that "a stop word is never interesting".

min_doc_freq

The frequency at which words will be ignored which do not occur in at least this many docs. Defaults to 5.

max_doc_freq

The maximum frequency in which words may still appear. Words that appear in more than this many docs will be ignored. Defaults to unbounded.

min_word_length

The minimum word length below which words will be ignored. Defaults to 0.(Old name "min_word_len" is deprecated)

max_word_length

The maximum word length above which words will be ignored. Defaults to unbounded (0). (Old name "max_word_len" is deprecated)

boost_terms

Sets the boost factor to use when boosting terms. Defaults to deactivated (0). Any other value activates boosting with given boost factor.

boost

Sets the boost value of the query. Defaults to 1.0.

analyzer

The analyzer that will be used to analyze the like text. Defaults to the analyzer associated with the first field in fields.