Multi-Field Search Just Got Better
The match
query is the go-to query for matching on a single field. It understands the field mapping and uses the appropriate analyzer for the field, it can match any word or require all words (with operator
set to "or"
or "and"
), or it can match a minimum number or percentage of words with minimum_should_match
. It can do fuzzy matching and phrase or proximity matching. In short, it is very flexible and very powerful.
Multi-field search, on the other hand, is hard. Elasticsearch provides the multi_match
query, which makes multi-field search look simple:
{
"multi_match": {
"query": "quick brown fox",
"fields": [ "title", "body" ]
}
}
But in reality, it is not as simple as it looks. Unless you understand how the multi_match
query works, you will often use it incorrectly and get suboptimal results. Elasticsearch v1.1.0 added some new features to the multi_match
query which make multi-field search much more powerful and easier to use.
Types of multi-field search
How you search across multiple fields depends on how your data is indexed and the type of search that you need. There are three main scenarios:
Best matching field
When searching across multiple fields for a single “concept”, you want to look for as many words as possible within the same field. For instance, “brown fox” in a single field is more meaningful than “brown” in one field and “fox” in the other. In other words, you’re looking for the single best matching field.
This type of query can be executed by running a match
query against each field, and choosing the relevance _score
from the best matching field, using the dis_max
query:
{
"dis_max": {
"queries": [
{ "match": { "title": "quick brown fox" }},
{ "match": { "body": "quick brown fox" }}
]
}
}
The multi_match
query accepts a type
parameter which tells it how to execute the query. The default type is best_fields
, which results in exactly the same dis_max
query as we have above:
{
"multi_match": {
"query": "quick brown fox",
"fields": [ "title", "body" ],
"type": "best_fields" # default
}
}
This query, as written above, will choose the single best matching field, but will ignore other lesser matches. We can still take these secondary matches into account by specifying the tie_breaker
parameter:
{
"multi_match": {
"query": "quick brown fox",
"fields": [ "title", "body" ],
"type": "best_fields",
"tie_breaker": 0.2
}
}
The above query will still use the _score
from the best matching field, but will also add in the _score
from any other matching fields, multiplied by 0.2
.
Most matching fields
Often we index the same text with several different analyzers, perhaps as stemmed and unstemmed, with synonyms, with shingles for proximity matching, with edge-ngrams for autocomplete etc. In this case, we want to query all of the fields and add up the _score
from each match to find the documents with the most matching fields.
We could write such a query by wrapping individual match
clauses with a bool
query:
{
"bool": {
"should": [
{ "match": { "title": "quick brown fox" }},
{ "match": { "title.stemmed": "quick brown fox" }},
{ "match": { "title.synonym": "quick brown fox" }},
{ "match": { "title.shingle": "quick brown fox" }},
{ "match": { "title.edge_ng": "quick brown fox" }}
]
}
}
This is the same query that would be executed by the multi_match
query when the type
parameter is set to most_fields
:
{
"multi_match": {
"query": "quick brown fox",
"fields": [ "title", "title.*" ],
"type": "most_fields"
}
}
You can give extra “weight” to one or more fields by specifying a boost
on that field, using the caret (^
) syntax:
{
"multi_match": {
"query": "quick brown fox",
"fields": [ "title^2", "title.*" ],
"type": "most_fields"
}
}
In the above query, the title
field is twice as important as the other fields.
Cross field matching
Finally, we often need to search for entities whose data is spread across multiple fields, such as when we search for "John Smith" in the first_name
and last_name
fields of a user
object. In this case, we want to find as many individual words as possible in any field. The most_fields
approach may appear to be the answer here, but there are several reasons why it will not give good results.
Both best_fields
and most_fields
are field-centric queries — they match each field separately. This means that:
-
The
operator
andminimum_should_match
operators would apply to each field, rather than to each word in any field. Requiring bothJohn
andSmith
with theand
operator would never match any documents, because they never occur in the same field. -
With the
most_fields
approach, if the same word appears in multiple fields, it will be counted multiple times, instead of just being counted once. -
Term frequencies in each field are different. Imagine we had a user whose name was “Smith Jones”. Smith as a last name is very common, but as a first name is very uncommon. A
most_fields
query for “Peter Smith” may well return the “Smith Jones” user as the first result, as the high weight of Smith-as-a-first-name trumps all documents with Smith-as-a-last-name.
One solution to this problem is just to index the data from first_name
and last_name
into the single field full_name
, which we can do automatically with this mapping:
{
"first_name": { "type": "string", "copy_to": "full_name" },
"last_name": { "type": "string", "copy_to": "full_name" },
"full_name": { "type": "string" }
}
Then we can just query the full_name
field with a simple match
query. That said, it is often useful to be able to achieve the same thing across multiple fields. Elasticsearch v1.1.0 added the new word-centric cross_fields
execution type which allows you to do just that:
{
"multi_match": {
"query": "Peter Smith",
"fields": [ "first_name", "last_name" ],
"type": "cross_fields"
}
}
The cross_fields
approach first analyzes the query string into individual terms, then it looks for each term in any field, much like this:
{
"bool": {
"should": [
{ "dis_max": {
"queries": [
{ "term": { "first_name": "peter" }},
{ "term": { "last_name": "peter" }}
]}},
{ "dis_max": {
"queries": [
{ "term": { "first_name": "smith" }},
{ "term": { "last_name": "smith" }}
]}}
]
}
}
The operator
and minimum_should_match
parameters would work as you expect, as each word is queried (and so can be counted) separately. But this still leaves the problem of term frequencies. In the above query, Smith-as-a-first-name would still score higher than Smith-as-a-last-name.
In fact, the cross_fields
approach doesn’t use dis_max
queries. Instead it uses a special blended
query which combines the term frequency of Smith-as-a-first-name with the term frequency of Smith-as-a-last-name and uses that value for both fields. In other words, it treats first_name
and last_name
as if they were one big field.
It has certain advantages over the one-big-field approach:
- It is a search-time solution rather than having to be setup at index time.
-
The index will be smaller without the
copy_to
field. -
Individual fields can be boosted, which can’t be done with the
copy_to
field. -
Each field preserves its own length-norm, which gives more weight to shorter fields
like thetitle
field
Note about analysis
All fields used in a cross_fields
query should use the same analyzer
so that they all produce the same list of query terms. If fields with different analyzers are queried, then they will be grouped together by analyzer. Each group will be queried with the cross_fields
approach, then the scores from all groups will be combined with a bool
query.
Alternatively, you can force the same analyzer across all fields by specifying an analyzer
in the query:
{
"multi_match": {
"query": "Quick brown fox",
"fields": [ "title", "body" ],
"type": "cross_fields",
"analyzer": "standard"
}
}
Conclusion
The cross_fields
feature is a really important addition to Elasticsearch. It adds functionality that it was impossible to replicate client side. You can read more about this topic in the Multi-field search chapter in our upcoming book: The Definitive Guide to Elasticsearch.