Phrase Matching

edit

In the same way that the match query is the go-to query for standard full-text search, the match_phrase query is the one you should reach for when you want to find words that are near each other:

GET /my_index/my_type/_search
{
    "query": {
        "match_phrase": {
            "title": "quick brown fox"
        }
    }
}

Like the match query, the match_phrase query first analyzes the query string to produce a list of terms. It then searches for all the terms, but keeps only documents that contain all of the search terms, in the same positions relative to each other. A query for the phrase quick fox would not match any of our documents, because no document contains the word quick immediately followed by fox.

The match_phrase query can also be written as a match query with type phrase:

"match": {
    "title": {
        "query": "quick brown fox",
        "type":  "phrase"
    }
}

Term Positions

edit

When a string is analyzed, the analyzer returns not only a list of terms, but also the position, or order, of each term in the original string:

GET /_analyze?analyzer=standard
Quick brown fox

This returns the following:

{
   "tokens": [
      {
         "token": "quick",
         "start_offset": 0,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1 
      },
      {
         "token": "brown",
         "start_offset": 6,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2 
      },
      {
         "token": "fox",
         "start_offset": 12,
         "end_offset": 15,
         "type": "<ALPHANUM>",
         "position": 3 
      }
   ]
}

The position of each term in the original string.

Positions can be stored in the inverted index, and position-aware queries like the match_phrase query can use them to match only documents that contain all the words in exactly the order specified, with no words in-between.

What Is a Phrase

edit

For a document to be considered a match for the phrase “quick brown fox,” the following must be true:

  • quick, brown, and fox must all appear in the field.
  • The position of brown must be 1 greater than the position of quick.
  • The position of fox must be 2 greater than the position of quick.

If any of these conditions is not met, the document is not considered a match.

Internally, the match_phrase query uses the low-level span query family to do position-aware matching. Span queries are term-level queries, so they have no analysis phase; they search for the exact term specified.

Thankfully, most people never need to use the span queries directly, as the match_phrase query is usually good enough. However, certain specialized fields, like patent searches, use these low-level queries to perform very specific, carefully constructed positional searches.