Distance Feature Query: Time and Geo in Elasticsearch Result Ranking
In Elasticsearch, you will soon be able to use proximity (geographical or time) as part of your ranking score. The new way for combining proximity within the ranking score is easy to configure, performs well, and will likely yield superior ranking in many scenarios. This blog explains why and how.
Why use geographical distance and time for ranking?
Consider a search for a restaurant. One typically uses a criteria, including type, price range, rating, as well as physical proximity. All other things kept equal, you would prefer the nearest restaurant. This is not unique — it would be the same in countless other relevance ranking scenarios e.g. when searching for a job, a date, a plumber, a dog-walker, a babysitter, a tourist attraction. In fact, relevance ranking for anything in which physical location matters — which is almost anything that exists in the physical world — would benefit from relating to geographical distance.
A similar situation occurs with time related search, e.g. when searching for an academic article, a newspaper article, a blogpost or a product review, one will typically prefer recent records over less recent records (assuming all other aspects are equal). As with physical proximity in search for things that have physical location in the real world, recency would be the typical preference in searches for information. In other words, almost any relevance ranking criteria would benefit from the inclusion of proximity, either date proximity or geographic proximity.
We have frequently seen date and geographical proximity used in either of the following ways:
- As a filter, i.e. dichotomic criteria, e.g. retrieve only records nearer than x or within the last y days, and rank these records by other search criteria (e.g. using BM25 on text)
- As the only ranking criteria, e.g. sort all the records that answer a dichotomic criteria by their geographical or date proximity
In other words, proximity is either used as a retrieval criterion, rather than for relevance ranking, or is used as a single criterion for relevance ranking.
The problem is that the user is often willing to walk a longer distance for a better ranked restaurant, wait a little longer for a better ranked babysitter or read the most cited article even if it is not the most recent. In other words our actual search criteria as searchers combines proximity (space and time) with other criteria (such as word frequency related criteria) into a single relevance score.
Can’t you do it with Function Score or Painless script?
Picking our other hat, as the people designing, tuning and programming search algorithms, we frequently avoided such a combined proximity and word frequency (or various other numeric) ranking criteria because it was technically challenging. There are two main challenges:
- Normalization - The proximity score has to be normalized so that it will have the right impact when summed together with other relevance scores, such as BM25. The default in Lucene is to simply sum the various scores into one number and in most cases it makes sense, as long as the different scores have been normalized first. The normalization itself, however, is not trivial.
- Performance - Re-ranking with proximity took a significant performance toll. To deal with it, the more sophisticated Elastic users frequently formed a criteria that did not include proximity to select a very small subset of most relevant records, and then reranked only these records after taking proximity into account. This compromise means the algorithm will not consider results that could become the most relevant ones after proximity is taken into account.
We want to assist our users with non-trivial algorithmic challenges, so we sprung into action. In an upcoming version, we will enable users to rank all records with a combination of the ‘normal’ ranking score and a normalized proximity (time or space) score.
So can you add proximity to the ranking algorithm using Function Score or Script Score? It depends. It is possible to relate to proximity and to normalize it with function score and Painless script currently, but only on a small subset of the result set, and with a significant performance toll. Sometimes these quantitative differences sum up to a qualitative difference: with distance_feature queries, ranking by proximity in combination with other relevance ranking algorithms, becomes the reasonable principle ranking mechanism in many scenarios.
Saturation Normalization Function
For the normalization, we used the concept of the Saturation normalization function. The concept behind the Saturation normalization function is similar to the one used in BM25 to relate to the length of the text in the document in setting the ranking score for that document. A record’s score will be computed using the following formula: . Assuming , the function is a nonlinear strictly decreasing function starting from when (assuming ) with . One way to think of the parameter is that it defines the distance at which but another way to think of it is that it defines how quickly the contribution of distance to the score diminishes. The boost parameter is an optional parameter that by default is set to , and allows to set the influence of distance on the overall score in a way that is reduced linearly to the distance, i.e. it is a way to indicate how important will distance be in the overall score, regardless of the specific area of the scale a record is in.
The above is the function plot for the function with set to and plotted on the X axis.
The bottom line is that by providing one mandatory parameter () the user gets a normalized score, ranging from to (given that our computers are discrete and have a ‘highest number’ that functions as infinity in this case). This score allows distance to impact more when the distance is small and less when the distance is large, which is how people typically want distance to impact.
Better performance enabling better ranking
In theory, all of this could be done currently using re-ranking through Painless script. The addition of Distance Feature Query simplifies it considerably and opens the door for those who are less proficient with Painless, which is important, but not the main benefit. The main issue is that in practice the performance toll frequently meant that the proximity impact on the ranking score could only be calculated for a tiny fraction of the records. The implication was that it was quite likely that the most relevant records after the introduction of the proximity score would not be in the set for which this impact would be calculated.
The Distance Feature Query performs better than the Painless script, but that’s not the big news. The really big news is that the distance feature query was designed to leverage the performance improvements resulting from the faster top-k queries (see Magic WAND: Faster Retrieval of Top Hits in Elasticsearch blog post). Those who can afford to get a lower boundary for number of results instead of an exact count can enjoy a potentially unbounded improvement in performance because the algorithm will disqualify at an early stage all records that cannot be one of the top K candidates.
Examples
Let’s plan an application for hotel reviews. A core functionality is to allow searching for reviews about a hotel. The user will want to search for the specific hotel, and to get reviews about aspects that are of interest to the user in the hotel as well as being comprehensive and as recent as possible.
Say our record structure is as follows:
PUT reviews { "mappings": { "properties": { "hotel": { "type": "text" }, "review": { "type": "text" }, "date": { "type": "date" } } } } PUT reviews/_doc/1 { "hotel": "Fawlty Towers", "review": "The hotel would have been managed in a perfect way, if it wasn't for the guests", "date" : "2017-12-27" } PUT reviews/_doc/2 { "hotel": "Fawlty Towers", "review": "The restaurant is, well, interesting", "date": "2018-07-07" } PUT reviews/_doc/3 { "hotel": "Fawlty Towers", "review": "We had an interesting vacation there. The service at the restaurant is a great chance to brush the dust off your Spanish. Mr. Fawlty himself had recently moved to The Life of Brian.", "date": "2019-03-25" } GET reviews/_search { "query": { "bool" : { "must" : { "match" : {"hotel": "Fawlty Towers"} }, "should" : [ {"distance_feature" : {"field" : "date","pivot" : "30d","origin": "now"}}, {"match" : {"review": "restaurant"}} ] } } } Results: { "took": 5, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1.5571662, "hits": [ { "_index": "reviews", "_type": "_doc", "_id": "3", "_score": 1.5571662, "_source": { "hotel": "Fawlty Towers", "review": "We had an interesting vacation there. The service at the restaurant is a great chance to brush the dust off your Spanish. Mr. Fawlty himself had recently moved to The Life of Brian.", "date": "2019-03-25" } }, { "_index": "reviews", "_type": "_doc", "_id": "2", "_score": 1.0365787, "_source": { "hotel": "Fawlty Towers", "review": "The restaurant is, well, interesting", "date": "2018-07-07" } }, { "_index": "reviews", "_type": "_doc", "_id": "1", "_score": 0.32892755, "_source": { "hotel": "Fawlty Towers", "review": "The hotel would have been managed in a perfect way, if it wasn't for the guests", "date": "2017-12-27" } } ] } }
The query filters to show only records in which the hotel is “Fawlty Towers”. Within these records, the ranking is based on the word restaurant, which two records contain. By BM25 only, record 2 would be ranked higher than record 1, just because both records include the word “restaurant” once and the review in the second record is longer than in the 3rd one (the first record does not include the word restaurant and so would be ranked lowest). However, due to the time distance query, the 3rd record would show up first, and the user will get the latest update about Fawlty Towers.
The distance feature query skips non-competitive records and through that achieved better performance, which allows running the combined relevance criteria that includes both time proximity and BM25 on natural language.
Conclusion
Proximity in either geographical space or time is an important aspect of almost every relevance ranking use case. Adding distance ranking to the relevance ranking, together with the normalization capability, will for sure prove useful in real life scenarios, and represents our long term strategic investment in relevance ranking.
Just as important is the full story of how this capability came about. It’s all started with a user from the community that came up with a suggestion to incorporate the MAXSCORE algorithm into Lucene. That original suggestion, although it was not accepted to Lucene in its original form, later led to our contribution of the block-max WAND algorithm to Lucene, and more generally introduced APIs that help compute top hits without evaluating all matches. This in-turn led to the development of Lucene's FeatureField for static features — exposed in Elasticsearch via the rank_feature and rank_features fields — and now feature queries for dynamically computed distances. This cycle of contributing and exposing the code we produce so that the community can come up with feedback and ideas, helps steer us towards new, useful and innovative directions.
We’re excited about this new feature, and we hope you are too. So for now, read more about the distance_feature
query, and once it is available, download it and take it for a spin. Once you’ve had a chance to test this new geo-ranking feature out, give us your feedback and comments about how you want to use distance feature query. Your input will help direct us in our future development.