- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment
WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Ignoring TF/IDF
editIgnoring TF/IDF
editSometimes we just don’t care about TF/IDF. All we want to know is that a certain word appears in a field. Perhaps we are searching for a vacation home and we want to find houses that have as many of these features as possible:
- WiFi
- Garden
- Pool
The vacation home documents look something like this:
{ "description": "A delightful four-bedroomed house with ... " }
We could use a simple match
query:
GET /_search { "query": { "match": { "description": "wifi garden pool" } } }
However, this isn’t really full-text search. In this case, TF/IDF just gets
in the way. We don’t care whether wifi
is a common term, or how often it
appears in the document. All we care about is that it does appear.
In fact, we just want to rank houses by the number of features they have—the more, the better. If a feature is present, it should score 1
, and if it
isn’t, 0
.
constant_score Query
editEnter the constant_score
query.
This query can wrap either a query or a filter, and assigns a score of
1
to any documents that match, regardless of TF/IDF:
GET /_search { "query": { "bool": { "should": [ { "constant_score": { "query": { "match": { "description": "wifi" }} }}, { "constant_score": { "query": { "match": { "description": "garden" }} }}, { "constant_score": { "query": { "match": { "description": "pool" }} }} ] } } }
Perhaps not all features are equally important—some have more value to the user than others. If the most important feature is the pool, we could boost that clause to make it count for more:
GET /_search { "query": { "bool": { "should": [ { "constant_score": { "query": { "match": { "description": "wifi" }} }}, { "constant_score": { "query": { "match": { "description": "garden" }} }}, { "constant_score": { "boost": 2 "query": { "match": { "description": "pool" }} }} ] } } }
A matching |
The final score for each result is not simply the sum of the scores of all matching clauses. The coordination factor and query normalization factor are still taken into account.
We could improve our vacation home documents by adding a not_analyzed
features
field to our vacation homes:
{ "features": [ "wifi", "pool", "garden" ] }
By default, a not_analyzed
field has field-length norms
disabled and has index_options
set to docs
, disabling
term frequencies, but the problem remains: the
inverse document frequency of each term is still taken into account.
We could use the same approach that we used previously, with the constant_score
query:
GET /_search { "query": { "bool": { "should": [ { "constant_score": { "query": { "match": { "features": "wifi" }} }}, { "constant_score": { "query": { "match": { "features": "garden" }} }}, { "constant_score": { "boost": 2 "query": { "match": { "features": "pool" }} }} ] } } }
Really, though, each of these features should be treated like a filter. A vacation home either has the feature or it doesn’t—a filter seems like it would be a natural fit. On top of that, if we use filters, we can benefit from filter caching.
The problem is this: filters don’t score. What we need is a way of bridging
the gap between filters and queries. The function_score
query does this
and a whole lot more.
On this page