- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment
WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Stopwords and Performance
editStopwords and Performance
editThe biggest disadvantage of keeping stopwords is that of performance. When
Elasticsearch performs a full-text search, it has to calculate the relevance
_score
on all matching documents in order to return the top 10 matches.
While most words typically occur in much fewer than 0.1% of all documents, a
few words such as the
may occur in almost all of them. Imagine you have an
index of one million documents. A query for quick brown fox
may match fewer
than 1,000 documents. But a query for the quick brown fox
has to score and
sort almost all of the one million documents in your index, just in order to
return the top 10!
The problem is that the quick brown fox
is really a query for the OR quick
OR brown OR fox
—any document that contains nothing more than the almost
meaningless term the
is included in the result set. What we need is a way of
reducing the number of documents that need to be scored.
and Operator
editThe easiest way to reduce the number of documents is simply to use the
and
operator with the match
query, in order
to make all words required.
A match
query like this:
{ "match": { "text": { "query": "the quick brown fox", "operator": "and" } } }
is rewritten as a bool
query like this:
{ "bool": { "must": [ { "term": { "text": "the" }}, { "term": { "text": "quick" }}, { "term": { "text": "brown" }}, { "term": { "text": "fox" }} ] } }
The bool
query is intelligent enough to execute each term
query in the
optimal order—it starts with the least frequent term. Because all terms
are required, only documents that contain the least frequent term can possibly
match. Using the and
operator greatly speeds up multiterm queries.
minimum_should_match
editIn Controlling Precision, we discussed using the minimum_should_match
operator
to trim the long tail of less-relevant results. It is useful for this purpose
alone but, as a nice side effect, it offers a similar performance benefit to
the and
operator:
{ "match": { "text": { "query": "the quick brown fox", "minimum_should_match": "75%" } } }
In this example, at least three out of the four terms must match. This means that the only docs that need to be considered are those that contain either the least or second least frequent terms.
This offers a huge performance gain over a simple query with the default or
operator! But we can do better yet…
On this page