Introducing the sparse vector query: Searching sparse vectors with inference or precomputed query vectors

Sparse vector queries take advantage of Elasticsearch’s powerful inference API, allowing easy built-in setup for Elastic-hosted models such as ELSER and E5, as well as the flexibility to host other models.

Introduction

Vector search is evolving, and as our needs for vector search evolve so does the need for a consistent and forward thinking vector search API.

When Elastic first launched semantic search, we leveraged existing rank_features fields using the text_expansion query. We then reintroduced the sparse_vector field type for semantic search use cases.

As we think about what sparse vector search is going forward, we’ve introduced a new sparse vector query. As of Elasticsearch 8.15.0, both the text_expansion query and weighted_tokens query have been deprecated in favor of the new sparse vector query.

The sparse vector query supports two modes of querying: using an inference ID and using precomputed query vectors. Both modes of querying require data to be indexed in a sparse_vector mapped field.

These token-weight pairs are then used in a query against a sparse vector. At query time, query vectors are calculated using the same inference model that was used to create the tokens.

Let’s look at an example: let’s say we’ve indexed a document detailing when Orion is most visible in the night sky:

Indexing sparse vectors encoding example

Now, assume we’re looking for constellations that are visible in the northern hemisphere, and we run this query through the same learned sparse encoder model. The output might look similar to this:

Searching sparse vectors encoding example

At query time, these vectors are ORed together, and scoring is effectively a dot product calculation between the stored dimensions and the query dimensions, which would score this example at 10.84:

Sparse vector queries with inference

Sparse vector queries using inference work in a very similar way to the previous text expansion query, instead of sending in a trained model, we create an inference endpoint associated with the model we want to use.

Here’s an example of how to create an inference endpoint for ELSER:

PUT _inference/sparse_embedding/my-elser-endpoint
{
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  }
}

You should use an inference endpoint to index your sparse vector data, and use the same endpoint as input to your sparse_vector query. For example:

POST my-index/_search
{
  "query": {
    "sparse_vector": {
      "field": "embeddings",
      "inference_id": "my-elser-endpoint",
      "query": "constellations in the northern hemisphere"
    }
  }
}

Sparse vector queries with precomputed query vectors

You may have precomputed vectors that don’t require inference at query time. These can be sent into the sparse_vector query instead of using inference. Here is an example:

POST my-index/_search
{
  "query": {
    "sparse_vector": {
      "field": "embeddings",
      "query_vector": {
        "constellation": 2.5,
        "northern": 1.9,
        "hemisphere": 1.8,
        "orion": 1.5,
        "galaxy": 1.4,
        "astronomy": 0.9,
        "telescope": 0.3,
        "star": 0.01
      }
    }
  }
}

Query optimization with token pruning

Like text expansion search, the sparse vector query is subject to performance penalties from huge boolean queries. Therefore the same token pruning strategies available for text expansion strategies are available in the sparse vector query. You can see the impact of token pruning in our nightly MS Marco Passage Ranking benchmarks.

In order to enable pruning with the default pruning configuration (which has been tuned for ELSER V2), simply add prune: true to your request:

POST my-index/_search
{
  "query": {
    "sparse_vector": {
      "field": "embeddings",
      "inference_id": "my-elser-endpoint",
      "query": "constellations in the northern hemisphere",
      "prune": true
    }
  }
}

Alternately, you can adjust the pruning configuration by sending it directly in with the request:

GET my-index/_search
{
   "query":{
      "sparse_vector":{
         "field": "embeddings",
         "inference_id": "my-elser-endpoint",
         "query": "constellations in the northern hemisphere",
         "prune": true,
         "pruning_config": {
           "tokens_freq_ratio_threshold": 5,
           "tokens_weight_threshold": 0.4,
           "only_score_pruned_tokens": false
         }
      }
   }
}

Because token pruning will incur a recall penalty, we recommend adding the pruned tokens back in a rescore:

GET my-index/_search
{
   "query":{
      "sparse_vector":{
         "field": "embeddings",
         "inference_id": "my-elser-endpoint",
         "query": "constellations in the northern hemisphere",
         "prune": true,
         "pruning_config": {
           "tokens_freq_ratio_threshold": 5,
           "tokens_weight_threshold": 0.4,
           "only_score_pruned_tokens": false
         }
      }
   },
   "rescore": {
      "window_size": 100,
      "query": {
         "rescore_query": {
            "sparse_vector": {
               "field": "embeddings",
               "inference_id": "my-elser-endpoint",
               "query": "constellations in the northern hemisphere",
               "prune": true,
               "pruning_config": {
                   "tokens_freq_ratio_threshold": 5,
                   "tokens_weight_threshold": 0.4,
                   "only_score_pruned_tokens": true
               }
            }
         }
      }
   }
}

What's next?

While the text_expansion query is GA’d and will be supported throughout Elasticsearch 8.x, we recommend updating to the sparse_vector query as soon as possible in order to ensure you’re using the most up to date features as we continually improve the vector search experience in Elasticsearch.

If you are using the weighted_tokens query, this was never GA’d and will be replaced by the sparse_vector query very soon.

The sparse_vector query will be available starting with 8.15.0 and is already available in Serverless - try it out today!

Try out vector search for yourself using this self-paced hands-on learning for Search AI. You can start a free cloud trial or try Elastic on your local machine now.

Report an issue

Related content

Exploring GPU-accelerated Vector Search in Elasticsearch with NVIDIA

Integrations Vector Database

March 19, 2025

Exploring GPU-accelerated Vector Search in Elasticsearch with NVIDIA

Powered by NVIDIA cuVS, the collaboration looks to provide developers with GPU-acceleration for vector search in Elasticsearch.

CH HM

By: Chris Hegarty and Hemant Malik

Scaling late interaction models in Elasticsearch - part 2

Search Relevance Vector Database+1

March 20, 2025

Scaling late interaction models in Elasticsearch - part 2

This article explores techniques for making late interaction vectors ready for large-scale production workloads, such as reducing disk space usage and improving computation efficiency.

PS BT

By: Peter Straßer and Benjamin Trent

Searching complex documents with ColPali - part 1

Search Relevance Vector Database+1

March 18, 2025

Searching complex documents with ColPali - part 1

The article introduces the ColPali model, a late-interaction model that simplifies the process of searching complex documents with images and tables, and discusses its implementation in Elasticsearch.

PS BT

By: Peter Straßer and Benjamin Trent

Unifying Elastic vector database and LLM functions for intelligent query

Vector Database Search Relevance+1

March 12, 2025

Unifying Elastic vector database and LLM functions for intelligent query

Leverage LLM functions for query parsing and Elasticsearch search templates to translate complex user requests into structured, schema-based searches for highly accurate results.

By: Sunile Manjee

Semantic Text: Simpler, better, leaner, stronger

Vector Database

March 13, 2025

Semantic Text: Simpler, better, leaner, stronger

Our latest semantic_text iteration brings a host of improvements. In addition to streamlining representation in _source, benefits include reduced verbosity, more efficient disk utilization, and better integration with other Elasticsearch features. You can now use highlighting to retrieve the chunks most relevant to your query. And perhaps best of all, it is now a generally available (GA) feature!

By: Mike Pellegrini