Improving information retrieval in the Elastic Stack: Steps to improve search relevance

Elasticsearch has native integrations with the industry-leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps with the Elastic vector database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

Since 8.0 and the release of third-party natural language processing (NLP) models for text embeddings, users of the Elastic Stack have access to a wide variety of models to embed their text documents and perform query-based information retrieval using vector search.

Given all these components and their parameters, and depending on the text corpus you want to search in, it can be overwhelming to choose which settings will give the best search relevance.

In this series of blog posts, we will introduce a number of tests we ran using various publicly available data sets and information retrieval techniques that are available in the Elastic Stack. We’ll then provide recommendations of the best techniques to use depending on the setup.

To kick off this series of blogs, we want to set the stage by describing the problem we are addressing and describe some methods we will dig further into in subsequent blogs.

Background and terminology

BM25: A sparse, unsupervised model for lexical search

The classic way documents are ranked for relevance by Elasticsearch according to a text query uses the Lucene implementation of the Okapi BM25 model. Although a few hyperparameters of this model were fine-tuned to optimize the results in most scenarios, this technique is considered unsupervised as labeled queries and documents are not required to use it: it’s very likely that the model will perform reasonably well on any corpus of text, without relying on annotated data. BM25 is known to be a strong baseline in zero-shot retrieval settings.

Under the hood, this kind of model builds a matrix of term frequencies (how many times a term appears in each document) and inverse document frequencies (inverse of how many documents contain each term). It then scores each query term for each document that was indexed based on those frequencies. Because each document typically contains a small fraction of all words used in the corpus, the matrix contains a lot of zeros. This is why this type of representation is called sparse.

Also, this model sums the relevance score of each individual term within a query for a document, without taking into account any semantic knowledge (synonyms, context, etc.). This is called lexical search (as opposed to semantic search). Its shortcoming is the so-called vocabulary mismatch problem, that query vocabulary is slightly different to the document vocabulary. This motivates other scoring models that try to incorporate semantic knowledge to avoid this problem.

Dense models: A dense, supervised model for semantic search

More recently, transformer-based models have allowed for a dense, context aware representation of text, addressing the principal shortcomings mentioned above.

To build such models, the following steps are required:

1. Pre-training
We first need to train a neural network to understand the basic syntax of natural language.

Using a huge corpus of text, the model learns semantic knowledge by training on unsupervised tasks (like Masked Word Prediction or Next Sentence Prediction).
BERT is probably the best known example of these models — it was trained on Wikipedia (2.5B words) and BookCorpus (800M words) using Masked Word Prediction.

This is called pre-training. The model learns vector representations of language tokens, which can be adapted for other tasks with much less training.

Note that at this step, the model wouldn’t perform well on downstream NLP tasks.

This step is very expensive, but many such foundational models exist that can be used off the shelf.

2. Task-specific training
Now that the model has built a representation of natural language, it’ll train much more effectively on a specific task such as Dense Passage Retrieval (DPR) that allows Question Answering.

To do so, we must slightly adapt the model’s architecture and then train it on a large number of instances of the task, which, for DPR, consists in matching a relevant passage taken from a relevant document.

So this requires a labeled data set, that is, a collection of triplets :

A query: "What is gold formed in?"
A document or passage taken from a document: "The core of large stars, especially during a nova"
Optionally, a score of degree of relevance for this (query, document) pair (If no score is given, we assume that the score is binary, and that all the other documents can be considered as irrelevant for the given query.)

A very popular and publicly available data set to perform such a training for DPR is the MS MARCO data set.

This data set was created using queries and top results from Microsoft’s Bing search engine. As such, the queries and documents it contains fall in the general knowledge linguistic domain, as opposed to specific linguistic domain (think about research papers or language used in law).

This notion of linguistic domain is important, as the semantic knowledge learned by those models is giving them an important advantage “in-domain”: when BERT came out, it improved previous state of the art models on this MS MARCO data set by a huge margin.

3. Domain-specific training
Depending on how different your data is from the data set used for task-specific training, you might need to train your model using a domain specific labeled data set. This step is also referred to as fine tuning for domain adaptation or domain-adaptation.

The good news is that you don’t need as large a data set as was required for the previous steps — a few thousands or tens of thousands of instances of the tasks can be enough.

The bad news is that these query-document pairs need to be built by domain experts, so it’s usually a costly option.

The domain adaptation is roughly similar to the task-specific training.

Having introduced these various techniques, we will measure how they perform on a wide variety of data sets. This sort of general purpose information retrieval task is of particular interest for us. We want to provide tools and guidance for a range of users, including those who don’t want to train models themselves in order to gain some of the benefits they bring to search. In the next blog post of this series, we will describe the methodology and benchmark suite we will be using.