Configurable chunking settings for inference API endpoints

The Elasticsearch Inference API allows users to utilize machine learning models across a variety of providers to perform inference operations. One common use case of this API is to support semantic text fields used for semantic search within an index. As the size of a document’s data increases, creating an embedding on the whole of the data will yield less accurate results. Some inference models also have limitations on the size of inputs that can be processed. As such, the inference API utilizes a process called chunking to break down large documents being ingested into an index into smaller and more manageable subsections of the original data called chunks. The inference operations are then run against each of the individual chunks and the inference results for each chunk are stored within the index.

In this blog, we’ll go over the chunking strategies, explain how Elasticsearch chunks text and how to configure chunking settings for an inference endpoint.

What can I configure with chunking settings?

From 8.16, users can now select from 2 strategies for generating chunks, each with their own configurable properties.

Word based chunking strategy

Configurable values provided by the user:

(required) max_chunk_size: The maximum number of words in a chunk.
(required) overlap: The number of overlapping words for chunks.
- Note: This can not be defined as more than half of the max_chunk_size.

Word based chunking splits input data into chunks with word counts up to the provided max_chunk_size. This strategy will always fill a chunk to the maximum size before building the next chunk unless it reaches the end of the input data. Each chunk after the first will have a number of words overlapping from the previous chunk based on the provided overlap value. The purpose of this overlap is to increase inference accuracy by preventing useful context for inference results from being split across chunks.

Sentence based chunking strategy

Configurable values provided by the user:

(required) max_chunk_size: The maximum number of words in a chunk.
(required) sentence_overlap: The number of overlapping sentences for chunks.
- Note: This can only be defined as 0 or 1.

Sentence based chunking will split input data into chunks containing full sentences. Chunks will contain only complete sentences, except when a sentence is longer than max_chunk_size, in which case it will be split across chunks. Each chunk after the first will have a number of sentences from the previous chunk overlapping based on the provided sentence_overlap value.

Note: If no chunking settings are provided when creating an inference endpoint after 8.16, the default chunking settings will use a sentence strategy with max_chunk_size of 250 and a sentence_overlap of 1. For inference endpoints created before 8.16, the default chunking settings will use a word strategy with a max_chunk_size of 250 and an overlap of 1.

How do I select a chunking strategy?

There is no one-size-fits-all solution for the best chunking strategy. The best chunking strategy will vary based on the documents being ingested, the underlying model being used and any compute constraints you have. We recommend taking a subset of your corpus and some example queries and seeing how changing the strategy, chunk size and overlap affects your use case. For example, you might parameter sweep over different chunk overlaps and lengths and measure the time to ingest, the impact on search latency and the relevance of the top results for each query.

The following are a few guidelines to help when starting out with configurable chunking:

Picking a chunking strategy

Generally, a sentence based chunking strategy works well to minimize context loss. However, it can often result in more chunks being generated as the process prioritizes keeping sentences intact over maximally filling each chunk. As such, an optimized word based chunking strategy may produce fewer chunks, which are more efficient to ingest and search.

Picking a chunk size

The chunk size should be selected to minimize useful contextual information from being split across chunks while retaining chunk topic coherence. Typically, chunks as close as possible to the maximum sequence length the model supports work better. However, long chunks are more likely to contain a mixture of topics that are less well represented.

Picking a chunk overlap

As the overlap between chunks increases, the number of chunks generated does as well. Similar to chunk size, you'll want to select an overlap that helps to minimize the chance of splitting important context across chunks subject to your compute constraints. Typically, more overlap, up to half the typical chunk length, results in better retrieval quality but comes at an increased cost.

How does Elasticsearch chunk text?

Elasticsearch uses the ICU4J library to detect word and sentence boundaries. Word boundaries are identified by following a series of rules, not just the presence of a whitespace character. For written languages that do not use whitespace, such as Chinese or Japanese, dictionary lookups are used to detect word boundaries. Sentence boundaries are similarly identified by following a series of rules, not just the presence of a period character. This ensures that sentence boundaries are accurately identified across languages in which sentence structures and sentence breaking characteristics may vary.

Finally, we note that sometimes chunks benefit from long range context, which can't be retained by any simple chunking strategy. In these cases, if you are prepared to pay the cost, chunks can be enriched with additional generated context. For more details, see this discussion.

How do I configure chunking settings for an inference endpoint?

Pre-requisites

Before configuring chunking settings, ensure that you have met the following requirements:

You have a valid enterprise license.
If you are configuring chunking settings for an inference endpoint connecting to any third-party integration, you have set up any necessary permissions to access these services (e.g., created accounts, retrieved API keys, etc.).
- For the purposes of this guide, we will be configuring chunking settings for an inference endpoint using Elastic’s ELSER model, for which the only requirement is having a valid enterprise license. To find the information required to create an inference endpoint for a third-party integration, see the create inference endpoint API documentation.

Step 1: Configure chunking settings during inference endpoint creation

client.inference.put(
	task_type="sparse_embedding",
	inference_id="my_elser_endpoint",
	body={
"service": "elasticsearch",
    		"service_settings": {
	"num_allocations": 1,
			"num_threads": 1,
			"model_id": ".elser_model_2"
  		},
		"chunking_settings": {
			"strategy": "sentence",
			"max_chunk_size": 25,
			"sentence_overlap": 1
		}
	}
)

Step 2: Link the inference endpoint to a semantic text field in an index

client.indices.create(
index="my_index",
mappings={
    		"properties": {
        		"infer_field": {
            			"type": "semantic_text",
            			"inference_id": "my_elser_endpoint"
        		}
    		}
	}
)

Step 3: Ingest a document into the index

Ingest the document into the index created above by calling the index document API:

client.index(index="my_index", document={
	"infer_field": "This is some sample document data. The data is being used to demonstrate the configurable chunking settings feature. The configured chunking settings will determine how this text is broken down into chunks to help increase inference accuracy."
})

The generated chunks and their corresponding inference results can be seen stored in the document in the index under the key chunks within the _inference_fields metafield. To see the stored chunks, you can search for all documents in the index with the search API :

client.search(index="my_index", body = {
    	'size' : 100,
    	'query': {
        	'match_all' : {}
    	},
       'fields': [ '_inference_fields' ]
})

The chunks can be seen in the response. Before 8.18, the chunks were stored as full-chunk text values. From 8.18, the chunks are stored as a list of character offset values:

...
'chunks': {
	'infer_field': [
    	{'start_offset': 0, 'end_offset': 117, 'embeddings':[...]},
    	{'start_offset': 34, 'end_offset': 198, 'embeddings':[...]},
    	{'start_offset': 120, 'end_offset': 242, 'embeddings':[...]}
	]
}
...

Get started with configurable chunking today!

For more information on utilizing this feature, view the documentation on configuring chunking. Try out this notebook to get started with configurable chunking settings: Configuring Chunking Settings For Inference Endpoints.

Elasticsearch has native integrations to industry leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps Elastic Vector Database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

Report an issue