Azure AI Document Intelligence is a powerful tool for extracting structured data from PDFs. It can be used to effectively extract text and table data. Once the data is extracted, it can be indexed into Elastic Cloud Serverless to power RAG (Retrieval Augmented Generation).
In this blog, we will demonstrate how powerful Azure AI Document Intelligence is by ingesting four recent Elastic N.V. quarterly reports. The PDFs range from 43 to 196 pages in length and each PDF contains both text and table data. We will test the retrieval of table data with the following prompt: Compare/contrast subscription revenue for Q2-2025, Q1-2025, Q4-2024 and Q3-2024?
This prompt is tricky because it requires context from four different PDFs that represent this information in tabular format.

Let’s walk through an end-to-end reference that consists of two main parts:
Python notebook
- Downloads four quarters’ worth of PDF 10-Q filings for Elastic N.V.
- Uses Azure AI Document Intelligence to parse the text and table data from each PDF file
- Outputs the text and table data into a JSON file
- Ingests the JSON files into Elastic Cloud Serverless
Elastic Cloud Serverless
- Creates vector embeddings for PDF text+table data
- Powers vector search database queries for RAG
- Pre-configured OpenAI connector for LLM integration
- A/B test interface for chatting with the 10-Q filings
Prerequisites
The code blocks in this notebook require API keys for Azure AI Document Intelligence and Elasticsearch. The best starting point for Azure AI Document intelligence is to create a Document Intelligence resource. For Elastic Cloud Serverless, refer to the get started guide. You will need Python 3.9+ to run these code blocks.
Create an .env file
Place secrets for Azure AI Document Intelligence and Elastic Cloud Serverless in a .env file.
AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT=YOUR_AZURE_RESOURCE_ENDPOINT
AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY=YOUR_AZURE_RESOURCE_API_KEY
ES_URL=YOUR_ES_URL
ES_API_KEY=YOUR_ES_API_KEY
Install Python packages
!pip install elasticsearch python-dotenv tqdm azure-core azure-ai-documentintelligence requests httpx
Create input and output folders
import os
input_folder_pdf = "./pdf"output_folder_pdf = "./json"
folders = [input_folder_pdf, output_folder_pdf]
def create_folders_if_not_exist(folders): for folder in folders: os.makedirs(folder, exist_ok=True) print(f"Folder '{folder}' created or already exists.")
create_folders_if_not_exist(folders)
Download PDF files
Download four recent Elastic 10-Q quarterly reports. If you already have PDF files, feel free to place them in the ‘./pdf’ folder.
import osimport requests
def download_pdf(url, directory='./pdf', filename=None): if not os.path.exists(directory): os.makedirs(directory) response = requests.get(url) if response.status_code == 200: if filename is None: filename = url.split('/')[-1] filepath = os.path.join(directory, filename) with open(filepath, 'wb') as file: file.write(response.content) print(f"Downloaded {filepath}") else: print(f"Failed to download file from {url}")
print("Downloading 4 recent 10-Q reports for Elastic NV.")base_url = 'https://s201.q4cdn.com/217177842/files/doc_financials'download_pdf(f'{base_url}/2025/q2/e5aa7a0a-6f56-468d-a5bd-661792773d71.pdf', filename='elastic-10Q-Q2-2025.pdf')download_pdf(f'{base_url}/2025/q1/18656e06-8107-4423-8e2b-6f2945438053.pdf', filename='elastic-10Q-Q1-2025.pdf')download_pdf(f'{base_url}/2024/q4/9949f03b-09fb-4941-b105-62a304dc1411.pdf', filename='elastic-10Q-Q4-2024.pdf')download_pdf(f'{base_url}/2024/q3/7e60e3bd-ff50-4ae8-ab12-5b3ae19420e6.pdf', filename='elastic-10Q-Q3-2024.pdf')
Parse PDFs using Azure AI Document Intelligence
A lot is going on in code blocks that parse the PDF files. Here’s a quick summary:
- Set Azure AI Document Intelligence imports and environment variables
- Parse PDF paragraphs using AnalyzeResult
- Parse PDF tables using AnalyzeResult
- Combine PDF paragraph and table data
- Bring it all together by doing 1-4 for each PDF file and store the result in JSON
Set Azure AI Document Intelligence imports and environment variables
The most important import is AnalyzeResult. This class represents the outcome of a document analysis and contains details about the document. The details we care about are pages, paragraphs and tables.
import osfrom azure.core.credentials import AzureKeyCredentialfrom azure.ai.documentintelligence import DocumentIntelligenceClientfrom azure.ai.documentintelligence.models import AnalyzeResultfrom azure.ai.documentintelligence.models import AnalyzeDocumentRequestimport jsonfrom dotenv import load_dotenvfrom tqdm import tqdm
load_dotenv()
AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT = os.getenv('AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT')AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY = os.getenv('AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY')
Parse PDF paragraphs using AnalzeResult
Extract the paragraph text from each page. Do not extract table data.
def parse_paragraphs(analyze_result): table_offsets = [] page_content = {}
for paragraph in analyze_result.paragraphs: for span in paragraph.spans: if span.offset not in table_offsets: for region in paragraph.bounding_regions: page_number = region.page_number if page_number not in page_content: page_content[page_number] = [] page_content[page_number].append({ "content_text": paragraph.content }) return page_content, table_offsets
Parse the PDF tables using AnalyzeResult
Extract the table content from each page. Do not extract paragraph text. The most interesting side effect of this technique is that there is no need to transform table data. LLMs know how to read text that looks like: “Cell [0, 1]: table data…”.
def parse_tables(analyze_result, table_offsets): page_content = {}
for table in analyze_result.tables: table_data = [] for region in table.bounding_regions: page_number = region.page_number for cell in table.cells: for span in cell.spans: table_offsets.append(span.offset) table_data.append(f"Cell [{cell.row_index}, {cell.column_index}]: {cell.content}")
if page_number not in page_content: page_content[page_number] = [] page_content[page_number].append({ "content_text": "\n".join(table_data)}) return page_content
Combine PDF paragraph and table data
Pre-process chunking at the page level preserves context so that we can easily validate RAG retrieval manually. Later, you will see that this pre-process chunking will not have a negative effect on the RAG output.
def combine_paragraphs_tables(filepath, paragraph_content, table_content): page_content_concatenated = {} structured_data = []
# Combine paragraph and table content for p_number in set(paragraph_content.keys()).union(table_content.keys()): concatenated_text = ""
if p_number in paragraph_content: for content in paragraph_content[p_number]: concatenated_text += content["content_text"] + "\n"
if p_number in table_content: for content in table_content[p_number]: concatenated_text += content["content_text"] + "\n" page_content_concatenated[p_number] = concatenated_text.strip()
# Append a single item per page to the structured_data list for p_number, concatenated_text in page_content_concatenated.items(): structured_data.append({ "page_number": p_number, "content_text": concatenated_text, "pdf_file": os.path.basename(filepath) })
return structured_data
Bring it all together
Open each PDF in the ./pdf folder, parse the text and table data and save the result in a JSON file that has entries for page_number, content_text and pdf_file. The content_text field represents the page paragraphs and table data for each page.
pdf_files = [ os.path.join(input_folder_pdf, file) for file in os.listdir(input_folder_pdf) if file.endswith(".pdf")]
document_intelligence_client = DocumentIntelligenceClient( endpoint=AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT, credential=AzureKeyCredential(AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY), connection_timeout=600 )
for filepath in tqdm(pdf_files, desc="Parsing PDF files"): with open(filepath, "rb") as file: poller = document_intelligence_client.begin_analyze_document("prebuilt-layout", AnalyzeDocumentRequest(bytes_source=file.read()) )
analyze_result: AnalyzeResult = poller.result() paragraph_content, table_offsets = parse_paragraphs(analyze_result) table_content = parse_tables(analyze_result, table_offsets) structured_data = combine_paragraphs_tables(filepath, paragraph_content, table_content)
# Convert the structured data to JSON format json_output = json.dumps(structured_data, indent=4) # Get the filename without the ".pdf" extension filename_without_ext = os.path.splitext(os.path.basename(filepath))[0] # Write the JSON output to a file output_json_file = f"{output_folder_pdf}/{filename_without_ext}.json"
with open(output_json_file, "w") as json_file: json_file.write(json_output)
Load data into Elastic Cloud Serverless
The code blocks below handle:
- Set imports for the Elasticsearch client and environment variables
- Create index in Elastic Cloud Serverless
- Load the JSON files from ./json directory into the pdf-chat index
Set imports for the Elasticsearch client and environment variables
The most important import is Elasticsearch. This class is responsible for connecting to Elastic Cloud Serverless to create and populate the pdf-chat index.
import jsonfrom dotenv import load_dotenvfrom elasticsearch import Elasticsearchfrom tqdm import tqdmimport os
load_dotenv()
ES_URL = os.getenv('ES_URL')ES_API_KEY = os.getenv('ES_API_KEY')
es = Elasticsearch(hosts=ES_URL,api_key=ES_API_KEY, request_timeout=300)
Create index in Elastic Cloud Serverless
This code block creates an index named “pdf_chat” that has the following mappings:
- page_content - For testing RAG using full-text search
- page_content_sparse - For testing RAG using sparse vectors
- page_content_dense - For testing RAG using dense vectors
- page_number - Useful for constructing citations
- pdf_file - Useful for constructing citations
Notice the use of copy_to and semantic_text. The copy_to utility copies body_content to two semantic text fields. Each semantic text field maps to an ML inference endpoint, one for the sparse vector and one for the dense vector. Elastic-powered ML inference will auto-chunk each page into 250 token chunks with a 100 token overlap.
index_name= "pdf-chat"index_body = { "mappings": { "properties": {
"page_content": {"type": "text", "copy_to": ["page_content_sparse", "page_content_dense"]}, "page_content_sparse": {"type": "semantic_text", "inference_id": ".elser-2-elasticsearch"}, "page_content_dense": {"type": "semantic_text", "inference_id": ".multilingual-e5-small-elasticsearch"}, "page_number": {"type": "text"}, "pdf_file": { "type": "text", "fields": {"keyword": {"type": "keyword"}} } } }}
if es.indices.exists(index=index_name): es.indices.delete(index=index_name) print(f"Index '{index_name}' deleted successfully.")
response = es.indices.create(index=index_name, body=index_body)if 'acknowledged' in response and response['acknowledged']: print(f"Index '{index_name}' created successfully.")elif 'error' in response: print(f"Failed to create: '{index_name}'") print(f"Error: {response['error']['reason']}")else: print(f"Index '{index_name}' already exists.")
Load the JSON files from ./json directory into the pdf-chat index
This process will take several minutes to run because we are:
- Loading 402 pages of PDF data
- Creating sparse text embeddings for each page_content chunk
- Creating dense text embeddings for each page_content chunk
files = os.listdir(output_folder_pdf)with tqdm(total=len(files), desc="Indexing PDF docs") as pbar_files: for file in files: with open(output_folder_pdf + "/" + file) as f: data = json.loads(f.read()) with tqdm(total=len(data), desc=f"Processing {file}") as pbar_pages: for page in data: doc = { "page_content": page['content_text'], "page_number": page['page_number'], "pdf_file": page['pdf_file'] } id = f"{page['pdf_file']}_{page['page_number']}" es.index(index=index_name, id=id, body=json.dumps(doc)) pbar_pages.update(1) pbar_files.update(1)
There is one last code trick to call out. We are going to set the elastic document ID by using the following naming convention: FILENAME_PAGENUMBER. This will make it easy/breezy to see the PDF file and page number associated with citations in Playground.
Elastic Cloud Serverless
Elastic Cloud Serverless is an excellent choice for prototyping a new Retrieval-Augmented Generation (RAG) system because it offers fully managed, scalable infrastructure without the complexity of manual cluster management. It supports both sparse and dense vector search out of the box, allowing you to experiment with different retrieval strategies efficiently. With built-in semantic text embedding, relevance ranking, and hybrid search capabilities, Elastic Cloud Serverless accelerates iteration cycles for search powered applications.
With the help of Azure AI Document Intelligence and a little Python code, we are ready to see if we can get the LLM to answer questions grounded in truth. Let’s open Playground and conduct some manual A/B testing using different query strategies.
Full text search

This query will return the top ten pages of content that get a full-text search hit.

Full-text search came close but it was only able to provide the right answer for three out of four quarters. This is understandable because we are stuffing the LLM context with ten full pages of data. And, we are not leveraging semantic search.
Sparse vector search

This query will return the top two semantic text fragments from pages that match our query using powerful sparse vector search.

Sparse vector search powered by Elastic’s ELSER does a really good job retrieving table data from all four PDF files. We can easily double check the answers by opening the PDF page number associated with each citation.
Dense vector search

Elastic also provides an excellent dense vector option for semantic text (E5). E5 is good for multi-lingual data and it has lower inference latency for high query per second use cases. This query will return the top two semantic text fragments that match our user input. The results are the same as with sparse search but notice how similar both queries are. The only difference is the “field” name.
Hybrid search
ELSER is so good for this use case that we do not need hybrid search. But, if we wanted to, we could combine dense and vector search into a single query. Then, rerank the results using RRF(Reciprocal Rank Fusion).

So what did we learn?
Azure AI Document Intelligence
- Is very capable of parsing both text and table data in PDF files.
- Integrates well with the elasticsearch python client.
Elastic Serverless Cloud
- Has built-in ML inference for sparse and dense vector embeddings at ingest and query time.
- Has powerful RAG A/B test tooling that can be used to identify the best retrieval technique for a specific use case.
There are other techniques and technologies that can be used to parse PDF files. If your organization is all-in on Azure, this approach can deliver an excellent RAG system.
Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!
Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.