Parse PDF text and table data with Azure AI Document Intelligence

Learn how to parse PDF documents that contain text and table data with Azure AI Document Intelligence.

Azure AI Document Intelligence is a powerful tool for extracting structured data from PDFs. It can be used to effectively extract text and table data. Once the data is extracted, it can be indexed into Elastic Cloud Serverless to power RAG (Retrieval Augmented Generation).

In this blog, we will demonstrate how powerful Azure AI Document Intelligence is by ingesting four recent Elastic N.V. quarterly reports. The PDFs range from 43 to 196 pages in length and each PDF contains both text and table data. We will test the retrieval of table data with the following prompt: Compare/contrast subscription revenue for Q2-2025, Q1-2025, Q4-2024 and Q3-2024?

This prompt is tricky because it requires context from four different PDFs that represent this information in tabular format.

Let’s walk through an end-to-end reference that consists of two main parts:

Python notebook

  1. Downloads four quarters’ worth of PDF 10-Q filings for Elastic N.V.
  2. Uses Azure AI Document Intelligence to parse the text and table data from each PDF file
  3. Outputs the text and table data into a JSON file
  4. Ingests the JSON files into Elastic Cloud Serverless

Elastic Cloud Serverless

  1. Creates vector embeddings for PDF text+table data
  2. Powers vector search database queries for RAG
  3. Pre-configured OpenAI connector for LLM integration
  4. A/B test interface for chatting with the 10-Q filings

Prerequisites

The code blocks in this notebook require API keys for Azure AI Document Intelligence and Elasticsearch. The best starting point for Azure AI Document intelligence is to create a Document Intelligence resource. For Elastic Cloud Serverless, refer to the get started guide. You will need Python 3.9+ to run these code blocks.

Create an .env file

Place secrets for Azure AI Document Intelligence and Elastic Cloud Serverless in a .env file.

AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT=YOUR_AZURE_RESOURCE_ENDPOINT
AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY=YOUR_AZURE_RESOURCE_API_KEY

ES_URL=YOUR_ES_URL
ES_API_KEY=YOUR_ES_API_KEY

Install Python packages

!pip install elasticsearch python-dotenv tqdm azure-core azure-ai-documentintelligence requests httpx

Create input and output folders

import os
input_folder_pdf = "./pdf"
output_folder_pdf = "./json"
folders = [input_folder_pdf, output_folder_pdf]
def create_folders_if_not_exist(folders):
for folder in folders:
os.makedirs(folder, exist_ok=True)
print(f"Folder '{folder}' created or already exists.")
create_folders_if_not_exist(folders)

Download PDF files

Download four recent Elastic 10-Q quarterly reports. If you already have PDF files, feel free to place them in the ‘./pdf’ folder.

import os
import requests
def download_pdf(url, directory='./pdf', filename=None):
if not os.path.exists(directory):
os.makedirs(directory)
response = requests.get(url)
if response.status_code == 200:
if filename is None:
filename = url.split('/')[-1]
filepath = os.path.join(directory, filename)
with open(filepath, 'wb') as file:
file.write(response.content)
print(f"Downloaded {filepath}")
else:
print(f"Failed to download file from {url}")
print("Downloading 4 recent 10-Q reports for Elastic NV.")
base_url = 'https://s201.q4cdn.com/217177842/files/doc_financials'
download_pdf(f'{base_url}/2025/q2/e5aa7a0a-6f56-468d-a5bd-661792773d71.pdf', filename='elastic-10Q-Q2-2025.pdf')
download_pdf(f'{base_url}/2025/q1/18656e06-8107-4423-8e2b-6f2945438053.pdf', filename='elastic-10Q-Q1-2025.pdf')
download_pdf(f'{base_url}/2024/q4/9949f03b-09fb-4941-b105-62a304dc1411.pdf', filename='elastic-10Q-Q4-2024.pdf')
download_pdf(f'{base_url}/2024/q3/7e60e3bd-ff50-4ae8-ab12-5b3ae19420e6.pdf', filename='elastic-10Q-Q3-2024.pdf')

Parse PDFs using Azure AI Document Intelligence

A lot is going on in code blocks that parse the PDF files. Here’s a quick summary:

  1. Set Azure AI Document Intelligence imports and environment variables
  2. Parse PDF paragraphs using AnalyzeResult
  3. Parse PDF tables using AnalyzeResult
  4. Combine PDF paragraph and table data
  5. Bring it all together by doing 1-4 for each PDF file and store the result in JSON

Set Azure AI Document Intelligence imports and environment variables

The most important import is AnalyzeResult. This class represents the outcome of a document analysis and contains details about the document. The details we care about are pages, paragraphs and tables.

import os
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
import json
from dotenv import load_dotenv
from tqdm import tqdm
load_dotenv()
AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT = os.getenv('AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT')
AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY = os.getenv('AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY')

Parse PDF paragraphs using AnalzeResult

Extract the paragraph text from each page. Do not extract table data.

def parse_paragraphs(analyze_result):
table_offsets = []
page_content = {}
for paragraph in analyze_result.paragraphs:
for span in paragraph.spans:
if span.offset not in table_offsets:
for region in paragraph.bounding_regions:
page_number = region.page_number
if page_number not in page_content:
page_content[page_number] = []
page_content[page_number].append({
"content_text": paragraph.content
})
return page_content, table_offsets

Parse the PDF tables using AnalyzeResult

Extract the table content from each page. Do not extract paragraph text. The most interesting side effect of this technique is that there is no need to transform table data. LLMs know how to read text that looks like: “Cell [0, 1]: table data…”.

def parse_tables(analyze_result, table_offsets):
page_content = {}
for table in analyze_result.tables:
table_data = []
for region in table.bounding_regions:
page_number = region.page_number
for cell in table.cells:
for span in cell.spans:
table_offsets.append(span.offset)
table_data.append(f"Cell [{cell.row_index}, {cell.column_index}]: {cell.content}")
if page_number not in page_content:
page_content[page_number] = []
page_content[page_number].append({
"content_text": "\n".join(table_data)})
return page_content

Combine PDF paragraph and table data

Pre-process chunking at the page level preserves context so that we can easily validate RAG retrieval manually. Later, you will see that this pre-process chunking will not have a negative effect on the RAG output.

def combine_paragraphs_tables(filepath, paragraph_content, table_content):
page_content_concatenated = {}
structured_data = []
# Combine paragraph and table content
for p_number in set(paragraph_content.keys()).union(table_content.keys()):
concatenated_text = ""
if p_number in paragraph_content:
for content in paragraph_content[p_number]:
concatenated_text += content["content_text"] + "\n"
if p_number in table_content:
for content in table_content[p_number]:
concatenated_text += content["content_text"] + "\n"
page_content_concatenated[p_number] = concatenated_text.strip()
# Append a single item per page to the structured_data list
for p_number, concatenated_text in page_content_concatenated.items():
structured_data.append({
"page_number": p_number,
"content_text": concatenated_text,
"pdf_file": os.path.basename(filepath)
})
return structured_data

Bring it all together

Open each PDF in the ./pdf folder, parse the text and table data and save the result in a JSON file that has entries for page_number, content_text and pdf_file. The content_text field represents the page paragraphs and table data for each page.

pdf_files = [
os.path.join(input_folder_pdf, file)
for file in os.listdir(input_folder_pdf)
if file.endswith(".pdf")
]
document_intelligence_client = DocumentIntelligenceClient(
endpoint=AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT,
credential=AzureKeyCredential(AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY),
connection_timeout=600
)
for filepath in tqdm(pdf_files, desc="Parsing PDF files"):
with open(filepath, "rb") as file:
poller = document_intelligence_client.begin_analyze_document("prebuilt-layout",
AnalyzeDocumentRequest(bytes_source=file.read())
)
analyze_result: AnalyzeResult = poller.result()
paragraph_content, table_offsets = parse_paragraphs(analyze_result)
table_content = parse_tables(analyze_result, table_offsets)
structured_data = combine_paragraphs_tables(filepath, paragraph_content, table_content)
# Convert the structured data to JSON format
json_output = json.dumps(structured_data, indent=4)
# Get the filename without the ".pdf" extension
filename_without_ext = os.path.splitext(os.path.basename(filepath))[0]
# Write the JSON output to a file
output_json_file = f"{output_folder_pdf}/{filename_without_ext}.json"
with open(output_json_file, "w") as json_file:
json_file.write(json_output)

Load data into Elastic Cloud Serverless

The code blocks below handle:

  1. Set imports for the Elasticsearch client and environment variables
  2. Create index in Elastic Cloud Serverless
  3. Load the JSON files from ./json directory into the pdf-chat index

Set imports for the Elasticsearch client and environment variables

The most important import is Elasticsearch. This class is responsible for connecting to Elastic Cloud Serverless to create and populate the pdf-chat index.

import json
from dotenv import load_dotenv
from elasticsearch import Elasticsearch
from tqdm import tqdm
import os
load_dotenv()
ES_URL = os.getenv('ES_URL')
ES_API_KEY = os.getenv('ES_API_KEY')
es = Elasticsearch(hosts=ES_URL,api_key=ES_API_KEY, request_timeout=300)

Create index in Elastic Cloud Serverless

This code block creates an index named “pdf_chat” that has the following mappings:

  • page_content - For testing RAG using full-text search
  • page_content_sparse - For testing RAG using sparse vectors
  • page_content_dense - For testing RAG using dense vectors
  • page_number - Useful for constructing citations
  • pdf_file - Useful for constructing citations

Notice the use of copy_to and semantic_text. The copy_to utility copies body_content to two semantic text fields. Each semantic text field maps to an ML inference endpoint, one for the sparse vector and one for the dense vector. Elastic-powered ML inference will auto-chunk each page into 250 token chunks with a 100 token overlap.

index_name= "pdf-chat"
index_body = {
"mappings": {
"properties": {
"page_content":
{"type": "text",
"copy_to": ["page_content_sparse",
"page_content_dense"]},
"page_content_sparse":
{"type": "semantic_text",
"inference_id": ".elser-2-elasticsearch"},
"page_content_dense":
{"type": "semantic_text",
"inference_id": ".multilingual-e5-small-elasticsearch"},
"page_number": {"type": "text"},
"pdf_file": {
"type": "text", "fields": {"keyword": {"type": "keyword"}}
}
}
}
}
if es.indices.exists(index=index_name):
es.indices.delete(index=index_name)
print(f"Index '{index_name}' deleted successfully.")
response = es.indices.create(index=index_name, body=index_body)
if 'acknowledged' in response and response['acknowledged']:
print(f"Index '{index_name}' created successfully.")
elif 'error' in response:
print(f"Failed to create: '{index_name}'")
print(f"Error: {response['error']['reason']}")
else:
print(f"Index '{index_name}' already exists.")

Load the JSON files from ./json directory into the pdf-chat index

This process will take several minutes to run because we are:

  1. Loading 402 pages of PDF data
  2. Creating sparse text embeddings for each page_content chunk
  3. Creating dense text embeddings for each page_content chunk
files = os.listdir(output_folder_pdf)
with tqdm(total=len(files), desc="Indexing PDF docs") as pbar_files:
for file in files:
with open(output_folder_pdf + "/" + file) as f:
data = json.loads(f.read())
with tqdm(total=len(data), desc=f"Processing {file}") as pbar_pages:
for page in data:
doc = {
"page_content": page['content_text'],
"page_number": page['page_number'],
"pdf_file": page['pdf_file']
}
id = f"{page['pdf_file']}_{page['page_number']}"
es.index(index=index_name, id=id, body=json.dumps(doc))
pbar_pages.update(1)
pbar_files.update(1)

There is one last code trick to call out. We are going to set the elastic document ID by using the following naming convention: FILENAME_PAGENUMBER. This will make it easy/breezy to see the PDF file and page number associated with citations in Playground.

Elastic Cloud Serverless

Elastic Cloud Serverless is an excellent choice for prototyping a new Retrieval-Augmented Generation (RAG) system because it offers fully managed, scalable infrastructure without the complexity of manual cluster management. It supports both sparse and dense vector search out of the box, allowing you to experiment with different retrieval strategies efficiently. With built-in semantic text embedding, relevance ranking, and hybrid search capabilities, Elastic Cloud Serverless accelerates iteration cycles for search powered applications.

With the help of Azure AI Document Intelligence and a little Python code, we are ready to see if we can get the LLM to answer questions grounded in truth. Let’s open Playground and conduct some manual A/B testing using different query strategies.

This query will return the top ten pages of content that get a full-text search hit.

Full-text search came close but it was only able to provide the right answer for three out of four quarters. This is understandable because we are stuffing the LLM context with ten full pages of data. And, we are not leveraging semantic search.

This query will return the top two semantic text fragments from pages that match our query using powerful sparse vector search.

Sparse vector search powered by Elastic’s ELSER does a really good job retrieving table data from all four PDF files. We can easily double check the answers by opening the PDF page number associated with each citation.

Elastic also provides an excellent dense vector option for semantic text (E5). E5 is good for multi-lingual data and it has lower inference latency for high query per second use cases. This query will return the top two semantic text fragments that match our user input. The results are the same as with sparse search but notice how similar both queries are. The only difference is the “field” name.

ELSER is so good for this use case that we do not need hybrid search. But, if we wanted to, we could combine dense and vector search into a single query. Then, rerank the results using RRF(Reciprocal Rank Fusion).

So what did we learn?

Azure AI Document Intelligence

Elastic Serverless Cloud

  • Has built-in ML inference for sparse and dense vector embeddings at ingest and query time.
  • Has powerful RAG A/B test tooling that can be used to identify the best retrieval technique for a specific use case.

There are other techniques and technologies that can be used to parse PDF files. If your organization is all-in on Azure, this approach can deliver an excellent RAG system.

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

Related content

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself