Automating Document Processing with Azure Document Intelligence
Processing hundreds of multi-page PDF forms manually is exactly the kind of repetitive work that AI should handle. I built an automated document processing pipeline on Azure that splits PDFs, extracts structured data using AI, and stores results — all triggered by a simple file upload.
The Problem
Imagine receiving a 200-page PDF containing 100 two-page forms. Each form has the same fields — name, date, checkboxes, handwritten notes — but the data is different on every page. Manually entering this data into a database would take days. And it would be error-prone.
This is a common pattern in government agencies, healthcare organizations, and any institution that still deals with paper forms. The solution needs to be:
- Automated: Drop a PDF, get structured data
- Accurate: AI extraction with confidence scores
- Scalable: Handle bursts of documents without provisioning servers
- Auditable: Track every step of processing for compliance
Pipeline Architecture
The pipeline uses serverless Azure services orchestrated by Synapse Analytics:
| Component | Technology | Role |
|---|---|---|
| Ingestion | Azure Blob Storage | PDFs uploaded to incoming/ folder trigger the pipeline |
| Orchestration | Azure Synapse Analytics | Coordinates the multi-step pipeline |
| Processing | Azure Functions (Python) | PDF splitting, field extraction, data transformation |
| AI Extraction | Azure Document Intelligence | Custom and prebuilt models for form field extraction |
| Storage | Azure Cosmos DB | Extracted data with links to source PDFs |
| Monitoring | Application Insights | End-to-end pipeline telemetry |
| Infrastructure | Azure Bicep | Everything defined as code |
Processing Flow
1. PDF uploaded to Blob Storage (incoming/)
│
2. Synapse pipeline triggered (blob event)
│
3. Azure Function: Split PDF
│ ├── Detects page boundaries
│ ├── Splits into 2-page form chunks
│ └── Saves splits to _splits/ folder
│
4. Azure Function: Extract Fields (parallel)
│ ├── Sends each split to Document Intelligence
│ ├── Custom model extracts form-specific fields
│ ├── Prebuilt model handles common patterns
│ └── Returns structured JSON with confidence scores
│
5. Azure Function: Transform & Store
│ ├── Validates extracted data
│ ├── Maps fields to schema
│ └── Writes to Cosmos DB
│
6. Original PDF → processed/ (archive)
Azure Document Intelligence Deep Dive
Document Intelligence (formerly Form Recognizer) is the AI engine at the core of the pipeline. It supports two approaches:
Prebuilt Models
Microsoft provides pre-trained models for common document types — invoices, receipts, ID documents, tax forms. No training required:
# Using a prebuilt model
from azure.ai.formrecognizer import DocumentAnalysisClient
client = DocumentAnalysisClient(
endpoint=endpoint,
credential=DefaultAzureCredential()
)
# Analyze an invoice
poller = client.begin_analyze_document(
"prebuilt-invoice",
document=pdf_bytes
)
result = poller.result()
for document in result.documents:
vendor = document.fields.get("VendorName")
total = document.fields.get("InvoiceTotal")
print(f"Vendor: {vendor.value} (confidence: {vendor.confidence})")
print(f"Total: {total.value} (confidence: {total.confidence})")
Custom Models
For domain-specific forms, you train custom models using labeled samples. The pipeline supports this workflow:
- Label training data using Document Intelligence Studio (web UI)
- Train the model with as few as 5 labeled samples
- Deploy — the model gets an ID that the pipeline references
- Extract — custom fields extracted with per-field confidence scores
Error Handling & Dead Letter Queue
Not every document processes cleanly. Handwritten text, poor scan quality, or unexpected layouts can cause extraction failures. The pipeline handles this gracefully:
# Retry logic with dead letter
MAX_RETRIES = 3
async def process_document(doc_path: str, retry_count: int = 0):
try:
result = await extract_fields(doc_path)
await store_result(result)
except ExtractionError as e:
if retry_count < MAX_RETRIES:
# Retry with exponential backoff
await asyncio.sleep(2 ** retry_count)
await process_document(doc_path, retry_count + 1)
else:
# Move to dead letter queue
await move_to_dead_letter(doc_path, str(e))
await send_alert(f"Document failed after {MAX_RETRIES} retries")
Pipeline Monitoring
Application Insights provides end-to-end visibility with custom metrics:
// Forms processed per hour
customMetrics
| where name == "forms_processed"
| summarize count() by bin(timestamp, 1h), tostring(customDimensions.model_id)
// Average processing time by model
customMetrics
| where name == "processing_duration_ms"
| summarize avg(value) by tostring(customDimensions.model_id)
Infrastructure as Code
Every Azure resource is defined in Bicep templates. The entire pipeline can be deployed to a new environment in minutes:
# Deploy the complete pipeline
az deployment group create \
--resource-group rg-doc-pipeline \
--template-file infra/main.bicep \
--parameters environment=prod
Interactive Notebooks
The repo includes Jupyter notebooks for exploring and testing the pipeline components independently — great for prototyping extraction logic before deploying to production.
The full pipeline with documentation, Bicep templates, and sample notebooks: fgarofalo56/azure-doc-intelligence-pipeline