Automating Document Processing with Azure Document Intelligence

Automating Document Processing with Azure Document Intelligence

Processing hundreds of multi-page PDF forms manually is exactly the kind of repetitive work that AI should handle. I built an automated document processing pipeline on Azure that splits PDFs, extracts structured data using AI, and stores results — all triggered by a simple file upload.

The Problem

Imagine receiving a 200-page PDF containing 100 two-page forms. Each form has the same fields — name, date, checkboxes, handwritten notes — but the data is different on every page. Manually entering this data into a database would take days. And it would be error-prone.

This is a common pattern in government agencies, healthcare organizations, and any institution that still deals with paper forms. The solution needs to be:

  • Automated: Drop a PDF, get structured data
  • Accurate: AI extraction with confidence scores
  • Scalable: Handle bursts of documents without provisioning servers
  • Auditable: Track every step of processing for compliance

Pipeline Architecture

The pipeline uses serverless Azure services orchestrated by Synapse Analytics:

ComponentTechnologyRole
IngestionAzure Blob StoragePDFs uploaded to incoming/ folder trigger the pipeline
OrchestrationAzure Synapse AnalyticsCoordinates the multi-step pipeline
ProcessingAzure Functions (Python)PDF splitting, field extraction, data transformation
AI ExtractionAzure Document IntelligenceCustom and prebuilt models for form field extraction
StorageAzure Cosmos DBExtracted data with links to source PDFs
MonitoringApplication InsightsEnd-to-end pipeline telemetry
InfrastructureAzure BicepEverything defined as code

Processing Flow

1. PDF uploaded to Blob Storage (incoming/)
   │
2. Synapse pipeline triggered (blob event)
   │
3. Azure Function: Split PDF
   │  ├── Detects page boundaries
   │  ├── Splits into 2-page form chunks
   │  └── Saves splits to _splits/ folder
   │
4. Azure Function: Extract Fields (parallel)
   │  ├── Sends each split to Document Intelligence
   │  ├── Custom model extracts form-specific fields
   │  ├── Prebuilt model handles common patterns
   │  └── Returns structured JSON with confidence scores
   │
5. Azure Function: Transform & Store
   │  ├── Validates extracted data
   │  ├── Maps fields to schema
   │  └── Writes to Cosmos DB
   │
6. Original PDF → processed/ (archive)

Azure Document Intelligence Deep Dive

Document Intelligence (formerly Form Recognizer) is the AI engine at the core of the pipeline. It supports two approaches:

Prebuilt Models

Microsoft provides pre-trained models for common document types — invoices, receipts, ID documents, tax forms. No training required:

# Using a prebuilt model
from azure.ai.formrecognizer import DocumentAnalysisClient

client = DocumentAnalysisClient(
    endpoint=endpoint,
    credential=DefaultAzureCredential()
)

# Analyze an invoice
poller = client.begin_analyze_document(
    "prebuilt-invoice",
    document=pdf_bytes
)
result = poller.result()

for document in result.documents:
    vendor = document.fields.get("VendorName")
    total = document.fields.get("InvoiceTotal")
    print(f"Vendor: {vendor.value} (confidence: {vendor.confidence})")
    print(f"Total: {total.value} (confidence: {total.confidence})")

Custom Models

For domain-specific forms, you train custom models using labeled samples. The pipeline supports this workflow:

  1. Label training data using Document Intelligence Studio (web UI)
  2. Train the model with as few as 5 labeled samples
  3. Deploy — the model gets an ID that the pipeline references
  4. Extract — custom fields extracted with per-field confidence scores

Error Handling & Dead Letter Queue

Not every document processes cleanly. Handwritten text, poor scan quality, or unexpected layouts can cause extraction failures. The pipeline handles this gracefully:

# Retry logic with dead letter
MAX_RETRIES = 3

async def process_document(doc_path: str, retry_count: int = 0):
    try:
        result = await extract_fields(doc_path)
        await store_result(result)
    except ExtractionError as e:
        if retry_count < MAX_RETRIES:
            # Retry with exponential backoff
            await asyncio.sleep(2 ** retry_count)
            await process_document(doc_path, retry_count + 1)
        else:
            # Move to dead letter queue
            await move_to_dead_letter(doc_path, str(e))
            await send_alert(f"Document failed after {MAX_RETRIES} retries")

Pipeline Monitoring

Application Insights provides end-to-end visibility with custom metrics:

// Forms processed per hour
customMetrics
| where name == "forms_processed"
| summarize count() by bin(timestamp, 1h), tostring(customDimensions.model_id)

// Average processing time by model
customMetrics
| where name == "processing_duration_ms"
| summarize avg(value) by tostring(customDimensions.model_id)

Infrastructure as Code

Every Azure resource is defined in Bicep templates. The entire pipeline can be deployed to a new environment in minutes:

# Deploy the complete pipeline
az deployment group create \
  --resource-group rg-doc-pipeline \
  --template-file infra/main.bicep \
  --parameters environment=prod

Interactive Notebooks

The repo includes Jupyter notebooks for exploring and testing the pipeline components independently — great for prototyping extraction logic before deploying to production.

The full pipeline with documentation, Bicep templates, and sample notebooks: fgarofalo56/azure-doc-intelligence-pipeline

Link copied