How To Create Document Workflows With Temporal And DocRouter.AI

Processing large multi-page documents with AI requires careful orchestration. When documents are hundreds of pages long, they can’t be processed in a single LLM prompt due to token limits. This post describes a real-world implementation that uses Temporal to orchestrate document processing workflows with DocRouter.AI, solving the specific problem of extracting patient information from surgery schedule documents.

The implementation is available at doc-router-temporal and processes surgery schedule documents containing hundreds of pages, extracting patient names, dates of birth, and medical insurance information.

The Problem and Solution

Surgery schedule documents can contain hundreds of pages with medical insurance cards, pre-operative documentation, anesthesia records, and other patient information. The challenge is that these documents are too large to process in a single LLM prompt due to token limits (typically 128K-200K tokens).

The solution requires:

Chunking: Split the PDF into individual pages
Classification: Identify page types (insurance card, pre-op, anesthesia records, etc.)
Grouping: Group patient pages by patient (using name and DOB as keys)
Extraction: Extract structured data from each patient set of pages

Loading diagram...

📝 Edit in Excalidraw

Why Temporal?

Temporal provides durable workflow orchestration that’s perfect for this use case. Unlike traditional approaches (queues, background jobs, or simple scripts), Temporal handles:

Durable execution: Resumes from crashes during 200-page processing
Parallel processing: Processes multiple pages simultaneously while maintaining order
Error handling: Automatic retries for API rate limits and network issues
State management: Tracks processed pages and identified patients
Long-running workflows: Handles processes that take minutes to hours

The Workflow Implementation

The implementation uses a hierarchical workflow structure with two main workflows:

Classify and Group PDF Pages (ClassifyAndGroupPDFPagesWorkflow): Chunks the PDF, classifies each page, and groups pages by patient
Extract Insurance Information (ClassifyGroupAndExtractInsuranceWorkflow): Creates patient-specific PDFs and extracts insurance card data

The main workflow (ClassifyGroupAndExtractInsuranceWorkflow) orchestrates the entire process:

Loading diagram...

📝 Edit in Excalidraw

Creating Schemas with Claude Agent

Before building the Temporal workflow, we created the extraction schemas and prompts using the Claude Agent for DocRouter.AI (an MCP server at doc-router/packages/typescript/mcp).

The Claude Agent allows Claude Code to create extraction schemas and prompts. For example, you can prompt: “Create a schema for extracting patient information from surgery schedule pages” and it will validate, create, and test the schema automatically.

For this implementation, we created:

anesthesia_bundle_page_classifier: Classifies pages as surgery schedule, patient information, or other
insurance_card: Extracts insurance card information from patient pages

Workflow Implementation

The main workflow (ClassifyGroupAndExtractInsuranceWorkflow) orchestrates the entire process. Complete implementation: workflows/classify_group_and_extract_insurance.py.

Step 1: Classify and Group Pages

The workflow calls ClassifyAndGroupPDFPagesWorkflow to:

Chunk the PDF into individual pages
Classify each page using DocRouter.AI
Group pages by patient using name and DOB matching

The grouping logic (activities/group_classification_results.py) includes name normalization, DOB parsing, and fuzzy matching with Levenshtein distance to handle typos and variations.

Step 2: Extract Insurance Information

For each patient group, the workflow:

Creates patient-specific PDFs with only that patient’s pages (activities/create_and_upload_patient_pdf.py)
Uploads them to DocRouter.AI for insurance card extraction
Polls for completion and retrieves results

To avoid passing large binary data through Temporal, PDFs are read from disk and uploaded directly to DocRouter.AI.

Key Implementation Details

The workflow was developed in Cursor over 2 days with 3-4 iterations, adding chunking, classification, grouping, insurance extraction, and error handling.

Design Decisions

Avoid large data transfer: PDFs are read from disk and uploaded directly to DocRouter.AI, not passed through Temporal
Parallel processing: Multiple patients processed concurrently with status polling
Error handling: Retry logic, graceful degradation, and timeout handling
State management: Only document IDs and metadata flow through Temporal to keep history efficient

Results

The implementation successfully processes surgery schedule documents with hundreds of pages, extracting patient names, dates of birth, and medical insurance information. It handles large documents (200+ pages), parallel patient processing, error recovery, and long-running operations.

Running the Workflow

# Start the Temporal worker
python worker.py

# In another terminal, run the client
python client_classify_group_and_extract_insurance.py <path_to_pdf>

See the README and client script for details.

The workflow returns JSON with file name, page classifications, schedule pages, and patient data with insurance information.

Conclusion

Temporal and DocRouter.AI together provide reliable, scalable document processing with durable workflows, parallel processing, and rapid schema iteration using the Claude Agent. The implementation took just 2 days to build.

Code available at doc-router-temporal.