DocRouter Schema Definition Manual
Overview
DocRouter uses OpenAI’s Structured Outputs JSON Schema format to define extraction schemas for document processing. Schemas ensure that AI-extracted data from documents follows a consistent, validated structure.
Table of Contents
- Schema Format Specification
- Basic Schema Structure
- Field Types
- Required Fields and Strict Mode
- Advanced Schema Features
- Best Practices
- Examples
- API Integration
Schema Format Specification
Root Structure
All DocRouter schemas follow this format:
{
"type": "json_schema",
"json_schema": {
"name": "document_extraction",
"schema": {
"type": "object",
"properties": {},
"required": [],
"additionalProperties": false
},
"strict": true
}
}
Components
Component | Type | Required | Description |
---|---|---|---|
type |
string | Yes | Must be "json_schema" |
json_schema |
object | Yes | Container for schema definition |
json_schema.name |
string | Yes | Identifier for the schema |
json_schema.schema |
object | Yes | JSON Schema specification following JSON Schema Draft 7 |
json_schema.strict |
boolean | Yes | Must be true - Ensures 100% schema adherence |
Strict Mode Constraints
DocRouter always uses strict: true
mode, which requires these additional constraints beyond standard JSON Schema:
- All properties MUST be in the
required
array - No optional fields allowed additionalProperties: false
MUST be set - At every level, including nested objects
Basic Schema Structure
Minimal Schema Example
{
"type": "json_schema",
"json_schema": {
"name": "document_extraction",
"schema": {
"type": "object",
"properties": {
"field_name": {
"type": "string",
"description": "Human-readable description of this field"
}
},
"required": ["field_name"],
"additionalProperties": false
},
"strict": true
}
}
Schema Object Properties
Property | Type | Required | Description |
---|---|---|---|
type |
string | Yes | Must be "object" for root schema |
properties |
object | Yes | Defines all extractable fields |
required |
array | Yes | Must list ALL properties when strict: true |
additionalProperties |
boolean | Yes | Must be false when strict: true |
Field Types
DocRouter schemas support standard JSON Schema data types:
String Fields
{
"field_name": {
"type": "string",
"description": "A text field"
}
}
Use for: Names, emails, addresses, free-text descriptions, comma-separated lists
Number Fields
{
"amount": {
"type": "number",
"description": "A numeric value"
}
}
Use for: Quantities, amounts, percentages, measurements
Integer Fields
{
"count": {
"type": "integer",
"description": "A whole number"
}
}
Use for: Counts, years, age, quantity of items
Boolean Fields
{
"is_verified": {
"type": "boolean",
"description": "True/false indicator"
}
}
Use for: Yes/no questions, checkboxes, status flags
Array Fields
{
"skills": {
"type": "array",
"description": "List of programming skills",
"items": {
"type": "string"
}
}
}
Use for: Lists, multiple values, repeated items
Object Fields (Nested)
{
"address": {
"type": "object",
"description": "Address information",
"properties": {
"street": {
"type": "string",
"description": "Street address"
},
"city": {
"type": "string",
"description": "City name"
},
"postal_code": {
"type": "string",
"description": "Postal code"
}
},
"required": ["street", "city", "postal_code"],
"additionalProperties": false
}
}
Use for: Grouped related fields, structured sub-data
How the LLM Handles Missing Data
Since all fields must be required in strict mode, the LLM handles missing data as follows:
- String fields: Returns empty string
""
if data not found in document - Number/Integer fields: Returns
0
if data not found - Boolean fields: Returns
false
if data not found - Array fields: Returns empty array
[]
if data not found - Object fields: Returns object with all nested required fields populated with default values
Best Practice: Design your schema knowing that all fields will always be present in the response, but may contain empty/default values when data is not found in the document.
Advanced Schema Features
⚠️ PORTABILITY WARNING: The features below are supported by OpenAI’s Structured Outputs, but not recommended for DocRouter schemas. These constraints may not be portable across different LLM providers (Anthropic Claude, Google Gemini, etc.). For maximum compatibility and reliability:
- Use basic types only: string, number, integer, boolean, array, object
- Avoid enums, patterns, minimum/maximum, minItems/maxItems, uniqueItems
- Handle validation in your application code instead of in the schema
- Use detailed descriptions to guide the LLM rather than strict constraints
Best Practices
1. Use Clear, Descriptive Field Names and Detailed Descriptions
Good:
{
"current_academic_program": {
"type": "string",
"description": "Current degree program (e.g., MEng Computing, BSc Computer Science)"
}
}
Avoid:
{
"prog": {
"type": "string",
"description": "program"
}
}
Descriptions guide the LLM on what to extract and should include:
- What data to extract
- Expected format
- Examples when helpful
Good:
{
"total_amount": {
"type": "string",
"description": "Total invoice amount including tax, with currency symbol and commas (e.g., $1,234.56)"
}
}
Avoid:
{
"total_amount": {
"type": "string",
"description": "total"
}
}
Additional Context in Prompts: While field descriptions provide specific guidance for each field, you can also provide additional context and instructions in the prompt associated with the schema. This is particularly useful for:
- Document-specific extraction rules
- Edge case handling instructions
- Format preferences that apply across multiple fields
- Business logic that guides the overall extraction process
2. Choose Appropriate Field Types
- Use string for currency values with formatting (e.g., “$1,234.56”)
- Use number for numeric calculations
- Use array for multiple items instead of comma-separated strings
- Use object to group related fields
3. Avoid Advanced Constraints for Portability
For maximum portability across LLM providers (OpenAI, Anthropic, Gemini, etc.):
- Use basic types only and avoid enums, patterns, min/max constraints
- Put constraints in descriptions instead:
"Status (paid, unpaid, overdue, or cancelled)"
- Validate data in your application rather than in the schema
- This ensures your schemas work consistently across all supported LLM providers
Not recommended:
{
"invoice_status": {
"type": "string",
"enum": ["paid", "unpaid", "overdue", "cancelled"]
}
}
Recommended:
{
"invoice_status": {
"type": "string",
"description": "Invoice status (paid, unpaid, overdue, or cancelled)"
}
}
Examples
Example 1: Invoice Schema
{
"type": "json_schema",
"json_schema": {
"name": "document_extraction",
"schema": {
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "Unique invoice identifier"
},
"invoice_date": {
"type": "string",
"description": "Date of invoice in YYYY-MM-DD format"
},
"vendor_name": {
"type": "string",
"description": "Name of the vendor/supplier"
},
"vendor_address": {
"type": "string",
"description": "Complete vendor address"
},
"customer_name": {
"type": "string",
"description": "Name of the customer/buyer"
},
"line_items": {
"type": "array",
"description": "List of items on the invoice",
"items": {
"type": "object",
"properties": {
"description": {
"type": "string",
"description": "Item description"
},
"quantity": {
"type": "string",
"description": "Quantity ordered"
},
"unit_price": {
"type": "string",
"description": "Price per unit with currency"
},
"total": {
"type": "string",
"description": "Line total with currency"
}
},
"required": ["description", "quantity", "unit_price", "total"],
"additionalProperties": false
}
},
"subtotal": {
"type": "string",
"description": "Subtotal before tax with currency"
},
"tax_amount": {
"type": "string",
"description": "Tax amount with currency"
},
"total_amount": {
"type": "string",
"description": "Total amount due with currency"
},
"payment_terms": {
"type": "string",
"description": "Payment terms (e.g., Net 30, Due on Receipt)"
}
},
"required": [
"invoice_number",
"invoice_date",
"vendor_name",
"vendor_address",
"customer_name",
"line_items",
"subtotal",
"tax_amount",
"total_amount",
"payment_terms"
],
"additionalProperties": false
},
"strict": true
}
}
Example 2: Resume/CV Schema
{
"type": "json_schema",
"json_schema": {
"name": "document_extraction",
"schema": {
"type": "object",
"properties": {
"Name": {
"type": "string",
"description": "Candidate's full name"
},
"Email": {
"type": "string",
"description": "Email address"
},
"Telephone": {
"type": "string",
"description": "Phone number"
},
"Current Academic Program": {
"type": "string",
"description": "Current degree program (e.g., MEng Computing)"
},
"Current Grade": {
"type": "string",
"description": "Academic year or GPA/grade information"
},
"High School Qualification": {
"type": "string",
"description": "A-levels, GCSEs, or equivalent qualifications"
},
"Programming Languages": {
"type": "string",
"description": "Comma-separated list of programming languages"
},
"Experiences": {
"type": "string",
"description": "Professional or research experiences"
},
"Projects": {
"type": "string",
"description": "Academic or personal projects with descriptions"
},
"Awards": {
"type": "string",
"description": "Academic awards, honors, competition placements"
},
"Work Experience": {
"type": "string",
"description": "Employment history with companies and roles"
},
"Extracurricular": {
"type": "string",
"description": "Clubs, hobbies, volunteer work, sports"
},
"Languages": {
"type": "string",
"description": "Spoken languages and proficiency levels"
}
},
"required": [
"Name",
"Email",
"Telephone",
"Current Academic Program",
"Current Grade",
"High School Qualification",
"Programming Languages",
"Experiences",
"Projects",
"Awards",
"Work Experience",
"Extracurricular",
"Languages"
],
"additionalProperties": false
},
"strict": true
}
}
Example 3: Financial Statement Schema
{
"type": "json_schema",
"json_schema": {
"name": "document_extraction",
"schema": {
"type": "object",
"properties": {
"net_interest_income": {
"type": "string",
"description": "Net interest income in thousands with formatting"
},
"net_fee_and_commission_income": {
"type": "string",
"description": "Net fee and commission income"
},
"other_operating_income": {
"type": "string",
"description": "Other operating income"
},
"credit_loss_expense": {
"type": "string",
"description": "Credit loss expense (negative values in parentheses)"
},
"net_operating_income": {
"type": "string",
"description": "Net operating income"
},
"personnel_expenses": {
"type": "string",
"description": "Personnel expenses"
},
"other_operating_expenses": {
"type": "string",
"description": "Other operating expenses"
},
"total_expenses": {
"type": "string",
"description": "Total expenses"
},
"profit_loss_before_tax": {
"type": "string",
"description": "Profit/loss before tax"
},
"tax_expense_credit": {
"type": "string",
"description": "Tax expense or credit"
},
"profit_loss_for_the_year": {
"type": "string",
"description": "Final profit/loss for the year"
}
},
"required": [
"net_interest_income",
"net_fee_and_commission_income",
"other_operating_income",
"credit_loss_expense",
"net_operating_income",
"personnel_expenses",
"other_operating_expenses",
"total_expenses",
"profit_loss_before_tax",
"tax_expense_credit",
"profit_loss_for_the_year"
],
"additionalProperties": false
},
"strict": true
}
}
API Integration
DocRouter provides multiple ways to interact with schemas programmatically:
- TypeScript/JavaScript SDK - Type-safe client library for Node.js and browsers (see
packages/typescript/docrouter-sdk/
) - Python SDK - Type-safe Python client library (see
packages/docrouter_sdk/
) - REST API - Direct HTTP requests (see API documentation for endpoints)
- MCP (Model Context Protocol) - Integration with AI assistants like Claude Code
All methods support the same schema operations: create, list, retrieve, update, delete, and validate against schemas.
Schema Workflow
1. Design Phase
- Identify document type and key fields to extract
- Choose appropriate data types for each field
- Design nested structures for complex data
- Remember: ALL fields will be required in strict mode
2. Creation Phase
- Create schema using API or UI
- Test with sample documents
- Iterate based on extraction results
3. Prompt Integration
- Link schema to extraction prompt
- Configure LLM model (e.g., gpt-4o-mini, gemini-2.0-flash)
- Associate with document tags for automatic processing
4. Processing Phase
- Upload documents with appropriate tags
- LLM extracts data according to schema
- Results available via
getLLMResult
API
5. Validation Phase
- Review extracted data
- Verify against schema requirements
- Mark as verified when accurate
Troubleshooting
Common Issues
Issue: LLM returns empty strings for all fields
- Solution: Check prompt content, ensure it references the schema, verify document has OCR text
Issue: Extra fields appear in extraction
- Solution: Ensure
additionalProperties: false
is set in schema
Issue: Error “all properties must be required when strict is true”
- Solution: Ensure ALL properties are listed in the
required
array at every level (including nested objects)
Issue: Error “additionalProperties must be false when strict is true”
- Solution: Set
additionalProperties: false
on all objects in the schema, including nested objects
Issue: Required fields missing from extraction
- Solution: This should not happen with
strict: true
. Verify schema matches exactly, check field names
Issue: Number fields returned as strings
- Solution: For formatted numbers (with commas, currency), use string type. For calculations, use number type.
Issue: Array fields contain single concatenated string
- Solution: Update prompt to explicitly instruct LLM to return array of items
Version Control
DocRouter maintains schema versioning:
- Each schema update creates a new version
schema_version
increments with each changeschema_revid
uniquely identifies each version- Previous versions remain accessible for historical extractions
References
Document Version: 1.0
Last Updated: 2025-10-11
Maintained by: DocRouter Development Team