How We Achieve 95%+ Invoice OCR Accuracy
How Invoice OCR achieves 95%+ field extraction accuracy — a technical look at the architecture, training approach, and engineering decisions behind it.
Ninety-five percent accuracy on invoice data extraction is not a marketing number — it is a measurement on a held-out test set of real-world invoices that we publish and update quarterly. Getting there required solving four distinct problems: handling document quality variation, extracting structured line items from unstructured layouts, normalizing numeric and date fields without hallucination, and generalizing across 109 languages and dozens of invoice conventions. This post explains the technical decisions behind each.
The measurement methodology
We measure field-level accuracy, not document-level accuracy. A document is "correct" only if every extracted field matches the ground-truth value — a strict standard that penalizes partial extraction. Our test set contains 2,000 invoices sourced from real customer uploads (anonymized and consent-granted), distributed across: 800 standard commercial PDF invoices from 200+ vendors, 400 scanned paper invoices at varying DPI and scan quality, 300 multi-page invoices with 5–50 line items each, 300 non-English invoices across 12 languages, and 200 edge-case documents (hand-annotated, damaged scans, unusual layouts). We measure six fields: vendor name, invoice number, invoice date, line items (set match), subtotal, and grand total. The reported 95.6% is the arithmetic mean field accuracy across all six fields on this full test set, not cherry-picked on the easy subset.
Architecture: vision encoder + structured decoder
Invoice OCR uses a two-stage architecture. The first stage is a vision encoder — a fine-tuned ViT-L variant trained on a large corpus of document images. Unlike general image encoders, this model is specifically optimized for document text extraction, handling multi-column layouts, rotated text, mixed fonts, and low-contrast scans. The encoder produces a high-dimensional token representation of the document layout and text content. The second stage is a structured generation decoder, which takes the encoder output and generates a JSON object with a fixed schema. The key constraint: the decoder is trained with a copy-faithfulness objective that prevents it from generating token sequences not present in the source document. This eliminates hallucination on numeric fields — a critical requirement for financial data where a hallucinated total is worse than no extraction at all.
Training data: quantity, diversity, and quality filtering
The model was trained on a proprietary corpus of 4.2 million invoice images with field-level annotations. Data quality was the primary driver of accuracy improvement — we spent more engineering time on annotation quality than on model architecture iteration. The training pipeline includes: (1) automatic annotation using an ensemble of three independent extraction models, with disagreements flagged for human review; (2) vendor diversity sampling to ensure representation from small regional suppliers, not just large enterprise vendors who tend to have consistent layouts; (3) language-stratified sampling to maintain coverage of all 109 supported languages; and (4) quality tier stratification (high-quality scan, medium-quality scan, poor-quality scan) so the model sees the full distribution of document quality, not just clean PDFs. The result is a training corpus that reflects what finance teams actually process — not a curated dataset of ideal documents.
Line item extraction: the hardest problem
Line item extraction is the hardest part of invoice OCR, and the area where most competing tools fall short. The challenge is structural: line items are tabular data rendered in hundreds of different layouts — some use grids, some use indentation, some split across pages, some mix line items with running totals. Our approach uses a two-pass extraction strategy. The first pass identifies the line item region using a layout segmentation model trained to detect table boundaries regardless of visual style. The second pass extracts individual rows using a row-level decoder that is conditioned on the inferred column structure. On our test set, multi-page line item extraction achieves 92.1% set match accuracy — meaning the full set of line items is correctly extracted in 92.1% of multi-page invoices. This is significantly higher than single-model approaches, which typically achieve 75–80% on multi-page documents.
Numeric and date normalization
Extracted text is not the same as structured data. An invoice might show "1,234.56 USD", "$1.234,56" (European format), or "¥123,456" — all the same amount, but requiring different parsing logic. Invoice OCR applies a normalization layer between the raw text extraction and the JSON output. All amounts are normalized to a decimal number with an explicit currency field. All dates are normalized to ISO 8601 format (YYYY-MM-DD) regardless of the source format. The normalization layer handles 47 date format patterns and 23 currency formatting conventions identified in our training corpus. Normalization failures are surfaced as a confidence flag on the affected field, allowing downstream systems to route low-confidence fields for human review rather than silently accepting a potentially wrong value.
Continuous accuracy monitoring in production
We measure accuracy in production, not just at training time. Every extraction result has a per-field confidence score derived from the decoder's output probability distribution. Fields with confidence below a threshold are routed to a human-in-the-loop review queue. Reviewed results feed back into our fine-tuning pipeline, creating a continuous improvement loop: production traffic that triggers human review becomes training signal for the next model version. This approach allows the model to improve on the specific failure modes encountered in real customer workflows, not just synthetic test cases. The production accuracy monitoring system also powers our quarterly accuracy report — we track field accuracy on a rolling 90-day window of human-verified extractions and publish the number publicly.
Frequently Asked Questions
Test our accuracy on your invoices
Upload up to 50 invoices free and see the extracted structured data — no account required to start.