Beyond OCR — OCR-free Document Understanding and Unified Models

Introduction: Why Go Beyond OCR
- OCR and Document Understanding Are Different
Core Principles: The Traditional OCR Pipeline and Its Limits
Deep Dive 1: OCR-free End-to-End Document Understanding
- The Donut Family: Structured Output Directly From Image Without OCR
- VLM-Based Document Understanding
Deep Dive 2: High Resolution, Layout, Tables, Formulas
- High-Resolution Handling
- Layout, Tables, Formulas
Deep Dive 3: Multilingual OCR and Unified Models
- Multilingual OCR
- Unified Models: Detection-Recognition-Parsing as One
Deep Dive 3.5: The Difficulty of Reading Order and Layout
Deep Dive 3.6: Concrete Ways to Structure Tables
Deep Dive 4: Benchmarks and Evaluation Methods
Deep Dive 5: Designing the Inference Pipeline
Deep Dive 6: Hard Cases Like Handwriting, Low Quality, Rotation
Comparison Table: Traditional OCR vs OCR-free
Deep Dive 9.5: Controlling Output via Prompt Design
Practical Application: Receipts, Forms, Documents
Deep Dive 7: Hybrid of OCR-free and Traditional OCR
Deep Dive 8: Getting Structured Output Reliably
Deep Dive 9: A Cost and Latency View
Deep Dive 12: Additional Challenges of Multilingual Documents
Pitfalls and Limits
Deep Dive 10: Considerations When Applying by Domain
Deep Dive 11: Directions Ahead
Closing
References

Introduction: Why Go Beyond OCR

Pulling the total from a receipt, extracting line items from a contract, turning a chart into a table. Such document-understanding tasks have long been built on OCR (optical character recognition). But OCR is only good at reading characters; it does not tell you what those characters mean or which cell they belong to.

So traditional document processing chains several stages together like links. It detects where the characters are, recognizes what those characters are, analyzes the page layout, and finally interprets the meaning. The more stages, the more an error in one place propagates to the next, the accumulated-error problem.

Recently an OCR-free approach has risen, where a single model replaces the whole chain. It takes an image as input and directly produces the answer we want (text, key-value, table). In this post we organize the limits of the traditional pipeline, OCR-free end-to-end models, and the move toward unified models that bundle detection, recognition, and parsing into one.

OCR and Document Understanding Are Different

First, let us clarify terms. OCR is transcribing characters from an image into text. Document understanding is grasping what structure and meaning that text has. They are different stages, and traditional systems handled them separately.

OCR:                  image -> "total 32,500"  (characters to text)
document understanding: "total 32,500" + position/context -> {field: total, value: 32500}  (assigning meaning)

The core claim of the OCR-free approach is that there is no need to separate the two. When one model grasps characters and meaning at once while looking at the image, it can preserve information lost passing through intermediate text (position, font, layout context). It is similar to how a person looking at a receipt does not read each character and then separately attach meaning.

Core Principles: The Traditional OCR Pipeline and Its Limits

A traditional document-understanding pipeline usually consists of the following stages.

[document image]
   |
   v
[1. text detection] --> find boxes of character/word regions
   |
   v
[2. text recognition] --> turn characters in each box into a string
   |
   v
[3. layout analysis] --> grasp title/body/table/cell structure
   |
   v
[4. information extraction] --> key-value, table, meaning interpretation

Each stage is usually a separate model, trained and tuned separately. This structure has clear limits.

Accumulated error: if stage 1 detection misses part of a character, stage 2 recognition cannot recover it no matter how good it is. Errors pile up at each stage.
Information disconnect between stages: the recognition stage sees only character shapes and cannot use semantic context. It struggles to correct a blurry character using context.
Layout fragility: layout analysis often breaks on merged table cells, complex forms, and documents mixed with handwriting.
Pipeline maintenance cost: you must manage a model and rules per stage, so operations are complex.

These limits motivate the OCR-free approach. Fewer stages means less accumulated error, and a single model that sees characters and meaning together enables context correction.

Deep Dive 1: OCR-free End-to-End Document Understanding

The Donut Family: Structured Output Directly From Image Without OCR

Donut (Document Understanding Transformer) proposed an approach that removes the OCR stage entirely and generates structured output (e.g., JSON) directly from a document image. A vision encoder reads the image and a text decoder autoregressively generates the result we want.

[document image]
   |
   v
[vision encoder] --> sequence of image features
   |
   v
[text decoder] --> structured output (e.g., JSON key-value)

The strength is clear. There is no separate OCR box detection or recognition model, and the model learns character shape and meaning at once. Make answers in the format of the information you want to extract and train on them, and the model produces answers in that format directly. Accumulated error disappears and the pipeline simplifies.

There is a trade-off. With no explicit OCR, it depends heavily on the answer format and sufficient data. And when the decoder must generate long output, speed can be an issue.

VLM-Based Document Understanding

The vision-language model (VLM) covered in the earlier post inherently has OCR-free document understanding ability. It turns images into visual tokens and an LLM reads them together with text, so it can take a document image and answer questions or extract tables.

[document image] --> [VLM] --> answer questions like "what's the total?" in natural language
                          --> structure tables into markdown/JSON
                          --> even point to the location (coordinates) of a specific field

The advantage of a VLM is generality. One model handles captioning, VQA, OCR, and grounding, and you can flexibly get the output you want with natural-language instructions. Thanks to strong language ability it tends to infer blurry or partially occluded characters from context.

Deep Dive 2: High Resolution, Layout, Tables, Formulas

Documents differ from natural images. They are dense with small text and full of structures like tables and formulas. Handling them well requires a few devices.

High-Resolution Handling

Reading small text in a document requires sufficient resolution. Force-resizing the image small smears the characters. So OCR-free models use strategies to handle high-resolution input.

Tiling/slicing: cut a large page into several tiles, process each, then merge.
Arbitrary resolution: like the dynamic resolution seen in the earlier post, preserve original ratio and size to keep detail.

high-resolution document strategies
method A: tiling             large page -> several tiles -> process each -> merge
method B: dynamic resolution  keep original ratio -> token count proportional to size

The higher the resolution the better the detail, but token count and cost grow. This balance is central to document model design.

Layout, Tables, Formulas

Layout: you must understand region structure such as titles, paragraphs, headers, and tables to extract information in the right context. OCR-free models internalize this through training instead of a separate stage.
Tables: you must read while preserving the structure of rows, columns, and merged cells. Train to generate output as a markdown table or structured format.
Formulas: formulas have a 2D structure (fractions, exponents, subscripts) hard to express as plain text. The common approach trains the model to convert to a standard notation and output it.

Since the format of the answer data determines the model output format for such structural elements, preparing consistent structured answers is important.

Deep Dive 3: Multilingual OCR and Unified Models

Multilingual OCR

Documents are written in many languages. Traditional OCR often kept a separate recognition model or dictionary per language, but an OCR-free VLM can handle several languages with one model by training on image-text data across languages. Efforts continue to broadly handle not only Latin scripts but also logographic scripts like Chinese-Japanese-Korean and right-to-left scripts. That said, accuracy can still drop on low-resource languages or rare fonts, so data augmentation matched to the target language distribution matters.

Unified Models: Detection-Recognition-Parsing as One

A recent trend is the unified model, which consolidates detection, recognition, layout analysis, and information extraction into a single model. Where the traditional pipeline kept a model per stage, a single encoder-decoder handles them all at once.

traditional pipeline:  detection model -> recognition model -> layout model -> extraction model
unified model:         single model -> (detection + recognition + layout + parsing) simultaneously

Advantages: reduced accumulated error, simplified operations, context-based correction. The information disconnect between stages disappears.
Flexibility: give the same model different instructions to read the whole text, extract a specific field, structure a table, and more.
Considerations: since the whole process is internalized via training, it depends heavily on data quality and diversity, and you need a verification stage to guarantee output accuracy.

Deep Dive 3.5: The Difficulty of Reading Order and Layout

A frequently overlooked difficulty in understanding documents is reading order. A person looking at a multi-column newspaper naturally knows to read the left column top to bottom then move to the right column. But to a model this is not obvious.

multi-column layout example
  +------------------+------------------+
  | col1 first para  | col2 first para  |
  | col1 second para | col2 second para |
  +------------------+------------------+

  correct reading order: all of col1 -> all of col2
  wrong order:           alternate left-right per row (meaning breaks)

The traditional pipeline computed this order explicitly in the layout-analysis stage. An OCR-free model must learn this through training, so if not trained enough on data with diverse layouts, the reading order can scramble. Mix in tables, footnotes, sidebars, and captions and the difficulty rises further.

This problem is partly eased by output-format design. For example, training to output structured by region makes the model more conscious of region boundaries and better at ordering. The key is to verify with validation data whether the model emits a consistent reading order.

Deep Dive 3.6: Concrete Ways to Structure Tables

Tables are one of the trickiest elements in document understanding, because you must preserve row-column relationships, merged cells, and header hierarchy all at once. In what format does an OCR-free model learn to output a table?

table output format choices
  markdown table   simple, fits simple tables, limited merged-cell expression
  HTML table       can express merged cells (rowspan/colspan), verbose
  structured JSON  specifies per-cell coordinates/content, easy to parse, most flexible

Each format has trade-offs. Markdown is human-readable but struggles to express complex merged cells. HTML tables can express merges but the output gets long. Structured JSON is most flexible but you must prepare the answer data consistently in that format.

The key is to pick the format that fits the task and unify the answer data in that format. Since the model imitates the answer format, if the table output format is uneven, the model also produces inconsistent tables. And when evaluating table accuracy, comparing per cell and per structure, not simple string match, captures the real quality.

Deep Dive 4: Benchmarks and Evaluation Methods

How do you evaluate an OCR-free model? Metrics differ by task.

Text recognition accuracy: how closely the recognized string matches the answer. Edit-distance-based metrics are common.
Key-value extraction accuracy: whether fields in receipts/forms were matched per field. Look at per-field precision and recall.
Table structure accuracy: evaluate not only cell-level match but also whether rows, columns, and merge structure are correct.
Document VQA accuracy: whether the answer to a question about the document is correct. Look at semantic match with the answer.

evaluation perspective summary
  what did it read      text recognition accuracy (edit distance)
  what did it extract   key-value precision/recall
  did it keep structure table cell/structure match
  did it understand     document VQA accuracy

The important point is that a single score does not represent all of a model's abilities. One model reads plain text well but is weak on table structure; another is strong in English but drops on multilingual. So building your own evaluation set matched to your actual document distribution is safest of all. Public benchmark scores are only a starting point and do not reflect the noise and diversity of real operational data as is.

Deep Dive 5: Designing the Inference Pipeline

The inference pipeline for putting an OCR-free model into a real service does not end with a single model call. It is better seen as three stages: preprocessing, model inference, and post-processing/verification.

[original document]
   |
   v
[preprocessing]   resolution normalization, rotation correction, denoising
   |
   v
[model inference]  OCR-free model -> generate structured output
   |
   v
[post-processing]  format validation (JSON schema), confidence judgment, fallback
   |
   v
[result]    passes validation -> auto process / fails -> human review

Each stage has a clear role.

Preprocessing: scanned documents are often skewed or shadowed. Rotation correction and normalization make input the model sees well.
Model inference: call the model with the instruction that matches the task (read all, extract field, structure table).
Post-processing: verify the output follows the expected schema, and double-check fields like numbers and dates with format rules. If confidence is low, hand off to human review.

The key in this structure is to see the model as one stage of the pipeline, not the endpoint of trust. Designing verification and fallback on the premise that the model can be wrong is what makes it operate stably in real work.

Deep Dive 6: Hard Cases Like Handwriting, Low Quality, Rotation

Real documents are not only clean print. Here is what goes wrong on tricky input.

Handwriting: handwriting differs per person, making recognition hard. Augmenting with handwriting data, or routing low-confidence fields to human review, is realistic.
Low quality/compression noise: blurriness or heavy compression artifacts collapse small text. Secure original high quality where possible and reduce noise with preprocessing.
Rotation/skew: a tilted scan angle hurts recognition. Rotation-correction preprocessing helps.
Mixed layout: complex pages with multi-column layout, sidebars, and footnotes confuse reading order. Training and verifying that the model emits a consistent reading order is important.

These hard cases are hard to solve perfectly with the model alone. Preprocessing that raises input quality and post-processing that filters uncertain output must go together to reach an accuracy that holds up in practice.

Comparison Table: Traditional OCR vs OCR-free

Item	Traditional OCR pipeline	OCR-free / unified model
Structure	multi-stage detect-recognize-layout-extract	single-model end to end
Accumulated error	piles up at each stage	greatly reduced
Context correction	hard (recognition sees only shape)	possible (uses language ability)
Tables/formulas	needs separate rules/models	structured output via training
Multilingual	tends to need per-language model/dictionary	multilingual with one model
Operational complexity	high (manage several models)	low (single model)
Data dependence	per-stage intermediate labels	depends on final answer format
Output accuracy guarantee	per-stage verification easy	verification stage recommended

Deep Dive 9.5: Controlling Output via Prompt Design

A VLM-based OCR-free model produces different output depending on the prompt (instruction) even for the same image. This is powerful flexibility and at the same time a source of variability.

same receipt, different instructions
  "read this receipt"               -> free natural-language summary
  "answer only the total as a number" -> 32500
  "extract items as a JSON array"     -> structured list
  "store name, date, total as key-value" -> clear field mapping

To get stable results in practice, designing prompts concretely and consistently is important.

Specify output format: clearly write the desired format (JSON schema, field list) in the instruction.
Provide examples: showing one or two examples of expected output improves format adherence.
Instruct uncertainty handling: specify to emit an empty value rather than guess when a value cannot be found, reducing hallucination.
Fix consistency: use the same prompt for the same task to reduce output variability.

The prompt is the cheapest knob to adjust output without retraining the model. But accuracy problems that the prompt alone cannot solve must ultimately be handled with data, fine-tuning, and post-processing. The prompt is a starting point, not a universal solution.

Practical Application: Receipts, Forms, Documents

Points to consider when applying an OCR-free approach to real work, organized by task.

Receipts/invoices: extract key-values like store name, date, items, total. Defining output as a fixed JSON schema makes post-processing easy. Numbers like amounts are fatal if formatted wrong, so add validation logic.
Forms/applications: checkboxes, signatures, and handwriting are mixed in. Handwriting recognition is still tricky, so a human-in-the-loop design that routes low-confidence fields to human review is safe.
Table-heavy documents: financial statements, spec sheets, etc. center on preserving table structure. Verify with validation data whether merged cells and headers are captured accurately.
Varied formats: even the same document type differs by vendor. OCR-free models are relatively robust to format variation, but fine-tuning with domain data raises accuracy a lot.

In common, rather than trusting model output as is, keeping safeguards like format validation, confidence thresholds, and human review is important in practice. The model can confidently produce wrong values, so the more important the field, the more you should strengthen verification.

Deep Dive 7: Hybrid of OCR-free and Traditional OCR

Since OCR-free is not a cure-all, in practice a hybrid design mixing traditional OCR and OCR-free often appears. The idea is to use the strengths of each.

hybrid design (example)
  1. quickly extract text and positions with traditional OCR
   |
   v
  2. OCR-free/VLM interprets meaning seeing that text and the image together
   |
   v
  3. cross-check the two paths' results to estimate confidence

Inject OCR result as a hint: feed the text extracted by traditional OCR into the VLM's input to help the model read characters better.
Cross-verification: if the two paths produce the same value, confidence is high; if they differ, hand off to human review. Useful for accuracy-critical financial and legal documents.
Speed-accuracy trade-off: process first with fast OCR and send only hard cases to the heavy VLM to balance cost and accuracy.

The lesson of the hybrid is clear. Rather than new technology fully replacing the old, combining the two to cover weaknesses is often more robust in practice.

Deep Dive 8: Getting Structured Output Reliably

When you take structured output (JSON, tables) from an OCR-free model, the model does not always emit a perfect format. You need devices for stable parsing.

structured-output stabilization flow
  [model output string]
   |
   v
  [attempt format parse]  JSON parse, etc.
   |
   +-- success --> [schema validation]  check field presence/type
   |                  |
   |                  +-- pass --> accept result
   |                  +-- fail --> fallback/retry
   |
   +-- fail --> [recovery attempt]  partial extraction or regenerate

Fix the schema: predefine the expected fields and types and validate the output against them.
Retry strategy: if the format breaks, try once more with a different prompt or setting.
Partial acceptance: if only some fields are trustworthy, accept those and hand the rest to a human.

Only with such post-processing can you safely connect model output to a real system. Design on the premise that the model occasionally going out of format is normal.

Deep Dive 9: A Cost and Latency View

OCR-free models, especially VLM-based ones, are heavier than traditional OCR. You need a view that manages cost and latency.

cost/latency comparison (intuition)
  traditional OCR    light, fast, cheap
  small OCR-free     medium
  large VLM          heavy, slow, expensive, but flexible/accurate

Tiered routing: send easy documents through a light path and only hard documents to the heavy VLM to lower average cost.
Resolution adjustment: when there is no small text, lower resolution to reduce tokens and cost.
Batch processing: process bulk documents in batches rather than real time to raise throughput.
Caching: when documents of the same form repeat, cache and reuse prompts or intermediate results.

Chase only accuracy and cost becomes unbearable; chase only cost and quality drops. Balancing between the task's accuracy requirement and budget is the core of practical design.

Deep Dive 12: Additional Challenges of Multilingual Documents

Multilingual documents carry challenges beyond simply recognizing several languages.

Mixed languages: it is common for one document to mix languages (e.g., English body + local-language annotations). The model must handle language boundaries well.
Character direction: when right-to-left scripts or vertical writing mix in, reading order gets more complex.
Number/date formats: date and currency notation differs by region, so normalization after extraction is needed.
Font diversity: font variation is large per language, so low-resource languages have lower accuracy.

stages of multilingual handling
  recognition    various writing systems to text
   |
   v
  structure understanding   grasp layout/tables regardless of language
   |
   v
  normalization  unify region-specific number/date/currency formats

The key is that you must consider not only recognition accuracy but also normalization after recognition to get a usable result in practice. Clarifying the target languages and regions, and preparing data and post-processing rules matched to that distribution, is the starting point of multilingual document handling.

Pitfalls and Limits

Hallucination: the model can plausibly generate values not in the document. It is especially risky on blurry or empty fields. You need verification to filter ungrounded output.
Rare fonts/languages: accuracy drops on fonts, languages, and domains with little training data. Data augmentation matched to the target distribution matters.
High-resolution cost: raising resolution to read small text spikes tokens and cost. You need a balance that raises only as much as needed.
Output format deviation: structured output can produce results that break the schema. Prepare format validation and fallback in the parsing stage.
Gap between benchmark and reality: even with high benchmark scores, behavior can differ on the diversity and noise of real work documents. Evaluating with your own data is safe.

Deep Dive 10: Considerations When Applying by Domain

When applying an OCR-free model to a specific industry, each domain has different constraints and demands.

Finance: numeric accuracy is absolute. A single-digit error leads to a big problem, so place strong cross-verification and human review. Audit trails for regulatory compliance are also needed.
Healthcare: terms on prescriptions and lab sheets are specialized, and the cost of error is high. Domain-data fine-tuning and conservative confidence thresholds matter.
Legal: long documents, complex layouts, and precise citations are central. Reading order and structure preservation are especially tricky.
Logistics/manufacturing: invoices, labels, and barcodes mix, and forms differ by vendor. A model robust to form variation and post-processing rules are useful.

priorities by domain
  finance   accuracy > speed, strong verification
  healthcare accuracy > cost, domain fine-tuning
  legal     structure/order preservation, long-document handling
  logistics handling form diversity, throughput

The common lesson is to tune verification strength in proportion to the domain's cost of error. The more fatal an error is in a domain, the more you must not trust model output as is and stack safeguards thickly.

Deep Dive 11: Directions Ahead

Document-understanding models are advancing fast. Let us note a few currents.

Longer documents: the ability to handle multiple pages at once is becoming important. Maintaining cross-page context is the challenge.
Efficiency: research to lower high-resolution processing cost with token pruning, adaptive resolution, and the like is active.
Agent integration: there is a current of combining with agents that read a document and take follow-up actions (search, calculate, API call) from the result.
Built-in verification: a direction is researched where the model also emits its own output confidence, marking uncertain parts itself.

These directions all aim at one thing: maintaining accuracy while reducing human intervention. For now, though, rather than full automation, a design combining the model with verification and human review is most robust in practice.

Closing

Document AI has shifted its center of gravity from a chain of character-reading stages to a single model that takes an image and produces meaning directly. The OCR-free approach reduces accumulated error, corrects context with language ability, and simplifies operations by unifying detection, recognition, and parsing into one.

Of course it is not a cure-all. Limits remain like hallucination, rare domains, and high-resolution cost, and you need verification to guarantee output accuracy. Still, the direction of one model seeing characters and meaning together is a powerful current in document understanding, and its scope of application keeps widening with the advance of VLMs. The key is to understand both the strengths and limits of the tool and use it with appropriate safeguards matched to the task's accuracy requirements.

References

Qwen2-VL: Enhancing Vision-Language Model's Perception (arXiv: 2409.12191) — arxiv.org/abs/2409.12191
Attention Is All You Need (arXiv: 1706.03762) — arxiv.org/abs/1706.03762
FlashAttention: Fast and Memory-Efficient Exact Attention (arXiv: 2205.14135) — arxiv.org/abs/2205.14135
Qwen official repository — github.com/QwenLM
Hugging Face Transformers docs — huggingface.co/docs
PyTorch official docs — pytorch.org
vLLM docs (including multimodal serving) — docs.vllm.ai
vLLM repository — github.com/vllm-project/vllm