💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — Why OCR is still hard in 2026

In 2026 we still suffer from PDFs. Invoices, contracts, insurance policies, medical records, academic papers, government forms — most business documents humanity has ever produced are still PDFs. Many are scanned images, and even the ones with embedded text mix tables, images, equations, and multi-column layouts so badly that "just extract the text" doesn't work.

LLMs got smart enough — surely document parsing is solved? No. The fact that Mistral OCR, OlmoOCR, Reducto, new LlamaParse versions, Docling 1.0, Marker 1.0, Surya 0.6 all landed within 2025 is itself proof that there is no settled answer. Nobody one-shotted it.

This post maps the document AI ecosystem as of May 2026 across 13 candidates. Where each is strong, where each falls apart, and what to pick if your team is doing invoices, contracts, papers, or RAG ingestion.

1. The 2026 document AI map — a four-stage pipeline

Treating document AI as a single blob makes tool selection impossible. You have to split it into stages. We use a four-stage model.

| --- | --- | --- | --- |

Each stage can be solved by a separate tool, or one tool can absorb several. The 2026 trend is clear — **stages are merging**.

- Mistral OCR, OlmoOCR, Marker, Docling — OCR plus Layout in one shot

- LlamaParse, Unstract, Reducto — add Extraction on top

- Multimodal LLMs (Pixtral, Florence-2, Claude, GPT-4o) — try to absorb all four stages

Merging isn't pure win. When one stage breaks, it's harder to see where. So we use both — a **stage-by-stage pipeline** and **end-to-end LLM calls** in parallel. Which wins depends on the domain.

Key insight — **no single model wins all of OCR in 2026.** What's good on invoices breaks on academic papers; what's good in Korean breaks in Japanese. There is a domain-specific best.

2. Mistral OCR (March 2025) — the new standard API

In March 2025 Mistral shipped a product literally named "OCR API." The industry was surprised — why is an LLM company shipping a standalone OCR? The answer was, "to do RAG with our own models we need a good PDF parser, and nothing on the market was good enough."

Three things Mistral OCR changed.

- **Speed first** — roughly 2,000 pages/minute in batch mode. An order of magnitude faster than open-source baselines like Marker.

- **Returns layout + equations + tables + image positions, all as markdown** — not just text, but structure.

- **Per-page pricing** — about $1 per 1,000 pages. Half to a third of cloud OCR (Document AI, Textract).

The API is plain.

from mistralai import Mistral

client = Mistral(api_key="...")

response = client.ocr.process(

model="mistral-ocr-latest",

document={"type": "document_url", "document_url": "https://example.com/contract.pdf"},

include_image_base64=True,

)

for page in response.pages:

print(page.markdown)

The return value is per-page markdown. Tables come as GFM tables, equations as LaTeX, images embedded as base64. One call gives you markdown plus images plus page metadata.

Weaknesses — non-Latin scripts like Korean and Japanese aren't as stable as English/French (late-2025 saw additional Korean training, but reports still place it slightly below Naver CLOVA OCR). And it's a closed API — no self-hosting.

When to pick it — **English-heavy PDFs, fast markdown conversion, and you can afford API costs.** Especially when you need to ship a RAG ingestion pipeline in a week, this is almost always the first try.

3. Marker (Vik Paruchuri) — the open-source PDF-to-Markdown champion

Marker is an open-source project by Vik Paruchuri (also founder of Datalab). It is the most well-known open-source "PDF to markdown" tool, and has crossed 30k GitHub stars (as of May 2026). Before Mistral OCR, it was effectively the only good answer in OSS for PDF-to-markdown conversion.

Marker's stack.

- **Surya** — the multilingual OCR + layout model by the same author (next chapter)

- **Layout model** — DocLayoutYOLO-style block classification

- **Markdown converter** — recognises tables, equations, code blocks and converts to GFM/LaTeX

- **Optional LLM refinement** — re-asks an LLM about ambiguous regions for polish

Usage is one line.

pip install marker-pdf

marker_single contract.pdf output/ --output_format markdown

Batch processing, worker count, GPU/CPU choice, and LLM refinement are all CLI flags.

Marker's strengths are **equations and tables**. On academic papers full of LaTeX math, it often produces cleaner output than Mistral OCR. And it self-hosts — on a single-GPU machine you'll see 30 to 100 pages per minute. A good fit for enterprises that can't send data out.

The weakness is speed. CPU-only takes 5 to 30 seconds per page. Even on GPU it's slower than Mistral OCR.

When to pick it — **you can't send PDFs to a third party (security), you handle academic / financial documents with lots of equations and tables, or you have GPU capacity to spare.**

4. Surya — multilingual OCR + layout + reading order

Surya is the OCR module inside Marker, but it stands alone too. Same author (Vik Paruchuri), and it currently holds the spot of "best multilingual OCR in open source."

Four things Surya does in one model.

- **Text detection** — find glyph regions on a page

- **Text recognition (OCR)** — 90 languages including Korean, Japanese, Chinese

- **Layout analysis** — classify headings / body / tables / images / captions

- **Reading order** — re-sort multi-column text, image captions, footnotes into the human reading order

That last one matters more than people think. Newspapers and academic papers have multiple columns and sidebars; footnotes and captions weave into the body. A simple top-left to bottom-right scan tangles the text. Surya uses a transformer to predict the reading order the way a human would.

from surya.recognition import RecognitionPredictor

from surya.detection import DetectionPredictor

from PIL import Image

image = Image.open("page.png")

det = DetectionPredictor()

rec = RecognitionPredictor()

predictions = rec([image], det_predictor=det)

for line in predictions[0].text_lines:

print(line.bbox, line.text)

Surya's Korean / Japanese quality is at the top of open source as of 2026. Slightly below Naver CLOVA OCR or Google Document AI's Korean / Japanese processors, but the fact that it's a self-hostable OSS model is decisive.

When to pick it — **multilingual (especially CJK) PDFs that need self-hosting.** Most people use it via Marker, but you can call it directly when you only need OCR.

5. LlamaParse (LlamaIndex) — a parser tuned for RAG

LlamaParse is a PDF parser by the LlamaIndex team. Its starting point is different — "produce good input for RAG." So its markdown output is designed to mesh well with LlamaIndex's chunkers.

LlamaParse highlights.

- **Three modes** — fast (cheap), balanced (default), premium (LLM refinement, best table accuracy)

- **Tables as accurate GFM** — premium mode is widely reported as the strongest at tables

- **Image position preserved** — image placeholders in the markdown

- **Per-page pricing** — first 1,000 pages free on fast, about $3 / 1,000 on premium

The API is plain.

from llama_parse import LlamaParse

parser = LlamaParse(

api_key="...",

result_type="markdown",

parsing_mode="premium",

)

docs = parser.load_data("contract.pdf")

for doc in docs:

print(doc.text)

A distinctive feature is **parsing instructions** — you can pass natural-language hints like "this is an insurance policy, treat clause numbers as headings." Internally an LLM ingests the hint and adjusts the parse.

Weaknesses — cost and being a closed API. Premium at $3 per 1,000 pages is 3x Mistral OCR. And no self-hosting.

When to pick it — **you already run LlamaIndex-based RAG and your documents are table-heavy.** LlamaParse → SemanticSplitterNodeParser → vector index is a smooth flow.

6. Docling (IBM, open source) — the enterprise all-in-one

Docling is an open-source document parser released in late 2024 by IBM Research. It packages parsing knowhow IBM accumulated while building "watsonx Discovery," their internal RAG product.

Three things Docling brings.

- **Rich output formats** — not just markdown but a native JSON format ("DoclingDocument") that preserves coordinates, block IDs, paragraphs, table cells

- **TableFormer** — IBM's separately trained table-structure model. SOTA on ICDAR benchmarks

- **Local + GPU options** — self-hostable, runs on CPU too

Docling in code.

from docling.document_converter import DocumentConverter

converter = DocumentConverter()

result = converter.convert("contract.pdf")

print(result.document.export_to_markdown())

or as JSON structure

print(json.dumps(result.document.export_to_dict(), indent=2))

DoclingDocument JSON is a tree. Under body sit sections, under sections sit paragraphs / tables / figures, and each node carries a page number and bbox. That matters — with plain markdown you can't trace "where in the PDF did this sentence come from," but Docling JSON lets you back-reference down to page and coordinates. Drop it straight into citations for RAG.

Weaknesses — speed. Slightly slower than Marker, and CPU runs can exceed 10 seconds per page. Korean and Japanese aren't as mature as English.

When to pick it — **you need self-hosting and your RAG needs exact citation (which page, which region) for compliance.** A great fit for enterprise compliance settings.

7. DocLayoutYOLO — layout detection, fast

DocLayoutYOLO is not OCR. It does **layout only**. Feed it a page image and it returns bounding boxes like "this region is a title, this is body, this is a table, this is a figure."

Why a separate model when every other tool embeds layout? **Speed and composition freedom.** DocLayoutYOLO is YOLOv10-based and handles a page in roughly 10 ms on GPU. An order of magnitude faster than the layout pieces inside Marker or Docling.

Combine DocLayoutYOLO with your OCR of choice (Tesseract, Surya, PaddleOCR) and you have your own pipeline.

from doclayout_yolo import YOLOv10

model = YOLOv10("doclayout_yolo_docstructbench_imgsz1024.pt")

results = model.predict("page.png", imgsz=1024, conf=0.2)

for box in results[0].boxes:

print(box.cls, box.xyxy)

When to pick it — **you only need layout, or you're already running OCR with another model and want layout separately and fast.** A great match for high-volume pipelines that need to keep every stage's GPU saturated.

8. Nougat (Meta) — academic papers only

Nougat is Meta's 2023 OCR model for academic papers. The name decodes to "Neural Optical Understanding for Academic Texts." It was trained on 8 million arXiv papers, and is overwhelming on **academic PDFs full of equations and tables**.

Nougat's output is markdown plus LaTeX equations. While other OCRs leave equation regions as images or as broken text, Nougat reconstructs the LaTeX. Matrices, integrals, summations — all of it.

pip install nougat-ocr

nougat path/to/paper.pdf -o out --model 0.1.0-base

Weakness — **it breaks on anything but academic papers.** On invoices, contracts, general PDFs you often get garbage. The training distribution is academic.

It also hallucinates. Nougat is an encoder-decoder transformer, so it can generate "plausible LaTeX," sometimes producing equations that aren't in the original or rewriting them wrong. Even in the academic domain you need verification.

When to pick it — **batch-converting arXiv-style papers to markdown / LaTeX.** Outside that, another tool is almost always better.

9. OlmoOCR (Allen AI, February 2025) — open weights, cloud quality

OlmoOCR is an open-weight OCR model released February 2025 by the Allen Institute for AI (AI2). The name tracks AI2's Olmo line of open LLMs, with OCR ability trained on top.

The shock OlmoOCR delivered — **open weights, but GPT-4o-class OCR quality.** On AI2's own benchmark (olmOCR-Bench) it lands at par with GPT-4o and Mistral OCR, and ahead on certain domains.

Training is unusual. AI2 took Qwen2-VL-7B as the base vision-language model and SFT'd it on 250,000 pages of "ground-truth text extracted from PDFs." The labels were generated with GPT-4o and reviewed by humans.

from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

model = Qwen2VLForConditionalGeneration.from_pretrained(

"allenai/olmOCR-7B-0225-preview", torch_dtype=torch.bfloat16

)

processor = AutoProcessor.from_pretrained("allenai/olmOCR-7B-0225-preview")

... inference with image + prompt

Inference fits on a single GPU since it's 7B. About 1 second per page on A100. Batch with vLLM or SGLang for throughput.

Weaknesses — it's 7B, so you need 12+ GB VRAM, and it's a VLM rather than a pure OCR engine, which sometimes makes it "want to obey the prompt and summarise." System prompts need tuning.

When to pick it — **you can't expose data to cloud APIs but need cloud-grade quality, and you have GPU.** Open weights give you self-hosting freedom and no licence drag.

10. Tesseract 5 / LayoutLMv3 / Donut — the classics and pretrained models

Don't only chase the new tools. The classics are still alive.

Tesseract 5

Tesseract started at HP in 1985, was acquired by Google, then released as open source. Version 5.x (2021 onward) was rewritten with LSTM and saw a big accuracy jump. 100+ languages, fast even on CPU, free.

tesseract scan.png output -l kor+eng --psm 6

The weakness is clear — it doesn't see layout. It only emits text. It can't tell a table is a table or a heading is a heading. So in 2026 Tesseract is used as "Tesseract + a separate layout model" — or as a fallback pipeline for clean single-column PDFs.

When — **offline, ultra-cheap, single-column documents.**

LayoutLMv3 (Microsoft)

LayoutLMv3 is a pretrained document understanding model from Microsoft Research, released 2022. Text + coordinates + image patches are fed jointly into one transformer at pretraining. So it shines on classification / extraction tasks like "this region is a header, that region is a totals field."

It doesn't do OCR itself — it takes OCR output (text + coords) and only does classification / extraction. So "Tesseract / Azure OCR + fine-tuned LayoutLMv3" is the common combo. Strong on IE benchmarks like FUNSD, CORD, SROIE.

When — **invoices / receipts / forms where the field set is fixed, and you can fine-tune with your own data.**

Donut (NAVER CLOVA AI)

Donut is an "OCR-free" document understanding model released 2022 by NAVER CLOVA AI. The name shortens "Document Understanding Transformer." Unlike everything else, Donut **skips the OCR stage**. Image goes into the transformer, JSON / markdown comes out.

Why does that matter? Traditional pipelines are OCR → layout → extraction in three stages, and errors compound. Donut does it in one model. Especially strong on CORD (receipts), DocVQA, TICKET.

The weakness is that it needs domain fine-tuning. Stock Donut on arbitrary PDFs is poor. With a few thousand pages of your domain, it shines.

When — **one document type repeated hundreds of thousands of times (Korean receipts, US W-2s), and you can collect training data.**

11. Multimodal LLMs — Pixtral 12B / Florence-2 / KOSMOS-2.5

A new 2025–2026 current is **just throwing a PDF page at a multimodal LLM**. No separate OCR, no separate layout model, no separate extractor. The VLM does it.

Pixtral 12B (Mistral, September 2024)

Pixtral 12B is Mistral's first multimodal model. Image in, text out. Hand it a PDF page image with "transcribe this page to markdown" and you get markdown. Before Mistral OCR existed, many users were doing OCR with Pixtral.

Florence-2 (Microsoft, June 2024)

Florence-2 is a small vision model (0.23B / 0.77B). It does OCR, captioning, object detection, and region extraction in one model. The "small but strong" concept — great for edge / on-device.

KOSMOS-2.5 (Microsoft, September 2023)

KOSMOS-2.5 is a multimodal model purpose-built for text-rich images. Screenshots and document images go in, markdown comes out. Academically influential, but in production OlmoOCR has gradually taken its slot.

Claude / GPT-4o / Gemini 1.5/2.0

Commercial multimodal LLMs are widely used for PDF parsing too. Throw a page image with "convert to markdown" — done. Quality is good, but token cost is high (thousands of tokens per page), and it falls over on large page counts.

**One-line takeaway** — multimodal LLMs win on small page counts, complex layouts, and structured (JSON) extraction. For bulk OCR, Mistral OCR / OlmoOCR / Marker dominate on cost and speed.

12. Korean — Naver CLOVA OCR and the Korean domain

Korean documents have weak points across global tools. Hangul morphology, mixed Hanja, mixed horizontal / vertical text, and Korean-specific table styles all bite. That's why Naver CLOVA OCR is broadly recognised as the strongest in the Korean domain.

CLOVA OCR features.

- **Korean accuracy above 95 percent** (Naver's own benchmarks, printed text)

- **Templates for IDs, passports, vehicle registration — domain specialised**

- **Table recognition tuned for Korean-style tables (merged cells, dotted borders)**

- **Per-call pricing** (free tier up to 10,000 calls per month)

headers = {"X-OCR-SECRET": "..."}

files = {"file": open("doc.pdf", "rb")}

payload = {

"message": json.dumps({

"version": "V2",

"requestId": "uuid",

"timestamp": 0,

"images": [{"format": "pdf", "name": "doc"}],

})

}

r = requests.post("https://...apigw.ntruss.com/custom/v1/.../general", headers=headers, files=files, data=payload)

Alternatives — **PaddleOCR** (strong on Chinese and Korean) and **EasyOCR** (80 languages) are common open-source picks. Surya's Korean has improved a lot too (since 0.5 in 2025).

When to pick CLOVA — **bulk processing of Korean business documents** (tax invoices, business registration certificates, receipts, bank statements). The domain templates hit directly.

13. Japanese — Yomi-toku, Google Cloud Document AI Japanese

Japanese has different pains from Korean. Mixed horizontal / vertical writing, furigana (ruby text), mixed kanji + kana + katakana, and a high share of handwriting.

Yomi-toku

Yomi-toku is a Japanese-specialised open-source OCR. Strong on Japanese print, vertical writing, and furigana. Adopted in some Japan Digital Agency projects, which raised its profile.

pip install yomitoku

yomitoku scan.pdf -o output.json

Google Cloud Document AI Japanese processors

Google Cloud's Document AI is the most stable global cloud OCR for Japanese. Dedicated processors exist for Japanese invoices (請求書), quotations (見積書), and receipts (領収書). Japanese SaaS / fintech companies adopt it often.

Others

- **Tegaki** — OSS library for handwriting

- **NDL OCR** — model trained by Japan's National Diet Library, strong on classical Japanese and vertical writing

When to pick what — **Korea-only business docs → CLOVA; Japan-only → Document AI Japanese processors; neither fits → start with Surya / OlmoOCR and fine-tune on your domain.**

14. Who should pick what — four scenarios

You don't need to memorise 13 tools. Compress them by scenario.

Scenario A — Bulk invoice / receipt processing (IE)

Documents with a fixed field set (vendor, amount, date, line items) processed at volume, output as JSON.

- **Recommended stack** — Mistral OCR / Docling for markdown → LLM (Claude 3.5 / GPT-4o) for JSON extraction

- **Need self-hosting** — Marker → vLLM-hosted LLM

- **Hundreds of thousands plus** — fine-tune Donut on your data

Scenario B — Contract analysis

100–300 page PDFs where you need to find and compare clauses.

- **Recommended stack** — LlamaParse premium / Docling for structure-preserving extraction → chunking → RAG

- **Citation matters** — Docling JSON (page + coords) for accurate citation

Scenario C — Academic papers (arXiv, journals)

Equation-heavy, table-heavy PDFs processed in bulk.

- **Recommended** — Marker (equations) or Nougat (arXiv-style only)

- **After** — RAG / search index over LaTeX-preserved markdown

Scenario D — RAG ingestion pipeline (mixed bag)

Hundreds of varied PDFs land daily and need to enter the search index.

- **Quick and easy** — Mistral OCR API + automatic chunking

- **Self-hosted** — Docling + custom chunking policy

- **Quality first, budget allows** — LlamaParse premium

Key — **no single tool finishes the job.** When your domain is mixed, build a fallback chain. Try Mistral OCR first, then Marker on failure, then multimodal LLM on second failure.

15. Building your team's first pipeline — a concrete recipe

Theory is enough. Let's say you want to convert 10,000 PDFs to markdown in a week. The steps are:

Step 1 — **Sample 100 pages, profile the domain.** Lots of tables? Lots of equations? Lots of non-Latin script? Any handwriting? If the domain is mixed, accept up front that one tool won't solve it.

Step 2 — **Process the same 100 pages with 2–3 tools.** Candidates are Mistral OCR, Marker, Docling, LlamaParse — pick 2–3 by domain fit. Use an in-house evaluation script to measure GFM table accuracy, heading extraction, equation preservation.

Step 3 — **Winner becomes main, runner-up becomes fallback.** Run the main tool, and when a page fails (zero blocks, broken table, low confidence) auto-reprocess with the fallback.

Step 4 — **Keep metadata consistent.** Markdown body plus (page number, bbox, tool name, version, confidence) metadata. You need this to later compare which tool did well where.

Step 5 — **Keep LLM extraction as a separate step.** Don't conflate OCR / parsing with IE (field extraction). Make extraction a Claude / GPT-4o / Gemini call that takes markdown and emits JSON. That way model swaps are easy.

As a diagram.

[10,000 PDFs]

[Sample 100]

[Mistral OCR / Marker / Docling in parallel]

[In-house eval script picks the tool]

[Main tool runs on remaining 9,900, fallback on failure]

[Markdown + metadata stored]

[Optional LLM step to extract JSON fields]

[Vector index / DB]

Run that sequence and you'll have a stable pipeline in a week.

Epilogue — Document AI in 2026 still has a long road

Document AI has just crossed the "usable" threshold. As recently as 2024 it was nearly impossible to cleanly extract a table from a PDF. In 2026 you can pick Mistral OCR / Marker / Docling and it mostly works. A huge jump in a year.

But it's not done.

- **Long-document context** — even after you've extracted a 100-page contract page-by-page into markdown, "how does clause 5 amend clause 12" is a separate cross-page problem.

- **Handwriting and old documents** — Yomi-toku and NDL OCR exist for specialised cases, but generic handwriting is still hard.

- **Charts and diagrams** — bar charts, circuit diagrams, medical charts are not OCR but a different problem. Florence-2 and GPT-4o try, but they aren't there.

- **Verification and hallucination** — LLM / VLM-based OCR hallucinates. In mission-critical domains (law, medicine, finance) you need a verification layer.

You don't need to memorise all 13 tools. **Split into stages, group into scenarios, and accept the answer that the domain gives you.** Mistral OCR dominates invoices but loses to Marker on academic papers. That's normal. Document AI has no single "one shot."

References

- Mistral OCR launch (March 2025): [https://mistral.ai/news/mistral-ocr](https://mistral.ai/news/mistral-ocr)

- Marker GitHub (Vik Paruchuri): [https://github.com/VikParuchuri/marker](https://github.com/VikParuchuri/marker)

- Surya GitHub: [https://github.com/VikParuchuri/surya](https://github.com/VikParuchuri/surya)

- LlamaParse (LlamaIndex): [https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/](https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/)

- Docling (IBM): [https://github.com/DS4SD/docling](https://github.com/DS4SD/docling)

- DocLayoutYOLO: [https://github.com/opendatalab/DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)

- Nougat (Meta): [https://github.com/facebookresearch/nougat](https://github.com/facebookresearch/nougat)

- OlmoOCR (Allen AI, Feb 2025): [https://github.com/allenai/olmocr](https://github.com/allenai/olmocr)

- olmOCR paper: [https://arxiv.org/abs/2502.18443](https://arxiv.org/abs/2502.18443)

- Tesseract OCR: [https://github.com/tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)

- LayoutLMv3 (Microsoft): [https://github.com/microsoft/unilm/tree/master/layoutlmv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3)

- Donut (NAVER CLOVA AI): [https://github.com/clovaai/donut](https://github.com/clovaai/donut)

- Pixtral 12B (Mistral): [https://mistral.ai/news/pixtral-12b](https://mistral.ai/news/pixtral-12b)

- Florence-2 (Microsoft): [https://huggingface.co/microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large)

- KOSMOS-2.5 (Microsoft): [https://arxiv.org/abs/2309.11419](https://arxiv.org/abs/2309.11419)

- Naver CLOVA OCR: [https://www.ncloud.com/product/aiService/ocr](https://www.ncloud.com/product/aiService/ocr)

- Yomi-toku: [https://github.com/kotaro-kinoshita/yomitoku](https://github.com/kotaro-kinoshita/yomitoku)

- Google Cloud Document AI: [https://cloud.google.com/document-ai](https://cloud.google.com/document-ai)

- Reducto: [https://reducto.ai/](https://reducto.ai/)

- Unstract: [https://github.com/Zipstack/unstract](https://github.com/Zipstack/unstract)

- PaddleOCR: [https://github.com/PaddlePaddle/PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)

- EasyOCR: [https://github.com/JaidedAI/EasyOCR](https://github.com/JaidedAI/EasyOCR)