Reasoning Models in 2026 — A Deep Dive on o3, o4, DeepSeek R1, Claude Thinking, Gemini Deep Think, and QwQ

Prologue — "Think longer, score higher"

In September 2024, OpenAI shipped o1-preview. The model itself was not large. What was new was one thing: before it answers, it talks to itself for a while.

Earlier LLMs played "predict the next token." o1 added one move. It generates a pile of hidden chain-of-thought tokens, refines its reasoning there, and only then emits the answer. Spend more tokens — that is, "think" longer — and the answer gets closer to correct. That is test-time compute scaling.

This one-line idea redrew the model landscape across 2025–2026. o3 went GA, DeepSeek R1 reproduced the same curve in open weights, Anthropic baked "extended thinking" into Sonnet/Opus 4.5 as a per-request toggle, and Google shipped Gemini 2.5 Pro and Deep Think to GA. Alibaba's QwQ and QwQ-Plus created the second large open-weights reasoning stream.

2024's question: "Which model do we use?" 2026's question: "For this task, do we turn thinking on or off — and how much?"

This post lays out where reasoning models stand in 2026. Six families across thinking behavior, benchmarks, and price, on one page. And it answers the question that actually matters in production: when does a reasoning model earn its cost, and when does a fast non-thinking model win?

1. What is test-time compute?

Classic LLM scaling rode three axes.

Axis	Meaning
Parameters	Make the model larger
Training data	Feed it more
Training compute	Train it longer

o1 added a fourth: test-time compute. Spend more tokens at inference and accuracy goes up.

       accuracy
         ▲
  R1 ────│              ╱── thinking ON
         │           ╱
  base ──│       ╱
         │   ╱──── thinking OFF (immediate)
         └────────────────────────▶ inference token budget

The curve differs per model and per problem. On math, code, and proofs — verifiable problems — it is steep. On creative writing, summarization, and chit-chat it is nearly flat — thinking longer barely helps.

What "thinking tokens" actually are

A reasoning model's "thinking" tokens are usually one of three forms.

Hidden reasoning — o1, o3, o4. The user never sees the raw chain of thought; only summaries.
Visible reasoning — DeepSeek R1, QwQ. Raw reasoning streamed inside a \<think\>...\</think\> block.
Toggleable — Claude Sonnet/Opus 4.5 extended thinking. Per-request on/off with a token budget.

Hidden vs visible is not just a UX choice. Visible is great for debugging, teaching, and verifying trust, but vulnerable to imitation and distillation. The wave of distillation work that hit the day R1 went open is exactly that.

2. RLVR — the recipe behind reasoning models

A reasoning model is a base model with two extra layers on top.

2-1. The ability to generate long CoT

First the model must be able to lay out a long chain of thought. Base models prefer short, confident answers. Long-CoT SFT teaches the habit of unspooling reasoning.

2-2. RLVR — Reinforcement Learning with Verifiable Rewards

The key is the second layer. RLVR uses rewards that can be checked automatically.

RLVR loop:
  1. Hand the model a problem (math, code, logic)
  2. The model emits a long CoT plus a final answer
  3. A verifier grades it:
     - Math: does the answer match?
     - Code: do the tests pass?
     - Formal reasoning: is the proof valid?
  4. Pass = +1, fail = 0 (or negative)
  5. Update with a policy gradient (PPO, GRPO, ...)
  6. Repeat

The point is "verifiable" rewards. RLHF (human feedback) is expensive and inconsistent. RLVR is graded by compilers, test runners, and math checkers — infinitely cheap, perfectly consistent.

The DeepSeek R1 paper (Jan 2025) was the shock: starting almost from cold, RLVR alone produced R1-Zero. The model spontaneously discovered self-correction patterns like "wait, let me reconsider" — emergent reasoning, with no human teaching it that move.

Where RLVR pays off

Domain	Verifier	RLVR impact
Math	Answer match	Huge (AIME jumps)
Code	Tests pass	Large (LiveCodeBench, SWE-bench)
Logic puzzles	Formal check	Large
Tool use	Intended tool calls	Moderate
Writing / summarization	Needs humans	Small (weak verifier)
Safety / honesty	Human or model judge	Small (RLHF is a better fit)

So reasoning models are not universally better. They dominate only where verifiers are strong.

3. OpenAI — o3 / o3-pro / o4

OpenAI, the company that created the category, has the broadest lineup in 2026.

3-1. o3 (GA from 2025 Q2)

Eval results landed in December 2024, GA in April 2025. Ships a reasoning effort dial (low / medium / high) — same weights, different thinking budget. High effort can take minutes per response.

Notable behavior:

Uses tools during reasoning ("agentic reasoning") — searches the web, runs code interpreter, feeds results back into thinking.
Hidden CoT — users see summaries, not raw reasoning.
First widely available model to approach human performance on ARC-AGI (at high effort).

3-2. o3-pro

For genuinely hard problems. Same weights, run much longer. Costs roughly an order of magnitude more, responses can take minutes. Used for research, deep analysis, and complex debugging.

3-3. o4 / o4-mini

The next generation, late 2025. Multimodal reasoning (looks at images and diagrams while thinking) and tighter tool-use integration. o4-mini is fast yet posts coding numbers near o3 — the new default for coding workloads.

Model	Thinking	Tools in-loop	Strength
o3	Hidden, 3-step dial	Yes	General reasoning, ARC-AGI
o3-pro	Hidden, very long	Yes	Truly hard problems
o4	Hidden, multimodal	Yes	Complex multi-step
o4-mini	Hidden, short	Yes	Coding, cost efficiency

4. DeepSeek — R1 / R1-0528 / V3.1 reasoner

The bomb that landed in the open-weights camp. When R1 dropped in January 2025, the industry stopped.

4-1. DeepSeek R1 (Jan 2025, MIT license)

671B MoE (37B active). Base is V3.
Built with RLVR alone as R1-Zero, then a light SFT pass to ship R1.
Streams raw CoT inside \<think\>...\</think\> — heaven for debugging and research, a nightmare for closed labs (imitation risk).
Matches the o1 curve on AIME, MATH, and coding evals.
An order of magnitude cheaper than closed models.

4-2. R1-0528 (May 2025 update)

Same weight class, more RL on top. A real step up on complex coding and long-context reasoning. SWE-bench Verified moved meaningfully.

4-3. V3.1 reasoner (early 2026)

A unified model with thinking as a toggle on V3.1 base — like Claude 4.5. One set of weights, thinking on/off; on emits R1-style \<think\> blocks. The first time the open camp shipped "toggleable reasoning."

Why DeepSeek matters: it proved reasoning models are not a closed-model monopoly. Anyone with 8 H100s can self-host. For regulated industries or on-prem requirements, this is effectively the default.

5. Anthropic — Claude Sonnet 4.5 / Opus 4.5 extended thinking

Anthropic took a different path. Not a separate model family — a state of the same model.

5-1. What extended thinking is

Sonnet 4.5 and Opus 4.5 ship a per-request toggle. The API call sets thinking with a token budget. The model spends up to that budget generating reasoning blocks, then emits the answer.

Request:
  thinking: { type: "enabled", budget_tokens: 16000 }

Response:
  - thinking block (up to budget)
  - final answer (assistant message)

5-2. Notable behavior

One set of weights, two modes — operationally simple.
Interleaved thinking — calls tools mid-reasoning and continues thinking after seeing results.
The thinking content arrives in the API response as text. It is not hidden. It is auto-compressed across turns.
Strong on coding and SWE-bench Verified. Sonnet 4.5 with extended thinking is one of the strongest options for real PR automation.

5-3. Budget sizing intuition

Task	Suggested budget
Instant-answer question	thinking off
One or two reasoning steps	2k–4k
Small code patch	8k–16k
Complex bug debugging	32k–64k
Math / proofs / research	64k or more

Rule: scale budget to difficulty. Do not turn thinking on by reflex.

6. Google — Gemini 2.5 Pro / Deep Think

Gemini 2.5 Pro shipped with reasoning baked into the general model from day one.

6-1. Gemini 2.5 Pro

Thinking defaults to ON with dynamic thinking — the model decides how long to think based on the problem.
One-million-token context plus thinking — strong for reasoning over long documents.
Multimodal — video, audio, and images participate in reasoning.

6-2. Deep Think (Gemini 2.5)

For genuinely hard work. Parallel thinking — multiple hypotheses run in parallel and merge. Made headlines at IMO 2025 as the first model at human gold-medal level. GA in late 2025.

Model	Thinking	Context	Strength
Gemini 2.5 Flash	Dynamic, short	1M	Fast reasoning, cost efficiency
Gemini 2.5 Pro	Dynamic, long	1M	General, multimodal
Gemini 2.5 Deep Think	Parallel, very long	1M	Hard math and proofs

7. Alibaba — Qwen QwQ / QwQ-Plus

The second big stream in open-weights reasoning. Together with R1, the two pillars of open reasoning.

QwQ-32B (Nov 2024) — an open 32B that approached o1-preview on reasoning. A shock.
QwQ-Plus (2025) — the next generation. A step up on both code and math.
Qwen3 reasoner — larger sizes, Apache 2.0.

Like R1, QwQ exposes visible CoT. Self-host friendly. Strong on Korean, Japanese, Chinese, and English — the preferred default for in-house use across Asia.

8. xAI — Grok 3 / 4 Heavy thinking

Grok 3 thinking, Grok 4, and Grok 4 Heavy all have thinking modes.

Grok 3 Thinking (early 2025) — long chain-of-thought mode. Trained heavily on X (Twitter) data, so strong on current events.
Grok 4 / 4 Heavy (late 2025) — Heavy runs multi-agent thinking, with several instances reasoning in parallel and merging. Top scores on extremely hard evals like HLE (Humanity's Last Exam).

Model	Thinking	Notes
Grok 3 thinking	Partially visible	Live X data
Grok 4	Hidden, long	General
Grok 4 Heavy	Parallel multi-agent	HLE leader

9. Comparison matrix — one page

Benchmark numbers move with each release. The table below shows relative positioning as of early 2026.

9-1. Thinking behavior

Model	Thinking form	Budget control	Tools in-thinking
OpenAI o3	Hidden (summary only)	low/med/high	Yes
OpenAI o3-pro	Hidden, very long	Auto (very large)	Yes
OpenAI o4 / o4-mini	Hidden	low/med/high	Yes
DeepSeek R1 / 0528	Visible (<think>)	Auto	Partial
DeepSeek V3.1 reasoner	Visible, toggleable	API toggle	Partial
Claude Sonnet 4.5	Visible, toggleable	Token budget	Yes (interleaved)
Claude Opus 4.5	Visible, toggleable	Token budget	Yes (interleaved)
Gemini 2.5 Pro	Hidden, dynamic	Dynamic auto	Yes
Gemini 2.5 Deep Think	Hidden, parallel	Dynamic auto	Yes
Qwen QwQ / QwQ-Plus	Visible (<think>)	Auto	Partial
Grok 4 / 4 Heavy	Hidden / parallel	Mode select	Yes

9-2. Benchmark position (early 2026, relative)

Model	AIME-style math	LiveCodeBench	SWE-bench Verified	Cost / latency
o3 (high)	Top-tier	Top-tier	Top-tier	Expensive, slow
o3-pro	Top-tier	Top-tier	Top-tier	Very expensive, very slow
o4-mini	High	High	High	Moderate, moderate
R1-0528	High	High	Top-tier-ish	Cheap (open), moderate
Sonnet 4.5 thinking	High	Top-tier	Top-tier	Moderate, moderate
Opus 4.5 thinking	Top-tier	Top-tier	Top-tier	Expensive, moderate
Gemini 2.5 Pro	High	High	High	Moderate, moderate
Deep Think	Top-tier (IMO)	High	High	Expensive, very slow
QwQ-Plus	High	High	Mid-to-high	Cheap (open), moderate
Grok 4 Heavy	Top-tier	High	High	Expensive, slow

Absolute numbers shift with each release and eval methodology. Decide with your own eval suite on your data, your tasks, and your SLAs.

10. Pricing and the thinking-token economy

Reasoning models price differently. Thinking tokens count as output tokens, and they typically run tens of times longer than the visible answer.

Request:  "find the bug in this code (200 tokens)"

Response: [thinking: 8,000 tokens]  ← billed as output
          [answer:   600 tokens]    ← billed as output

Total = input(200) + output(8,600)

The implication: thinking budget is the price tag. Turning thinking on for a tiny task can be 10–50x the normal cost.

10-1. Rough per-1M-output-token positioning

Prices change often. The table is for relative comparison — get exact figures from each provider.

Model	Input/1M	Output/1M	Thinking in output?
o3	Moderate-high	Very high	Yes
o3-pro	Very high	Very, very high	Yes
o4-mini	Low-moderate	Moderate	Yes
R1 (DeepSeek API)	Very low	Low	Yes
Sonnet 4.5 thinking	Moderate	High	Yes (thinking counts as output)
Opus 4.5 thinking	High	Very high	Yes
Gemini 2.5 Pro	Moderate	High	Yes
Deep Think	High	Very high	Yes
QwQ-Plus (Alibaba API)	Very low	Low	Yes
Grok 4 Heavy	High	Very high	Yes

Open-weights models like R1 and QwQ go to zero once self-hosted (only infra cost). For high-volume repeated work, the gap is enormous.

10-2. Budget guidelines

Task	Suggested
FAQ, summary, translation	Thinking off (use a non-reasoning model)
Short code snippet	Thinking off or minimal
Routine bug fix	Thinking low / 4k
Complex debug	Thinking medium / 16k
Hard math / proofs	Thinking high / 64k+
Deep research, hard one-shots	o3-pro, Deep Think, Grok 4 Heavy

11. When you actually need a reasoning model

Reasoning models are not universal. There are clear times to turn it on, and more times to turn it off.

11-1. Reasoning models shine when

Math, logic, proofs — multi-step reasoning is where value lives.
Complex code changes — coherent edits across many files in a large repo. The SWE-bench shape.
Agent planning — figuring out which tool to call in what order on a new task.
Debugging — hypothesizing, gathering evidence, falsifying.
Research and analysis — surfacing trade-offs, counter-examples, and falsifiable claims.
One-shot exam-like questions — IMO, AIME, HLE, where you must be right the first time.

11-2. Reasoning models cost you when

Instant factual lookup — no reason to spend 16k thinking tokens on "what's the date?"
High-volume classification or tagging — cost multiplies per item.
Latency-sensitive chat UX — thinking is slow; users leave.
Creative writing — verifier is weak; a general model is more varied and natural.
Casual or emotional dialogue — overthinking reads as awkward.
Templated reports — you are just filling slots.

Rule: thinking costs money. Turn it on only when accuracy gains justify it.

11-3. The routing pattern

request arrives
  │
  ▼
complexity classifier (cheap fast model: Haiku, Flash, 4o-mini)
  │
  ├── "simple" → fast non-reasoning model (immediate)
  ├── "medium" → reasoning model, low budget
  └── "hard"   → reasoning model, high budget or pro/Heavy

This is the default shape of production AI systems in 2026. Sending every request to a reasoning model is cost-and-latency suicide.

12. Accuracy / cost / latency — the triangle

The same task at the same accuracy is a different system if cost and latency differ.

12-1. Three axes

         accuracy ▲
              ╱│╲
             ╱ │ ╲    ← Pareto frontier
            ╱  │  ╲
   ────────●───┼───●─────
       expensive  slow
                ▼
              latency

Pareto frontier: gaining one axis costs another. o3-pro buys accuracy only. R1 self-host buys cost. Haiku/Flash buy latency.

12-2. Which point to buy

Product shape	Recommended point
Interactive chat (sub-second)	Non-reasoning model or minimal thinking
Async agent (minutes OK)	Thinking medium / high
Batch analysis (overnight OK)	Highest-accuracy model, optimize cost only
On-prem / regulated	Open-weights (R1, QwQ)
High-stakes one-off decision	Pro / Heavy / Deep Think

12-3. Dynamic budget — graduated thinking

Advanced pattern: escalate budget on failure.

1. Try thinking 2k
2. Self-consistency: is the answer stable across reruns?
3. If stable → done
4. If unstable → retry at 4k
5. Still unstable → 16k or a different model

This escalation pattern lowers average cost a lot — easy problems stay cheap, only hard ones go expensive.

13. The open-vs-closed reasoning ladder

The 2026 reasoning landscape on an open/closed axis:

        Closed (closed-weights)
         │
o3-pro · Opus 4.5 thinking · Deep Think · Grok 4 Heavy
         │   ← "strongest" but expensive and gated
         │
   o3 · Sonnet 4.5 thinking · Gemini 2.5 Pro · Grok 4
         │   ← standard for general work
         │
   o4-mini · Gemini 2.5 Flash · Grok 3 thinking
         │   ← fast reasoning
         │
─────────┼─────────────────────────── price / latency
         │
   QwQ-Plus · Qwen3 reasoner
         │
   DeepSeek R1-0528 · V3.1 reasoner
         │
        Open (open-weights, self-hostable)

Why pick open

Data must not leave — healthcare, finance, defense, government.
High-volume repeats — token-cost goes to zero.
Further fine-tuning — domain adaptation is possible.
Reproducibility and audit — the weights make every decision traceable.

Why pick closed

Top-tier performance — for some tasks 1–3 percentage points decide.
Operations outsourced — hosting, updates, safety.
Multimodal integration — image, video, audio, tools in one API.
Frontier rotation — instant access to the latest.

Reality in 2026: serious shops run both. Sensitive data goes to open self-host; general public-OK work goes to closed APIs. Routing is the hardest decision.

14. Working with reasoning models — practical tips

14-1. Keep the prompt short, keep context rich

A reasoning model's job is to think to itself. Forcing "step 1, step 2, ..." in the prompt gets in its way. State the goal clearly, state the constraints clearly, and leave the structure to the model.

14-2. Drop CoT directives

"Think step by step" helped non-reasoning models. In reasoning models, that already happens inside the thinking block. Adding the directive duplicates or even shortens reasoning. Remove it.

14-3. Tool use differs by family

o3 / o4, Sonnet 4.5, Gemini 2.5 Pro: interleaved thinking — tool results merge smoothly into reasoning.
R1, QwQ: weaker tool integration. Wrap with an external ReAct loop.

14-4. Self-consistency

Run the same question N times and majority-vote. Especially powerful on reasoning models. Cost goes Nx but accuracy moves measurably. Useful for high-stakes decisions like medical and financial.

14-5. Log thinking traces — where you can

Visible-CoT models (R1, QwQ, Claude) log the trace. It is a goldmine for debugging, improvement, and evaluation. But do not show raw thinking to end users verbatim — wrong hypotheses can read as facts.

14-6. Use caching

If the system prompt is long, thinking happens on top of it. Prompt caching (Anthropic, OpenAI, Gemini) can cut input cost up to 90%. Thinking tokens themselves are not cached — they regenerate every call.

Epilogue — two lines, then what's next

Two-line summary:

Reasoning models are not universally better — they dominate only where verifiers are strong.
In 2026, the decision is not "which model" but "which model x which thinking mode x which routing."

12-point checklist

Do you decide per-task whether to turn thinking on?
Do you scale thinking budget to task difficulty?
Is there a router (cheap classifier + expensive reasoning model)?
Do you use self-consistency for high-stakes decisions?
Does your cost model reflect that thinking counts as output tokens?
Does your tool-use design lean on interleaved thinking properly?
Do you log reasoning traces from visible-CoT models?
Do you have your own eval suite (not vendor benchmarks alone)?
Have you evaluated open-weights options (for regulated or high-volume)?
Have you cut input cost with prompt caching?
Did you remove "think step by step" from reasoning-model prompts?
Have you blocked raw thinking from leaking into user UI?

Ten anti-patterns

Reasoning model on every request — cost-and-latency suicide.
Forced CoT in prompts — counter-productive for reasoning models.
Defaulting thinking budget to max — billing time bomb.
Trusting vendor benchmarks only — not your tasks.
Showing raw reasoning to end users — wrong guesses look like facts.
Self-consistency everywhere — Nx cost.
Picking either open or closed only — routing is the answer.
Not monitoring thinking tokens — cost tracking is impossible.
Sensitive data through external reasoning APIs — compliance violation.
Reasoning models in raw chat UX — no one waits a minute.

Next post candidates

Candidates: Building a reasoning-model eval suite — measuring thinking on your data, Agents x reasoning — patterns for tool use and thinking together, Self-hosting open reasoning models — vLLM vs SGLang vs TGI.

"Not bigger models, but models that think better — and then, models that know when not to think."

— Reasoning models 2026 guide, end.

References

OpenAI, "Learning to reason with LLMs (o1)" — https://openai.com/index/learning-to-reason-with-llms/
OpenAI, "Introducing o3 and o4-mini" — https://openai.com/index/introducing-o3-and-o4-mini/
OpenAI, "OpenAI o3-mini" — https://openai.com/index/openai-o3-mini/
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (arXiv:2501.12948) — https://arxiv.org/abs/2501.12948
DeepSeek, "DeepSeek-R1-0528 release notes" — https://api-docs.deepseek.com/news/news250528
Anthropic, "Claude's extended thinking" — https://www.anthropic.com/news/visible-extended-thinking
Anthropic, "Claude Sonnet 4.5" — https://www.anthropic.com/news/claude-sonnet-4-5
Anthropic, "Extended thinking docs" — https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking
Google DeepMind, "Gemini 2.5: Our most intelligent AI model" — https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/
Google DeepMind, "Try Deep Think in the Gemini app" — https://blog.google/products/gemini/gemini-2-5-deep-think/
Alibaba Qwen, "QwQ-32B: Reflect deeply on the boundaries of the unknown" — https://qwenlm.github.io/blog/qwq-32b-preview/
Alibaba Qwen, "QwQ-Plus / Qwen3 reasoning" — https://qwenlm.github.io/blog/qwen3/
xAI, "Grok 3 Beta" — https://x.ai/news/grok-3
xAI, "Grok 4 and Grok 4 Heavy" — https://x.ai/news/grok-4
Kimi/Moonshot, "Kimi k1.5: Scaling RL with LLMs" (RLVR comparison) — https://arxiv.org/abs/2501.12599
ARC Prize, "ARC-AGI-1 Leaderboard" — https://arcprize.org/
SWE-bench Verified leaderboard — https://www.swebench.com/
LiveCodeBench — https://livecodebench.github.io/
HLE (Humanity's Last Exam) — https://lastexam.ai/
AIME 2024/2025 discussion — https://artofproblemsolving.com/community/c3416_2024_aime_i
Lilian Weng, "Why we think" — https://lilianweng.github.io/posts/2025-05-01-thinking/