- Published on
Reasoning Models in 2026 — A Deep Dive on o3, o4, DeepSeek R1, Claude Thinking, Gemini Deep Think, and QwQ
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — "Think longer, score higher"
In September 2024, OpenAI shipped o1-preview. The model itself was not large. What was new was one thing: before it answers, it talks to itself for a while.
Earlier LLMs played "predict the next token." o1 added one move. It generates a pile of hidden chain-of-thought tokens, refines its reasoning there, and only then emits the answer. Spend more tokens — that is, "think" longer — and the answer gets closer to correct. That is test-time compute scaling.
This one-line idea redrew the model landscape across 2025–2026. o3 went GA, DeepSeek R1 reproduced the same curve in open weights, Anthropic baked "extended thinking" into Sonnet/Opus 4.5 as a per-request toggle, and Google shipped Gemini 2.5 Pro and Deep Think to GA. Alibaba's QwQ and QwQ-Plus created the second large open-weights reasoning stream.
2024's question: "Which model do we use?" 2026's question: "For this task, do we turn thinking on or off — and how much?"
This post lays out where reasoning models stand in 2026. Six families across thinking behavior, benchmarks, and price, on one page. And it answers the question that actually matters in production: when does a reasoning model earn its cost, and when does a fast non-thinking model win?
1. What is test-time compute?
Classic LLM scaling rode three axes.
| Axis | Meaning |
|---|---|
| Parameters | Make the model larger |
| Training data | Feed it more |
| Training compute | Train it longer |
o1 added a fourth: test-time compute. Spend more tokens at inference and accuracy goes up.
accuracy
▲
R1 ────│ ╱── thinking ON
│ ╱
base ──│ ╱
│ ╱──── thinking OFF (immediate)
└────────────────────────▶ inference token budget
The curve differs per model and per problem. On math, code, and proofs — verifiable problems — it is steep. On creative writing, summarization, and chit-chat it is nearly flat — thinking longer barely helps.
What "thinking tokens" actually are
A reasoning model's "thinking" tokens are usually one of three forms.
- Hidden reasoning — o1, o3, o4. The user never sees the raw chain of thought; only summaries.
- Visible reasoning — DeepSeek R1, QwQ. Raw reasoning streamed inside a
\<think\>...\</think\>block. - Toggleable — Claude Sonnet/Opus 4.5 extended thinking. Per-request on/off with a token budget.
Hidden vs visible is not just a UX choice. Visible is great for debugging, teaching, and verifying trust, but vulnerable to imitation and distillation. The wave of distillation work that hit the day R1 went open is exactly that.
2. RLVR — the recipe behind reasoning models
A reasoning model is a base model with two extra layers on top.
2-1. The ability to generate long CoT
First the model must be able to lay out a long chain of thought. Base models prefer short, confident answers. Long-CoT SFT teaches the habit of unspooling reasoning.
2-2. RLVR — Reinforcement Learning with Verifiable Rewards
The key is the second layer. RLVR uses rewards that can be checked automatically.
RLVR loop:
1. Hand the model a problem (math, code, logic)
2. The model emits a long CoT plus a final answer
3. A verifier grades it:
- Math: does the answer match?
- Code: do the tests pass?
- Formal reasoning: is the proof valid?
4. Pass = +1, fail = 0 (or negative)
5. Update with a policy gradient (PPO, GRPO, ...)
6. Repeat
The point is "verifiable" rewards. RLHF (human feedback) is expensive and inconsistent. RLVR is graded by compilers, test runners, and math checkers — infinitely cheap, perfectly consistent.
The DeepSeek R1 paper (Jan 2025) was the shock: starting almost from cold, RLVR alone produced R1-Zero. The model spontaneously discovered self-correction patterns like "wait, let me reconsider" — emergent reasoning, with no human teaching it that move.
Where RLVR pays off
| Domain | Verifier | RLVR impact |
|---|---|---|
| Math | Answer match | Huge (AIME jumps) |
| Code | Tests pass | Large (LiveCodeBench, SWE-bench) |
| Logic puzzles | Formal check | Large |
| Tool use | Intended tool calls | Moderate |
| Writing / summarization | Needs humans | Small (weak verifier) |
| Safety / honesty | Human or model judge | Small (RLHF is a better fit) |
So reasoning models are not universally better. They dominate only where verifiers are strong.
3. OpenAI — o3 / o3-pro / o4
OpenAI, the company that created the category, has the broadest lineup in 2026.
3-1. o3 (GA from 2025 Q2)
Eval results landed in December 2024, GA in April 2025. Ships a reasoning effort dial (low / medium / high) — same weights, different thinking budget. High effort can take minutes per response.
Notable behavior:
- Uses tools during reasoning ("agentic reasoning") — searches the web, runs code interpreter, feeds results back into thinking.
- Hidden CoT — users see summaries, not raw reasoning.
- First widely available model to approach human performance on ARC-AGI (at high effort).
3-2. o3-pro
For genuinely hard problems. Same weights, run much longer. Costs roughly an order of magnitude more, responses can take minutes. Used for research, deep analysis, and complex debugging.
3-3. o4 / o4-mini
The next generation, late 2025. Multimodal reasoning (looks at images and diagrams while thinking) and tighter tool-use integration. o4-mini is fast yet posts coding numbers near o3 — the new default for coding workloads.
| Model | Thinking | Tools in-loop | Strength |
|---|---|---|---|
| o3 | Hidden, 3-step dial | Yes | General reasoning, ARC-AGI |
| o3-pro | Hidden, very long | Yes | Truly hard problems |
| o4 | Hidden, multimodal | Yes | Complex multi-step |
| o4-mini | Hidden, short | Yes | Coding, cost efficiency |
4. DeepSeek — R1 / R1-0528 / V3.1 reasoner
The bomb that landed in the open-weights camp. When R1 dropped in January 2025, the industry stopped.
4-1. DeepSeek R1 (Jan 2025, MIT license)
- 671B MoE (37B active). Base is V3.
- Built with RLVR alone as R1-Zero, then a light SFT pass to ship R1.
- Streams raw CoT inside
\<think\>...\</think\>— heaven for debugging and research, a nightmare for closed labs (imitation risk). - Matches the o1 curve on AIME, MATH, and coding evals.
- An order of magnitude cheaper than closed models.
4-2. R1-0528 (May 2025 update)
Same weight class, more RL on top. A real step up on complex coding and long-context reasoning. SWE-bench Verified moved meaningfully.
4-3. V3.1 reasoner (early 2026)
A unified model with thinking as a toggle on V3.1 base — like Claude 4.5. One set of weights, thinking on/off; on emits R1-style \<think\> blocks. The first time the open camp shipped "toggleable reasoning."
Why DeepSeek matters: it proved reasoning models are not a closed-model monopoly. Anyone with 8 H100s can self-host. For regulated industries or on-prem requirements, this is effectively the default.
5. Anthropic — Claude Sonnet 4.5 / Opus 4.5 extended thinking
Anthropic took a different path. Not a separate model family — a state of the same model.
5-1. What extended thinking is
Sonnet 4.5 and Opus 4.5 ship a per-request toggle. The API call sets thinking with a token budget. The model spends up to that budget generating reasoning blocks, then emits the answer.
Request:
thinking: { type: "enabled", budget_tokens: 16000 }
Response:
- thinking block (up to budget)
- final answer (assistant message)
5-2. Notable behavior
- One set of weights, two modes — operationally simple.
- Interleaved thinking — calls tools mid-reasoning and continues thinking after seeing results.
- The thinking content arrives in the API response as text. It is not hidden. It is auto-compressed across turns.
- Strong on coding and SWE-bench Verified. Sonnet 4.5 with extended thinking is one of the strongest options for real PR automation.
5-3. Budget sizing intuition
| Task | Suggested budget |
|---|---|
| Instant-answer question | thinking off |
| One or two reasoning steps | 2k–4k |
| Small code patch | 8k–16k |
| Complex bug debugging | 32k–64k |
| Math / proofs / research | 64k or more |
Rule: scale budget to difficulty. Do not turn thinking on by reflex.
6. Google — Gemini 2.5 Pro / Deep Think
Gemini 2.5 Pro shipped with reasoning baked into the general model from day one.
6-1. Gemini 2.5 Pro
- Thinking defaults to ON with dynamic thinking — the model decides how long to think based on the problem.
- One-million-token context plus thinking — strong for reasoning over long documents.
- Multimodal — video, audio, and images participate in reasoning.
6-2. Deep Think (Gemini 2.5)
For genuinely hard work. Parallel thinking — multiple hypotheses run in parallel and merge. Made headlines at IMO 2025 as the first model at human gold-medal level. GA in late 2025.
| Model | Thinking | Context | Strength |
|---|---|---|---|
| Gemini 2.5 Flash | Dynamic, short | 1M | Fast reasoning, cost efficiency |
| Gemini 2.5 Pro | Dynamic, long | 1M | General, multimodal |
| Gemini 2.5 Deep Think | Parallel, very long | 1M | Hard math and proofs |
7. Alibaba — Qwen QwQ / QwQ-Plus
The second big stream in open-weights reasoning. Together with R1, the two pillars of open reasoning.
- QwQ-32B (Nov 2024) — an open 32B that approached o1-preview on reasoning. A shock.
- QwQ-Plus (2025) — the next generation. A step up on both code and math.
- Qwen3 reasoner — larger sizes, Apache 2.0.
Like R1, QwQ exposes visible CoT. Self-host friendly. Strong on Korean, Japanese, Chinese, and English — the preferred default for in-house use across Asia.
8. xAI — Grok 3 / 4 Heavy thinking
Grok 3 thinking, Grok 4, and Grok 4 Heavy all have thinking modes.
- Grok 3 Thinking (early 2025) — long chain-of-thought mode. Trained heavily on X (Twitter) data, so strong on current events.
- Grok 4 / 4 Heavy (late 2025) — Heavy runs multi-agent thinking, with several instances reasoning in parallel and merging. Top scores on extremely hard evals like HLE (Humanity's Last Exam).
| Model | Thinking | Notes |
|---|---|---|
| Grok 3 thinking | Partially visible | Live X data |
| Grok 4 | Hidden, long | General |
| Grok 4 Heavy | Parallel multi-agent | HLE leader |
9. Comparison matrix — one page
Benchmark numbers move with each release. The table below shows relative positioning as of early 2026.
9-1. Thinking behavior
| Model | Thinking form | Budget control | Tools in-thinking |
|---|---|---|---|
| OpenAI o3 | Hidden (summary only) | low/med/high | Yes |
| OpenAI o3-pro | Hidden, very long | Auto (very large) | Yes |
| OpenAI o4 / o4-mini | Hidden | low/med/high | Yes |
| DeepSeek R1 / 0528 | Visible (<think>) | Auto | Partial |
| DeepSeek V3.1 reasoner | Visible, toggleable | API toggle | Partial |
| Claude Sonnet 4.5 | Visible, toggleable | Token budget | Yes (interleaved) |
| Claude Opus 4.5 | Visible, toggleable | Token budget | Yes (interleaved) |
| Gemini 2.5 Pro | Hidden, dynamic | Dynamic auto | Yes |
| Gemini 2.5 Deep Think | Hidden, parallel | Dynamic auto | Yes |
| Qwen QwQ / QwQ-Plus | Visible (<think>) | Auto | Partial |
| Grok 4 / 4 Heavy | Hidden / parallel | Mode select | Yes |
9-2. Benchmark position (early 2026, relative)
| Model | AIME-style math | LiveCodeBench | SWE-bench Verified | Cost / latency |
|---|---|---|---|---|
| o3 (high) | Top-tier | Top-tier | Top-tier | Expensive, slow |
| o3-pro | Top-tier | Top-tier | Top-tier | Very expensive, very slow |
| o4-mini | High | High | High | Moderate, moderate |
| R1-0528 | High | High | Top-tier-ish | Cheap (open), moderate |
| Sonnet 4.5 thinking | High | Top-tier | Top-tier | Moderate, moderate |
| Opus 4.5 thinking | Top-tier | Top-tier | Top-tier | Expensive, moderate |
| Gemini 2.5 Pro | High | High | High | Moderate, moderate |
| Deep Think | Top-tier (IMO) | High | High | Expensive, very slow |
| QwQ-Plus | High | High | Mid-to-high | Cheap (open), moderate |
| Grok 4 Heavy | Top-tier | High | High | Expensive, slow |
Absolute numbers shift with each release and eval methodology. Decide with your own eval suite on your data, your tasks, and your SLAs.
10. Pricing and the thinking-token economy
Reasoning models price differently. Thinking tokens count as output tokens, and they typically run tens of times longer than the visible answer.
Request: "find the bug in this code (200 tokens)"
Response: [thinking: 8,000 tokens] ← billed as output
[answer: 600 tokens] ← billed as output
Total = input(200) + output(8,600)
The implication: thinking budget is the price tag. Turning thinking on for a tiny task can be 10–50x the normal cost.
10-1. Rough per-1M-output-token positioning
Prices change often. The table is for relative comparison — get exact figures from each provider.
| Model | Input/1M | Output/1M | Thinking in output? |
|---|---|---|---|
| o3 | Moderate-high | Very high | Yes |
| o3-pro | Very high | Very, very high | Yes |
| o4-mini | Low-moderate | Moderate | Yes |
| R1 (DeepSeek API) | Very low | Low | Yes |
| Sonnet 4.5 thinking | Moderate | High | Yes (thinking counts as output) |
| Opus 4.5 thinking | High | Very high | Yes |
| Gemini 2.5 Pro | Moderate | High | Yes |
| Deep Think | High | Very high | Yes |
| QwQ-Plus (Alibaba API) | Very low | Low | Yes |
| Grok 4 Heavy | High | Very high | Yes |
Open-weights models like R1 and QwQ go to zero once self-hosted (only infra cost). For high-volume repeated work, the gap is enormous.
10-2. Budget guidelines
| Task | Suggested |
|---|---|
| FAQ, summary, translation | Thinking off (use a non-reasoning model) |
| Short code snippet | Thinking off or minimal |
| Routine bug fix | Thinking low / 4k |
| Complex debug | Thinking medium / 16k |
| Hard math / proofs | Thinking high / 64k+ |
| Deep research, hard one-shots | o3-pro, Deep Think, Grok 4 Heavy |
11. When you actually need a reasoning model
Reasoning models are not universal. There are clear times to turn it on, and more times to turn it off.
11-1. Reasoning models shine when
- Math, logic, proofs — multi-step reasoning is where value lives.
- Complex code changes — coherent edits across many files in a large repo. The SWE-bench shape.
- Agent planning — figuring out which tool to call in what order on a new task.
- Debugging — hypothesizing, gathering evidence, falsifying.
- Research and analysis — surfacing trade-offs, counter-examples, and falsifiable claims.
- One-shot exam-like questions — IMO, AIME, HLE, where you must be right the first time.
11-2. Reasoning models cost you when
- Instant factual lookup — no reason to spend 16k thinking tokens on "what's the date?"
- High-volume classification or tagging — cost multiplies per item.
- Latency-sensitive chat UX — thinking is slow; users leave.
- Creative writing — verifier is weak; a general model is more varied and natural.
- Casual or emotional dialogue — overthinking reads as awkward.
- Templated reports — you are just filling slots.
Rule: thinking costs money. Turn it on only when accuracy gains justify it.
11-3. The routing pattern
request arrives
│
▼
complexity classifier (cheap fast model: Haiku, Flash, 4o-mini)
│
├── "simple" → fast non-reasoning model (immediate)
├── "medium" → reasoning model, low budget
└── "hard" → reasoning model, high budget or pro/Heavy
This is the default shape of production AI systems in 2026. Sending every request to a reasoning model is cost-and-latency suicide.
12. Accuracy / cost / latency — the triangle
The same task at the same accuracy is a different system if cost and latency differ.
12-1. Three axes
accuracy ▲
╱│╲
╱ │ ╲ ← Pareto frontier
╱ │ ╲
────────●───┼───●─────
expensive slow
▼
latency
Pareto frontier: gaining one axis costs another. o3-pro buys accuracy only. R1 self-host buys cost. Haiku/Flash buy latency.
12-2. Which point to buy
| Product shape | Recommended point |
|---|---|
| Interactive chat (sub-second) | Non-reasoning model or minimal thinking |
| Async agent (minutes OK) | Thinking medium / high |
| Batch analysis (overnight OK) | Highest-accuracy model, optimize cost only |
| On-prem / regulated | Open-weights (R1, QwQ) |
| High-stakes one-off decision | Pro / Heavy / Deep Think |
12-3. Dynamic budget — graduated thinking
Advanced pattern: escalate budget on failure.
1. Try thinking 2k
2. Self-consistency: is the answer stable across reruns?
3. If stable → done
4. If unstable → retry at 4k
5. Still unstable → 16k or a different model
This escalation pattern lowers average cost a lot — easy problems stay cheap, only hard ones go expensive.
13. The open-vs-closed reasoning ladder
The 2026 reasoning landscape on an open/closed axis:
Closed (closed-weights)
│
o3-pro · Opus 4.5 thinking · Deep Think · Grok 4 Heavy
│ ← "strongest" but expensive and gated
│
o3 · Sonnet 4.5 thinking · Gemini 2.5 Pro · Grok 4
│ ← standard for general work
│
o4-mini · Gemini 2.5 Flash · Grok 3 thinking
│ ← fast reasoning
│
─────────┼─────────────────────────── price / latency
│
QwQ-Plus · Qwen3 reasoner
│
DeepSeek R1-0528 · V3.1 reasoner
│
Open (open-weights, self-hostable)
Why pick open
- Data must not leave — healthcare, finance, defense, government.
- High-volume repeats — token-cost goes to zero.
- Further fine-tuning — domain adaptation is possible.
- Reproducibility and audit — the weights make every decision traceable.
Why pick closed
- Top-tier performance — for some tasks 1–3 percentage points decide.
- Operations outsourced — hosting, updates, safety.
- Multimodal integration — image, video, audio, tools in one API.
- Frontier rotation — instant access to the latest.
Reality in 2026: serious shops run both. Sensitive data goes to open self-host; general public-OK work goes to closed APIs. Routing is the hardest decision.
14. Working with reasoning models — practical tips
14-1. Keep the prompt short, keep context rich
A reasoning model's job is to think to itself. Forcing "step 1, step 2, ..." in the prompt gets in its way. State the goal clearly, state the constraints clearly, and leave the structure to the model.
14-2. Drop CoT directives
"Think step by step" helped non-reasoning models. In reasoning models, that already happens inside the thinking block. Adding the directive duplicates or even shortens reasoning. Remove it.
14-3. Tool use differs by family
- o3 / o4, Sonnet 4.5, Gemini 2.5 Pro: interleaved thinking — tool results merge smoothly into reasoning.
- R1, QwQ: weaker tool integration. Wrap with an external ReAct loop.
14-4. Self-consistency
Run the same question N times and majority-vote. Especially powerful on reasoning models. Cost goes Nx but accuracy moves measurably. Useful for high-stakes decisions like medical and financial.
14-5. Log thinking traces — where you can
Visible-CoT models (R1, QwQ, Claude) log the trace. It is a goldmine for debugging, improvement, and evaluation. But do not show raw thinking to end users verbatim — wrong hypotheses can read as facts.
14-6. Use caching
If the system prompt is long, thinking happens on top of it. Prompt caching (Anthropic, OpenAI, Gemini) can cut input cost up to 90%. Thinking tokens themselves are not cached — they regenerate every call.
Epilogue — two lines, then what's next
Two-line summary:
- Reasoning models are not universally better — they dominate only where verifiers are strong.
- In 2026, the decision is not "which model" but "which model x which thinking mode x which routing."
12-point checklist
- Do you decide per-task whether to turn thinking on?
- Do you scale thinking budget to task difficulty?
- Is there a router (cheap classifier + expensive reasoning model)?
- Do you use self-consistency for high-stakes decisions?
- Does your cost model reflect that thinking counts as output tokens?
- Does your tool-use design lean on interleaved thinking properly?
- Do you log reasoning traces from visible-CoT models?
- Do you have your own eval suite (not vendor benchmarks alone)?
- Have you evaluated open-weights options (for regulated or high-volume)?
- Have you cut input cost with prompt caching?
- Did you remove "think step by step" from reasoning-model prompts?
- Have you blocked raw thinking from leaking into user UI?
Ten anti-patterns
- Reasoning model on every request — cost-and-latency suicide.
- Forced CoT in prompts — counter-productive for reasoning models.
- Defaulting thinking budget to max — billing time bomb.
- Trusting vendor benchmarks only — not your tasks.
- Showing raw reasoning to end users — wrong guesses look like facts.
- Self-consistency everywhere — Nx cost.
- Picking either open or closed only — routing is the answer.
- Not monitoring thinking tokens — cost tracking is impossible.
- Sensitive data through external reasoning APIs — compliance violation.
- Reasoning models in raw chat UX — no one waits a minute.
Next post candidates
Candidates: Building a reasoning-model eval suite — measuring thinking on your data, Agents x reasoning — patterns for tool use and thinking together, Self-hosting open reasoning models — vLLM vs SGLang vs TGI.
"Not bigger models, but models that think better — and then, models that know when not to think."
— Reasoning models 2026 guide, end.
References
- OpenAI, "Learning to reason with LLMs (o1)" — https://openai.com/index/learning-to-reason-with-llms/
- OpenAI, "Introducing o3 and o4-mini" — https://openai.com/index/introducing-o3-and-o4-mini/
- OpenAI, "OpenAI o3-mini" — https://openai.com/index/openai-o3-mini/
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (arXiv:2501.12948) — https://arxiv.org/abs/2501.12948
- DeepSeek, "DeepSeek-R1-0528 release notes" — https://api-docs.deepseek.com/news/news250528
- Anthropic, "Claude's extended thinking" — https://www.anthropic.com/news/visible-extended-thinking
- Anthropic, "Claude Sonnet 4.5" — https://www.anthropic.com/news/claude-sonnet-4-5
- Anthropic, "Extended thinking docs" — https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking
- Google DeepMind, "Gemini 2.5: Our most intelligent AI model" — https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/
- Google DeepMind, "Try Deep Think in the Gemini app" — https://blog.google/products/gemini/gemini-2-5-deep-think/
- Alibaba Qwen, "QwQ-32B: Reflect deeply on the boundaries of the unknown" — https://qwenlm.github.io/blog/qwq-32b-preview/
- Alibaba Qwen, "QwQ-Plus / Qwen3 reasoning" — https://qwenlm.github.io/blog/qwen3/
- xAI, "Grok 3 Beta" — https://x.ai/news/grok-3
- xAI, "Grok 4 and Grok 4 Heavy" — https://x.ai/news/grok-4
- Kimi/Moonshot, "Kimi k1.5: Scaling RL with LLMs" (RLVR comparison) — https://arxiv.org/abs/2501.12599
- ARC Prize, "ARC-AGI-1 Leaderboard" — https://arcprize.org/
- SWE-bench Verified leaderboard — https://www.swebench.com/
- LiveCodeBench — https://livecodebench.github.io/
- HLE (Humanity's Last Exam) — https://lastexam.ai/
- AIME 2024/2025 discussion — https://artofproblemsolving.com/community/c3416_2024_aime_i
- Lilian Weng, "Why we think" — https://lilianweng.github.io/posts/2025-05-01-thinking/