Skip to content
Published on

Reasoning Models in 2026 — A Deep Dive on o3, o4, DeepSeek R1, Claude Thinking, Gemini Deep Think, and QwQ

Authors

Prologue — "Think longer, score higher"

In September 2024, OpenAI shipped o1-preview. The model itself was not large. What was new was one thing: before it answers, it talks to itself for a while.

Earlier LLMs played "predict the next token." o1 added one move. It generates a pile of hidden chain-of-thought tokens, refines its reasoning there, and only then emits the answer. Spend more tokens — that is, "think" longer — and the answer gets closer to correct. That is test-time compute scaling.

This one-line idea redrew the model landscape across 2025–2026. o3 went GA, DeepSeek R1 reproduced the same curve in open weights, Anthropic baked "extended thinking" into Sonnet/Opus 4.5 as a per-request toggle, and Google shipped Gemini 2.5 Pro and Deep Think to GA. Alibaba's QwQ and QwQ-Plus created the second large open-weights reasoning stream.

2024's question: "Which model do we use?" 2026's question: "For this task, do we turn thinking on or off — and how much?"

This post lays out where reasoning models stand in 2026. Six families across thinking behavior, benchmarks, and price, on one page. And it answers the question that actually matters in production: when does a reasoning model earn its cost, and when does a fast non-thinking model win?


1. What is test-time compute?

Classic LLM scaling rode three axes.

AxisMeaning
ParametersMake the model larger
Training dataFeed it more
Training computeTrain it longer

o1 added a fourth: test-time compute. Spend more tokens at inference and accuracy goes up.

       accuracy
  R1 ────│              ╱── thinking ON
         │           ╱
  base ──│       ╱
         │   ╱──── thinking OFF (immediate)
         └────────────────────────▶ inference token budget

The curve differs per model and per problem. On math, code, and proofs — verifiable problems — it is steep. On creative writing, summarization, and chit-chat it is nearly flat — thinking longer barely helps.

What "thinking tokens" actually are

A reasoning model's "thinking" tokens are usually one of three forms.

  1. Hidden reasoning — o1, o3, o4. The user never sees the raw chain of thought; only summaries.
  2. Visible reasoning — DeepSeek R1, QwQ. Raw reasoning streamed inside a \<think\>...\</think\> block.
  3. Toggleable — Claude Sonnet/Opus 4.5 extended thinking. Per-request on/off with a token budget.

Hidden vs visible is not just a UX choice. Visible is great for debugging, teaching, and verifying trust, but vulnerable to imitation and distillation. The wave of distillation work that hit the day R1 went open is exactly that.


2. RLVR — the recipe behind reasoning models

A reasoning model is a base model with two extra layers on top.

2-1. The ability to generate long CoT

First the model must be able to lay out a long chain of thought. Base models prefer short, confident answers. Long-CoT SFT teaches the habit of unspooling reasoning.

2-2. RLVR — Reinforcement Learning with Verifiable Rewards

The key is the second layer. RLVR uses rewards that can be checked automatically.

RLVR loop:
  1. Hand the model a problem (math, code, logic)
  2. The model emits a long CoT plus a final answer
  3. A verifier grades it:
     - Math: does the answer match?
     - Code: do the tests pass?
     - Formal reasoning: is the proof valid?
  4. Pass = +1, fail = 0 (or negative)
  5. Update with a policy gradient (PPO, GRPO, ...)
  6. Repeat

The point is "verifiable" rewards. RLHF (human feedback) is expensive and inconsistent. RLVR is graded by compilers, test runners, and math checkers — infinitely cheap, perfectly consistent.

The DeepSeek R1 paper (Jan 2025) was the shock: starting almost from cold, RLVR alone produced R1-Zero. The model spontaneously discovered self-correction patterns like "wait, let me reconsider" — emergent reasoning, with no human teaching it that move.

Where RLVR pays off

DomainVerifierRLVR impact
MathAnswer matchHuge (AIME jumps)
CodeTests passLarge (LiveCodeBench, SWE-bench)
Logic puzzlesFormal checkLarge
Tool useIntended tool callsModerate
Writing / summarizationNeeds humansSmall (weak verifier)
Safety / honestyHuman or model judgeSmall (RLHF is a better fit)

So reasoning models are not universally better. They dominate only where verifiers are strong.


3. OpenAI — o3 / o3-pro / o4

OpenAI, the company that created the category, has the broadest lineup in 2026.

3-1. o3 (GA from 2025 Q2)

Eval results landed in December 2024, GA in April 2025. Ships a reasoning effort dial (low / medium / high) — same weights, different thinking budget. High effort can take minutes per response.

Notable behavior:

  • Uses tools during reasoning ("agentic reasoning") — searches the web, runs code interpreter, feeds results back into thinking.
  • Hidden CoT — users see summaries, not raw reasoning.
  • First widely available model to approach human performance on ARC-AGI (at high effort).

3-2. o3-pro

For genuinely hard problems. Same weights, run much longer. Costs roughly an order of magnitude more, responses can take minutes. Used for research, deep analysis, and complex debugging.

3-3. o4 / o4-mini

The next generation, late 2025. Multimodal reasoning (looks at images and diagrams while thinking) and tighter tool-use integration. o4-mini is fast yet posts coding numbers near o3 — the new default for coding workloads.

ModelThinkingTools in-loopStrength
o3Hidden, 3-step dialYesGeneral reasoning, ARC-AGI
o3-proHidden, very longYesTruly hard problems
o4Hidden, multimodalYesComplex multi-step
o4-miniHidden, shortYesCoding, cost efficiency

4. DeepSeek — R1 / R1-0528 / V3.1 reasoner

The bomb that landed in the open-weights camp. When R1 dropped in January 2025, the industry stopped.

4-1. DeepSeek R1 (Jan 2025, MIT license)

  • 671B MoE (37B active). Base is V3.
  • Built with RLVR alone as R1-Zero, then a light SFT pass to ship R1.
  • Streams raw CoT inside \<think\>...\</think\> — heaven for debugging and research, a nightmare for closed labs (imitation risk).
  • Matches the o1 curve on AIME, MATH, and coding evals.
  • An order of magnitude cheaper than closed models.

4-2. R1-0528 (May 2025 update)

Same weight class, more RL on top. A real step up on complex coding and long-context reasoning. SWE-bench Verified moved meaningfully.

4-3. V3.1 reasoner (early 2026)

A unified model with thinking as a toggle on V3.1 base — like Claude 4.5. One set of weights, thinking on/off; on emits R1-style \<think\> blocks. The first time the open camp shipped "toggleable reasoning."

Why DeepSeek matters: it proved reasoning models are not a closed-model monopoly. Anyone with 8 H100s can self-host. For regulated industries or on-prem requirements, this is effectively the default.


5. Anthropic — Claude Sonnet 4.5 / Opus 4.5 extended thinking

Anthropic took a different path. Not a separate model family — a state of the same model.

5-1. What extended thinking is

Sonnet 4.5 and Opus 4.5 ship a per-request toggle. The API call sets thinking with a token budget. The model spends up to that budget generating reasoning blocks, then emits the answer.

Request:
  thinking: { type: "enabled", budget_tokens: 16000 }

Response:
  - thinking block (up to budget)
  - final answer (assistant message)

5-2. Notable behavior

  • One set of weights, two modes — operationally simple.
  • Interleaved thinking — calls tools mid-reasoning and continues thinking after seeing results.
  • The thinking content arrives in the API response as text. It is not hidden. It is auto-compressed across turns.
  • Strong on coding and SWE-bench Verified. Sonnet 4.5 with extended thinking is one of the strongest options for real PR automation.

5-3. Budget sizing intuition

TaskSuggested budget
Instant-answer questionthinking off
One or two reasoning steps2k–4k
Small code patch8k–16k
Complex bug debugging32k–64k
Math / proofs / research64k or more

Rule: scale budget to difficulty. Do not turn thinking on by reflex.


6. Google — Gemini 2.5 Pro / Deep Think

Gemini 2.5 Pro shipped with reasoning baked into the general model from day one.

6-1. Gemini 2.5 Pro

  • Thinking defaults to ON with dynamic thinking — the model decides how long to think based on the problem.
  • One-million-token context plus thinking — strong for reasoning over long documents.
  • Multimodal — video, audio, and images participate in reasoning.

6-2. Deep Think (Gemini 2.5)

For genuinely hard work. Parallel thinking — multiple hypotheses run in parallel and merge. Made headlines at IMO 2025 as the first model at human gold-medal level. GA in late 2025.

ModelThinkingContextStrength
Gemini 2.5 FlashDynamic, short1MFast reasoning, cost efficiency
Gemini 2.5 ProDynamic, long1MGeneral, multimodal
Gemini 2.5 Deep ThinkParallel, very long1MHard math and proofs

7. Alibaba — Qwen QwQ / QwQ-Plus

The second big stream in open-weights reasoning. Together with R1, the two pillars of open reasoning.

  • QwQ-32B (Nov 2024) — an open 32B that approached o1-preview on reasoning. A shock.
  • QwQ-Plus (2025) — the next generation. A step up on both code and math.
  • Qwen3 reasoner — larger sizes, Apache 2.0.

Like R1, QwQ exposes visible CoT. Self-host friendly. Strong on Korean, Japanese, Chinese, and English — the preferred default for in-house use across Asia.


8. xAI — Grok 3 / 4 Heavy thinking

Grok 3 thinking, Grok 4, and Grok 4 Heavy all have thinking modes.

  • Grok 3 Thinking (early 2025) — long chain-of-thought mode. Trained heavily on X (Twitter) data, so strong on current events.
  • Grok 4 / 4 Heavy (late 2025) — Heavy runs multi-agent thinking, with several instances reasoning in parallel and merging. Top scores on extremely hard evals like HLE (Humanity's Last Exam).
ModelThinkingNotes
Grok 3 thinkingPartially visibleLive X data
Grok 4Hidden, longGeneral
Grok 4 HeavyParallel multi-agentHLE leader

9. Comparison matrix — one page

Benchmark numbers move with each release. The table below shows relative positioning as of early 2026.

9-1. Thinking behavior

ModelThinking formBudget controlTools in-thinking
OpenAI o3Hidden (summary only)low/med/highYes
OpenAI o3-proHidden, very longAuto (very large)Yes
OpenAI o4 / o4-miniHiddenlow/med/highYes
DeepSeek R1 / 0528Visible (<think>)AutoPartial
DeepSeek V3.1 reasonerVisible, toggleableAPI togglePartial
Claude Sonnet 4.5Visible, toggleableToken budgetYes (interleaved)
Claude Opus 4.5Visible, toggleableToken budgetYes (interleaved)
Gemini 2.5 ProHidden, dynamicDynamic autoYes
Gemini 2.5 Deep ThinkHidden, parallelDynamic autoYes
Qwen QwQ / QwQ-PlusVisible (<think>)AutoPartial
Grok 4 / 4 HeavyHidden / parallelMode selectYes

9-2. Benchmark position (early 2026, relative)

ModelAIME-style mathLiveCodeBenchSWE-bench VerifiedCost / latency
o3 (high)Top-tierTop-tierTop-tierExpensive, slow
o3-proTop-tierTop-tierTop-tierVery expensive, very slow
o4-miniHighHighHighModerate, moderate
R1-0528HighHighTop-tier-ishCheap (open), moderate
Sonnet 4.5 thinkingHighTop-tierTop-tierModerate, moderate
Opus 4.5 thinkingTop-tierTop-tierTop-tierExpensive, moderate
Gemini 2.5 ProHighHighHighModerate, moderate
Deep ThinkTop-tier (IMO)HighHighExpensive, very slow
QwQ-PlusHighHighMid-to-highCheap (open), moderate
Grok 4 HeavyTop-tierHighHighExpensive, slow

Absolute numbers shift with each release and eval methodology. Decide with your own eval suite on your data, your tasks, and your SLAs.


10. Pricing and the thinking-token economy

Reasoning models price differently. Thinking tokens count as output tokens, and they typically run tens of times longer than the visible answer.

Request:  "find the bug in this code (200 tokens)"

Response: [thinking: 8,000 tokens]  ← billed as output
          [answer:   600 tokens]    ← billed as output

Total = input(200) + output(8,600)

The implication: thinking budget is the price tag. Turning thinking on for a tiny task can be 10–50x the normal cost.

10-1. Rough per-1M-output-token positioning

Prices change often. The table is for relative comparison — get exact figures from each provider.

ModelInput/1MOutput/1MThinking in output?
o3Moderate-highVery highYes
o3-proVery highVery, very highYes
o4-miniLow-moderateModerateYes
R1 (DeepSeek API)Very lowLowYes
Sonnet 4.5 thinkingModerateHighYes (thinking counts as output)
Opus 4.5 thinkingHighVery highYes
Gemini 2.5 ProModerateHighYes
Deep ThinkHighVery highYes
QwQ-Plus (Alibaba API)Very lowLowYes
Grok 4 HeavyHighVery highYes

Open-weights models like R1 and QwQ go to zero once self-hosted (only infra cost). For high-volume repeated work, the gap is enormous.

10-2. Budget guidelines

TaskSuggested
FAQ, summary, translationThinking off (use a non-reasoning model)
Short code snippetThinking off or minimal
Routine bug fixThinking low / 4k
Complex debugThinking medium / 16k
Hard math / proofsThinking high / 64k+
Deep research, hard one-shotso3-pro, Deep Think, Grok 4 Heavy

11. When you actually need a reasoning model

Reasoning models are not universal. There are clear times to turn it on, and more times to turn it off.

11-1. Reasoning models shine when

  1. Math, logic, proofs — multi-step reasoning is where value lives.
  2. Complex code changes — coherent edits across many files in a large repo. The SWE-bench shape.
  3. Agent planning — figuring out which tool to call in what order on a new task.
  4. Debugging — hypothesizing, gathering evidence, falsifying.
  5. Research and analysis — surfacing trade-offs, counter-examples, and falsifiable claims.
  6. One-shot exam-like questions — IMO, AIME, HLE, where you must be right the first time.

11-2. Reasoning models cost you when

  1. Instant factual lookup — no reason to spend 16k thinking tokens on "what's the date?"
  2. High-volume classification or tagging — cost multiplies per item.
  3. Latency-sensitive chat UX — thinking is slow; users leave.
  4. Creative writing — verifier is weak; a general model is more varied and natural.
  5. Casual or emotional dialogue — overthinking reads as awkward.
  6. Templated reports — you are just filling slots.

Rule: thinking costs money. Turn it on only when accuracy gains justify it.

11-3. The routing pattern

request arrives
complexity classifier (cheap fast model: Haiku, Flash, 4o-mini)
  ├── "simple" → fast non-reasoning model (immediate)
  ├── "medium" → reasoning model, low budget
  └── "hard"   → reasoning model, high budget or pro/Heavy

This is the default shape of production AI systems in 2026. Sending every request to a reasoning model is cost-and-latency suicide.


12. Accuracy / cost / latency — the triangle

The same task at the same accuracy is a different system if cost and latency differ.

12-1. Three axes

         accuracy ▲
              ╱│╲
             ╱ │ ╲    ← Pareto frontier
            ╱  │  ╲
   ────────●───┼───●─────
       expensive  slow
              latency

Pareto frontier: gaining one axis costs another. o3-pro buys accuracy only. R1 self-host buys cost. Haiku/Flash buy latency.

12-2. Which point to buy

Product shapeRecommended point
Interactive chat (sub-second)Non-reasoning model or minimal thinking
Async agent (minutes OK)Thinking medium / high
Batch analysis (overnight OK)Highest-accuracy model, optimize cost only
On-prem / regulatedOpen-weights (R1, QwQ)
High-stakes one-off decisionPro / Heavy / Deep Think

12-3. Dynamic budget — graduated thinking

Advanced pattern: escalate budget on failure.

1. Try thinking 2k
2. Self-consistency: is the answer stable across reruns?
3. If stable → done
4. If unstable → retry at 4k
5. Still unstable → 16k or a different model

This escalation pattern lowers average cost a lot — easy problems stay cheap, only hard ones go expensive.


13. The open-vs-closed reasoning ladder

The 2026 reasoning landscape on an open/closed axis:

        Closed (closed-weights)
o3-pro · Opus 4.5 thinking · Deep Think · Grok 4 Heavy
         │   ← "strongest" but expensive and gated
   o3 · Sonnet 4.5 thinking · Gemini 2.5 Pro · Grok 4
         │   ← standard for general work
   o4-mini · Gemini 2.5 Flash · Grok 3 thinking
         │   ← fast reasoning
─────────┼─────────────────────────── price / latency
   QwQ-Plus · Qwen3 reasoner
   DeepSeek R1-0528 · V3.1 reasoner
        Open (open-weights, self-hostable)

Why pick open

  • Data must not leave — healthcare, finance, defense, government.
  • High-volume repeats — token-cost goes to zero.
  • Further fine-tuning — domain adaptation is possible.
  • Reproducibility and audit — the weights make every decision traceable.

Why pick closed

  • Top-tier performance — for some tasks 1–3 percentage points decide.
  • Operations outsourced — hosting, updates, safety.
  • Multimodal integration — image, video, audio, tools in one API.
  • Frontier rotation — instant access to the latest.

Reality in 2026: serious shops run both. Sensitive data goes to open self-host; general public-OK work goes to closed APIs. Routing is the hardest decision.


14. Working with reasoning models — practical tips

14-1. Keep the prompt short, keep context rich

A reasoning model's job is to think to itself. Forcing "step 1, step 2, ..." in the prompt gets in its way. State the goal clearly, state the constraints clearly, and leave the structure to the model.

14-2. Drop CoT directives

"Think step by step" helped non-reasoning models. In reasoning models, that already happens inside the thinking block. Adding the directive duplicates or even shortens reasoning. Remove it.

14-3. Tool use differs by family

  • o3 / o4, Sonnet 4.5, Gemini 2.5 Pro: interleaved thinking — tool results merge smoothly into reasoning.
  • R1, QwQ: weaker tool integration. Wrap with an external ReAct loop.

14-4. Self-consistency

Run the same question N times and majority-vote. Especially powerful on reasoning models. Cost goes Nx but accuracy moves measurably. Useful for high-stakes decisions like medical and financial.

14-5. Log thinking traces — where you can

Visible-CoT models (R1, QwQ, Claude) log the trace. It is a goldmine for debugging, improvement, and evaluation. But do not show raw thinking to end users verbatim — wrong hypotheses can read as facts.

14-6. Use caching

If the system prompt is long, thinking happens on top of it. Prompt caching (Anthropic, OpenAI, Gemini) can cut input cost up to 90%. Thinking tokens themselves are not cached — they regenerate every call.


Epilogue — two lines, then what's next

Two-line summary:

  1. Reasoning models are not universally better — they dominate only where verifiers are strong.
  2. In 2026, the decision is not "which model" but "which model x which thinking mode x which routing."

12-point checklist

  1. Do you decide per-task whether to turn thinking on?
  2. Do you scale thinking budget to task difficulty?
  3. Is there a router (cheap classifier + expensive reasoning model)?
  4. Do you use self-consistency for high-stakes decisions?
  5. Does your cost model reflect that thinking counts as output tokens?
  6. Does your tool-use design lean on interleaved thinking properly?
  7. Do you log reasoning traces from visible-CoT models?
  8. Do you have your own eval suite (not vendor benchmarks alone)?
  9. Have you evaluated open-weights options (for regulated or high-volume)?
  10. Have you cut input cost with prompt caching?
  11. Did you remove "think step by step" from reasoning-model prompts?
  12. Have you blocked raw thinking from leaking into user UI?

Ten anti-patterns

  1. Reasoning model on every request — cost-and-latency suicide.
  2. Forced CoT in prompts — counter-productive for reasoning models.
  3. Defaulting thinking budget to max — billing time bomb.
  4. Trusting vendor benchmarks only — not your tasks.
  5. Showing raw reasoning to end users — wrong guesses look like facts.
  6. Self-consistency everywhere — Nx cost.
  7. Picking either open or closed only — routing is the answer.
  8. Not monitoring thinking tokens — cost tracking is impossible.
  9. Sensitive data through external reasoning APIs — compliance violation.
  10. Reasoning models in raw chat UX — no one waits a minute.

Next post candidates

Candidates: Building a reasoning-model eval suite — measuring thinking on your data, Agents x reasoning — patterns for tool use and thinking together, Self-hosting open reasoning models — vLLM vs SGLang vs TGI.

"Not bigger models, but models that think better — and then, models that know when not to think."

— Reasoning models 2026 guide, end.


References