💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction — Why This Debate Is Blowing Up Right Now

In June 2026, a single tweet from Rich Sutton, the living legend of reinforcement learning (RL), took over the discussion threads of GeekNews and Hacker News. The gist is provocative.

> Generative AI trained by supervised learning to imitate human text is, in essence, an imitation model. Imitation is recombination of what is already known, so genuinely new scientific discovery is unlikely to come from that mechanism.

The timing was exquisite. 2026 is the year AI coding agents became ubiquitous, a generation of frontier models capable of hours of autonomous work arrived, and headlines saying "an LLM solved a hard math problem" appear on a regular cycle. At the very moment everyone is saying "LLMs will soon do science," a grandmaster of the RL camp declared head-on: not by that road.

Sutton is no mere critic. He is a 2024 Turing Award laureate, co-author of the canonical RL textbook, and above all the author of one of the most-cited essays in AI history, The Bitter Lesson (2019). The fascinating twist: many people have read the Bitter Lesson as "scaling is everything — therefore LLMs are right," while the author himself is critical of the current LLM paradigm.

In this post we reconstruct the claim precisely, connect it to the Bitter Lesson, and pin down the essential difference between imitation learning and reinforcement learning. Then we review the counterarguments (LLMs can discover too) and actual cases of scientific-discovery AI, and close with the practical lessons for developers building agents.

The Core of the Claim — The Ceiling of Imitation

Reconstructing the argument step by step:

1. **The training objective of an LLM is next-token prediction.** Mimicking the distribution of human-written text is the entire objective function.

2. **That is, by definition, imitation.** What the model is good at is producing plausible continuations within the training distribution.

3. **Scientific discovery lives outside the distribution.** A new theory is not a plausible continuation of existing text; it is a claim that contradicts existing consensus while agreeing with the world.

4. **There is no contact with the world.** An imitation model receives no direct feedback from the world about whether its output is right. It sees the world only through the secondary source of human-written text.

5. **Therefore only a system that learns from experience — an RL-style agent that acts with goals, observes outcomes, and corrects itself — can make real discoveries.**

Sutton has been packaging this view for several years under the phrase "The Era of Experience." The era of human data has hit its ceiling, and the next stage is the era in which agents learn from experience data they generate themselves.

The core intuition as a diagram:

The imitation worldview The experience worldview

[corpus of human text] [world / environment / simulator]

| ^ |

v | action | reward/observation

[next-token prediction model] | v

| [agent policy]

v |

"plausible text" v

(in-distribution interpolation) "actions validated by the world"

(out-of-distribution discovery possible)

Revisiting the Bitter Lesson — and the Part People Misread

The message of the Bitter Lesson (2019) fits in two sentences.

1. Across 70 years of AI history, approaches that hand-carve human knowledge into systems always win short-term and lose long-term.

2. What wins long-term are general methods that absorb growing computation as-is — namely **search and learning**.

Here is the frequently misread part. Many have compressed the essay into "scale and you win," citing it as the founding document of LLM scaling. But the original emphasizes two general methods, and one of them is search. The 2026 statement is not self-contradiction; it is closer to a re-emphasis of the original. From his standpoint, the current LLM looks like this:

- Learning has scaled. (Pass)

- But the source of that learning is human text — a finite, second-hand resource. (A refined variant of the old trap of injecting human knowledge)

- Search — the axis of interacting with the world to create new data — is still weak. (Fail)

In other words, to Sutton, LLMs may not be the winner of the Bitter Lesson but the most gigantic version ever of the "carve in human knowledge" approach. That reading is the real crux of the June 2026 debate.

Imitation Learning vs Reinforcement Learning — The Essential Difference

Comparison Table

| Axis | Imitation learning (supervised/SSL) | Reinforcement learning (RL) |

| --- | --- | --- |

| Data source | Fixed corpus made by humans | Experience generated by the agent |

| Objective | Distribution matching (next-token prediction) | Reward maximization |

| Criterion of truth | Did humans write it that way | Did it work in the world |

| Out-of-distribution behavior | Trained to avoid | Can be encouraged via exploration bonuses |

| Data limits | Ceiling when the corpus is exhausted | Unbounded, as far as the environment allows |

| Failure mode | Plausible nonsense (hallucination) | Reward hacking, exploding exploration cost |

| Signature wins | GPT family, translation, summarization | AlphaGo, AlphaZero, robot control |

Interpolation and Exploration — Why the Difference Is Essential

The generalization of an imitation model is essentially interpolation on the manifold defined by the training distribution. The surprise is that this manifold is far broader than expected: interpolation alone can produce "combinations never seen before." A Kubernetes incident report written in the style of Shakespeare exists nowhere in the corpus, yet an LLM produces it with ease.

The problem is that scientific discovery sometimes demands not combinational freshness but a **break with the distribution**. Heliocentrism, relativity, continental drift were claims of vanishingly low likelihood under the text distribution of their day. To a model that perfectly imitates all contemporary text, such claims are by definition anomalous output.

What makes RL different is that its criterion of truth is not the distribution but the reward. Move 37 of AlphaGo (game two) is the emblem. Under the distribution of human game records, the probability of that move was estimated around one in ten thousand, yet the accumulated experience of self-play discovered that the move wins. A move that looks like a mistake by the standard of human distribution was the better move by the standard of the world (the rules of Go).

distribution of human moves where move 37 sits

-------------------------- ----------------------

high probability ████

medium ██████

low █ <------- here (nearly unselectable for pure imitation)

but self-play value estimate: top contribution to win rate <-- RL follows this signal

An Important Caveat — There Was No AlphaGo Without Pretraining

To be fair, the original AlphaGo started with **imitation learning** on human game records. Imitation built a reasonable initial policy, and RL explored on top of it. AlphaZero, the pure self-play system, came a generation later. The historical fact is less "imitation vs RL" than "RL broke through the ceiling standing on the floor that imitation laid." This point becomes important again in the counterarguments below.

The Counterarguments — Claims That LLMs Can Discover

The pushback against Sutton is formidable. The main lines:

Counterargument 1: Combinational generalization is also discovery

Many discoveries in the history of science were not creation from nothing but new connections between existing concepts. Darwin connected Malthusian population theory to biology; Schroedinger connected the wave equation to quanta. If discovery means "the ability to connect concepts across fields," an LLM has a broader reading range than any individual human. Even within in-distribution interpolation, the distribution of all human text contains an astronomical number of combinations nobody has explicitly connected yet.

Counterargument 2: Move 37 was made by RL — and LLMs have already merged with RL

The frontier models of 2026 are no longer pure imitation models. Beyond RLHF, reinforcement learning on verifiable rewards (RLVR) has become the standard recipe for reasoning models. Train the model on objective rewards — correct math answers, passing code tests, formal proof checkers — and it discovers solution paths absent from human text. The "aha moment" of the DeepSeek-R1 line (the emergence of spontaneous self-correction behavior) is the canonical example. So the critique is valid against pure pretrained models, but the systems actually deployed today are hybrids that have already partially adopted his prescription: experience and reward.

Counterargument 3: Real discovery cases are accumulating

- **FunSearch (2023)**: an evolutionary loop in which an LLM proposes programs and an evaluator scores them found constructions for the cap set problem better than anything humans knew. Widely cited as a genuinely new LLM-assisted result in mathematics.

- **AlphaGeometry (2024)**: a combination of a neural model and a symbolic reasoning engine reached gold-medalist level on International Mathematical Olympiad geometry problems.

- **The AlphaFold line**: changed the pace of experimental science on the hard problem of protein structure prediction, culminating in the 2024 Nobel Prize in Chemistry.

But look closely and these cases share a structure: **not the LLM or neural net alone, but a loop coupled to an external verifier** (evaluator, proof checker, physical experiment). This is a counterargument and simultaneously a partial concession to Sutton. What produced the discovery was not the imitation model itself, but a system that uses the imitation model as a proposer while the world — or its proxy, the verifier — does the grading.

Counterargument 4: Human scientists start with imitation too

The first three years of a PhD are, in effect, imitation learning. You read papers (absorbing the corpus), reproduce existing techniques (fine-tuning), and mimic the style of your advisor. Discovery grows out of that imitation. If the logic were "imitation, therefore no discovery," humans could not discover either. Imitation may be not the opposite of discovery but its precondition.

Reviewing the Scientific-Discovery AI Cases — What Exactly Do We Know

Both excitement and skepticism are easily exaggerated here, so let me distill only what can be said with reasonable confidence as of mid-2026.

| --- | --- | --- | --- |

A pattern emerges. Wherever the achievement is real, there is always a **fast, accurate verifier**: the win/loss of Go, the proof checker of mathematics, the experimental data of proteins. Conversely, in domains where the verifier is slow (clinical trials) or fuzzy (social-science theory), definitive results remain scarce relative to flashy demos.

The Definition Problem of Discovery — Why the Debate Spins Its Wheels

The debate often spins because the word discovery is used in at least three different senses.

1. **Level 1 — rediscovery of known answers**: the model independently derives results humans already know. Useful for benchmarks; not discovery.

2. **Level 2 — new answers to known problems**: better-than-human answers to concrete open problems like cap sets. FunSearch reached this. But humans supplied the problem definition and the scoring function.

3. **Level 3 — posing new problems, concepts, theories**: paradigm proposals that change what should be asked at all. Relativity-grade. There is no consensus that any AI has reached this.

The strongest reading of Sutton is "imitation models cannot reach level 3"; the weakest is "imitation alone struggles even at level 2." The evidence from the counterargument camp clusters mostly at level 2. The two sides are frequently fighting about different levels. Agreeing on this distinction first resolves half the argument.

Practical Implications — Design Exploration and Verification Loops into Your Agents

This debate is not philosophy; it bears directly on agent design today. For anyone building the coding agents and research agents of 2026, the lesson is crisp: **use the model as a proposer, and embed a verifier in the system.**

The Basic Propose-Verify-Improve Loop

propose_verify_loop.py — the basic pattern for embedding experience learning in an agent

def discovery_loop(task, llm, verifier, budget):

best = None

history = [] # the "experience" of the agent

for step in range(budget):

candidates = llm.propose(

task=task,

history=summarize(history), # past attempts and failure reasons as context

diversity=temperature_schedule(step), # control exploration strength

)

for cand in candidates:

score, feedback = verifier.evaluate(cand) # proxy for the world

history.append((cand, score, feedback))

if best is None or score > best.score:

best = Result(cand, score)

if verifier.is_solved(best):

break

return best, history

This simple loop is the shared skeleton of FunSearch, the test-time search of reasoning models, and the "iterate until the tests pass" of coding agents. There are four design points.

1. **The quality of the verifier sets the ceiling.** For a coding agent the verifier is the test suite, the type checker, the linter. A flimsy verifier (poor test coverage) teaches the agent reward hacking — junk code that merely passes the tests. The old lesson of RL replays itself verbatim in the agent era.

2. **Feed failure history back as context.** The moment "why it failed" becomes input for the next proposal — not mere retrying — the system starts learning from experience rather than imitation.

3. **Manage the diversity schedule explicitly.** Raise temperature early to explore, lower it late to converge: the exploration-exploitation balance of RL applies unchanged to LLM loops.

4. **Store experience as an asset.** The history is not disposable. Accumulated attempt-outcome pairs are the training data for the next fine-tune — the experience data Sutton talks about.

The Value of Simulators — If You Cannot Buy a Verifier, Build One

The verifier spectrum (the further left, the stronger the agent loop)

instant/exact slow/fuzzy

|----------|------------|------------|------------|

compiler unit tests simulators human review real-world

type checker property (physics/ A/B tests experiments

tests economics) (clinical etc.)

The more expensive real-world experiments are in your domain, the more a simulator becomes a strategic asset. Discovery-style agents will start working first in domains where simulators are accurate enough — proteins (structure predictors), circuits (SPICE), fluids (CFD), economic policy (agent-based simulation). Put the other way: if you want AI to make discoveries in your domain, your first investment is probably not a bigger model but a better verifier and simulator.

The Research Landscape — Two Camps Converging

The research landscape of mid-2026 looks less like a war of "imitation camp vs RL camp" and more like convergence.

- **The LLM camp is absorbing RL**: RLVR, process reward models, and test-time search have become the standard stack. The relative share of pretraining is shrinking, while post-training and inference-time compute grow.

- **The RL camp is absorbing LLMs as priors**: research is active on using LLM priors to mitigate the sample inefficiency of pure RL — LLMs as policy initialization, reward design, exploration guides.

- **The remaining hard problems**: rewards that resist verification (what is a good theory), long horizons (the reward sparsity of a months-long research project), and safety (controlling systems that form and test their own hypotheses).

The role Sutton plays is to accelerate this convergence through criticism. The "human data ceiling" warning has become the most powerful narrative justifying investment in synthetic data, self-play-style environments, and experience-accumulation infrastructure.

The Perspective for Developers — Knowing the Limits of the Tool, and Using It

Practical takeaways for working developers:

1. **Do not expect "unverified novelty" from an LLM.** A new idea the model offers confidently is plausible within the distribution, not validated by the world. The fresher a proposal looks, the sooner you should budget its verification cost.

2. **Conversely, do not underrate the value of "broad imitation."** At connecting literatures, porting known techniques, and implementing baselines, LLMs are already superhuman. They dramatically cut the cost of everything upstream of discovery.

3. **Promote verifiers to first-class citizens in your pipeline.** Investment in tests, simulators, and evaluation functions depreciates more slowly than model upgrades. Models change every quarter; a good verifier lasts years.

4. **Write the exploration budget into your agent design docs.** How many candidates, at what diversity, until which stopping condition — these determine the discovery capacity of your agent. Exactly as the 2026 maxim goes: loop engineering replaces prompt engineering.

A Mini Exercise — Going Beyond Imitation with Verifiable Rewards

Here is a toy-scale way to experience the core idea of RLVR. In a small discovery task — finding the rule of an integer sequence — we compare pure sampling (imitation) with a verifier-coupled loop (experience).

tiny_rlvr_demo.py — a toy experiment showing the difference between imitation and a verification loop

def verifier(formula, examples):

"""The proxy for the world: does the candidate formula satisfy all examples"""

try:

return all(eval(formula, None, dict(n=n)) == y for n, y in examples)

except Exception:

return False

def imitation_only(llm, task, k=20):

"""Strategy A: sample k candidates once and stop (no-feedback imitation sampling)"""

candidates = [llm.sample(task) for _ in range(k)]

return [c for c in candidates if verifier(c, task.examples)]

def experience_loop(llm, task, budget=20):

"""Strategy B: iterate while feeding back the reason for failure (experience loop)"""

feedback = ""

for _ in range(budget):

cand = llm.sample(task, hint=feedback)

if verifier(cand, task.examples):

return cand

inject which example failed into the context of the next attempt

wrong = first_failing_example(cand, task.examples)

feedback = f"candidate {cand} failed on input {wrong}"

return None

Same model, same call budget — yet strategy B consistently finds harder rules. What makes the difference is not the model but the loop structure: the existence of a verifier and a feedback channel. I think of this as the entire debate compressed into five lines of code.

If you want to extend the experiment, try these variations.

1. Deliberately weaken the verifier (check only two examples). Reward hacking appears immediately: under a flimsy verifier the loop converges to a junk rule that fits only those two examples.

2. Turn the feedback into a cumulative history. You will observe both a regime where convergence speeds up versus one-shot feedback, and a regime where the lengthening context makes things worse. You will feel in your hands why context engineering is the keyword of 2026.

3. Add a schedule that lowers temperature step by step. The exploration-exploitation trade-off shows up clearly even at this tiny scale.

Parallel Lines of History — This Debate Is Not the First

"Imitation or search" is in fact roughly the third recurrence of this debate in AI history.

1997 Chess: heuristics from human games vs brute-force search (Deep Blue)

--> search won; but human knowledge survived in the evaluation function

2016 Go: imitating human records vs self-play RL (AlphaGo -> Zero)

--> started with imitation, transcended via RL, finally removed imitation

2026 Science: imitating human text vs experience/verification loops (ongoing)

--> ??? (the scene we are watching now)

The endings of the first two share a pattern. Human knowledge (imitation) was decisive as a bootstrap, but the final ceiling was always broken by search and experience. And each time, the camp saying "impossible without human knowledge" collided with the camp saying "human knowledge is the bias," and the answer was a staged hybrid. If the same pattern repeats in scientific discovery, both the current LLM skepticism and the optimism turn out to be only partially right.

One thing makes science decisively different from Go, though. The verifier of Go (win/loss judgment) was free; the verifier of science (experiments) is expensive and slow. Because of this asymmetry, the "AlphaZero moment of science" will likely arrive far more gradually than in Go, sequentially in the fields where verifiers are cheap — mathematics, code, simulable physics.

Frequently Asked Questions

**Q1. Is Sutton saying LLMs are useless?**

No. His claim is closer to a theory of limited use. Imitation models are superb at reorganizing known knowledge but structurally ill-suited as engines of new discovery. The accurate reading is a demand for balance in a regime where, of search and learning, only learning has hypertrophied.

**Q2. Do reasoning models trained with RLVR escape the critique?**

Only partially. RLVR introduces experience learning in domains where verifiable rewards exist (math, code), but those rewards live inside problems humans defined. It can reach level 2 (new answers to known problems), but level 3 (posing new problems) retains the fundamental issue that the reward cannot be defined.

**Q3. How does this relate to the AGI debate?**

Directly. The "scaling alone reaches AGI" position is the hypothesis that general intelligence emerges as an extension of imitation learning; the Sutton position is the hypothesis that the separate axis of experience-based learning is indispensable. That the roadmaps of frontier labs in 2026 are shifting weight toward post-training and agentic experience collection can be read as the industry effectively hedging toward the latter.

**Q4. From a career standpoint, what should developers prepare?**

The value of the ability to build verifiers is structurally rising. Skills like evaluation-function design, simulator construction, test infrastructure, and domain-specific benchmark building do not lose demand no matter how good models get. Many people can use models; people who can grade them remain scarce.

Infrastructure for the Era of Experience — What to Build Now

If you take the prescription seriously, the next bottleneck is not the model but the infrastructure for collecting, storing, and reusing experience. Here is a minimal setup worth designing now for an agent team.

An Experience Store Schema

-- experience_store.sql — a minimal schema for accumulating agent experience as a learning asset

CREATE TABLE episodes (

episode_id UUID PRIMARY KEY,

task_family TEXT NOT NULL, -- e.g. code_fix, theorem_search

task_spec JSONB NOT NULL, -- problem definition (must be reproducible)

agent_version TEXT NOT NULL, -- model+prompt+loop version

started_at TIMESTAMPTZ NOT NULL,

ended_at TIMESTAMPTZ

);

CREATE TABLE steps (

step_id BIGSERIAL PRIMARY KEY,

episode_id UUID REFERENCES episodes(episode_id),

action JSONB NOT NULL, -- proposed candidate / tool call

observation JSONB NOT NULL, -- verifier output, error messages

reward DOUBLE PRECISION, -- verifier score (NULL if absent)

created_at TIMESTAMPTZ DEFAULT now()

);

-- Crucial: store failures too. Failure cases are the learning signal for the next model.

CREATE INDEX idx_steps_reward ON steps (reward) WHERE reward IS NOT NULL;

Three points. First, do not discard failures: in RLVR-style post-training, wrong solution paths become material for contrastive learning. Second, enforce reproducibility at the schema level: experience accumulated without task_spec and agent_version cannot be used as training data. Third, include the reward column from day one: even if it is a heuristic score now, what matters is a structure you can later re-label with a more refined verifier.

A Verifier Portfolio Checklist

[ ] Instant verifiers: compile/type-check/lint — the first gate of the agent loop

[ ] Functional verifiers: unit/property/integration tests — coverage equals reward quality

[ ] Simulation verifiers: domain simulators — a cheap proxy for real-world experiments

[ ] Statistical verifiers: A/B, offline evaluation — slow but closest to the final verdict

[ ] Human verifiers: review/audit — the most expensive, so filter as much as possible upstream

[ ] Adversarial verifiers: reward-hacking detection — second-order checks that catch solutions which fool the verifier

The last item is the one most often missing. The moment a verifier becomes the reward, the loopholes of the verifier become the goal of the agent. Goodhart has been the old enemy of RL, and it is the operational risk of the agent era too.

The Research Landscape at a Glance

| --- | --- | --- | --- |

Closing — Imitation and Discovery Are Not Enemies

The debate in one sentence: **imitation can be the starting point of discovery, but completing a discovery requires grading by the world.**

I think the history of AlphaGo already showed the shape of the answer. It began by imitating human games (imitation), broke the ceiling through self-play (experience), and finally became stronger without imitation at all (AlphaZero). The relationship between LLMs and scientific discovery is likely to trace a similar arc. We are somewhere between the first and second stages right now.

The conclusion for developers is practical. Do not mistake the imitative competence of a model for evidence of discovery; but equally, do not ignore the fact that stacking a verification loop on top of that competence really does produce new things. The most productive reading of the provocation is not "abandon LLMs" but a demand to restore to our systems the missing half: experience and verification.

To finish, a checklist summary.

[ ] What is the verifier of our agent? Are we measuring its quality?

[ ] Does failure history feed back as context for the next attempt?

[ ] Is the exploration budget (candidate count, diversity, stop conditions) explicit?

[ ] Is experience (attempt-outcome pairs) accumulated in a reusable form?

[ ] Does a second-order check exist to catch reward hacking?

[ ] Do we consciously budget the verification cost of any "novel proposal"?

If you can answer yes to these six lines, your system is already moving away from being an imitation machine and toward being a small discovery machine.

References

- [The Rich Sutton tweet in question (June 2026)](https://twitter.com/RichardSSutton/status/2061216087744946656)

- [GeekNews — discussion of the Sutton claim on the limits of imitation models](https://news.hada.io/topic?id=30387)

- [The Bitter Lesson — Rich Sutton (2019)](http://www.incompleteideas.net/IncIdeas/BitterLesson.html)

- [Rich Sutton homepage — Incomplete Ideas](http://www.incompleteideas.net/)

- [Reinforcement Learning: An Introduction (Sutton and Barto) official page](http://incompleteideas.net/book/the-book.html)

- [DeepMind — AlphaGo Zero: Starting from scratch](https://deepmind.google/discover/blog/alphago-zero-starting-from-scratch/)

- [DeepMind — FunSearch: making new discoveries in mathematical sciences using LLMs](https://deepmind.google/discover/blog/funsearch-making-new-discoveries-in-mathematical-sciences-using-large-language-models/)

- [DeepMind — AlphaGeometry: an Olympiad-level AI system for geometry](https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/)

- [AlphaFold paper — Highly accurate protein structure prediction (Nature, 2021)](https://www.nature.com/articles/s41586-021-03819-2)

- [DeepSeek-R1 paper — reasoning emerging through RL (arXiv)](https://arxiv.org/abs/2501.12948)

- [Reward is Enough — Silver, Singh, Precup, Sutton (2021)](https://www.sciencedirect.com/science/article/pii/S0004370221000862)

- [Hacker News](https://news.ycombinator.com/)