Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — This is the moment for calibration, not cheerleading

Most AI-coding content over the past 18 months has been in one of three modes:

Utopia — "Developers are obsolete now."
Dystopia — "All AI code is trash."
Marketing — "Our tool automates X% of your work."

This post is none of those. This post is about calibration.

AI coding assistants are powerful. But the shape of that power is not uniform. On some tasks the multiplier is 3x to 10x. On some it is 0x. On some it is negative — slower and worse than a human doing it alone.

As of mid-May 2026, the top agents on SWE-bench Verified sit around 75%. A great number. But if you do not know which 25% the failures live in, you set your trust at 1 in the 25% zone and lose.

This post covers the ten things AI does not do well. For each:

Real failure pattern — code, not abstraction.
Why it happens — reduced to model architecture and harness limits.
What the developer should keep doing — where human judgment is a multiplier.

At the end you get a matrix, a checklist, and an anti-pattern list. Not AI skepticism — calibration so you can use AI tools better.

Chapter 1 · Deep bug debugging — "Agents write code well; they do not narrow bugs well"

Pattern

A production service had a P99 latency spike every five minutes. Root cause was not GC — it was TLS session renegotiation storms. The agent burned eight hours on GC tuning. At hour eight it asked for permissions to "look at GC logs in more detail."

Why it happens

LLM agents are vulnerable to hypothesis tunneling. Once "this looks like a GC issue" enters the context, every subsequent tool call and reasoning step is biased toward confirming that hypothesis. A human debugger pauses around the one-hour mark and asks, "what if it is not this?" An agent keeps reinforcing the initial hypothesis as the context thickens — the most insidious form of context rot.

Also, the agent's tool calls see local information. Loading a debugger, inspecting a core dump, tailing packet captures, reading kernel traces — only some of this is well-automated. And bugs that do not reproduce are an agent weak point: when reproduction is hard, the agent multiplies its guessing rather than narrowing.

What humans should keep doing

Do not close hypotheses. Force the agent to "list five alternative hypotheses; this might not be GC."
Design falsifiable experiments. Agents are good at finding "evidence I am right" and bad at finding "evidence I am wrong."
The eight-hour rule. If you have been chasing the same hypothesis for eight hours, stop. The agent will not stop — you have to.

Sample dialog

Human: "P99 is spiking, looks like GC. Take a look."
Agent: (1 hour) "Let us tune the G1GC parameters..."
Human: "What other hypotheses could fit? Enumerate five."
Agent: "1) GC, 2) disk I/O storm, 3) network retransmits, 4) TLS renegotiation, 5) compaction work."
Human: "Spend five minutes on each of 2 through 5. Is there evidence?"

Without this forced branching, the agent never escapes hypothesis 1. The meta-instruction — "look at the other hypotheses" — is the multiplier.

Chapter 2 · Architecture in a large codebase — "Agents see slices, not the whole"

Pattern

In a one-million-line monorepo you ask the agent to add a feature. The agent cleanly builds it under new-feature/. Code works. A senior on the same team reads it and goes, "uh, there is already something similar under platform/shared/orchestration." The agent could not see it — it never read that directory.

Why it happens

LLMs reason inside their context window. A 1M-token model still cannot fit a million-line monorepo. Agents navigate by name-based search and mentioned paths — these miss implicit knowledge about which code is canonical.

The most dangerous variant is abstraction conflict. The agent makes a "clean abstraction." But if the new abstraction overlaps in intent with an existing one and differs only in shape, the codebase ends up with two similar abstraction layers. Six months later a new hire does not know which to use, uses both, and ultimately creates a third.

What humans should keep doing

Force a "search for existing similar" step. Put it in AGENTS.md or in the prompt.
An "abstraction review" stage in PR — a senior asks, "how is this different from existing X?" Agents do not ask this of themselves.
Store architecture docs in agent-readable form. ADRs, module responsibility matrices. Agents do not see what they cannot read.

One step further — make it read existing code twice

A trick that has worked well in big monorepos: before giving the agent the actual task, force two explicit steps.

1) "Spend five minutes searching for existing modules that could be related to this task. Enumerate them."
2) "For each module, describe how its responsibility overlaps with the task. Propose three candidate locations for the new code."
3) "Wait for human confirmation before writing code."

This three-step gate reduces abstraction conflicts by 30 to 50% in our team's experience. The agent's default is "start writing code"; you have to explicitly tell it to stop and think first.

Chapter 3 · Subtle performance regressions — "Looks the same, ten times slower"

Pattern

The agent "refactors" an N+1 query. The code looks cleaner. All tests pass. You ship. P95 latency goes up 10x. Turns out the clean-looking code now uses a query pattern that misses the index, or lazy-load became eager-load.

Why it happens

LLMs see the shape of code. The runtime cost is something they infer, not observe. And that inference often depends on the data distribution — an N+1 invisible at 100 rows kills P95 at 1M rows. The agent does not know what your distribution looks like.

Agents also write poor micro-benchmarks. JIT warm-up, cache effects, measurement noise — all need human intuition.

What humans should keep doing

Bake performance regression tests into CI. The agent will not check regressions — CI has to.
Feed profiler results directly to the agent. "Look at this PR" is not the same as "look at this flamegraph and find the regression." Tool changes capability.
Owning the data distribution is a human job. The agent knows "works at 1k rows," not "works at 100M rows."

Small case — algorithmic complexity's hidden upgrade

A common PR: the agent "simplifies" a Set-based dedup into Array.filter + includes. Code is shorter. At 100 items it is even faster (smaller constant). At 100k items it is O(n squared) and 30x slower. The unit test has 10 items and does not catch it. Production catches it.

This pattern is dangerous because it looks like a refactor but is a regression. PR reviewers go, "ooh cleaner," and merge. Only automated performance regression tests give you an objective safety net.

Chapter 4 · Concurrency correctness — "Pattern-matches race-free idioms but misses the subtle data race"

Pattern

The agent writes a multi-threaded queue. The code uses Mutex and Condvar canonically. Reviewing it you go, "looks fine." Six months later, under load, you get a deadlock. Two locks were acquired in opposite orders in two places — on a path with rare contention.

Why it happens

Concurrency does not yield to case-based learning. The model imitates "common race-free patterns" very well — producer/consumer, reader/writer lock, channel-based message passing. But lock-order consistency, memory model semantics (Acquire/Release/SeqCst), reentrancy conditions — these require global reasoning. The model sees slices.

And tests do not catch them. Race conditions are non-deterministic; unit tests usually cover the deterministic cases. You need ThreadSanitizer, loom, jepsen — and agents are not fluent operators of those tools.

What humans should keep doing

Concurrency code always gets a second pair of human eyes. Label PRs "concurrency: review needed."
Use model checkers. TLA+, loom, Coq — the only real guarantees for concurrency correctness are still from formal verification.
Tell the agent explicitly to "enumerate interleavings." It will not do it unless asked.

A one-line comment to enforce lock order

// LOCK_ORDER: always (config -> session) — see SECURITY.md L42
let _c = config.lock();
let _s = session.lock();

That explicit comment is the tool that forces the same order in both places. Agents respect "context-marker comments" well. The humans, however, have to write the comment.

Chapter 5 · Judgment under ambiguity — "Which trade-off does the team prefer here"

Pattern

"Add error handling to this API." The agent adds clean Result-based error handling. The team's convention was panic-on-invariant-violation. Another team's convention was to emit errors via OpenTelemetry. The agent does not know the "right" answer — the answer depends on team choice.

Why it happens

Most of software engineering is taste and trade-offs. "Which paradigm here," "what is the right level of abstraction," "is 50ms of latency worth 30% better readability." These need information outside the context — team values, company priorities, user pain points.

The agent guesses. It makes good guesses. But a guess is a guess.

What humans should keep doing

Write a "preferences" doc. Put "errors are Results not panics," "logging is structured," "async is channel-based" in AGENTS.md.
Make the agent ask about trade-offs. When the answer is ambiguous, the agent should request clarification. Otherwise it guesses.
Make "matches team convention" an explicit PR review step. Do not automate away taste.

Chapter 6 · Genuinely new tech — "Training cutoff bias; even with new docs, it reverts to old patterns"

Pattern

A library released in early 2026 (say, Effect-TS 3.x, new concurrent features in React 19.5, PPR stabilization in Next.js 16). You dump the full new docs into context. The agent uses the new API correctly for the first few hundred tokens. Then it reverts to 2024 patterns — layering useEffect patterns on top of the new concurrency hooks, calling new APIs with old signatures because it cannot remember the new ones.

Why it happens

LLMs are strongly influenced by frequency in training data. The new library's docs are tens of thousands of tokens; the old library is deeply baked into the weights. Context cannot fully overwrite the weight prior.

RAG does not fix this. Even with the new docs retrieved, the model produces a mixed hallucination — new signature plus old pattern.

What humans should keep doing

Write code in the new library by hand once. Learn the pattern yourself before delegating. Then you can recognize regressions.
Bake "version enforcement" into the toolchain. Lint rules, type checks. "Old API call is a compile error."
Correct the agent every time it reverts. Repeated correction within a session works.

Verified regression signals

While working with a new library, suspect regression if you see any of:

import paths quietly returning to the old path (react instead of react/jsx-runtime).
new hooks used with the mental model of old hooks.
error messages that do not match the new version's signatures (meaning the try/catch assumes the old API).

If you see any one of these, start a new session and reinject the new docs with stronger emphasis.

Chapter 7 · Multi-repo cross-cutting changes — "Still hard"

Pattern

Forty microservices need the same change (e.g. deprecated SDK 1.x to 2.x). The agent does repo one well. Repo two well. Repo five — it misses the subtle differences each repo has (test runner differences, build system differences, dependency versions). It opens 40 PRs; 12 of them are broken at build time.

Why it happens

Each repo carries hidden context. A multi-repo change is fundamentally about N contexts, but the LLM context window is one. The agent generalizes a decision from repo A to repo B — and fails wherever that generalization breaks.

Multi-repo work is also a dependency graph. To change A you have to deploy B first, which depends on C being compatible. Graph reasoning across a single context is brittle.

What humans should keep doing

Settle the pattern in one repo first, then scale. The first five repos are human-driven; the rest are agent plus human review.
Encode shared changes as codemods. AST transforms plus AI review beat pure AI.
Own the rollout order. Agents are weak at dependency-graph reasoning.

Chapter 8 · Security-critical code — "Looks right; has a subtle flaw"

Pattern

The agent writes a password hashing function. It uses bcrypt, salts from random. Looks canonical. But the comparison is timing-attack vulnerable (uses == instead of constant_time_eq); the salt rounds cost factor is 8 (current guidance in 2026 is 10 to 12); the error message echoes the username (timing leak). All subtle. All easy to miss in code review.

Why it happens

Security is the discipline of negative space — "if something that should be there is missing, that is a flaw." LLMs reason about "what is there." Catching "what is not there" requires a security mental model. The agent's model is average, not expert.

Security vulnerabilities are also slow discoveries — they surface six months later in an incident. By then the agent is doing something else.

What humans should keep doing

Security-critical code requires a security expert review. AI review is supplementary.
Enforce automation — semgrep, bandit, gosec, cargo-audit. More deterministic than AI.
Mark "security risk" sections in AGENTS.md. "Code in this directory always requires human security review."

Commonly missed negative-space checklist

What we check in security reviews of agent-written code:

Is a timing-safe comparison used?
Could a secret end up in logs?
Does the error response create a timing oracle?
Is rate limiting applied to every path?
Is input validation immediately before sanitization (not after)?
Are CSRF/CORS explicitly configured (not relying on defaults)?

Bake these six items into the PR template and you catch about 80% of security regressions. The remaining 20% is the domain of expert review.

Chapter 9 · Legacy code with implicit assumptions — "The code knows something the code does not say"

Pattern

A function in a 15-year-old monolith looks weird. A variable is named magic_offset_42. The agent decides "this needs cleanup" and refactors it. You ship. A month later an old data migration pipeline dies. Turns out magic_offset_42 was an off-by-one correction dating from a 2012 database migration accident. The code did not say so. But the code knew.

Why it happens

Legacy code is compressed implicit knowledge. The reason code looks the way it does is often not in the code — it is in a Slack channel somewhere, in the head of an engineer who left, in an incident report from five years ago. LLMs see the code only.

RAG does not solve this. Even if the incident report is retrieved, the model has to make the connection between "this is the line that needs to keep its weirdness" and the report — that connection often does not fire.

What humans should keep doing

Put "do not refactor" markers on legacy code. Comments, separate files.
Force git blame + history into context. "If you want to know why this line is this way, read the commit history."
Legacy changes always get human review. Automated refactor on legacy is dangerous.

Chapter 10 · Prompt drift in long sessions — "The agent slowly loses the thread"

Pattern

An eight-hour agent session. The first hour is flawless. Around hour four the code style gets weird — variable naming convention drifts slightly, error-handling patterns change a bit, comments thin out. By hour seven the agent acts as if it forgot the initial instructions.

Why it happens

The longer the context, the more the model weights recent tool results over the original instructions. This is essentially attention dilution. Automatic context compression (summarization) adds another step where subtle information disappears. The model slowly forgets who it is.

This is not fully solved by model size. 200K, 1M, 10M context windows — relative attention patterns remain.

What humans should keep doing

Keep sessions short. Break large work into multiple sessions. Each session has clear input and output.
Reinject AGENTS.md repeatedly. Even with a system prompt, restate the important conventions explicitly.
When you see drift, start a new session. No sunk-cost regret. Context reset is the most powerful tool you have.

A one-line heuristic for measuring drift

At the start of the session, ask "summarize the three core rules of this task in one line each," and save the answer. Four hours later, ask the same question. If the answer changes, that is drift. If the answer stays the same but the code style has changed, that is partial drift. Either way: start a new session.

Chapter 11 · So what does AI do well (the control group)

This post is not anti-AI. To be fair we have to name where AI clearly multiplies.

Tasks where AI shines

Bootstrap — new project scaffolds, boilerplate, the first 100 lines of familiar patterns.
Well-defined single tasks — "convert this function to TypeScript," "this SQL into ORM code," "add the missing test cases for this function."
Read, summarize, explain — summarize a 1000-line file in 200, find the real error in a build log, explain the intent of a PR change.
Repetitive well-defined migrations — anything expressible as a consistent codemod.
Test writing — especially unit tests where input/output shape is clear.
Doc generation — docstrings from code, README drafts, API specs.
Familiar debugging — null pointer, off-by-one, common missing-transaction patterns.

In these areas AI is a 3x to 10x multiplier. Nothing to argue against.

Areas where AI is strictly faster than humans

Typing itself — keyboard input is always slower for humans.
API surface memorization — human time spent flipping through docs goes to zero.
Format conversion — JSON to YAML, schema to TypeScript, REST to GraphQL.
Translation — both natural language and code language.

Chapter 12 · "Where AI shines" vs "Where humans multiply" matrix

Task type	AI alone	AI + human (human multiplier)
Boilerplate, scaffolds	Strong	Final 5-min human review
Well-defined function writing	Strong	Human defines the signature
Unit tests	Strong	Human suggests edge case candidates
Code summary, explanation	Strong	Fine without humans
Familiar debugging	OK	Human forces hypothesis diversity
Deep bug debugging	Weak	Human breaks hypothesis tunneling
Small refactors	OK	Human bounds scope
Large architecture	Weak	Human owns the big picture
Catching performance regressions	Weak	Human interprets profiler output
Concurrency code	Risky	Human + formal verification
Security code	Risky	Human security expert + automated tools
Legacy changes	Risky	Human teaches the implicit assumptions
New libraries	Regression risk	Human writes a pattern first, then delegates
Multi-repo changes	Inconsistent	Human writes codemod + reviews
Ambiguous trade-offs	Guesses	Human teaches team preference
Long-session work	Drifts	Split sessions (human decision)

The matrix is the message. The question is not "do I use AI or not" but "where in this task am I the multiplier".

Chapter 13 · Anti-patterns — common mistakes

Anti-pattern 1: "The agent has been chasing one hypothesis for eight hours; let it keep going"

Stop and doubt the hypothesis. First rule of human debugging.

Anti-pattern 2: "The agent wrote security code; humans did not look at it"

Security is negative space. Automated tools + expert review are non-negotiable.

Anti-pattern 3: "Give a one-line 'add feature X' to a big codebase"

The agent sees slices. Force a "look for existing similar" step.

Anti-pattern 4: "Trust the agent's performance regression eyeballing without CI"

Regressions need deterministic measurement. AI measurement is supplementary.

Anti-pattern 5: "Let long sessions run"

Past four hours, assume drift. Split sessions.

Anti-pattern 6: "RAG alone for new libraries"

RAG does not overwrite weight priors. Write the pattern yourself first, then delegate.

Anti-pattern 7: "Hand legacy refactor entirely to the agent"

Legacy = implicit knowledge. Human review mandatory.

Anti-pattern 8: "Verify agent-written concurrency code with unit tests only"

Races do not fall out of unit tests. ThreadSanitizer, loom.

Anti-pattern 9: "Do not write team conventions in AGENTS.md"

The agent guesses if it does not know. Explicit conventions are a multiplier.

Anti-pattern 10: "Auto-merge AI results"

Human review is the last safety net. Auto-merge buys trust too cheaply.

Chapter 14 · Developer checklist (calibration tool)

Before starting a task:

Which of the ten failure modes above does this task touch?
Where in the workflow does human review need to happen (PR, design, before merge)?
Are the safety nets (CI, lint, SAST) actually in place?
Are team conventions written in AGENTS.md?
If new libraries or APIs are involved — have you written a sample pattern yourself?
If it is a multi-repo change — are the first one or two repos being done by hand?
Does this touch security, concurrency, performance-critical code? Is expert review scheduled?

During the task:

Have you been chasing the same hypothesis for more than four hours?
Has the session exceeded four hours? Check for drift.
Is the agent reverting to "old API patterns"?

After the task:

Did a human actually read the result once?
Did performance regression tests and security scans run?
In multi-repo work — did every repo's build pass?

Epilogue — The developer as multiplier

Again: this is not AI skepticism.

AI coding assistants really are multipliers. But a multiplier only matters when applied to a non-zero value. That non-zero value is your judgment.

If judgment is zero — you hand 100% of the task to the agent and never review the result — any multiplier times zero is zero. Sometimes it is negative. (Cleaning up the code the agent got hard-to-find wrong takes longer than writing it from scratch.)

If judgment is one — you know the ten failure modes, install safety nets, break hypothesis tunneling, split sessions, hand security/concurrency/performance to humans where appropriate — that is when AI becomes a real multiplier.

The most important skill for a mid-2026 developer is not the ability to doubt AI but the ability to estimate AI's confidence interval accurately.

AI coding assistant capabilities improve explosively. But some of the ten failure modes above will not be solved by model size. Hypothesis tunneling, implicit knowledge, team preference, security negative space — these need human context. That context lives in human heads for the foreseeable future.

Your job does not shrink. It changes shape. Less time typing code, more time deciding what to trust. That is the developer-as-multiplier shape.

"Reviewing AI Code — 7 Signals Every Human Reviewer Should Watch" — where AI code is most suspect, which patterns lead to "passed review then incident," PR review checklists.
"How to Write AGENTS.md — Teaching Your Team Conventions to the Agent" — real working examples, what kinds of information move the needle and what is just noise.

References

Benchmarks and evaluations

SWE-bench Verified leaderboard — https://www.swebench.com/ (tracks agent performance ceilings)
OSWorld benchmark — https://os-world.github.io/ (desktop-environment task limits)
MLE-bench — https://github.com/openai/mle-bench (agent evaluation on ML engineering)
HumanEval and LiveCodeBench — the gap between single-function ability and real engineering ability

Context rot, hypothesis tunneling, attention dilution

"Lost in the Middle" — Liu et al., 2023, first to quantify attention dilution in long context
"Context Rot" — Anthropic, long-context degradation analysis in the Claude 4 series
"Confirmation Bias in LLM-based Agents" — repeatedly reported in recent evaluation studies

Concurrency and security limits of LLMs

"LLMs and Concurrency: A Survey" — 2025, why formal verification remains necessary
"Security Implications of Code Generated by LLMs" — Pearce et al., 2025 update, common CWE frequencies

Practical guides and opinion

Simon Willison's blog — https://simonwillison.net/ — a model of calibrated takes
"When AI Coding Assistants Fail" — collections of failure cases across engineering blogs
Anthropic "Best practices for agentic coding" — official guide, names failure modes

Tools

Claude Code, Cursor, Codex CLI — mentioned in the body
semgrep, gosec, bandit — security automation (for AI-assisted use)
ThreadSanitizer, loom — concurrency verification
TLA+ — formal verification

TL;DR — AI coding assistants are powerful but uneven. Across deep debugging, architecture, performance regressions, concurrency, security, legacy, ambiguity, new tech, multi-repo, and long sessions, human judgment is still the multiplier. Do not doubt AI. Estimate AI's confidence interval accurately. That is the most valuable developer skill in mid-2026.

Prologue — This is the moment for calibration, not cheerleading

Chapter 1 · Deep bug debugging — "Agents write code well; they do not narrow bugs well"

Pattern

Why it happens

What humans should keep doing

Sample dialog

Chapter 2 · Architecture in a large codebase — "Agents see slices, not the whole"

Pattern

Why it happens

What humans should keep doing

One step further — make it read existing code twice

Chapter 3 · Subtle performance regressions — "Looks the same, ten times slower"

Pattern

Why it happens

What humans should keep doing

Small case — algorithmic complexity's hidden upgrade

Chapter 4 · Concurrency correctness — "Pattern-matches race-free idioms but misses the subtle data race"

Pattern

Why it happens

What humans should keep doing

A one-line comment to enforce lock order

Chapter 5 · Judgment under ambiguity — "Which trade-off does the team prefer here"

Pattern

Why it happens

What humans should keep doing

Chapter 6 · Genuinely new tech — "Training cutoff bias; even with new docs, it reverts to old patterns"

Pattern

Why it happens

What humans should keep doing

Verified regression signals

Chapter 7 · Multi-repo cross-cutting changes — "Still hard"

Pattern

Why it happens

What humans should keep doing

Chapter 8 · Security-critical code — "Looks right; has a subtle flaw"

Pattern

Why it happens

What humans should keep doing

Commonly missed negative-space checklist

Chapter 9 · Legacy code with implicit assumptions — "The code knows something the code does not say"

Pattern

Why it happens

What humans should keep doing

Chapter 10 · Prompt drift in long sessions — "The agent slowly loses the thread"

Pattern

Why it happens

What humans should keep doing

A one-line heuristic for measuring drift

Chapter 11 · So what does AI do well (the control group)

Tasks where AI shines

Areas where AI is strictly faster than humans

Chapter 12 · "Where AI shines" vs "Where humans multiply" matrix

Chapter 13 · Anti-patterns — common mistakes

Anti-pattern 1: "The agent has been chasing one hypothesis for eight hours; let it keep going"

Anti-pattern 2: "The agent wrote security code; humans did not look at it"

Anti-pattern 3: "Give a one-line 'add feature X' to a big codebase"

Anti-pattern 4: "Trust the agent's performance regression eyeballing without CI"

Anti-pattern 5: "Let long sessions run"

Anti-pattern 6: "RAG alone for new libraries"

Anti-pattern 7: "Hand legacy refactor entirely to the agent"

Anti-pattern 8: "Verify agent-written concurrency code with unit tests only"

Anti-pattern 9: "Do not write team conventions in AGENTS.md"

Anti-pattern 10: "Auto-merge AI results"

Chapter 14 · Developer checklist (calibration tool)

Epilogue — The developer as multiplier

Next posts

References

Benchmarks and evaluations

Context rot, hypothesis tunneling, attention dilution

Concurrency and security limits of LLMs

Practical guides and opinion

Tools