Skip to content
Published on

How to Review AI-Generated Code: The Verification Discipline for Agent Output and Filtering 'AI Slop'

Authors

"When the cost of producing code approaches zero, the cost of confirming the code is correct becomes everything."

Prologue — The Bottleneck Is Now Review

A few years ago, the bottleneck in a software team was writing the code. Implementing one feature meant a human sat at a keyboard and typed it out line by line, and that time dominated the schedule.

In 2026, that bottleneck is gone. A coding agent generates hundreds of lines in minutes. The time it takes for one feature to land as a PR has dropped tenfold. But team throughput has not risen tenfold — because the bottleneck did not disappear, it moved. The bottleneck is now review.

Generation got faster; verification did not. So PRs pile up in the review queue, reviewers read hundreds of lines a day of code they did not write, and they oscillate between the temptation to "just approve it" and the dread of "I have to re-check everything."

It Is a Different Skill From Human Code Review

Here is the core claim up front. Reviewing AI-written code is a different skill from reviewing a human PR.

Human code review is largely a social act. PR etiquette, merge queues, tone calibration like "this might just be a matter of taste," comments that help a colleague grow — half of it is that. Human review assumes trust. The assumption underneath is: "this person knows the domain, and if they didn't, they would have asked."

AI code review has no such assumption. The agent does not ask when it does not know. It fills in with the most plausible thing. Confidence and accuracy are decoupled. So AI code review is not a social act — it is a verification discipline. There is no tone to calibrate, no growth to support. Instead you confirm "is this actually correct" systematically, with suspicion as the default.

This article is a practical guide to that verification discipline. It is not a generic human-code-review article — that is a different topic. This article covers only one thing: how to verify code an agent wrote.

What this article covers:

ChapterTopic
1Why AI code needs a different review lens
2The characteristic failure patterns of AI-generated code
3The verification loop — types before tests before humans
4Reading AI diffs efficiently
5'AI slop' — what it is and how to filter it
6When AI reviews AI code
7Making your codebase verifiable
8The human's irreducible job
EpilogueChecklist + anti-patterns + next-post teaser

Chapter 1 · Why AI Code Needs a Different Review Lens

Bugs in human-written code and bugs in AI-written code have a different distribution. Looking through the same lens, you miss them.

1.1 Plausible-but-Wrong Code

A human's mistakes usually look "rough." The variable name is odd, the indentation is off, it visibly looks unfinished. The reviewer's eye naturally stops there.

AI's mistakes are different. The surface is smooth. The variable names are appropriate, the structure follows convention, there are even comments. And yet the logic is subtly wrong. This is the most dangerous case — because the code looks "well written," the reviewer's suspicion sensor never switches on.

Human code review: you look for "the parts that look wrong." AI code review: you look for "the parts that look right but are wrong." Much harder.

1.2 Confident Hallucination

An agent does not say "I'm not sure." It confidently calls APIs that do not exist, imports packages that are not there, and invents config keys with plausible names that do nothing. The tone of the output carries no trace of uncertainty.

A human leaves signals when working in unfamiliar territory — a question mark in a comment, an "is this right?" note, a draft PR. The agent's output has none of that. Confident code and hallucinated code look identical.

1.3 Missing Edge Cases

AI writes the "happy path" well. The code for when input is valid, the network is fine, and the array is not empty is almost always correct.

The problem is everything outside that. Empty arrays, null input, timeouts, concurrency, partial failure, integer overflow — the cases a senior human reflexively recalls, AI frequently omits unless you explicitly demand them. The code runs perfectly in the demo and falls apart in the third week of production.

1.4 Over-Engineering

There is a failure in the opposite direction too. You ask "uppercase the user's name" and back comes an abstract factory, a strategy-pattern interface, and a configurable transformation pipeline. AI tends to over-apply the "enterprise-grade" patterns it saw in training data to small problems.

Over-engineering is not a bug, but it is debt. There is more code to read, the maintenance surface grows, and the next person (or the next agent) gets lost.

1.5 Subtle API Misuse

AI uses APIs "approximately" correctly. The function name is right but the argument order is wrong, one key of the options object is from an old version, an async function is called without await, or the meaning of a return value is subtly misunderstood.

When the type system is strong, a good portion of this is caught at the compile stage. So the conclusion of Chapter 1 leads not to the next chapter but to Chapter 3 — what machines can catch should be left to machines.


Chapter 2 · The Characteristic Failure Patterns of AI-Generated Code

If Chapter 1 was "why it is different," Chapter 2 is "concretely, what to look for." These are the check items to keep loaded in your head while reviewing.

2.1 The Pattern Catalog

PatternSymptomHow to catch it
Hallucinated API/packageFunctions, options, libraries that do not existBuild/typecheck, check the dependency lockfile
Copy-paste inconsistencyThe same logic, subtly different per fileScan the whole diff at once
No error handlingNo try/catch, failures ignoredAsk "what if this fails?" of every external call
Security blind spotsUnvalidated input, hardcoded secrets, injectionTrace data flow along the trust boundary
Tests that test nothingPass, but verify nothingDeliberately break the test and see
Over-engineeringBig abstraction for a small problemAsk "is this really needed"

2.2 Hallucinated APIs and Packages

The most common and the easiest to catch. The agent invents the API it "wishes existed."

// AI-generated code — plausible but wrong
import { retryWithBackoff } from 'lodash' // lodash has no such function
import dayjs from 'dayjs'

const result = await fetchUser(userId, {
  retry: 3,           // fetchUser options have no retry — hallucination
  timeout: '5s',      // timeout takes a number (ms) — type misuse
})

const formatted = dayjs(result.createdAt).format('YYYY-MM-DD')

The code above reads naturally. But lodash has no retryWithBackoff, and fetchUser has no retry option. The good news: this class is almost entirely caught by typecheck and build. A human does not need to find it by eye.

2.3 Copy-Paste Inconsistency

An agent does the same thing several ways even within a single PR. async/await in one file, .then() in another. Throws an error in one place, returns null in another. Each piece looks fine on its own, but as a whole there is no consistency.

To catch this you cannot look at the diff file by file — you have to scan the whole PR at once. The question is: "how many ways does this PR do the same kind of thing?"

2.4 No Error Handling

The happy path of AI code is clean. And the unhappy path does not exist. There is no try/catch around the external API call, a file-read failure is ignored, JSON.parse is called without a guard.

Review rule: ask "what happens if this fails here?" of every external boundary (network, disk, parsing, third-party calls). In AI diffs, the answer to that question is "the app crashes" surprisingly often.

2.5 Security Blind Spots

An agent treats security as "not a feature." It builds code that does what was asked, but it does not assume adversarial input. Common blind spots:

  • Wiring user input into queries, commands, or paths without validation
  • Hardcoding secrets in the code (to "just make it work" in the demo)
  • Missing authorization checks — never asking "can this user do this"
  • Printing sensitive information to logs

Trace, finger on the screen, where data enters and where it goes along the trust boundary. AI does not do this on its own.

2.6 Tests That Verify Nothing

The most insidious pattern. Ask an agent to "write tests too" and it writes tests. Tests that pass. But look at what those tests verify and it is often nothing.

// AI-generated "test" — passes but is meaningless
test('calculateDiscount works', () => {
  const result = calculateDiscount(100, 0.1)
  expect(result).toBeDefined()        // passes for anything
  expect(typeof result).toBe('number') // passes for 90 or 9999
})

This test passes whether calculateDiscount returns 0 or -50, because it never verifies the actual value (90). Worse, some tests copy the implementation verbatim and confirm "the implementation equals the implementation."

The check is simple. Deliberately break the test. If you plant a bug in the implementation and the test still passes, that test is fake.


Chapter 3 · The Verification Loop — Types Before Tests Before Humans

Finding the patterns of Chapter 2 one by one with human eyes is exhausting. The core principle is this: what machines can catch, leave to machines; spend human attention on what machines cannot catch.

3.1 Three Layers of Filter

Think of verification as a three-stage filter. Cheap filters first, expensive filters later.

StageToolWhat it catchesCost
1. TypesTypechecker, linter, compilerHallucinated APIs, type misuse, unused variablesNear zero (seconds)
2. TestsUnit/integration tests, static analysisLogic errors, regressions, edge casesLow (minutes)
3. HumansThe reviewer's judgmentArchitecture, intent, "does this make sense"Expensive (human time)

The principle: code that does not pass stage 1 is not worth a human's eyes. A human reviewing a PR with type errors is a waste of human time. Let CI filter it first.

3.2 Types as the First Line of Defense

A strong type system is the biggest leverage in AI code review. The hallucinated APIs, API misuse, and wrong arguments of Chapter 2 — a large share of these fall out as compile errors at typecheck, before a human reads a single line.

So loose typing gets more expensive in the AI era. In a codebase plastered with any, the first filter does not work, and that whole load shifts to stage 3 (humans).

3.3 Tests as the Second Line of Defense

If types check "is this code in a shape that makes sense," tests check "does this code behave correctly." But remember the trap of 2.6 — the tests the AI wrote are themselves subject to verification.

Recommended order: have the human write the tests first (or have the human pin down the spec), then leave the implementation to the agent. If the tests exist first, those tests cannot be "fakes that copied the implementation."

3.4 Humans Last, but Most Important

Only code that passes the three-layer filter comes before a human. At that point, human attention concentrates on what machines cannot catch in principle — is the architecture right, does this change match the intent of the ticket, is this even a sensible approach in the first place. That is the subject of Chapter 8.

The verification loop in one line: after passing every cheap machine check, the human looks only at what machines cannot see.


Chapter 4 · Reading AI Diffs Efficiently

Even a PR that passed the verification loop, a human must read the diff. A human-written diff and an AI diff are read in different ways.

4.1 The First Question: Does It Match the Ticket?

When you open an AI diff, you feel the urge to look at code quality first. Resist it. The first question is always this: does this diff do what the ticket asked for?

An agent interprets the ticket "approximately." It does 80% of what was asked and 20% differently, or it adds things that were not asked for. No matter how clean the code is, if it implemented the wrong thing it is meaningless. Cross-check the ticket's acceptance criteria against the diff, line by line.

4.2 Detecting Scope Creep

The second question: did this diff touch something it should not have touched?

An agent tends to fix other things "while it is at it." It changes formatting, refactors unrelated files, bumps dependency versions. Each may be well-intentioned, but together they make a diff that cannot be reviewed.

SignalWhat to suspect
Number of changed files is large for the ticket's sizeScope creep
Changes scattered across unrelated directories"while I'm at it" refactoring
Lockfile or config files changed for no reasonUnintended dependency change
Mass of lines where only formatting changedNoise — hides the substantive change

Send these diffs back. "Leave only the ticket scope and redo" is a legitimate request.

4.3 The Order to Read a Diff

There is an efficient order:

  1. PR description — what and why does it claim to have changed
  2. Tests — what does this change claim to guarantee
  3. Core logic — does the claim match the actual code
  4. Boundaries and error handling — is there an unhappy path
  5. The rest — config, formatting, incidental changes

Why read tests before logic: tests are the spec of "what this code is supposed to do." Read the spec first, and the divergences jump out when you read the logic.

4.4 Diff Size Inversely Proportional to Trust

A small diff you can read carefully. An 800-line diff, a human cannot stay focused to the end — AI or human. AI easily produces large diffs, so you must always ask "can this PR be split smaller." Reviewability is inversely proportional to diff size.


Chapter 5 · 'AI Slop' — What It Is and How to Filter It

5.1 What AI Slop Is

AI slop is the term for AI-generated output that looks plausible but has low real value. AI slop in code is "code that compiles and passes tests but makes the codebase worse."

Slop is not an obvious bug. Bugs, frankly, are easier to catch. Slop is code that is subtly bloated, subtly inconsistent, subtly unnecessary. It does not show in one PR, but stack up 100 PRs and the codebase becomes a swamp.

5.2 The Signs of Slop

SignDescription
Verbosity30 lines for what takes 5. Unnecessary helpers, wrappers, abstractions
Empty commentsgetUser() under // gets the user — comments that repeat the code
Defensive noiseChecks for conditions that cannot happen, everywhere
InconsistencyFive styles in the same codebase
Dead abstractionsAn interface used once, a factory with a single implementation
Plausible dummiesMeaningless tests, functions with only a TODO, placeholder logic

5.3 The Questions That Filter Slop

The questions to ask while reviewing are simple:

  • "If I delete this line, what breaks?" — if the answer is "nothing," it is slop.
  • "How many places use this abstraction?" — if it is one place, inline it.
  • "Does this comment say something the code does not?" — if not, delete it.
  • "Can this PR be cut to half its size?" — usually it can.

5.4 Slop Is a Problem of Acceptance, Not Generation

An important perspective: it is AI that produces slop, but it is the human who lets slop into the codebase. The agent only proposes slop; the one who presses the merge button is the reviewer.

So the solution to the slop problem is not "do not use AI" but "do not lower the review bar." You must not pass AI code more leniently than human code. If anything, because of the confident surface, you must look more strictly.


Chapter 6 · When AI Reviews AI Code

Adding one more "AI reviewer" stage to the verification loop is reasonable. But there is a trap.

6.1 Generator and Verifier Must Be Separated

The core principle: the agent that generated the code must not review the same code.

An agent with the same model, the same context, the same assumptions cannot see the hallucination it made itself. The reasoning process that invented the wrong API will, unchanged, judge that "this API is correct." In human terms, it is grading your own exam.

The verifier must be separated — a different model, or at minimum a separate session with a different context and a different prompt.

6.2 What an AI Reviewer Is Good and Bad At

What an AI reviewer is good atWhat an AI reviewer is bad at
Pattern matching — known anti-patterns, common bugsArchitectural judgment — "is this approach right"
Consistency checks — style, namingIntent matching — "is this really what the ticket wanted"
Checklist application — missing error handling, etc.Trade-offs — "is this complexity worth it"
Surface-level security — hardcoded secrets, etc.Domain correctness — are the business rules right

In summary: an AI reviewer is stage 1.5 of the verification loop — smarter than a typecheck but not a replacement for a human. It only widens the range of what machines can catch; the final judgment is still the human's.

6.3 A Practical Arrangement

The recommended configuration:

  1. Agent A generates the code
  2. Typecheck, tests, lint (machine stage 1)
  3. Agent B (different context) reviews — patterns, consistency, checklist
  4. A human does the final review — architecture, intent, judgment

The AI reviewer's comments are input, not conclusion. The human must filter both what the AI reviewer missed and what it overreacted to.


Chapter 7 · Making Your Codebase Verifiable

The efficiency of the verification loop depends less on the skill of verifying code and more on how verifiable the codebase is. A codebase that is good to verify is rewarded compoundingly in the AI era.

7.1 Strong Types

Already said in Chapter 3, but worth stressing again. Strong types determine the performance of the first filter. Reduce any, express the domain as types (types like UserId, Email instead of primitives), and make function signatures honest. The stronger the types, the less a human has to read.

7.2 Fast Tests

If tests are slow, the verification loop breaks. A 30-minute test suite invites "merge it now and look later." Tests must be fast, deterministic, and trustworthy. A flaky test is as harmful as slop — because it turns signal into noise.

7.3 Clear Conventions

An agent imitates the patterns of the codebase. If the codebase has consistent conventions, the agent's output is consistent. If there are no conventions, the agent brings a different style every time (the copy-paste inconsistency of 2.3 is, in fact, also a symptom of absent conventions).

Pin conventions in documents and, where possible, in linter rules. A CONTRIBUTING document, a clear directory structure, a guide file for agents — these directly raise the quality of the agent's output.

7.4 Verifiability Checklist

ItemQuestion
TypesIs the share of any low? Is the domain expressed as types?
TestsDoes the full suite finish within minutes? Are there no flaky ones?
LintAre style and common mistakes caught automatically?
ConventionsAre the patterns new code should follow documented?
BoundariesAre module boundaries clear so a diff's blast radius is narrow?

In a codebase with these five in place, AI code review is fast and trustworthy. In a codebase without them, every PR demands a full human review — and that is exactly the bottleneck.


Chapter 8 · The Human's Irreducible Job

Once machines take stages 1 and 2 and an AI reviewer takes stage 1.5, what is left for the human? A job that has shrunk but has not disappeared, and has in fact become more important.

8.1 Judgment

"This code is correct." That sentence means two things. (1) The code behaves per the spec — this, machines verify. (2) The spec itself is right — this, only a human can do.

An agent does what the ticket told it to. If the ticket is wrong, it implements the wrong thing perfectly. "Is this ticket even the right problem to solve" — machines do not ask that.

8.2 Architecture

Even when individual functions are correct, the system structure can be wrong. Whether this abstraction will hold up six months from now, whether this boundary is drawn in the right place, whether this dependency direction is healthy — these are questions pattern matching cannot answer. It is the work of imagining the future of the whole codebase, and that is a human's work.

8.3 "Does This Make Sense"

The most irreducible job is the simplest question: does this make sense.

The agent brought back perfectly functioning, type-correct, test-passing code — and the feature should perhaps never have been built in the first place. There may have been a simpler solution, or the problem should have been defined differently. Machines verify the "how" but the "why" and the "really?" are the human's.

8.4 Accountability

Finally, the person who presses the merge button is the owner of that code. "An AI wrote it" is not an excuse in front of a production outage. The essence of the verification discipline is here — whatever the tool generated, the decision to let it into the codebase is the human's, and that decision carries accountability.

The human's job did not shrink — it concentrated. The typing is gone; only the judgment remains.


Epilogue — Verification Is the New Core Competency

In an era when the cost of producing code approaches zero, the competitive edge is not "how fast you generate" but "how reliably you verify." Generation became a commodity; verification became scarce.

The one-line conclusion of this article: AI code review is not a social act but a verification discipline. Set suspicion as the default, leave what machines can catch to machines, and concentrate human attention on what machines cannot see in principle — judgment, architecture, "does this make sense."

AI Code Review Checklist

  1. Types first — a PR that does not pass typecheck and build is not worth a human's eyes
  2. Verify the tests — deliberately break the tests the AI wrote to confirm whether they are fake
  3. Cross-check the ticket — see whether the diff satisfies the acceptance criteria line by line
  4. Scope creep — see whether "while I'm at it" changes are mixed in
  5. Check boundaries — ask "what if this fails?" of every external call
  6. Trace security — finger on the screen, trace data flow along the trust boundary
  7. Filter slop — ask "what breaks if I delete this line"
  8. Generator != verifier — separate the AI reviewer from the agent that generated the code
  9. Diff size — split large PRs. Reviewability is inversely proportional to size
  10. Final judgment is the human's — the person who presses merge is the owner

Anti-Patterns to Avoid

  • Trusting the surface — switching off suspicion because the code looks smooth
  • Skipping types — having a human read a PR with type errors first
  • Blind faith in tests — mistaking "tests pass" for "verified"
  • Self-grading — having the generating agent review its own code
  • AI leniency — passing AI code more loosely than human code
  • Accepting giant diffs — approving an 800-line PR without reading it to the end
  • Dodging accountability — using "an AI wrote it" as an excuse for an outage

Next-Post Teaser

The next article is "A Testing Strategy for the AI Era — How to Make Agent-Written Tests Trustworthy." This article said "the tests the AI wrote are themselves subject to verification" — so then, how should trustworthy tests be designed? It covers how to pin down the spec first, how to verify tests with mutation testing, and when to leave tests to an agent and when not to.

In a world where generation became free, the most expensive skill is the verifying eye that knows how to say "no." That eye is engineering itself.