- Published on
How to Review AI-Written Code — Redesigning Code Review
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- Why Review Became the Bottleneck
- Human Code vs AI Code — The Defect Patterns Differ
- Redesigning the Review Strategy
- Reviewing AI Code with AI — Automated First Pass, Human Final
- Review Checklist — Security, License, Performance
- Redesigning the PR Unit and Commit Hygiene
- The OSS Maintainer Perspective — What the jqwik Affair Left Behind
- A Team Policy Template
- Adoption Roadmap
- Frequently Asked Questions
- Metrics — How Do You Know the Redesign Works
- Pitfalls and Critical Perspectives
- Closing
- References
Introduction
In June 2026, a post titled "The jqwik Anti-AI Affair" by Johannes Link, maintainer of the testing library jqwik, set GeekNews and Hacker News ablaze. Written from the maintainer's own perspective, it chronicled a conflict that erupted in an OSS project over AI contributions — and it touched a nerve, putting into words an anxiety many developers had only vaguely felt: the gap between the speed at which AI pours out code and the speed at which humans can review it.
The numbers make the situation plain. In 2026, with coding agents ubiquitous, more than half of newly written code in many teams passes through AI hands. The cost of writing code has dropped to a tenth, while the cost of reviewing it has stayed nearly the same. The bottleneck is no longer writing — it is review. And the traditional code review practice — humans reading human-written code line by line — was never designed to carry this load.
In this post we cover how the defect patterns of AI-generated code differ from human code, how review strategy should be redesigned to match that difference, a pipeline for reviewing AI code with AI, OSS and team-level policies, and finally the metrics.
Why Review Became the Bottleneck
The asymmetry between writing and reviewing looks like this.
2023: humans write + humans review
writing ████████████████ (slow)
review ████ (small share relative to writing)
2026: AI writes + humans review
writing ██ (fast, high volume)
review ████████████████████████ (bottleneck!)
3x the PRs, 2x the diff size per PR,
reviewer hours unchanged
Two more factors make it worse.
First, the nature of review fatigue changed. Defects in human code usually come with a "something is off" signal — inconsistent naming, awkward structure — that draws the reviewer's attention. AI code is the opposite: superficially clean, and wrong with confidence. Reviewers struggle to find a thread of suspicion to pull.
Second, more and more authors cannot explain their own code. A PR where "why did you implement it this way" gets answered with "the agent did it that way" destroys the premise of review — that the author understands the code.
Human Code vs AI Code — The Defect Patterns Differ
To redesign the review strategy, first know your enemy. Defects in AI-generated code are distributed differently from human defects.
| Defect type | Human code | AI code |
|---|---|---|
| Typos, syntax slips | Common | Rare |
| Plausible logic errors | Occasional | Common (wrong with confidence) |
| Hallucinated APIs | Almost never | Common (nonexistent methods, packages) |
| Excessive defensive code | Rare | Common (needless try-catch, nested null checks) |
| Misread requirements | Resolved via questions | Implemented wrongly in silence |
| Convention violations | Rare (team learning) | Common without a guide file |
| Dead code, duplicate implementations | Occasional | Common (does not search existing utils) |
| Security vulnerabilities | Varied patterns | Reproduces stale patterns from training data |
Three of these deserve special attention.
Plausible-but-wrong errors
The signature defect of AI code. The types check, some tests pass, the variable names look right — and yet a boundary condition in the business logic is subtly wrong. Think inclusive-versus-exclusive bounds in date range comparisons, the timing of rounding in currency calculations, off-by-one in pagination. At exactly the points where a human would leave a comment or a question out of uncertainty, the AI generates smoothly incorrect code.
A typical example in code:
# Spec: "a subscription is valid through the expiry date itself"
def is_active(sub, today) -> bool:
# AI-generated — smooth but wrong: the expiry day is excluded
return sub.start_date <= today < sub.expires_on
def is_active_correct(sub, today) -> bool:
# Correct implementation: boundary inclusive
return sub.start_date <= today <= sub.expires_on
This diff passes the type checker and every happy-path test. The way to catch it is not reading the implementation, but checking whether a boundary test for the expiry day itself exists.
from datetime import date
def test_active_on_expiry_day():
sub = make_sub(start="2026-06-01", expires="2026-06-30")
assert is_active(sub, date(2026, 6, 30)) # boundary case
Hallucinated APIs and packages
Calling library methods that do not exist, or adding dependencies on packages that do not exist. The latter leads directly to security problems. Slopsquatting attacks — where attackers pre-register package names that AI models frequently hallucinate — have been observed in the wild, and the June 2026 npm supply chain attack was a fresh reminder that adding a dependency is a security decision. The arXiv study "We Have a Package for You!" quantitatively showed that even commercial models recommend nonexistent packages at a substantial rate.
Excessive defensive code
AI models tend to be trained toward avoiding criticism, so they wrap try-catch where none is needed, nest unreachable null checks, and clone defensive logic into every function. This is not a mere style issue. A catch block that swallows errors silences outages, and redundant defenses destroy design information — the answer to "can this value actually be null."
Redesigning the Review Strategy
If the defect distribution is different, the allocation of reviewer attention must change too. Three principles.
Principle 1 — compare against the spec, not the lines
For human code, bottom-up review — read the code, spot the defects — worked. AI code has a clean surface, so bottom-up reading is low-yield. Go top-down instead. Read the requirements first (issue, spec, acceptance criteria), then trace "where in the code is this requirement satisfied." The primary goal is finding code with no spec behind it (scope explosion) and spec with no code behind it (omissions).
Bottom-up (legacy): code --> read --> find defects --> check spec
Top-down (redesigned): spec --> list acceptance criteria
--> trace satisfaction points in code
--> leftover code = suspect
Principle 2 — review the tests first
Read the tests before the implementation. Check two things. First, do the tests actually verify the acceptance criteria of the spec (or are they tautologies copied from the implementation)? Second, are there boundary condition cases? The plausible errors of AI mostly hide at boundaries, so the presence of tests covering "empty input, maximum values, concurrency, time zones" is a strong quality signal. If the tests are trustworthy, the burden of reading the implementation drops sharply.
Principle 3 — enforce diff size
Review quality is inversely proportional to diff size. This has been confirmed repeatedly by both research and practice, but the AI era requires an enforcement mechanism — an agent will happily produce a 3,000-line PR in one go if you let it. A practical cap is around 400 lines per PR, checked mechanically in CI.
Putting the three principles together also changes how reviewers allocate their time. A recommended split for a 60-minute review:
Review time allocation guide (60 minutes, AI-involved PR)
Spec comparison ████████████████ 20 min
Test review ████████████ 15 min
Security/dependency check ██████████ 12 min
Implementation reading ██████ 8 min (sampled)
Writing design feedback ████ 5 min
# .github/workflows/pr-size-gate.yml
name: pr-size-gate
on: [pull_request]
jobs:
check-size:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Fail if diff exceeds limit
run: |
LINES=$(git diff --shortstat origin/${{ github.base_ref }}... \
| awk '{print $4 + $6}')
echo "changed lines: $LINES"
if [ "$LINES" -gt 400 ]; then
echo "PR too large. Split it."
exit 1
fi
Reviewing AI Code with AI — Automated First Pass, Human Final
If humans alone cannot handle the load AI created, putting AI into review is the natural conclusion. The key is the division of roles.
PR created
|
v
[Gate 0] CI: build, tests, lint, diff size, secret scan
|
v
[Gate 1] AI first-pass review (what machines do well)
- Hallucinated API check: does every import and call exist
- New dependency check: registry existence, downloads, license
- Spec comparison: table of acceptance criteria vs satisfaction
- Flags excessive defensive code, dead code, duplication
- Test weakening detection (removed asserts, added skips)
|
v
[Gate 2] Human final review (what humans do well)
- Design appropriateness: is this approach itself right
- Product judgment: priority among edge cases
- Security design: trust boundaries, permission model
- Spot-check the AI first-pass findings
An example of the instructions given to the AI first-pass reviewer.
# Reviewer agent instructions (excerpt)
You are a first-pass reviewer. You have no merge authority.
Output only tables in the format below.
Checks:
1. List every external API call in the diff and verify each one
exists in the official docs for that version.
Mark UNVERIFIED when you cannot confirm.
2. For each newly added dependency: look up registry registration
date, weekly downloads, and license, and tabulate them.
3. Build a table with the issue acceptance criteria as rows and
the satisfying evidence (file:line) as columns.
Mark criteria with no evidence as MISSING.
4. List every assert removed or weakened in the test diff.
Forbidden: praise, trivial style nits, approval opinions
based on guesses.
One important operating principle: the AI reviewer's output must be evidence collection, not a verdict. If you delegate approve/reject to AI, defects sail through whenever the writing agent and the reviewing agent share the same blind spot. The AI builds the tables; the human judges.
Deterministic verification of hallucinated dependencies
Among the Gate 1 checks, "dependency existence verification" can be made deterministic with a script — no LLM required.
#!/usr/bin/env bash
# check-new-deps.sh — verify existence and hygiene of new dependencies
set -euo pipefail
BASE_REF="origin/main"
NEW_DEPS=$(git diff "$BASE_REF"...HEAD -- package.json \
| grep '^+ ' | grep -oE '"[@a-z0-9/._-]+"\s*:' \
| tr -d '": ' | sort -u)
for dep in $NEW_DEPS; do
created=$(npm view "$dep" time.created 2>/dev/null) || {
echo "FAIL: $dep — not in the registry (hallucination/typo suspected)"
exit 1
}
license=$(npm view "$dep" license 2>/dev/null || echo "UNKNOWN")
echo "OK: $dep (created: $created, license: $license)"
done
echo "dependency check passed"
Nonexistent packages are blocked deterministically here, while registration dates and license information are passed along as inputs to the human review. Never delegate to an LLM what can be checked deterministically — that is the basic design principle of Gate 1.
Comparing implementation options
There are broadly three options for building the automated first pass.
| Option | Pros | Cons |
|---|---|---|
| Commercial review bot | Instant adoption, zero maintenance | Limited custom checks, code leaves your perimeter |
| Calling an agent directly from CI | Full control over check items | Prompt and cost management burden |
| In-house pipeline | Optimal mix of deterministic checks and LLM | Build and maintenance cost |
Whichever you choose, the core requirement is the same: check results must land on the PR in a structured format, and the human reviewer must be able to use them as input.
Review Checklist — Security, License, Performance
A checklist specialized for AI-generated code, written so you can paste it straight into your team wiki.
Security
- New dependencies: does the package actually exist in the registry, is the registration date suspiciously recent (within weeks), is the maintainer credible
- Input validation: do AI-written regexes carry ReDoS potential
- AuthN/AuthZ: were stale patterns from training data reproduced (weak hashes, hardcoded secrets, outdated TLS settings)
- Error handling: are catch blocks swallowing security-relevant errors
- Prompt provenance: if the code came from an agent that reads external content, compare the intent of the diff against the issue to rule out injection-induced changes
License
- Is the generated code effectively identical to a specific OSS implementation (similarity check at the large-function level)
- Does the license of new dependencies conflict with org policy (e.g., GPL-family banned)
- If internal policy requires provenance, is AI involvement recorded
Performance
- N+1 queries inside loops, repeated sorting, unnecessary deep copies — patterns AI produces often
- Needless checks on hot paths caused by excessive defensive code
- Concurrency: AI tends to take conservatively large lock scopes — check for bottleneck potential
Redesigning the PR Unit and Commit Hygiene
A reviewable PR is made at writing time. Enforce the following on the agent via the guide file.
- One PR, one intent: never mix feature work and refactoring in one PR. Agents love "while I am at it" edits, so prohibit them explicitly.
- Commits in story order: have the agent structure commits in an order a reviewer can follow, like "add tests, implement, refactor."
- PR description template: what (spec link), how (approach summary), verification (tests run), and AI involvement (full / partial / none) as required fields.
- Separate mechanical from semantic changes: isolate formatting and import-sorting into their own commits to remove review noise.
An example PR description template:
## What (spec)
- Issue 142 — payment webhook retry logic
## How (approach summary)
- Idempotency key validation + exponential backoff retry queue
## Verification
- ./verify.sh passing
- 8 new tests (including 4 boundary cases)
- Manual webhook redelivery check on staging
## AI involvement
- partial (implementation: agent, test design and review: human)
## Notes for the reviewer
- Key judgment point: assumed max 5 retries (spec gap, please confirm)
- Intentionally excluded: dead-letter handling (follow-up PR planned)
The final section, "Notes for the reviewer," matters most. When the author surfaces the points they themselves felt unsure about, the reviewer's attention concentrates on the riskiest spots. The more AI wrote the code, the more important it is that a human writes this section personally.
The OSS Maintainer Perspective — What the jqwik Affair Left Behind
The outline of the jqwik affair: when the maintainer announced a conservative policy on AI contributions, fierce backlash and debate followed, escalating to personal attacks on the maintainer. What the affair reveals is not a technical problem but a governance problem.
For an OSS maintainer, AI contributions are an asymmetric-cost problem. A contributor produces a PR with an agent in ten minutes; the maintainer spends an hour reviewing it. Daniel Stenberg of curl publicly warning about AI-mass-produced fake security reports ("AI slop") belongs to the same context. The person paying the review cost and the person saving the generation cost are different people — that is the economics of the conflict.
The policy direction that healthy projects are converging on looks like this.
- Disclosure, not prohibition: rather than banning AI use, require PRs to state the degree of AI involvement. Sanction concealment when it is discovered.
- Contributor responsibility principle: "you must be able to explain the code you submit." Whatever the tool was, responsibility stays with the submitter.
- Graduated trust: do not accept large PRs from first-time contributors; let them build trust with small contributions before widening scope.
- Maintainer protection: declining to review is a legitimate right. A "right to be reviewed" does not exist.
A Team Policy Template
A template usable as a starting point for an internal team policy.
# AI-generated code policy (team standard v1)
## Allowed
- AI tools may be used for all production code
- However, the submitter must be able to explain every line of the diff
## Required
- State AI involvement in the PR description: full / partial / none
- AI-involved PRs must pass automated first-pass review (Gate 1)
- New dependencies go in a separate commit and must pass
the dependency verification check
## Forbidden
- Agent edits to test files and CI configuration
- AI auto-replies to review comments (a human reads and answers)
- PRs over 400 lines (must split; exceptions need lead approval)
## Reviewer protection
- Review SLA for AI-involved PRs is 1.5x that of normal PRs
- Reviewers may reject with "unexplainable code" as the reason
Adoption Roadmap
There is no need to roll out the whole redesign at once. The recommended order minimizes friction relative to impact.
- Week 1 — diff size gate and AI involvement labeling: start with one CI job and one line in the PR template. Least friction, immediate effect.
- Week 2 — the dependency verification script: add check-new-deps.sh to CI. Deterministically blocks hallucinated packages and slopsquatting.
- Weeks 3 to 4 — introduce AI first-pass review (Gate 1): run it in observe-only mode at first — comments but no blocking — and measure precision.
- Month 2 — policy documentation and a metrics dashboard: adapt the team policy template to your team and make it official; review the metrics in the next section weekly.
- Every quarter — policy retrospective: remove checks with low precision, and derive new check items backwards from post-merge defects.
Frequently Asked Questions
- The cost of AI first-pass review is a burden — Run Gate 1 only on PRs that passed Gate 0 (CI), and start with conditional execution that trims check items for small diffs.
- How do we confirm whether the author used AI — Do not try. A disclosure-based policy (mandatory labeling plus sanctions for false labeling) has far lower operating cost and fewer disputes than a detection-based one.
- We are critically short on reviewers — Tightening the diff size cap comes before adding reviewers. Four 100-line PRs produce higher quality than one 400-line PR in the same review time.
- Should urgent fixes be exempt — Create an exception path, but mandate post-hoc review and track exception usage frequency as a metric. When exceptions become routine, the policy is dead.
Metrics — How Do You Know the Redesign Works
Policy does not improve without measurement. Recommended metrics:
| Metric | Definition | Desired direction |
|---|---|---|
| Review lead time | PR creation to first review | Down |
| Post-merge defect rate | Reverts and hotfixes within N days of merge | Down |
| Median diff size | Changed lines per PR | Hold under 400 |
| AI first-pass precision | Share of AI flags humans judge valid | Above 60 percent |
| Reviewer load balance | Variance of weekly reviewed lines per reviewer | Down |
| Hallucinated API escapes | Nonexistent API calls found after merge | Hold at zero |
Splitting "post-merge defect rate" by AI involvement is especially revealing. If AI-involved PRs have a significantly higher defect rate, strengthen Gate 1; if there is no difference, you can trim excess process.
You do not need to automate metric collection from day one. Counting just these three by hand, weekly, is enough to set direction.
- Share of PRs merged this week exceeding 400 lines
- Share of AI first-pass flags adopted by humans
- Number of merges that led to a revert or hotfix
Pitfalls and Critical Perspectives
The redesign itself has pitfalls.
First, automation bias. When the AI first pass says "pass," humans tend to skip their review. The principle that AI review is evidence collection, not absolution, must be enforced by process — for example, the human reviewer personally verifies at least two items from the AI report.
Second, the risk of proceduralism. As AI involvement labels, checklists, and diff caps multiply, review can become "procedure followed, nobody thinking." If the metrics improve while incidents increase, suspect proceduralism.
Third, a chilling effect on contribution culture. A side effect behind the jqwik affair was well-meaning contributors pulling back. It is worth stating in the policy document itself that the goal is restoring reviewability, not excluding AI.
Finally, a fundamental question remains. In a world where writing code is nearly free, is line-by-line review itself still valid? In the long run the object of review will likely shift — from "reading the code" to "verifying specs, tests, and properties." The redesign in this post is a strategy for that transition period.
Closing
Code review always did two jobs at once: catching defects, and keeping the team's shared understanding of the codebase alive. AI exploded the load of the first job and shook the premise of the second (that the author understands the code).
So the essence of the redesign is not tooling but a reallocation of responsibility. Give machines what machines can verify (hallucinated APIs, dependencies, diff size, test weakening), and give humans what only humans can do (design judgment, product sense, final accountability). And keep the principle that no matter who wrote it, code the submitter cannot explain does not get merged. That, I believe, is how code review keeps its meaning in the AI era.
References
- Johannes Link — The jqwik Anti-AI Affair: https://blog.johanneslink.net/2026/06/09/the-jqwik-anti-ai-affair/
- GeekNews — jqwik anti-AI affair topic: https://news.hada.io/topic?id=30373
- jqwik — Property-Based Testing for Java: https://jqwik.net/
- Google Engineering Practices — Code Review Guide: https://google.github.io/eng-practices/review/
- We Have a Package for You! (package hallucination study): https://arxiv.org/abs/2406.10279
- Do Users Write More Insecure Code with AI Assistants?: https://arxiv.org/abs/2211.03622
- Daniel Stenberg — The I in LLM stands for intelligence: https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-for-intelligence/
- GitHub Octoverse — AI and developer ecosystem statistics: https://github.blog/news-insights/octoverse/
- SmartBear — Best Practices for Peer Code Review: https://smartbear.com/learn/code-review/best-practices-for-peer-code-review/
- Hacker News: https://news.ycombinator.com/