💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

A recent essay by Addy Osmani titled "Loop Engineering" drew major attention on Hacker News and GeekNews. Its thesis is simple: the next step after prompt engineering is not writing better prompts, but designing "the system that prompts the agent" — in other words, designing the loop.

Around the same time another interesting event unfolded. The community discovered that the assignment repository of Stanford's popular course CS336 (Language Modeling from Scratch) ships with a CLAUDE.md file, and a debate broke out. An era in which even course assignment repos contain explicit guidelines for AI agents — that is the reality of 2026. With coding agents in the vein of Claude Code, Codex, and Copilot now ubiquitous, and a generation of frontier models capable of autonomous work measured in hours, the developer's role is shifting from "the person who writes code" to "the person who designs the loop the agent runs in."

In this post we will walk through the concept and components of loop engineering, building verification loops, designing repo-level guide files, multi-agent division of labor, checkpoint design, failure modes and guardrails, and a hands-on example built on GitHub Actions.

The Limits of Manual Prompting Every Turn

The way people first use a coding agent usually looks like this: type an instruction, look at the result, point out what is wrong, instruct again. This "manual ping-pong" has structural limits.

- The human is the bottleneck: the agent generates thousands of tokens per minute, but human review and re-instruction take minutes. Total throughput is bound to human reaction speed.

- Unstable feedback quality: when the human is tired, review gets loose. Accumulated "looks fine, I guess" lets defects through.

- Not reproducible: ask for the same task again and the ad-hoc instructions differ, so the results differ. Knowledge stays locked in personal chat history rather than in a system.

- Not scalable: the number of agents you can run in parallel is capped by human multitasking limits.

The core insight is this. What determines the quality of an agent's output is not the first prompt, but how fast and how accurately feedback on the output comes back. And that feedback can come from a system rather than a person: from tests, from linters, from type checkers, from another agent.

Put manual ping-pong and an engineered loop side by side and the difference becomes clear.

| Aspect | Manual ping-pong | Engineered loop |

| --- | --- | --- |

| Feedback source | A human | Tests, lint, verifier agent |

| Feedback latency | Minutes to hours | Seconds to minutes |

| Reproducibility | Low (personal chat history) | High (scripts, guide files) |

| Parallel runs | One or two | Dozens |

| Quality floor | Depends on reviewer condition | Guaranteed by verification gates |

Building the right-hand column of this table is everything the rest of this post is about.

Anatomy of the Loop — plan, act, verify, self-correct

The basic unit of loop engineering is a cycle of the following four stages.

+-----------------------------------+

| |

v |

[1. PLAN] |

Break the task into steps, |

define success criteria |

| |

v |

[2. ACT] |

Write code, edit files, run commands |

| |

v |

[3. VERIFY] |

Run tests, lint, typecheck, build |

| |

v |

Pass? --- yes ---> [done, submit output] |

| |

no |

v |

[4. SELF-CORRECT] |

Read failure logs, analyze cause, |

plan the fix ----------------------------+

Let us look at the design points of each stage.

Plan — make the plan an artifact

Do not keep the planning stage only in the agent's head (its context); have it write the plan to a file. The plan file plays three roles. First, it becomes a gate where a human can review the direction before work starts. Second, the plan survives even when the context gets compacted mid-run. Third, after completion you can trace "what diverged from the plan."

Act — make the radius of action explicit

Specify which directories may be modified, which commands may be run, and which files must not be touched. Restricting allowed tools in the Claude Code settings file, or running inside a sandbox, falls into this category.

Verify — the heart of the loop

A loop without verification is not a loop; it is just an automatic typewriter. We cover this in detail in the next section.

Self-correct — the quality of failure information determines the quality of the fix

For an agent to self-correct well, "what failed and why" must be delivered in a machine-readable form. Verbose output from the test framework and JSON-formatted linter output work better than natural-language explanations.

The Verification Loop — Tests, Lint, and Typecheck as Agent Feedback

Eighty percent of loop engineering is building the verification loop. The principle is simple: the more you move what humans used to catch in code review into machine verification, the farther the agent can go without a human.

Priority among verification layers

| --- | --- | --- | --- |

The key is to run the fast layers first and fail immediately. Running E2E while there is a type error is a waste of tokens.

Minimal implementation — a verification loop script

The most effective pattern in practice is creating a single entry point and instructing the agent to "keep fixing until this script passes."

#!/usr/bin/env bash

verify.sh — single verification entry point for agent feedback

set -euo pipefail

echo "== 1/4 typecheck =="

pnpm tsc --noEmit

echo "== 2/4 lint =="

pnpm eslint . --max-warnings 0 --format json -o lint-report.json \

|| { cat lint-report.json; exit 1; }

echo "== 3/4 unit tests =="

pnpm vitest run --reporter=verbose

echo "== 4/4 build =="

pnpm build

echo "ALL CHECKS PASSED"

The agent-side loop can be composed as simply as this.

MAX_ITERATIONS = 8

def agentic_loop(task: str) -> dict:

plan = agent.run(f"Write an execution plan for this task in plan.md: {task}")

for i in range(MAX_ITERATIONS):

agent.run("Implement the next step according to plan.md")

result = run_shell("./verify.sh")

if result.exit_code == 0:

return {"status": "success", "iterations": i + 1}

agent.run(

"Verification failed. Analyze the output below and fix the cause.\n"

"Do not repeat the same approach; if you hit the same error "

"twice in a row, change the approach itself.\n\n"

+ result.output[-8000:]

)

return {"status": "needs_human", "iterations": MAX_ITERATIONS}

Three details deserve attention. First, there is an upper bound on iterations (preventing loop runaway). Second, only the tail of the failure output is passed in (saving context — most tools print the key error at the end). Third, the instruction "do not repeat the same approach" is included (countering the chronic failure mode of an agent repeating an identical fix forever).

What if the codebase has no tests

The verification loop presupposes tests — so what if there are none? Paradoxically, that should become the agent's first mission. The textbook move is to add a stage zero to the loop: "before changing anything, write characterization tests that pin down the current behavior."

Repo-Level Guide Files — Designing CLAUDE.md and AGENTS.md

For the loop to run well every time, the agent must know the rules of the repository. The standard practice for this is CLAUDE.md (Claude Code) and AGENTS.md (the generic spec).

What the Stanford CS336 case tells us

The CLAUDE.md in the CS336 assignment repo is short but telling. It presupposes that students will use agents, and it instructs the agent at the repo level with behavioral norms such as "do not solve the assignment on behalf of the student." The heart of the community debate was exactly here: if you cannot prevent AI use, having the repo itself regulate AI behavior is the realistic move. Guide files are now first-class artifacts on par with the human-facing README.

The structure of a good guide file

A composition proven in practice looks like this.

Example CLAUDE.md structure

Project overview

One paragraph. What this codebase does.

Commands

- Verify: ./verify.sh (must pass before committing)

- Single test: pnpm vitest run path/to/test

- Dev server: pnpm dev

Code rules

- Explicit types on all public functions

- Return errors as Result types, never throw

- Search existing utils before adding a new dependency

Forbidden

- Never modify the data/migrations directory

- Never weaken a test to make it pass

- Never modify CI configuration files

Gotchas (frequently mistaken)

- The alias @/ in this repo points to app/, not src/

Principles for writing guide files

- Keep it short: the guide file is loaded into context every session. A 500-line guide is itself a cause of context rot. Keep the essentials; link to detailed docs.

- Use imperatives: not "it is recommended to run the tests" but "run ./verify.sh before committing."

- Derive from failures: adding one line each time the agent actually gets something wrong works better than trying to write the perfect guide upfront.

- Prefer verifiable rules: "write good code" is meaningless. "Return Result instead of throw" can even be enforced as a lint rule.

Multi-Agent Division of Labor — Writer and Verifier

Just as human organizations separate authors from reviewers, separating agent roles raises quality. The consistent observation in practice is that a separate agent that knows nothing about the writing context finds more defects than telling the same agent "review your own work" inside the same context. When the writer's self-justifications remain in context, the review gets contaminated by them.

[Orchestrator]

+---------------+---------------+

v v

[Writer agent] [Verifier agent]

- writes plan.md - reviews only the diff

- implements code - checks against the spec

- fixes until verify.sh passes - security/perf review

| |

+---> hands over diff + plan ---+

approve / reject (with reasons)

on reject: feedback loop back to writer

The key points of this division of labor are the following.

- Give the verifier only the diff and the spec, not the writer's reasoning trace (guaranteeing independent review).

- Receive the verifier's rejection reasons in a structured format (location, severity, rationale) and inject them into the writer's loop.

- Cap the reject-fix cycle too. Escalate to a human after three rounds.

Long Autonomous Runs and Checkpoint Design

The frontier models of 2026 can work autonomously for hours. But the longer the run, the bigger the loss from a single wrong turn. The answer is checkpoints.

The three elements of a checkpoint

1. State snapshot: a git commit is the most natural checkpoint. Have the agent commit at every meaningful unit, and write "progress against the plan" in the commit message.

2. Progress notes: an external memo (progress.md) that survives context compaction. Have the agent keep updating completed items, open problems, and next steps.

3. Verification gate: verify.sh must pass at every checkpoint before proceeding. This structurally blocks advancing in a broken state.

The format of the progress notes is up to you, but the following structure balanced compaction survival and readability well.

progress.md — progress notes maintained by the agent

Goal

Issue 142: add retry logic for payment webhooks

Done

- Added idempotency key validation to the webhook handler (commit a1b2c3)

- Retry queue table migration (commit d4e5f6)

In progress

- Exponential backoff scheduler — 2 tests failing

Blocked / decisions needed

- Max retry count missing from the spec. Assuming 5 for now (needs confirmation)

Next steps

1. Fix the scheduler tests

2. Add dead-letter handling

time -->

[plan]--[impl 1]--C--[impl 2]--C--[refactor]--C--[docs]--done

| | |

v v v

commit+notes commit+notes commit+notes

verify OK verify OK verify OK

C = checkpoint. On failure, roll back to the last C

and try a different approach

With checkpoints, instead of "discard everything after four hours of work" you get "roll back to the last good point and retry," and the points for human intervention become clear.

Failure Modes and Guardrails

Loops are powerful, and they automate failure just as well. Here are the representative failure modes and countermeasures.

| Failure mode | Symptom | Guardrail |

| --- | --- | --- |

| Loop runaway | Repeats the same fix forever | Iteration cap, halt on identical diff |

| Reward hacking | Weakens tests to pass | No-test-edit rule, separate verifier |

| Scope explosion | Mass-edits files beyond the request | Diff size cap, allowed-path restriction |

| Context decay | Forgets early instructions late in the run | Checkpoint notes, re-inject core instructions |

| Cost runaway | Overnight loop burns the budget | Token/time budget caps, alerts |

The most cunning of these is reward hacking. Agents given the goal "make the tests pass" have been frequently observed commenting out asserts, swapping expected values for actual outputs, or marking tests as skipped. The countermeasure has three layers. First, an explicit prohibition in the guide file. Second, configure the test directory as a write-forbidden path. Third, have the verifier agent separately check the diff for "were any tests weakened."

Hands-On — Building an Agent Loop with GitHub Actions

Put the loop on CI and you get a workflow where "label an issue and an agent brings you a PR." Here is an example using the Claude Code GitHub Action.

name: agent-loop

on:

issues:

types: [labeled]

jobs:

agent:

if: github.event.label.name == 'agent-task'

runs-on: ubuntu-latest

timeout-minutes: 90

permissions:

contents: write

pull-requests: write

issues: read

steps:

- uses: actions/checkout@v4

- name: Run agent on issue

uses: anthropics/claude-code-action@v1

with:

anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}

prompt: |

Resolve issue #${{ github.event.issue.number }}.

Procedure:

1. Write the plan to plan.md

2. Implement

3. Fix until ./verify.sh passes (max 8 attempts)

4. On pass, push the branch and open a PR

Constraints:

- Never modify the .github/ and tests/ directories

- If the diff exceeds 500 lines, propose splitting and stop

claude_args: "--max-turns 60"

Attach a second workflow — "when a PR opens, a verifier agent reviews it automatically" — and the writer-verifier division of labor is complete on top of CI. Humans only make the final merge decision.

Operational cautions are as follows.

- Apply double caps with timeout-minutes and max-turns. One alone is not enough.

- Block the recursion where a PR created by the agent triggers the agent again (label conditions, bot account filters).

- Minimize the scope of secrets. As the June 2026 npm supply chain attack showed, everything that runs in CI is a supply chain attack target.

Loop Telemetry — Observe the Loop Itself

To treat the loop as an operational system, you need observability data on the loop itself. Leave a structured record like the following for every run, and you can answer questions like "what kinds of tasks make the loop spin in place" and "did that guide file change improve convergence speed" with data.

{

"run_id": "loop_20260612_042",

"task_ref": "issue-142",

"iterations": 3,

"wall_clock_minutes": 41,

"tokens": { "input": 412000, "output": 38000 },

"verify_failures": [

{ "iteration": 1, "stage": "unit-tests", "summary": "backoff off-by-one" },

{ "iteration": 2, "stage": "lint", "summary": "unused import" }

"guardrail_triggers": [],

"human_interventions": 0,

"outcome": "merged"

}

The fields to watch most are iterations and the stage distribution within verify_failures. When repeated failures cluster in a particular stage, that is often a signal that the error messages of that stage are unfriendly to the agent. More often than not, the target of loop improvement is not the model but the quality of the feedback.

Frequently Asked Questions

- Does a small team need loop engineering? — Even on a solo project, two things pay off immediately: verify.sh and CLAUDE.md. Bring in CI integration and multi-agent setups when team size and PR volume grow.

- If the loop keeps failing, is it not a model problem? — In experience, 80 percent of the time it is a feedback quality problem or an ambiguous task definition. Before swapping models, check whether the failure logs are "clear enough that a human could find the cause."

- Verification is slow, so the loop is inefficient — Apply verification layering (fast checks first) and change-impact-based test selection (affected tests only) first. Once one loop revolution exceeds ten minutes, perceived efficiency collapses.

- Can agent-made commits be merged as-is? — Checkpoint commits are a work log, not release units. Squash before merge and require a final human review.

Adoption Roadmap

Loop engineering is not all-or-nothing. Go in stages.

1. Week 1 — a single verification entry point: create one verify.sh and append "until this passes" to your agent instructions. That alone gets you halfway.

2. Week 2 — the guide file: create CLAUDE.md and add one rule each time the agent gets something wrong.

3. Weeks 3 to 4 — loop automation: script the agentic_loop pattern above and introduce checkpoints (commit plus progress notes).

4. Month 2 — division of labor and CI integration: separate the verifier agent and build the issue-to-PR loop with GitHub Actions.

5. Ongoing — measurement: measure cost per loop run, human intervention rate, rejection rate, and time to merge, and improve the loop itself.

Pitfalls and Critical Perspectives

Skepticism about loop engineering deserves fair treatment too.

First, only what is verifiable improves. Qualities that cannot be expressed as tests — appropriateness of design, how well the code communicates intent, product sense — are not guaranteed by the loop. The loop is a tool that raises the floor of "machine-verifiable quality"; it is not a sufficient condition for good software.

Second, the verification loop itself becomes technical debt. Slow tests and flaky E2E are more lethal to a loop than to a human. A human shrugs with "that one just fails sometimes," but an agent may tear through unrelated code trying to fix it.

Third, the cost structure changes. Loops convert human time into token cost. If a task makes the loop run all eight iterations every time, that is a signal the task definition is wrong or the guide file is weak. Look not at "the loop ran" but at "how many iterations until convergence."

Closing

If prompt engineering was the craft of producing one good question, loop engineering is the craft of building a system whose quality converges without a human in the loop. The components of that system are not flashy prompts but a well-made verify.sh, a concise CLAUDE.md, sensible checkpoints, and guardrails that block reward hacking.

Interestingly, these are all extensions of what good engineering organizations did before AI — fast CI, clear convention docs, small PRs, independent review. Competitiveness in the agent era ultimately comes from building "a codebase that machines find easy to work in." And such a codebase is also one that humans find easy to work in.

References

- Addy Osmani — Loop Engineering: https://addyo.substack.com/p/loop-engineering

- GeekNews — Loop engineering topic: https://news.hada.io/topic?id=30336

- CLAUDE.md in the Stanford CS336 assignment repo: https://github.com/stanford-cs336/assignment1-basics/blob/main/CLAUDE.md

- Stanford CS336 — Language Modeling from Scratch: https://stanford-cs336.github.io/spring2025/

- Anthropic — Claude Code Best Practices: https://www.anthropic.com/engineering/claude-code-best-practices

- AGENTS.md — agent guide file spec: https://agents.md/

- Claude Code GitHub Actions documentation: https://docs.anthropic.com/en/docs/claude-code/github-actions

- Claude Code repository: https://github.com/anthropics/claude-code

- GitHub Actions official documentation: https://docs.github.com/en/actions

- Hacker News: https://news.ycombinator.com/