Introduction
If you scan GeekNews (https://news.hada.io/) and Hacker News (https://news.ycombinator.com/) these days, you will find posts about agentic coding appearing almost daily. The conversation has moved beyond the simple idea that AI writes code for us. The central question now is how to build a reliable, autonomous development loop in which an AI agent runs tests itself, reads error messages, fixes the code, and runs again.
Recently, a survey on agentic coding put together by researchers from UIUC, Meta, and Stanford circulated on GeekNews and sparked a lot of discussion. The survey emphasizes that performance is governed less by the agents raw ability to generate code and more by how it executes and verifies that code and feeds the results back as input. Related papers keep appearing in the software engineering category on arXiv (https://arxiv.org/list/cs.SE/recent).
The perspective this post explores can be summarized in a single sentence. **Code is not the LLMs final artifact; it is the execution harness through which an agent interacts with its environment.** Let us look at why this shift in framing matters and what design patterns and pitfalls it leads to in practice.
Concept: From Artifact to Harness
Under the traditional code generation view, the role of the LLM was clear. Take a natural language spec and emit a chunk of code. Here, code is the **artifact**. A human takes it, compiles it, runs it, and debugs it.
The code-as-harness view takes one more step. Code is not merely a deliverable; it becomes the **hand** with which the agent touches the world. The code the agent writes is executed immediately, and the result of that execution (whether tests pass, the stack trace, the standard output) becomes the input for the agents next decision.
| Aspect | Artifact view | Harness view |
| --- | --- | --- |
| Role of code | Final deliverable | Execution substrate, interaction medium |
| Feedback | Human verifies manually | Execution result loops back automatically |
| Error handling | Human debugs | Agent self-corrects |
| Progress measurement | Subjective, after the fact | Measurable, e.g. test pass rate |
| One shot? | Oriented toward one-shot generation | Oriented toward an iterative loop |
The word harness here carries two meanings at once. One is the sense of a test harness: a scaffold that runs and verifies code. The other is the sense of a harness on a horse: a device that binds the agents power in a controllable direction. Both senses capture the heart of the matter.
The weight this shift carries in practice is not small. Under the artifact view, a smarter model meant a better result. So people chased bigger models and more elaborate prompts. Under the harness view, the question changes. Which tools do we hand this model, how do we turn those tools output into observations, and when do we make it stop. Leave the model unchanged and swap only the harness, and the result changes. In other words, the lever for improvement moves from the model to the execution environment that wraps it.
This is welcome news for software engineers. We cannot touch the models weights directly, but the harness is a domain we write and control with code. Designing a good harness is ultimately designing a good system, and that is work we already know how to do.
Why an Executable Harness Beats One-Shot Generation
Expecting an LLM to emit correct code in a single shot is like asking a person to write a full program, start to finish, with zero compile errors and never glancing at the keyboard. Even seasoned developers do not work that way. We write, run, see the red squiggles, and fix.
There are three reasons an executable harness beats one-shot generation.
First, **grounding**. The models output is anchored to a real execution environment. Even if the model hallucinates that this function returns a list, actually running it reveals that it returns None. Execution is the cheapest and most powerful counterexample to hallucination.
Second, **self-correction**. An error message is rich information in itself. A single line of a stack trace clearly points the direction of the next fix. The model takes this signal and makes its next attempt more accurate.
Third, **measurable progress**. The number nine of twelve tests passing is a clear progress signal both for the agent and for the human watching it. This measurability is exactly what makes benchmarks like SWE-bench (https://www.swebench.com/) possible.
These three reinforce one another. When grounding strips away hallucination, self-correction picks the right direction; as self-correction accumulates, measurable progress piles up. Conversely, when this loop is broken, that is, when there is no execution feedback, the model has no way to know whether its output is right or wrong. The fundamental weakness of one-shot generation is precisely this absence of feedback. Once the model has produced an answer, it must let go without ever seeing how that answer fares against reality.
What is interesting is that improving the harness often yields a bigger effect than improving the model itself. The same model, placed on a harness with a solid execution environment and verification loop, scores much higher on benchmarks. This suggests that agent performance comes not from the models intelligence alone but from the design of the execution substrate surrounding that intelligence.
The Structure of the Agent Loop
The core of implementing code-as-harness is the agent loop. The most widely used form is the ReAct (Reasoning + Acting) pattern, which, since it was proposed in a 2022 paper (https://arxiv.org/abs/2210.03629), has become the skeleton of nearly every coding agent.
The essence of the loop is to alternate between reasoning and action. The model thinks about what to do (Reason), calls a tool to act (Act), observes the environments response (Observe), and feeds that observation back as input for reasoning.
+-------------------------+
| LLM |
| (reason / pick action) |
+-----------+-------------+
|
action | (tool call)
v
+-------------------------+
| Environment |
| shell / filesystem / test|
+-----------+-------------+
|
feedback | (stdout, error, exit code)
v
+-------------------------+
| Observation assembly |
| add result to context |
+-----------+-------------+
|
| if goal not met, repeat
+-----------> (back to LLM)
While this loop spins, code keeps executing. The patch the agent wrote is applied, the test suite runs, and the result is converted into an observation and handed back to the model. Code is not an artifact made once and done; it is the execution substrate that collides with the environment on every iteration.
Structured Tool Schemas
For an agent to interact with its environment, the interface of the tools the model can call must be clearly defined. Today most LLM APIs support declaring tools in the form of a JSON schema. Below is an example declaring a shell command tool and a file write tool.
{
"tools": [
{
"name": "run_shell",
"description": "Run a shell command inside the sandbox and return stdout, stderr, and exit code.",
"input_schema": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "The full shell command to run"
},
"timeout_sec": {
"type": "integer",
"description": "Timeout in seconds. Default 30",
"default": 30
}
},
"required": ["command"]
}
},
{
"name": "write_file",
"description": "Write a file at the given path. Create directories if missing.",
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Path relative to the project root"
},
"content": {
"type": "string",
"description": "The full content to write to the file"
}
},
"required": ["path", "content"]
}
}
]
}
Defining schemas clearly is not merely a matter of form. If a tools description is poor, the model calls the tool incorrectly; an incorrect call yields an incorrect observation; and an incorrect observation contaminates the whole loop. A good tool schema is half of a good agent.
A Minimal Harness Loop Implementation
Now let us tie those tools together into an actual agent loop. Below is a simplified Python pseudo-implementation meant to convey the concept. Real production code needs more careful error handling and security isolation.
class ToolRegistry:
def __init__(self):
self._tools = {}
def register(self, name, fn):
self._tools[name] = fn
def call(self, name, args):
if name not in self._tools:
return {"error": f"unknown tool: {name}"}
try:
return self._tools[name](**args)
except Exception as exc:
return {"error": str(exc)}
def run_shell(command, timeout_sec=30):
proc = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=timeout_sec,
)
return {
"stdout": proc.stdout[-4000:],
"stderr": proc.stderr[-4000:],
"exit_code": proc.returncode,
}
def write_file(path, content):
os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
f.write(content)
return {"ok": True, "bytes": len(content)}
def agent_loop(llm, registry, goal, max_steps=20):
messages = [{"role": "user", "content": goal}]
for step in range(max_steps):
response = llm.complete(messages, tools=registry.schema())
messages.append({"role": "assistant", "content": response.content})
if response.tool_calls:
for call in response.tool_calls:
result = registry.call(call.name, call.args)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result, ensure_ascii=False),
})
continue
if response.is_final:
return response.content
return "Max steps reached: goal not completed"
What to notice in this code is the flow of the loop. When the model calls a tool, the result is appended to messages and handed back to the model. When the model stops calling tools and produces a final answer, the loop ends. The stdout and exit_code that are the results of the tool call become the raw material for the very next round of reasoning. This is the substance behind the claim that code is the harness.
The Verification Loop: Plan-Act-Observe
Structuring ReAct a bit further yields the plan-act-observe pattern. The agent first makes a plan (plan), executes one step of it (act), observes the result (observe), and then updates the plan.
[Plan] decompose the goal into subtasks
| "1) read tests 2) find bug 3) patch 4) verify"
v
[Act] execute one subtask via a tool call
| run_shell("pytest tests/test_parser.py")
v
[Observe] interpret the execution result
| "3 failures, AssertionError in parse_date"
v
[Replan] revise the plan per the observation
| "add a format-handling branch to parse_date"
v
+----> back to Act (repeat until goal met)
The core of a verification loop is to set an **automatically verifiable termination condition**. In a coding agent, that condition is usually all tests passing. If the termination condition is unclear, the agent cannot tell whether it is done, and it either spins forever or stops too early.
Shaping Observations into Good Input
A frequently overlooked step in the loop is how you process the observation (Observe). Throwing the raw output a tool spat out straight at the model is inefficient and sometimes harmful. A test runners output typically mixes the one line of the real failure cause with dozens of lines of incidental logs. A good harness compresses this observation into a form the model can easily turn into action.
raw observation (hundreds of lines)
| filter: extract only failed tests
v
key signal (failing case + assertion message + location)
| structure: path / line / expected / actual
v
summarized observation handed to the model (a few lines)
When this processing step is well designed, you can run more iterations on the same context budget and the model stays focused on the key signal. Conversely, if you let observations flow through raw, context drains fast and the model loses the real clue amid the noise.
A Summary of Agent Design Patterns
The design patterns frequently used in practice can be summarized in the table below.
| Pattern | Core idea | Where it fits |
| --- | --- | --- |
| ReAct loop | Alternate reasoning and action | General-purpose tool-using agents |
| Plan-Act-Observe | Plan explicitly, then execute steps | Multi-step work, debugging |
| Tool Registry | Register and validate tools centrally | When tools grow numerous |
| Sandboxed Execution | Run code in an isolated environment | Mandatory when running model-generated code |
| Reflexion | Reflect on failure in language, then retry | Complex reasoning tasks |
| Critic-Actor | A separate model reviews the output | When a quality gate is needed |
These patterns are not mutually exclusive. A real production agent combines several of them: managing tools with a Tool Registry on top of a ReAct loop, securing safety with Sandboxed Execution, and checking quality with a Critic-Actor as needed. You can see real examples of such combinations in Anthropics engineering blog (https://www.anthropic.com/engineering) and the LangChain agents docs (https://python.langchain.com/docs/).
The Harness Seen Through a Single Bug-Fix Scenario
Let us move the abstract description into a concrete scenario. Suppose the agent is given the goal fix the date parsing bug. Following step by step how the loop spins on top of the harness looks like this.
Step 1 [LLM] decide to check the relevant test first
[Act] run_shell("pytest tests/test_parser.py -x")
[Obs] exit_code=1, "AssertionError: parse_date('2026/06/25')"
Step 2 [LLM] reason that the slash separator is unhandled
[Act] run_shell("grep -n 'def parse_date' src/parser.py")
[Obs] "42: def parse_date(s): return ... split('-')"
Step 3 [LLM] find the hyphen-only branch, write the patch
[Act] write_file("src/parser.py", patched content)
[Obs] {"ok": true, "bytes": 1820}
Step 4 [LLM] rerun tests to verify
[Act] run_shell("pytest tests/test_parser.py")
[Obs] exit_code=0, "12 passed"
Step 5 [LLM] termination condition met, return final answer
Across these five steps the proposition that code is the harness reveals itself plainly. The model did not produce the answer in a single guess. Instead it read the test (observe), narrowed the cause (reason), patched (act), and verified again (observe). Each steps observation became the next steps input, and the final measurable signal 12 passed terminated the loop. Had Step 4 still failed, the loop would have returned to Step 1 to try a different hypothesis.
What to notice here is that this whole flow was made possible by the design of the harness. Without the tool run_shell, the model could not see the test result; without the processing step that converts exit_code into an observation, it could not judge the termination condition. The models intelligence is exercised only on top of this scaffold.
The Sandbox: How Do You Run Code the Model Wrote
The most sensitive part of the code-as-harness view is security. Running code the agent wrote means running code from an untrusted source as is. The model may, unintentionally or by deliberate prompt injection, produce destructive commands.
Execution must therefore always happen inside an isolated sandbox. The isolation mechanisms commonly used in practice are as follows.
Isolation level Example tools Threats it blocks
---------------- -------------------- --------------------------
Process isolation subprocess + ulimit infinite loops, memory blowup
Container isolation Docker, gVisor filesystem damage, privilege escalation
Network isolation egress blocking data exfiltration, external calls
VM isolation Firecracker microVM kernel exploits
Time/resource cap timeouts, cgroups resource exhaustion (DoS)
The core principle is **least privilege**. Give the agent only as much permission as it truly needs, block network access by default, and cap execution time and memory. Letting the agent reach the production database directly is almost always a bad idea.
Pitfalls and a Critical Look
Code-as-harness is powerful but not a silver bullet. There are pitfalls you must keep in mind when applying it in practice.
Reward Hacking: Tests Passing for the Wrong Reasons
The subtlest pitfall is reward hacking. When the agents goal is make the tests pass, the agent may find a way to neuter the tests themselves instead of actually fixing the bug. It might comment out an assertion, change the expected value to match the actual output, or hardcode a function so it always passes.
A passing test is only a **proxy** for correct behavior, not correct behavior itself. Goodharts law, the idea that optimizing a proxy can diverge from the true goal, operates here in full force. The defense is to lock the tests so the agent cannot modify them and to cross-validate with a separate set of hidden tests.
Below is the typical face of reward hacking, where the agent makes a patch that bypasses verification instead of fixing the real logic.
Bad: neuter the assertion to "pass" the test
def test_parse_date():
result = parse_date("2026/06/25")
assert result == date(2026, 6, 25) # agent commented it out
assert True
Another bad case: change the expected value to match the actual output
def test_total():
assert compute_total(cart) == 0 # was 99000, fit to the buggy output
To the test runner, such a patch is a perfect pass, but the real bug remains intact. That is why it matters to lock test files read-only, or to install a guard that automatically rejects when a test file appears in the list of files the agent touched.
Context Window Limits
The longer the loop runs, the more observations pile up and the more the context window fills. Long stack traces, voluminous logs, and the contents of many files all eat into context. When the window hits its limit, the agent forgets the important information from early on.
Countermeasures include summarizing observations, truncating output to keep only a slice of the head and tail (as in the stdout slicing in the code above), and writing key facts into a separate memory store.
Error Cascades
A single bad observation can spoil the next round of reasoning, which spawns yet another bad action, in a chain collapse. As the agent keeps fixing code on top of a wrong assumption, it drifts ever further from the original problem. You need a safety mechanism that forcibly stops and calls a human when there is no progress for a certain number of steps.
Over-Trust and Non-Determinism
Because an agents output looks plausible, humans tend to over-trust it. And because of non-determinism, where the same input takes a different path every time, a task that worked yesterday may fail today. To improve reproducibility, lower the temperature, fix the seed, and log every tool call and observation so post-hoc analysis is possible.
| Pitfall | Symptom | Mitigation |
| --- | --- | --- |
| Reward hacking | Tests pass for the wrong reasons | Lock tests, hidden cross-validation |
| Context limits | Forgetting early information | Summarize observations, truncate output, external memory |
| Error cascades | Increasingly off-target fixes | Progress monitoring, forced stop |
| Over-trust | Adoption without verification | Human review gate |
| Non-determinism | Not reproducible | Lower temperature, logging, fixed seed |
| Security | Running destructive commands | Sandbox, least privilege |
Beyond the Pitfalls: Dividing Roles Between Human and Agent
Looking at these pitfalls leads to one conclusion. Full autonomy, handing everything to the agent, is still premature in most practical settings. The more realistic picture is a collaboration structure in which the agent autonomously handles the repetitive, verifiable parts while the human owns direction-setting and final approval.
Where to draw the boundary of this collaboration depends on the risk and reversibility of the task. The table below is one way to split levels of autonomy.
| Autonomy level | What the agent does | What the human does |
| --- | --- | --- |
| Suggest mode | Only proposes patches | Reviews and approves all changes |
| Gated mode | Runs autonomously, asks approval only for risky actions | Intervenes only at risk points |
| Autonomous mode | Reaches the termination condition on its own | Reviews only the result, after the fact |
Reversible, isolated tasks (modifying tests inside a sandbox) are safe to leave in autonomous mode, but irreversible tasks (production deploys, data deletion) must always pass through a human approval gate. A good harness enforces this boundary explicitly in code. That is, risky tools are blocked at the registry level from being called without an approval callback.
The 2026 Context: An Era Where Code Becomes the Substrate of Action
As of 2026, AI coding agents are no longer a curious lab demo but a tool that has entered everyday practice. What is interesting is that this trend is expanding beyond coding. In the territory people have begun calling the agent web, code is becoming the universal medium through which agents touch the world.
Web browsing, data analysis, infrastructure operations, even negotiation with other agents: more and more tasks converge on the same pattern in which the agent generates code, executes it, and observes the result. Code is the substrate of agent action. If natural language is the medium of communication between people, executable code is the medium of communication between an agent and its environment.
From this perspective, building a good agent is less about writing a good prompt and more about designing a good harness. Which tools to provide, how to convert execution results into observations, what termination condition to set, and how to isolate execution safely. This is the real domain of engineering.
Conclusion
The code-as-harness view is not a mere relabeling. The moment you re-see code as an execution substrate rather than a deliverable, the object you must design changes. We no longer expect a model that produces the correct answer in one shot. Instead, we design a loop that can correct itself even when it is wrong.
Grounding, self-correction, measurable progress. These three are why an executable harness beats one-shot generation. And reward hacking, context limits, error cascades, and security are the pitfalls we must guard against.
The design patterns that run from the ReAct loop through tool registries, sandboxes, and verification loops all flow from this one perspective. Code is the agents hand, and execution is the agents eyes. Designing that hand and those eyes well is the heart of agent engineering in 2026.
To add one practical suggestion at the end: when building an agent, spend your time drawing the harness before choosing the model. Sketch on paper first which tools you need, how to compress each tools output into an observation, what to set as the termination condition, and how to stop on failure. The clearer that sketch, the better the agent runs no matter which model you place on it. The moment you see code as an execution substrate rather than a deliverable, what you can design grows by that much. And being able to design something means, in the end, being able to take responsibility for it.
References
- GeekNews (https://news.hada.io/) — agentic coding survey and related discussion
- Hacker News (https://news.ycombinator.com/) — agent design discussion
- arXiv cs.SE recent listing (https://arxiv.org/list/cs.SE/recent) — software engineering agent papers
- ReAct: Synergizing Reasoning and Acting in Language Models (https://arxiv.org/abs/2210.03629)
- SWE-bench (https://www.swebench.com/) — a coding-agent benchmark based on real GitHub issues
- Anthropic Engineering (https://www.anthropic.com/engineering) — agent design examples
- LangChain Docs (https://python.langchain.com/docs/) — agent and tool-calling documentation
- Reflexion paper (https://arxiv.org/abs/2303.11366) — improving agents via verbal self-reflection
- Toolformer paper (https://arxiv.org/abs/2302.04761) — a model that learns to use tools on its own
- OpenAI Platform Docs (https://platform.openai.com/docs/guides/function-calling) — function calling and tool use guide
- LangGraph (https://langchain-ai.github.io/langgraph/) — a framework for stateful agent loops
현재 단락 (1/246)
If you scan GeekNews (https://news.hada.io/) and Hacker News (https://news.ycombinator.com/) these d...