Anatomy of an AI Harness — The Scaffolding That Turns a Model Into an Agent (Loop, Tools, Context, and Building Your Own)

Prologue — Same Model, Different Harness, Different Result

A scene you see all the time in 2026: two teams use the same frontier model. One team's agent picks up a ticket and opens a PR; the other team's agent has been spinning on the same file for 30 minutes.

The difference isn't the model. It's the harness.

The model is the engine. The harness is the car. Drop the same engine into a racing chassis versus a broken-down truck, and the driving experience is completely different.

AI engineering in 2023–2024 was about "which model do you use." AI engineering in 2026 is mostly about "how do you wrap that model." Model providers have converged on a handful, but harnesses have exploded in variety — Claude Code, Cursor, Codex CLI, Aider, OpenClaw… they all call the same model APIs. What's different is the harness.

This article dissects the harness. The agent loop, the tool execution layer, context management, the System Prompt, control flow, failure modes. And we'll build a 40-line minimal harness ourselves, then look at how to evaluate one.

Chapter 1 · What Is a Harness

Harness = all the code that wraps an LLM and turns it into an "agent."

The LLM itself is a stateless function — text goes in, text comes out. That's it. It can't read files, can't run commands, can't remember. The harness does all of that.

        ┌──────────────────────────────────────┐
        │              Harness                 │
        │  ┌────────────┐   ┌────────────────┐ │
        │  │ System     │   │ Context mgmt   │ │
        │  │ Prompt     │   │ (window/compact)│ │
        │  └────────────┘   └────────────────┘ │
        │  ┌────────────┐   ┌────────────────┐ │
        │  │ Agent      │──▶│  Model (LLM)   │ │
        │  │ loop       │◀──│  = engine      │ │
        │  └─────┬──────┘   └────────────────┘ │
        │        ▼                              │
        │  ┌────────────┐   ┌────────────────┐ │
        │  │ Tool exec  │   │ Control flow / │ │
        │  │ layer      │   │ hooks / perms  │ │
        │  └────────────┘   └────────────────┘ │
        └──────────────────────────────────────┘

What the harness owns:

The loop — it calls the model repeatedly, not just once.
Tools — it lets the model "act."
Context — it decides what the model gets to see.
System Prompt — the model's identity and rules.
Control flow — when to stop, when to call a human, when to parallelize.
Safeguards — it prevents infinite loops, runaway cost, and malformed tool calls.

Change the model and the quality of reasoning changes. Change the harness and the agent's behavior itself changes. That's why you have to think about the two separately.

Chapter 2 · The Agent Loop — The Heart of the Harness

The core of a harness is astonishingly simple. It's a while loop.

1. Call the model (pass it the current messages)
2. Look at the model's response:
     - Is it a final answer?  → end the loop
     - Is it a tool call?     → go to 3
3. Execute the tool
4. Append the tool result to the messages
5. Go back to 1

This is the essence of the ReAct pattern — it repeats think → act → observe. The model "thinks" (reasons), "acts" (calls a tool), and the harness feeds it the "observation" (injects the result).

Termination Conditions — The Loop Must Stop

The most important part of the while is the exit. If the termination condition is weak, the agent runs forever.

Termination condition	Meaning
Final answer	The model returns text only, with no tool call
Step cap	Force-stop after N iterations (e.g. 40)
Token budget	Stop when cumulative tokens exceed a limit
Explicit signal	The model calls a termination tool like `done`
Error	An unrecoverable failure
Human intervention	The user aborts

A step cap and a token budget are not optional. A harness without these two is a cost bomb.

Chapter 3 · The Tool Execution Layer

The model can't "act." What the model does is produce structured output that says "I want to call this tool with these arguments." The actual execution is done by the harness.

The Lifecycle of a Tool Call

1. The harness tells the model the "tool list" (name, description, argument schema)
2. The model emits a tool-call block (tool name + arguments)
3. The harness parses that block
4. The harness validates the arguments (schema violation → return an error to the model)
5. The harness executes the tool (often inside a sandbox)
6. The harness captures the result (stdout, stderr, return value, error)
7. The harness wraps the result in a "tool result" message and appends it to the conversation
8. Back to the loop — call the model again

The key point: the model only knows a tool exists through its "description." If the tool description is poor, the model uses the tool wrong. Tool definitions are effectively part of the prompt.

What the Harness Is Responsible For in Tool Execution

Validation — do the arguments the model sent match the schema? If not, don't execute — return the error so the model can self-correct.
Sandboxing — tool execution happens in an isolated environment, separated from the host and production.
Error capture — even if a tool dies, the harness doesn't. It catches the error and feeds it to the model as an "observation."
Timeouts — every tool call gets a timeout. Don't let a stuck tool stop the loop.
Result formatting — drop a huge output (100,000 lines of logs) in as-is and the context blows up. Truncate or summarize it.

MCP — The Standard for the Tool Layer

In 2026 you don't reimplement tools from scratch in every harness. An MCP (Model Context Protocol) server exposes a "bundle of tools" through a standard interface, and any harness connects to it. A harness's tool layer is increasingly becoming an "MCP client."

Chapter 4 · Context Management — The Hardest Part

A model's context window is finite. And the agent loop stacks messages on every step — tool calls, tool results, reasoning. A 100-step task overwhelms the context.

Deciding what to put in and what to leave out is the harness's hardest responsibility.

Context Rot — Context Decays

As context grows longer, model performance doesn't simply "hold" — it degrades. When stale tool results, abandoned plans, and irrelevant exploration trails pile up, the model struggles to separate signal from noise. This is called context rot.

Not everything that "fits" in the window is equal. The front and the back get used better than the middle. So the harness has to design what goes where, not just "put everything in."

Techniques for Handling Context

Technique	Description
Pruning	Cut stale, irrelevant messages
Compaction	Compress old conversation into a summary
Retrieval	Fetch relevant chunks from outside only when needed (RAG)
Sub-agent isolation	Run a subtask in a separate context so it doesn't pollute the main one
Structuring	Tool results as structured, summarized form rather than raw text
External memory	Keep state in files/DB, keep only pointers in the context

Key insight: the context window is not RAM — it's a workbench. It's not a place to pile things up infinitely; it's where you lay out only what the current task needs. The harness's job is to keep that workbench clean.

Chapter 5 · The System Prompt — The Harness's Constitution

The System Prompt is the constitution the harness hands to the model. It's the unchanging foundation the model sees on every loop.

What goes in it:

Identity — "what you are, and what you exist for."
Rules and constraints — what to never do, what to always do.
Tool usage guidance — when and how to use tools.
Output format — the structure of the response.
Safety boundaries — requests to refuse, situations where a human must be called.

On top of this come context files (CLAUDE.md, AGENTS.md, and the like) — project-specific rules, architecture, conventions. If the System Prompt is "the harness's constitution," the context files are "this project's local law."

The key point: same model plus same tools, but a different System Prompt means a completely different agent. The System Prompt is where the harness shapes the model's "personality."

Chapter 6 · Control Flow — Orchestration on Top of the Loop

On top of the basic loop (Chapter 2), a mature harness layers control flow.

Sub-agent — Context Isolation

A separate agent that runs a subtask in a fresh context. Hand an exploration task like "investigate this directory structure" to a sub-agent, and dozens of file reads don't pollute the main context. The sub-agent returns only a summary of the result. It's the key tool for defending against context rot.

Hook — Deterministic Interception

Deterministic code that runs at specific points in the agent loop (before tool execution, after a response, and so on). Not the model's judgment — fixed rules. For example: "run the linter before any tool executes," "force tests before a commit."

Permission Gate — Stopping Before You Act

High-risk tools (deleting files, deploying, external transmission) go through a policy check or human approval before execution. The harness enforces a policy like "read tools are automatic, write tools are gated."

Parallelization

Tool calls with no dependency on each other run at the same time. There's no reason to do three independent file reads serially. But calls that touch state run serially.

Human-in-the-loop

In front of an irreversible decision, the harness stops and asks a human. The dial between autonomy and safety is held by the harness.

Chapter 7 · Failure Modes and Recovery

The difference between a mature harness and a demo harness comes down to how it handles failure.

Failure mode	Symptom	The harness's defense
Infinite loop	Repeats the same action, no progress	Step cap, loop detection (spotting repeated actions)
Tool error	A tool dies, a 400 response	Catch the error and pass it to the model as an "observation," retry cap
Hallucinated tool	Calls a tool name that doesn't exist	Check against the registry, return "no such tool" to the model
Context rot	Quality degrades in a long context	Compaction, sub-agent isolation, pruning
Runaway cost	Tokens quietly pile up	Token budget, cap on self-correction attempts
False confidence	Confidently wrong answers	Put verification tools (tests, type-checks) in the loop
Scope creep	Does more than it was asked	System Prompt constraints, permission gates

The core principle: agents fail — the harness has to absorb that failure, recover if recovery is possible, and otherwise stop cleanly. The harness's job isn't to prevent failure; it's to handle it.

Chapter 8 · Real Harnesses — What's Different

They call the same model APIs, yet the experience differs. Because the harness differs.

Harness	Surface	Characteristics
Claude Code	CLI	Sub-agents, hooks, skills, MCP, permission modes, `CLAUDE.md`
Cursor	IDE	The editor is the harness surface, background agents, diff-centric
Codex CLI	CLI	Sandboxed execution, `AGENTS.md`
Aider	CLI	git-native, repo map, commit-centric
OpenClaw	Messaging app	Local gateway, Skills/ClawHub, multi-channel
Claude Agent SDK	Library	A tool for building the harness itself

Almost all of these differences are harness design decisions:

Surface (CLI vs IDE vs messaging)
Context file convention (CLAUDE.md vs AGENTS.md)
Tool execution model (the strength of the sandbox)
Control flow (whether sub-agents and hooks are supported)
Permission model (how autonomous it is)

The Claude Agent SDK is especially interesting — it isn't a finished harness, it's a library for building a harness. It provides the loop, tools, and context management, and you build your own harness on top of it.

Chapter 9 · Build Your Own — A 40-Line Minimal Harness

The essence of a harness is simple. Building a minimal one yourself makes it clear.

import anthropic

client = anthropic.Anthropic()

# Tool definition — the model only knows the tool through this "description"
TOOLS = [{
    "name": "run_bash",
    "description": "Run a bash command and return its output.",
    "input_schema": {
        "type": "object",
        "properties": {"command": {"type": "string"}},
        "required": ["command"],
    },
}]

def execute_tool(name, args):
    # Actual execution — in the real world, inside a sandbox
    import subprocess
    if name == "run_bash":
        r = subprocess.run(args["command"], shell=True,
                           capture_output=True, text=True, timeout=30)
        return (r.stdout + r.stderr)[:4000]   # truncate the result to protect context
    return f"Unknown tool: {name}"            # defense against hallucinated tools

def agent_loop(task, max_steps=20):           # step cap = guaranteed termination
    messages = [{"role": "user", "content": task}]
    for step in range(max_steps):
        resp = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=2048,
            system="You are a coding agent. Use tools to complete the task.",
            tools=TOOLS,
            messages=messages,
        )
        messages.append({"role": "assistant", "content": resp.content})

        # No tool call → final answer → end the loop
        tool_uses = [b for b in resp.content if b.type == "tool_use"]
        if not tool_uses:
            return resp.content

        # Execute the tools and inject the results into the messages
        results = []
        for tu in tool_uses:
            output = execute_tool(tu.name, tu.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": tu.id,
                "content": output,
            })
        messages.append({"role": "user", "content": results})

    return "Step limit reached"               # fail-closed

That's all of it. A while loop + tool execution + message injection. Claude Code and Cursor too — they're built on this skeleton, with context management, sub-agents, hooks, permissions, better tools, and a better System Prompt layered on top.

What's missing from those 40 lines is exactly the job of a "production harness": context compaction (Chapter 4), permission gates (Chapter 6), failure recovery (Chapter 7), streaming UX, observability, cost tracking.

Chapter 10 · Evaluating a Harness — It's Different From Model Eval

"Is this model good?" is a model eval. "Is this harness good?" is a different question.

The core of harness eval: hold the model fixed and change only the harness to measure.

Model eval:   harness fixed, model A vs model B
Harness eval: model fixed, harness A vs harness B

Metrics to measure:

Task completion rate — the success rate when you run the same task set through harness A vs B.
Step count — how many loops to completion? Fewer is more efficient.
Cost per task — tokens and time. If the harness manages context well, it gets cheaper.
Error recovery rate — the rate at which the harness recovers when a tool fails.
Drift rate — the rate of going out of scope or falling into an infinite loop.

Even with the same frontier model, a bad harness loses to a weaker model plus a good harness on real-world tasks. That's why harness eval is as important as model eval — and yet most people don't measure it.

Chapter 11 · Why the Harness Matters as Much as the Model

Let's sum it up. Most of what we call "AI engineering" in 2026 is harness engineering.

Model providers have converged to a few — the model has effectively become close to a commodity.
Differentiation comes from how you wrap that model — the loop, tools, context, control flow.
Same model, different harness → one opens a PR and the other spins.
Frontier model + bad harness < weaker model + good harness (on real-world tasks).

This goes back to the analogy from Chapter 1. The engine (the model) is getting more and more similar. The race is decided by the car (the harness).

And the good news: the harness is engineerable. You can't change the model. But the loop, the termination conditions, the tool descriptions, context management, permission gates, failure recovery — these are all code you design. The harness is the surface an AI engineer can actually control.

Epilogue — Be Conscious of the Harness

The one-sentence summary of this article: be as conscious of the harness as you are when choosing a model.

When you use a tool, look at "how does this harness run the loop, how does it manage context, how does it handle failure." When you build a harness — on top of the 40-line skeleton — you layer on context management, permissions, and recovery. When you evaluate a harness, you hold the model fixed and measure only the harness.

The AI engineering of the next decade isn't about waiting for a bigger model. It's about building a better harness.

12-Item Checklist

Does the harness's agent loop have a clear termination condition (step cap, token budget)?
Do you validate tool arguments before execution?
Does tool execution run inside a sandbox?
Does every tool call have a timeout?
Do you truncate or summarize huge tool outputs?
Do you have a context compaction/pruning strategy?
Do you isolate exploratory subtasks into sub-agents?
Do high-risk tools have a permission gate?
Is there infinite-loop detection (spotting repeated actions)?
Do you filter hallucinated tool names against a registry?
Does the self-correction loop have a cap on attempts?
Do you evaluate the harness separately (with the model fixed)?

10 Anti-Patterns

A loop with no termination condition — a cost bomb.
Executing tool arguments without validation.
Running tool execution directly on the host (no sandbox).
Injecting a huge output into the context as-is.
A "put everything in" context strategy — context rot.
Doing exploration work in the main context — pollution.
No permission gate on high-risk tools.
The harness dying along with a tool error.
Evaluating only the model, never the harness.
Treating the harness as just a "model-call wrapper" — and stopping there.

Next Article Preview

Candidates for the next article: Sub-agent Orchestration — Designing a Multi-Agent Harness, Context Compaction Deep Dive — What to Throw Away and What to Keep, Building a Harness Eval Suite — Holding the Model Fixed and Measuring the Harness.

"The model is the engine. But the race is decided by the car. The center of gravity in 2026 AI engineering isn't the model — it's the harness."

— Anatomy of an AI Harness, end.