Skip to content
Published on

Production Agent Design Patterns 2026 — How to Actually Combine Supervisor, CodeAct, Plan-Execute, Self-RAG, Handoff, and Subagent Isolation

Authors

Prologue — Intro Patterns vs Production Patterns

The "LLM agent design pattern" posts that flooded 2023–2024 mostly catalog the same five boxes: ReAct, Chain-of-Thought, Reflection, Plan-and-Solve, Tree of Thoughts, with "Multi-Agent" as the closing chapter. Demos run fine.

The problem is that most of those patterns were validated on toy benchmark environments. HotpotQA two-hop questions, GSM8K arithmetic, ALFWorld text games. What worked there breaks differently when production hits — modifying a 5,000-line codebase, pulling internal data from 30 systems, calling a payments API, and surviving an OOM 30 minutes into a task.

The one-line truth of production agents in 2026: almost no real system runs on a single pattern. You ship Supervisor + CodeAct + Subagent isolation + Hooks together, or Plan-Execute + Self-RAG + Reflexion + Handoff together. Responsibility is split, and each pattern does a different job at its station.

This post inventories nine patterns that survived 2024–2026 production workloads. For each: (1) what it is, (2) when it shines, (3) how it breaks, (4) a 30–50 line minimal sketch. And at the end — how real teams actually compose them.


1. Supervisor / Orchestrator

One-line definition: one upper-level (supervisor) agent receives a task, dispatches it to specialist agents, collects results, and decides the next step.

This is the topology LangGraph's official "supervisor" guide recommends, and the most common multi-agent shape you'll meet. Unlike a free-form swarm, a single supervisor decides who runs next, which keeps the whole thing controllable.

Topology

                    ┌────────────┐
   user task ────►  │ Supervisor │
                    └─────┬──────┘
                          │ route by intent / state
              ┌───────────┼───────────┐
              ▼           ▼           ▼
       ┌──────────┐ ┌──────────┐ ┌──────────┐
       │ Research │ │ Coder    │ │ Reviewer │
       │ agent    │ │ agent    │ │ agent    │
       └────┬─────┘ └────┬─────┘ └────┬─────┘
            └──────┬─────┴──────┬─────┘
                   ▼            ▼
              return to Supervisor (loop)

When it shines

  • The stages of the workflow are heterogeneous — research is long and expensive, coding uses different tools, review is short and deterministic. Cram them into one agent and your system prompt becomes a quilt.
  • You need state-aware routing. The supervisor sees the whole state every step and decides "who works next."
  • You want a cost-asymmetric model mix: Opus for the supervisor, Haiku for specialists.

How it breaks

  • The supervisor becomes a context black hole. Every specialist's output accumulates in the supervisor's context; by round 5 the window is full. Fix: force-summarize specialist outputs before returning, or treat them as tool returns and drop the raw chain.
  • Infinite routing loops. Supervisor swings "Research → Coder → Research → Coder" six times and lands in the same place. Fix: a deterministic hop budget (think LangGraph's recursion_limit).
  • Specialist output shape mismatch. The coder returns a raw diff; the supervisor expected a PR description. Fix: pin each specialist's return type to a schema.

Minimal sketch

# supervisor.py — minimal LangGraph-style supervisor
from typing import Literal, TypedDict
from langgraph.graph import StateGraph, END

class State(TypedDict):
    task: str
    messages: list
    next: Literal["research", "code", "review", "done"]
    hop: int

def supervisor(state: State) -> dict:
    if state["hop"] >= 6:
        return {"next": "done"}
    prompt = f"Task: {state['task']}\nHistory: {state['messages'][-3:]}\nWho works next? research/code/review/done"
    decision = llm(prompt)  # forced to return one of the enum values
    return {"next": decision, "hop": state["hop"] + 1}

def specialist(name):
    def run(state: State) -> dict:
        result = run_agent(name, state["task"], state["messages"])
        # critical: don't stash the raw output, only a summary
        return {"messages": state["messages"] + [{"agent": name, "summary": summarize(result)}]}
    return run

g = StateGraph(State)
g.add_node("supervisor", supervisor)
g.add_node("research", specialist("research"))
g.add_node("code", specialist("code"))
g.add_node("review", specialist("review"))
g.set_entry_point("supervisor")
g.add_conditional_edges("supervisor", lambda s: s["next"],
    {"research": "research", "code": "code", "review": "review", "done": END})
for s in ["research", "code", "review"]:
    g.add_edge(s, "supervisor")
graph = g.compile()

The point: the supervisor node is a decision function — it does no work itself, only picks who works next.


2. CodeAct — Python as Action Language

One-line definition: make the agent's "Action" output an executable Python block instead of a tool_name(arg) JSON object. One step can assign variables, branch, loop, and call multiple tools at once.

The starting point is Yang et al., "Executable Code Actions Elicit Better LLM Agents" (CodeAct, June 2024). The Manus team disclosed publicly that they use it as the action layer for their agent in 2025, and HuggingFace's smolagents.CodeAgent is built around the same philosophy.

Comparison

[Classical ReAct + JSON tool calls]
Action: {"tool": "search", "args": {"q": "kubernetes pod oom"}}
Observation: [50 results]
Action: {"tool": "search", "args": {"q": "kubernetes pod oom limits"}}
Observation: [30 results]
Action: {"tool": "summarize", "args": {"text": "..."}}
→ 3 LLM calls, 3 round-trips

[CodeAct]
Action:
```python
results_a = search("kubernetes pod oom")
results_b = search("kubernetes pod oom limits")
merged = list(set(results_a + results_b))[:20]
print(summarize("\n".join(r.snippet for r in merged)))

Observation: [one summary] → 1 LLM call, 1 round-trip


### When it shines

- **Tasks that compose multiple tools.** Fetch → transform → aggregate → chart fits into one block.
- **Tasks with natural repetition**100 files to process. JSON tool calls would mean 100 round-trips; CodeAct uses one `for f in files:`.
- **Tasks with arithmetic.** The LLM doesn't compute by hand; one `numpy` line does it.
- **Tasks with accumulating state** — variables let the next step naturally consume the previous result.

### How it breaks

- **The moment your sandbox leaks, security is over.** Executing Python means arbitrary code execution. Fix: Docker, Firecracker, E2B, gVisor, or at minimum Python `restricted exec` plus syscall filtering.
- **A single `import os; os.system('rm -rf /')` ends the day.** There is no guarantee the LLM won't write that. Fix: import allowlist, network isolation, ephemeral filesystem.
- **Infinite loops.** One `while True: search("x")` blows your token bill apart. Fix: execution timeout plus output-token cap.
- **Debugging is harder.** If a block calls 7 tools and the 5th fails, the model has to re-derive what went wrong. Fix: return the execution trace verbatim (line-by-line stdout).

### Minimal sketch

```python
# codeact_executor.py — runs model-generated code in an isolated environment
import subprocess, json, textwrap

ALLOWED_IMPORTS = {"json", "re", "math", "datetime", "tools"}

def execute_codeact(code: str, tools: dict, timeout_s: int = 30) -> str:
    # tools is a dict of callables the model can invoke — pre-injected in the container
    wrapper = textwrap.dedent(f"""
        import json, sys
        from tools import {", ".join(tools.keys())}
        try:
        {textwrap.indent(code, "            ")}
        except Exception as e:
            print(f"__ERROR__: {{type(e).__name__}}: {{e}}", file=sys.stderr)
    """)
    result = subprocess.run(
        ["docker", "run", "--rm", "-i", "--network=none",
         "--memory=512m", "--cpus=1.0", "codeact-sandbox:latest", "python", "-"],
        input=wrapper.encode(),
        capture_output=True,
        timeout=timeout_s,
    )
    stdout = result.stdout.decode()[-4000:]  # output cap
    stderr = result.stderr.decode()[-1000:]
    return f"STDOUT:\n{stdout}\n\nSTDERR:\n{stderr}"

# Model loop
while not done:
    code = llm_extract_codeact(messages)  # pull the ```python ... ``` block
    obs = execute_codeact(code, tools)
    messages.append({"role": "user", "content": f"Execution result:\n{obs}"})

Key details: output cap (last 4 KB only), network isolation (--network=none then allowlist domains), memory/CPU limits, timeout.


3. Plan-Execute — Plan First, Then Execute

One-line definition: one big LLM call produces an N-step plan; the plan is then executed deterministically (or by a smaller model) step by step. ReAct interleaves "one thought, one action"; Plan-Execute does "one thought of N steps, then N actions."

LangGraph's official tutorials include "plan-and-execute"; Wang et al.'s "Plan-and-Solve Prompting" (2023) made it a stable structure.

Topology

[Plan phase]                       [Execute phase]
┌──────────────┐                   ┌──────────────┐
│  Planner LLM │  →  step list  →  │ Executor LLM │  →  results
│  (one shot)  │                   │  (per step)  │
└──────────────┘                   └──────┬───────┘
                  optional re-plan  ◄─────┘
                  (if step fails)

When it shines

  • Tasks where step count is predictable up front — "write a report" almost always decomposes into (research → outline → draft → review → polish).
  • Independent steps that can be executed with a cheaper model. Cost often drops by more than half.
  • Workflows that need reproducibility. Save the plan and you can rerun the same path on the same input (with temperature=0).

How it breaks

  • Hallucinated plans. The planner says "Step 3: query the customers table" but no such table exists. You only learn during execution. Fix: hand the planner the schemas of the available tools/resources, or run a "discover" phase before planning.
  • The world doesn't follow your plan. If the third step hits a 401, the fourth is meaningless. Fix: define re-planning triggers (failure or unexpected observation).
  • Over-decomposition. Four steps stretched into twelve. Fix: prompt the planner with "make plans of at most N steps unless absolutely necessary."

Minimal sketch

# plan_execute.py
def plan(task: str, tools_schema: dict) -> list[dict]:
    prompt = f"""You will be given a task. Decompose it into 3-7 ordered steps.
Each step: {{"id": int, "goal": str, "tool": str | null, "depends_on": [int]}}
Available tools: {list(tools_schema.keys())}
Task: {task}
Return a JSON array."""
    return json.loads(llm(prompt, temperature=0))

def execute_step(step: dict, prior_results: dict) -> dict:
    context = "\n".join(f"step{i}: {prior_results[i]}" for i in step["depends_on"])
    return run_executor_agent(step["goal"], context, tool=step["tool"])

def plan_execute(task: str, tools_schema: dict, max_replans: int = 2):
    steps = plan(task, tools_schema)
    results, replans = {}, 0
    i = 0
    while i < len(steps):
        try:
            results[steps[i]["id"]] = execute_step(steps[i], results)
            i += 1
        except UnexpectedObservation as e:
            if replans >= max_replans: raise
            steps = plan(f"{task}\nSo far: {results}\nProblem: {e}", tools_schema)
            replans += 1
            i = 0  # restart with the new plan
    return results

You typically see LLM-call count drop to one-third or one-fifth of ReAct. Cost is the biggest win.


4. Self-RAG / Corrective RAG — Decide Whether to Retrieve at All

One-line definition: unlike vanilla RAG that retrieves on every query, the agent decides "should I retrieve?", "are the results enough?", "should I retrieve again?"

Asai et al.'s "Self-RAG" (2023) is the origin. Yan et al.'s "Corrective Retrieval Augmented Generation" (CRAG, 2024) extended it: if retrieved docs are bad, supplement with web search. By 2025–2026, LangGraph's "Adaptive RAG" tutorial became the de facto reference.

Graph

                        ┌─────────────┐
   query  ───────────►  │ Route LLM   │
                        └──────┬──────┘
                  retrieve?    │     direct?
              ┌────────────────┴────────────────┐
              ▼                                 ▼
        ┌──────────┐                      ┌──────────┐
        │ Retrieve │                      │ Answer   │
        └────┬─────┘                      │ directly │
             ▼                            └──────────┘
        ┌──────────┐
        │ Grade    │  ──→  irrelevant  ──→  rewrite query, retry
        │ docs LLM │  ──→  partial     ──→  web search supplement
        └────┬─────┘  ──→  relevant    ──→  answer
        ┌──────────┐
        │ Answer   │  ──→  hallucinated?  ──→  retry
        │ + check  │  ──→  grounded      ──→  return
        └──────────┘

When it shines

  • Systems with heterogeneous questions. Some need no retrieval ("what's 2 + 2?"), some need several ("compare our refund policy to the 2025 EU consumer law").
  • Environments with uneven retrieval quality — wiki plus Slack plus Notion plus GitHub issues. It rarely all comes back at once.
  • Domains where hallucination is expensive — legal, medical, financial. Being able to say "the retrieval wasn't enough" is the whole point.

How it breaks

  • Loops get long. "Retrieve again", "retrieve again", "retrieve again" → token bill explodes. Fix: retrieval hop cap (usually 3).
  • The grader hallucinates about the grader. The docs do answer the question but the grader says "irrelevant." Fix: split the grader into a small classifier with a confidence threshold.
  • Query rewriting drifts. "Refund policy" rewritten to "step-by-step refund procedure" misses the exception clauses. Fix: ship the original query alongside the rewrite.

Minimal sketch

# self_rag.py
def adaptive_rag(query: str, max_hops: int = 3):
    route = llm_classify(query, choices=["direct", "retrieve"])
    if route == "direct":
        return llm_answer(query)

    for hop in range(max_hops):
        docs = retrieve(query, k=5)
        grade = llm_grade(query, docs)  # one of relevant/partial/irrelevant
        if grade == "relevant":
            answer = llm_answer_with_context(query, docs)
            if llm_check_grounding(answer, docs) == "grounded":
                return answer
            # hallucinated → retry
            continue
        if grade == "partial":
            web_docs = web_search(query, k=3)
            return llm_answer_with_context(query, docs + web_docs)
        # irrelevant → rewrite
        query = llm_rewrite(query, hint=f"prior docs were off-topic: {summarize(docs)}")
    return llm_answer_with_context(query, docs)  # last attempt

5. Reflexion / Self-Critique — Verbal-RL Retry

One-line definition: try the task, fail, generate a natural-language reflection on why you failed, store it in memory, and inject the reflection into the system prompt on the next attempt. Weights don't change — this is "verbal reinforcement learning."

Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (NeurIPS 2023). It pushed GPT-4's HumanEval coding success rate from 80% to 91%.

Cycle

     ┌──────────────────────────────────────┐
     │                                      │
     ▼                                      │
  Actor  ──► Action ──► Env ──► Reward      │
  (LLM)                                     │
     ▲                                      │
     │                                      │
  Evaluator (success/fail)                  │
     │                                      │
     ▼                                      │
  Self-Reflection LLM ──► "why I failed" ───┘
                          (memory buffer)

When it shines

  • Tasks with clear pass/fail signals. Code tests, numeric answers, user rejection. With a signal, the reflection has something to bite into.
  • Cheap, fast attempt environments — patch a line, run a unit test. Doesn't fit if each attempt is a 100-dollar external API call.
  • Repeating task families. Reflection memory accumulates and you get a learning curve over time.

How it breaks

  • Wrong reflection poisons the next attempt. "Last time I failed because of X" — if X is wrong, the next attempt heads in the wrong direction. Fix: attach confidence to reflections; if the same reflection repeats twice, force a different direction.
  • Infinite trying. "Failed again, more reflection, try again." Fix: hard retry cap (3–5).
  • Memory bloat. Reflections accumulate and eat the context window. Fix: keep only the last N, or dedupe by semantic similarity.

Minimal sketch

# reflexion.py
def reflexion_loop(task: str, max_trials: int = 5):
    reflections = []
    for trial in range(max_trials):
        sys_prompt = base_prompt + "\n\nPast reflections:\n" + "\n".join(reflections[-3:])
        result = actor_llm(task, system=sys_prompt)
        evaluation = evaluator(result, task)  # status: success|fail, plus reason
        if evaluation["status"] == "success":
            return result
        reflection = reflect_llm(task=task, attempt=result, why_failed=evaluation["reason"])
        # dedupe
        if reflection not in reflections:
            reflections.append(reflection)
        else:
            # repeating insight → force a different angle
            reflection = reflect_llm(task=task, attempt=result, hint="try a completely different approach")
            reflections.append(reflection)
    return result  # return the last attempt

6. Handoff — Swarm-Style Multi-Agent

One-line definition: no supervisor. Each agent, when its job is done, hands off directly to the next agent. This is the core pattern of OpenAI's Swarm library (October 2024), and Anthropic's computer-use demos effectively look the same.

Supervisor vs Handoff

[Supervisor]                       [Handoff]
   ┌─────────┐                        A
   │   Sup   │                        │ handoff_to(B)
   └──┬──┬──┬┘                        ▼
      ▼  ▼  ▼                         B
      A  B  C                         │ handoff_to(C)
centralized decision                  C
"who's next?"                         distributed decision
                                      "I know who to pass to"

In OpenAI Swarm, a handoff is "a tool call that looks like a function." Call transfer_to_billing() and the conversation context transfers to the billing agent.

When it shines

  • Workflows with clear stage transitions — "triage agent receives → if refund, billing; if technical, tech-support → and we're done."
  • Agents with narrow, well-defined responsibilities. Narrow scope means a focused system prompt.
  • Conversations with small accumulated state. The handoff carries context with it; if it's huge, costs explode.

How it breaks

  • Handoff loops. A → B → A → B. Fix: handoff hop counter and escalation if the same pair pings back twice.
  • Cost blowup because context follows everywhere. Long conversations mean each handoff re-processes the same context. Fix: at handoff time, summarize and pass a small variable packet.
  • System prompt clash. Tech-support says "be empathetic"; billing says "stick to numbers." Tone jumps on handoff. Fix: common base prompt plus role-specific overlay.

Minimal sketch

# swarm.py — OpenAI Swarm style
def make_agent(name, instructions, tools, handoffs):
    return {
        "name": name,
        "instructions": instructions,
        "tools": tools + [{"name": f"transfer_to_{h.name}", "fn": lambda h=h: h} for h in handoffs],
    }

triage = make_agent("triage",
    "Route user to billing or tech_support based on intent.",
    tools=[], handoffs=[billing, tech])

def run(agent, user_msg, history):
    while True:
        history.append({"role": "user", "content": user_msg})
        out = llm(agent["instructions"], history)
        if out.tool_call and out.tool_call.name.startswith("transfer_to_"):
            agent = out.tool_call.fn()  # handoff!
            continue
        return out, history

7. Specialist Routing — Intent Classifier Dispatches

One-line definition: when input comes in, a cheap, fast classifier grabs the intent and routes to the matching specialist agent. Looks similar to Supervisor, but the difference is that the classifier is a small classification model or a deterministic rule, not an LLM.

Most production chatbots already do this. Anthropic's customer-facing agents, OpenAI ChatGPT's GPTs routing, Slack and Discord bot command routing.

Topology

user msg
┌─────────────┐
│ Intent      │   ← small BERT classifier, regex rules, or a cheap LLM like Haiku
│ classifier  │
└──┬─────┬────┘
   │     │
   ▼     ▼
agent_a  agent_b  ... agent_n
(each: own system prompt, own tools)

When it shines

  • Intent is enumerable ("refund / shipping / account / other").
  • Each intent's response is wildly different — one prompt can't do it.
  • Latency is critical. A BERT-base classifier finishes in under 50 ms.

How it breaks

  • A misclassified input goes to the wrong specialist forever. "My payment failed and I'm furious you won't refund" is billing? complaint? Fix: low-confidence falls through to a generalist; the classifier should always have an "uncertain" option.
  • A new intent requires retraining. Fix: use a small LLM to do zero-shot "existing N classes plus other," and monitor the "other" rate.
  • Multi-intent. "Cancel my order and place a new one" = cancel + create. Fix: multi-label classification, or let the generalist handle it.

Minimal sketch

# routing.py
INTENT_AGENTS = {
    "billing": billing_agent,
    "shipping": shipping_agent,
    "account": account_agent,
    "general": general_agent,
}

def classify(text: str) -> tuple[str, float]:
    # small classifier — bert-base finetune or Haiku zero-shot
    logits = small_classifier(text)
    intent, conf = top1(logits)
    return intent, conf

def route(user_msg: str):
    intent, conf = classify(user_msg)
    if conf < 0.7 or intent == "uncertain":
        return general_agent(user_msg)  # fallback
    return INTENT_AGENTS[intent](user_msg)

8. Subagent Isolation — Claude Code's Context Isolation Pattern

One-line definition: in the middle of a big task, the main agent spawns a child agent with a completely fresh context window, gives it a narrow sub-task, and only receives a summary back. The child's context never accumulates in the parent.

This is one of Claude Code's signature patterns; Anthropic's docs cover it under "subagents / Task tool." Similar ideas show up in OpenAI's Agents SDK subagents and Cursor's background agents.

Picture

[main agent context]
  ┌────────────────────────────────────────────────┐
  │ user task, system prompt, history (240k tok)   │
  │                                                │
  │   ── spawn(subagent, "summarize 18 PRs") ──►   │
  │                                                │
  │            [subagent context]                  │
  │            ┌───────────────────┐               │
  │            │ fresh window      │               │
  │            │ subtask + tools   │               │
  │            │ (returns: 400 tok)│               │
  │            └─────────┬─────────┘               │
  │                      │                         │
  │   ◄────── result summary (400 tok) ──          │
  │                                                │
  │ continues main task (context grew by 400 tok)  │
  └────────────────────────────────────────────────┘

When it shines

  • Large search/exploration where the parent doesn't need raw results. "Read these 18 PRs and find the stale ones" goes to a subagent; the parent only gets a list back.
  • Preventing context rot in long sessions. Delegate before parent context exceeds 80%.
  • Parallelizable independent sub-tasks. Run three subagents at once.

How it breaks

  • The child doesn't know the parent's intent. "Find the stale PRs" gives no "why?" Fix: when authoring the prompt, the parent packs enough context explicitly.
  • The child grows the job indefinitely. "18 PRs" → child spawns subagents per PR → recursion. Fix: depth limit plus token budget.
  • Result integration is hard. Three children return three different shapes. Fix: spell the output schema in the subagent's prompt.

Minimal sketch

# subagent.py
def spawn_subagent(parent_state, subtask: str, tools: list,
                   max_tokens: int = 8000, depth: int = 0):
    if depth >= 2:
        raise RuntimeError("subagent depth limit reached")
    sub_context = {
        "system": SUB_AGENT_SYSTEM,  # different system prompt than parent
        "messages": [{"role": "user", "content": subtask}],
    }
    while not done(sub_context) and tokens(sub_context) < max_tokens:
        sub_context = step(sub_context, tools, depth=depth + 1)
    summary = final_summary(sub_context)  # child summarizes its own result
    return summary  # the parent only sees this

# inside the parent loop
if parent_decides_to_delegate():
    summary = spawn_subagent(state, "summarize the 18 open PRs in the repo", tools=[git_tool, gh_tool])
    state["messages"].append({"role": "tool", "name": "subagent", "content": summary})

The point: the child's raw chain never lands in the parent's messages. An 8,000-token child task is folded into a 400-token summary.


9. Hooks + Permission Gates — Deterministic Guardrails Between LLM Steps

One-line definition: before/after every step of the agent loop, deterministic code (a hook) runs to inspect/transform/block. LLMs are fast and flexible but untrustworthy; hooks are slow but 100 percent deterministic. They alternate.

Claude Code's hook system (PreToolUse, PostToolUse, UserPromptSubmit) is canonical; LangGraph's interrupt_before / interrupt_after, OpenAI Agents SDK guardrails, and Anthropic Agent Skills permission gates all fit the same family.

Where they sit

       ┌──────────────────────────────────────┐
       │              Agent loop              │
       │                                      │
 user  │  ┌──── UserPromptSubmit hook ───┐    │
 msg ──┼─►│ check policy, strip secrets   │   │
       │  └──┬────────────────────────────┘   │
       │     ▼                                │
       │  ┌──── LLM step ──────────┐          │
       │  │ propose tool call      │          │
       │  └──┬─────────────────────┘          │
       │     ▼                                │
       │  ┌── PreToolUse hook ────┐           │
       │  │ allow? deny? mutate?  │           │
       │  └──┬────────────────────┘           │
       │     ▼                                │
       │  ┌── Tool execution ─────┐           │
       │  └──┬────────────────────┘           │
       │     ▼                                │
       │  ┌── PostToolUse hook ───┐           │
       │  │ redact, audit         │           │
       │  └──┬────────────────────┘           │
       │     ▼                                │
       │  loop                                │
       └──────────────────────────────────────┘

When it shines

  • Irreversible actions that need human approval. rm -rf, payments, sending external email. The LLM can say it's fine, the hook says no.
  • Secret-leak prevention — every LLM input scanned with regex for API keys, tokens, PII.
  • Policy enforcement — "don't touch anything outside this directory", "don't fetch URLs outside this domain list".
  • Audit log — every tool call written to durable storage in the PostHook. Incident traceable.

How it breaks

  • Too tight and the agent stops working. "Refuse all file writes" → agent can't do anything. Fix: allowlist with "ask-user" for borderline cases instead of deny-by-default.
  • Invisible hooks → repeated attempts. Rejected, the model never learns why and tries again. Fix: feed the hook's reject message into the next LLM input.
  • LLM calls inside hooks = non-determinism. "Ask another LLM if this is dangerous" defeats the point. Fix: hooks are static rules; ambiguity goes to a human.

Minimal sketch

# hooks.py
def pre_tool_hook(tool: str, args: dict, state) -> tuple[bool, str | None]:
    if tool == "bash":
        cmd = args.get("command", "")
        if "rm -rf" in cmd or "curl http" in cmd:
            return False, "blocked: dangerous command pattern"
        if "$SECRET" in cmd or re.search(r"sk-[a-zA-Z0-9]{20,}", cmd):
            return False, "blocked: secret leak detected"
    if tool == "write_file":
        if not args["path"].startswith(state["allowed_root"]):
            return False, f"blocked: outside allowed root {state['allowed_root']}"
    return True, None

def agent_step(state):
    proposal = llm_propose_action(state)
    allowed, reason = pre_tool_hook(proposal.tool, proposal.args, state)
    if not allowed:
        state["messages"].append({"role": "tool", "content": f"BLOCKED: {reason}"})
        return state
    result = execute(proposal)
    audit_log(proposal, result)  # post hook
    state["messages"].append({"role": "tool", "content": result})
    return state

10. How Real Systems Actually Compose These

Here's the real point. Almost no production agent runs on a single pattern. A well-built 2026 agent typically composes 3–4 patterns.

Case 1 · Coding agents (Claude Code, Cursor, Devin family)

[user task]
┌────────────────────────────────────────────────┐
│  Main agent (loop) — Plan-Execute              │
│   - Reads task, builds a 3-7 step plan         │
│   - Per step:                                  │
│      ┌─── PreToolUse hook ────┐                │
│      │ permission + policy    │                │
│      └─────────┬──────────────┘                │
│                ▼                               │
│      ┌─── Tool / CodeAct exec ────┐            │
│      └─────────┬──────────────────┘            │
│                ▼                               │
│      ┌─── PostToolUse hook ───┐                │
│      │ audit, redact          │                │
│      └─────────┬──────────────┘                │
│                ▼                               │
│      need deep exploration? ──► spawn subagent │
│                                  (isolated ctx)│
│                ▼                               │
│   - On step failure: Reflexion-style retry     │
└────────────────────────────────────────────────┘

Combination: Plan-Execute + Subagent isolation + Hooks + Reflexion (4 patterns).

Case 2 · Customer support agents (Anthropic, Intercom Fin, Linear Asks)

[user msg]
┌──── Specialist routing (intent classifier) ────┐
│       ▼          ▼         ▼          ▼        │
│   billing    shipping   account    general     │
│      │          │         │           │        │
└──────┼──────────┼─────────┼───────────┼────────┘
       ▼          ▼         ▼           ▼
     [each specialist: Self-RAG over internal KB]
     can resolve? ──── yes ──► reply
       no
     Handoff to human OR another specialist

Combination: Specialist routing + Self-RAG + Handoff (3 patterns).

Case 3 · Data analysis agents (Manus, OpenAI Code Interpreter)

[user question over CSV / DB]
Supervisor (Manus-style)
    ├─► Researcher subagent  (subagent isolation)
    │     └─ Self-RAG over schemas/docs
    ├─► Analyst subagent
    │     └─ CodeAct (writes pandas / SQL in Python blocks)
    │     └─ executes in sandbox with Hooks (network=none, fs scoped)
    └─► Reporter subagent
          └─ writes markdown summary

Combination: Supervisor + Subagent isolation + CodeAct + Self-RAG + Hooks (5 patterns).

Case 4 · OS automation agents (Anthropic computer use)

[user goal: "book a flight"]
Plan phase: high-level plan (find site → search → fill form → confirm)
Execute phase: action loop
   - screenshot
   - decide click/type/key
   - PreActionHook: dangerous zone? (banking, settings) → halt + ask user
   - execute
   - PostActionHook: screenshot diff
   - on unexpected screen → Reflexion-style replan

Combination: Plan-Execute + Hooks + Reflexion.

Combination matrix

                       simple  | mid-complex   | very complex
Intro books            ReAct   | ReAct+Refl    | (benchmark stops here)
Real production        Routing+| Plan-Exec+    | Supervisor+Subagent+
                       Self-RAG| Hooks+Subagent| CodeAct+Hooks+Reflexion

11. Anti-patterns

  1. Combine all the patterns. "A supervisor inside a supervisor, a plan-execute inside a subagent..." Debugging is impossible. Before adding, remove is the golden rule.
  2. CodeAct without hooks. Within days, someone writes rm -rf or leaks a secret. CodeAct's prerequisite is sandbox plus hooks.
  3. Accumulating every specialist's raw output in the supervisor. Five rounds in, 240k tokens gone.
  4. Reflexion without a hard cap. Twelve retries → a 3,000-dollar bill.
  5. Worshipping the plan in Plan-Execute. A plan is a hypothesis. If execution shows otherwise, replanning has to be allowed.
  6. Passing the full context on handoff. Each new agent re-reads the same context; cost goes N×. Pass a summary plus a variable packet only.
  7. Using the same model for Self-RAG's grader. The grader is cheaper as a separate small model.
  8. Ignoring routing confidence. Confidence 0.51 still gets routed to a specialist. A fallback track is non-negotiable.

12. Pattern Selection Checklist

When building a new agent, run these questions:

  • Is the step count predictable up front? → If yes, Plan-Execute; otherwise ReAct + Reflexion.
  • Are there heterogeneous tools/knowledge domains? → Specialist routing or Supervisor.
  • Are there compute-heavy or composable actions? → CodeAct (only with sandbox + hooks).
  • Is external knowledge always required? → Not all-RAG, but Self-RAG (only when needed).
  • Are there clear success/failure signals? → Consider Reflexion.
  • Are there irreversible actions? → Hooks + permission gates, non-negotiable.
  • Are there long sessions or large explorations? → Subagent isolation.
  • Does the workflow have natural agent-to-agent handoffs? → Handoff (with a hop cap).

Draw your own composition matrix. Patterns are tools.

Epilogue — System Thinking, Not Pattern Cataloging

If the 2024 articles said "use ReAct," the 2026 answer is quieter: "which failures dominate your workload?" Context blowups → Subagent. Spotty retrieval → Self-RAG. Scattered steps → Plan-Execute. Irreversible actions → Hooks. Too-broad responsibility → Specialist routing or Supervisor.

Patterns are a card deck; you play a hand. The hand depends on the system. And add one card at a time — that is the biggest lesson.

The next post will cover evaluating these compositions. How Inspect AI, Phoenix, and LangSmith catch regressions in multi-pattern agents, and how to define "correctness" for non-deterministic topologies like swarms.

References