Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — "It works well" is not an evaluation

A conversation that happens at least once on every team in 2026.

PM: "How is the agent doing?" Engineer: "Pretty well. Want a demo?" PM: "Better than last week?" Engineer: "Uh... feels like it."

This is still the normal scene. We know how to build agents, but we cannot say in numbers how well the agent actually works. Model companies refresh MMLU, GPQA, and SWE-bench numbers every month, while our own agent sits there with nothing but a vibe check.

The root issue is that agent evaluation is a different problem from LLM evaluation. LLM eval asks "what output does this model produce for this input." Agent eval asks "does this model plus this harness plus these tools carry this task to completion." The latter is non-deterministic, multi-step, and environment-dependent. It is not a one-shot benchmark.

This post is a map of the 2026 agent evaluation landscape. Inspect AI (UK AISI), Promptfoo, Arize Phoenix, LangSmith, OpenAI Evals, Braintrust, Helicone, Langfuse — where each one sits, what taxonomy of evals you actually have to think about, and how a team should build its first eval suite.

1. Model evals vs agent evals — different problems

Words first. The two are routinely conflated, but they are different problems.

Dimension	Model (LLM) eval	Agent eval
What you measure	Model weights	Model + harness + tools + env
Input	A single prompt	A task spec + tools + state
Output	A single response	A multi-step action sequence
Determinism	Near-deterministic at temp 0	Non-deterministic (tool side effects, env)
Scoring	Reference comparison	Task completion, trajectory, efficiency
Canonical benchmarks	MMLU, GPQA, HumanEval	SWE-bench, WebArena, OSWorld, tau-bench
Who owns it	The model provider	Your team

Key insight: models are commoditizing; the differentiation is in the agent system. And the agent system is your responsibility. The model company can bump their SWE-bench number all they want; that does not guarantee a better outcome on your domain's tickets. Whether it works on your domain — you measure that.

Second point: agent eval is also harness eval. Same model, different harness, different result. So "model comparison" only means something when you fix the harness, and "harness comparison" only means something when you fix the model. Both require the same evaluation infrastructure underneath.

2. The eval taxonomy — what can we actually measure

The first question when designing an agent eval is "what are we measuring." Six common axes.

1) Deterministic eval

A golden test set: input maps to expected output. Comparison is mechanical (exact match, regex, structural compare, run code and check the result).

Strengths: fast, cheap, reproducible.
Weaknesses: poor fit for free-form natural language, expensive to construct.
When: classification, extraction, computation, code generation — anywhere a "correct answer" exists.

2) LLM-as-judge

Another (usually stronger) model scores the output against a rubric in the prompt.

Strengths: scales to free-form output, fast.
Weaknesses: judge biases (positional bias, verbosity bias, self-preference), cost, score drift when the judge model updates.
When: summarization, explanation, creative work, anywhere the criterion is qualitative.

3) Human eval

People score directly. Usually pairwise (A vs B) or a 1-to-5 scale.

Strengths: closest to ground truth.
Weaknesses: slow, expensive, hard to keep consistent.
When: final verification for major model/harness changes, calibrating LLM-as-judge.

4) Task-completion

Did the agent finish the task — a binary result. SWE-bench's "do the tests pass," WebArena's "did the page reach the goal state."

Strengths: most meaningful signal — does it actually work.
Weaknesses: environment setup cost, hard to award partial credit.

5) Trajectory eval

Not just whether the agent finished, but how. The order of tool calls, the logical flow, the intermediate reasoning.

Strengths: distinguishes lucky completions from real competence.
Weaknesses: the scoring rubric itself is hard.

6) Efficiency eval

Did the agent finish the same task using fewer resources. Steps, tokens, wall-clock time, tool-call count, cost.

Strengths: directly maps to cost and latency in production.
Weaknesses: confounded with task difficulty — harder tasks naturally take more steps.

Practical combination: task-completion (binary) plus trajectory (sampled) plus efficiency (continuous). Use LLM-as-judge only on free-form output. Use human only as the regression gate for major changes. Reading any single axis in isolation will fool you.

3. Inspect AI — becoming the gold standard

Inspect AI is an OSS framework from the UK AI Safety Institute (AISI). Released May 2024. By 2026 it is the de facto standard for agent safety evals — Apollo Research, METR, Anthropic, and OpenAI all publish risk evaluations on top of it.

Why the acceleration:

Designed for agentic eval from day one — tool use, multi-step, sandboxed environments are first-class.
Python-first — built around the decorators task, solver, scorer.
Reproducible logs — one evaluation run produces one serialized log file with both scores and trajectories.
AISI provenance — being a government-built framework matters in governance and policy contexts.

A typical Inspect task:

from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import includes
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import bash, python

@task
def ctf_challenge():
    return Task(
        dataset=[
            Sample(
                input="Find the flag in /flag.txt",
                target="flag{example}",
            ),
        ],
        solver=[
            use_tools([bash(), python()]),
            generate(),
        ],
        scorer=includes(),
        sandbox="docker",
    )

A one-liner runs it: inspect eval ctf_challenge.py --model anthropic/claude-3-7-sonnet. Scoring, logs, and trajectory all fall out into a single file. Inspect View, the web UI, lets you inspect which tools the model called in which order for any task.

Where Inspect lives in 2026:

Model safety evals — frontier labs use it for external evaluations.
AISI's official evaluations — the UK government's pre-deployment model assessments run on Inspect.
Research labs measuring agentic capability — there is a noticeable consolidation around the Inspect format for SWE-bench, CTF, biosecurity, and autonomy evals.

Caveat: there is a learning curve. You have to internalize the decorators, the solver pipeline, the scorer model. And it leans more "rigorous evaluation campaign" than "daily regression on a production app."

4. Promptfoo — OSS and CLI-first

Promptfoo is OSS and treats the CLI as a first-class citizen. Write prompts, test cases, and assertions in a YAML file; run promptfoo eval. Results come back as a table, viewable in a web UI.

A typical config:

description: "Customer support agent regression"
providers:
  - id: anthropic:messages:claude-3-7-sonnet
  - id: openai:gpt-4o
prompts:
  - file://prompts/support_agent.txt
tests:
  - vars:
      ticket: "I want a refund"
    assert:
      - type: contains
        value: "refund policy"
      - type: llm-rubric
        value: "Empathetic tone and policy limits are stated"
      - type: latency
        threshold: 4000
  - vars:
      ticket: "Please reset my password"
    assert:
      - type: javascript
        value: "output.includes('reset_link')"

Selling points:

Easy on-ramp — one CLI, one YAML.
Provider plurality — OpenAI, Anthropic, Bedrock, local models behind a single interface.
Rich assertion types — exact match, regex, LLM-rubric, JavaScript, Python, external tools.
CI-friendly — exit codes and JSON output drop straight into GitHub Actions.
Red-teaming — promptfoo redteam auto-generates jailbreak and PII-leak tests.

Through 2026, Promptfoo has been expanding beyond pure prompt-eval into agent evaluation — tool-calling support, multi-turn scenarios, an "agent simulator" mode where an LLM plays the user. That said, the "directly orchestrate an agent run" angle is stronger on the Inspect AI side. Promptfoo shines brightest when you want prompts and responses laid out in a table for comparison.

When to pick Promptfoo: fast regression suite, A/B model comparison, daily CI checks.

5. Arize Phoenix — OSS observability plus eval

Arize Phoenix is OSS and unifies tracing plus eval for LLM and agent systems on top of OpenTelemetry. Arize AI built it as the OSS sibling of their commercial product, and they ship OpenInference — a trace spec with GenAI-specific semantics — alongside.

Core idea: agents are distributed systems, so debug them like APMs. Each task produces a trace tree of LLM calls, tool calls, retrievals, and sub-agent invocations. Phoenix captures these traces and runs evaluators (LLM-as-judge, regex, custom) on top of them.

from phoenix.evals import (
    HallucinationEvaluator,
    RelevanceEvaluator,
    run_evals,
)
import pandas as pd

df = pd.DataFrame({
    "input": ["What is the refund policy?"],
    "reference": ["30-day refunds available"],
    "output": ["You can refund within 30 days"],
})

hallucination = HallucinationEvaluator(model="gpt-4o")
relevance = RelevanceEvaluator(model="gpt-4o")

results = run_evals(
    dataframe=df,
    evaluators=[hallucination, relevance],
    provide_explanation=True,
)

Selling points:

The trace itself is the eval input — you can run evals over production traces directly.
OpenInference standard — no vendor lock-in; auto-instrumentation for LangChain, LlamaIndex, Haystack.
Self-host — phoenix serve or a container.
Built-in RAG evaluators — relevance, hallucination, QA correctness ship out of the box.

When to pick Phoenix: you need RAG/agent observability in production, you like the "eval on top of trace" framing, and you prefer to self-host.

6. LangSmith — LangChain's hosted offering

LangSmith is LangChain Inc.'s hosted tracing/eval platform. It binds tightest to LangChain and LangGraph but accepts traces from arbitrary frameworks through its SDK.

The feature set:

Trace UI — multi-step agent runs in tree, timeline, and message views.
Dataset management — promote interesting production traces into a dataset.
Eval runner — run LLM-as-judge or custom evaluators across a dataset.
A/B experiments — run different prompts, models, or chains against the same dataset and compare.
Prompt hub — versioned prompts.

Through 2026 LangSmith has been intentionally shedding the "LangChain-only" image, strengthening arbitrary trace ingestion via SDK, an HTTP API, and OTel compatibility. For LangGraph users it is effectively the default option.

When to pick LangSmith: you are on the LangChain/LangGraph stack, you prefer hosted, and you like having dataset to experiment to production monitoring under one roof.

Caveats: SaaS pricing, data governance constraints (especially in regulated industries), and LangChain dependence is a burden for some users.

7. OpenAI Evals — the original

OpenAI Evals is the OSS evaluation framework OpenAI released in March 2023, well before "agent eval" was a common phrase. You declare evals in YAML and run them against OpenAI models.

Through 2024 to 2025, faster-moving newcomers like Promptfoo and Inspect AI overtook it, but OpenAI Evals is still used for OpenAI's own model evaluation, and the GitHub repository is broad enough to function as a reference library — a place to look at how other people structure evals.

Where OpenAI Evals sits in 2026:

It is no longer the default first pick even for OpenAI users — Promptfoo and Inspect are the common choices.
The value is as a reference collection — 100-plus eval definitions are public.
The OpenAI Platform Evals hosted web UI is a separate product from the CLI/SDK Evals and lives in the OpenAI console.

When to pick OpenAI Evals: OpenAI-centric setup and you want to reuse the existing public evals. For a greenfield project, the reasons to start here are thinner now.

8. The rest — Braintrust, Helicone, Langfuse

Braintrust — paid SaaS, strong at integrated UX across eval, monitoring, and a playground. The dataset to experiment to production-regression loop is fast. Dataset diffs, side-by-side comparison, and automatic regression detection are well polished. Popular with AI-native startups.

Helicone — OSS plus hosted, started life as an "LLM API gateway" that proxies all model calls for caching, rate-limit, logging, and cost tracking. Eval features were grafted on later. Closer to request/response monitoring than to deep trace.

Langfuse — OSS, covers trace plus eval plus prompt management. Similar shape to Phoenix and LangSmith, but OSS-first and self-hosting is a primary deployment model. Popular with Europe-based teams (GDPR-friendly).

None of the three leads with "agent evaluation" in their pitch, yet all three capture production traces and let you run evals on top of them. Which one you pick is largely decided by "what observability infrastructure are you already on" and "self-host or SaaS."

Framework comparison matrix

Tool	License	Hosting	Strengths	Weaknesses	Best 2026 fit
Inspect AI	OSS (MIT)	self	agentic, sandboxed, reproducible	learning curve	safety evals, rigorous campaigns, AISI compatibility
Promptfoo	OSS (MIT)	self / SaaS	CLI, YAML, CI-friendly, red-team	weak at multi-step orchestration	regression suites, model A/B, prompt-change verification
Phoenix	OSS (Elastic 2.0)	self	OTel, OpenInference, RAG	UI less mature than SaaS	self-host observability, RAG eval
LangSmith	Closed	SaaS (self option)	LangChain integration, smooth UX	SaaS pricing, LangChain dependence	LangGraph stack, hosted preference
OpenAI Evals	OSS (MIT)	self	rich reference set	overtaken by newer tools	OpenAI models reusing existing evals
Braintrust	Closed	SaaS	integrated UX, experiment workflow	pricing, lock-in	AI-native startups doing daily evals
Helicone	OSS + SaaS	self / SaaS	gateway, cache, cost tracking	trace depth lower than Phoenix	cost/request monitoring first
Langfuse	OSS (MIT)	self / SaaS	trace plus prompt management OSS-first	some features SaaS-only	EU/regulated friendly, OSS self-host

9. Building your first agent eval suite — 7 steps

Enough theory. Here is how a team puts evals on its agent for the first time.

1) Define the task — what does "success" mean, in one sentence

The most commonly skipped step. "A customer support agent" is not a task. "Given a refund request ticket, output a policy-compliant decision and a user-facing reply — the decision is one of four options, the reply is free form" — that is a task. Pin down the input format, the output format, and what "success" means.

2) A golden dataset of 30 to 100 items

If you have production traffic, sample from it; otherwise hand-write. Diversity (easy, medium, hard, edge cases) matters more than volume. Fifty items is a fine start if labeling is expensive.

The minimum structure of an entry:

Input (task spec plus context)
Expected outcome (a reference answer or a scoring rubric)
Metadata (difficulty, category, tags)

3) Scoring function — start with binary task-completion

Start with the simplest scorer: success/failure. Binary. Save partial credit for later. "Do the tests pass," or "is the JSON output schema-valid and the decision field equal to the reference."

4) Run once, get a baseline

Run the current system over the dataset. The score will almost always be lower than you expected. That is normal. This number is your starting line.

5) Sprinkle in trajectory and efficiency

Looking at only the binary score will miss "lucky completions." So sample some trajectories (tool-call sequences), and record average steps, tokens, and cost alongside the score. Watch how these numbers move together across regressions.

6) Wire it into CI

Hook the eval into CI. When a PR opens, run 30 to 50 items from the dataset and post the table back as a comment. Block merge on regressions. Run the full set nightly because of cost; on PRs, run a fast subset.

7) Grow the dataset from production traces

Capture interesting cases from production (failures, low scores, user complaints) and add them to the dataset. A month in and your golden set has doubled. An eval is not a one-shot — it is alive.

Anti-patterns

Run a benchmark once and call it done — an eval is a regression tool.
One LLM-as-judge for everything — judge drift and bias contaminate every signal.
Not wired to CI — nobody reads it.
Dataset that does not reflect production — passes the eval, breaks in production.

10. Real cases

Case 1 — SWE-bench Verified as an agent eval

SWE-bench started as a "coding ability benchmark," but the OpenAI-curated SWE-bench Verified (500 validated tasks) released in 2024 functions in practice as an agent eval — you cannot solve it with the model alone; you need model plus harness plus tools.

Look at the leaderboard and the same model gets different scores under different harnesses (SWE-agent, Aider, Claude Code, Devin). That is agent evaluation in the wild. As of early 2026 the top entries hit 60 to 70 percent — double the 30 percent from a year before. And a significant chunk of that gain came from the harness, not the model.

Practical implication for your team: you should be able to fix the harness and swap models, and fix the model and swap harnesses. Both measurements.

Case 2 — Anthropic's Inspect-based pre-deployment safety evals

Anthropic, under its Responsible Scaling Policy (RSP), runs pre-deployment evaluations before shipping a new model. A meaningful portion is written in Inspect AI format so external evaluators (METR, Apollo, AISI) can reproduce the same evaluations.

Typical evaluation categories:

Autonomy (autonomous replication, long-horizon agentic tasks)
Cyber (CTF challenges)
Bio (specialized knowledge, synthesis pathways)
Deception

Each is an Inspect task, and results are published with both trajectory and logs preserved. Reproducibility is the heart of governance, and Inspect's serialized log format makes it possible.

Case 3 — Daily eval on a production agent

A common (hypothetical) setup. A team runs a customer support agent in production.

Nightly: 200 golden items plus 50 production samples, run through Promptfoo.
Slack alert on regression.
Weekly: a human scores 100 production traces by hand, and we check correlation with our LLM-as-judge.
Monthly: dataset refresh — add 30 newly discovered failures from production.
Quarterly: model/harness change campaign — full dataset, 30 human evals, cost and latency regression check.

When all four loops are in place, "feels better, I think" disappears from standups.

11. The reproducibility problem — measuring non-deterministic agents

The hardest part of agent eval is reproducibility. Models can be made nearly deterministic with temperature, but agents cannot.

Sources of non-determinism:

Model sampling (temperature, top-p).
Tool side effects (current time, external APIs, file system state).
Concurrency (the order of parallel tool calls).
External services (search-result index drift).

Mitigations:

Average over multiple seeds — run each input N times (typically 3 to 5) and look at the mean and variance. A single-run score will fool you.
Pin the environment — Docker or sandbox containers for tool runtime. This is exactly why Inspect AI made sandbox first-class from day one.
Pin time — mock tools like date to return a fixed time.
Reduce external dependence — mock or cache external APIs for evaluation runs.
Trajectory distribution — how widely trajectories scatter for the same input is itself a signal. High scatter means you need other levers (lower temperature, more seeds).

Core mindset: an agent score is a distribution, not a number. Single numbers lie.

12. When not to evaluate

People forget: evals are not free. Sometimes the evaluation costs more than it saves.

The system changes too fast — if the prompt changes weekly and the dataset has to chase, the cost of building a golden set never amortizes. Vibe check plus 5 to 10 unit tests is enough at this stage.
The usage is too small — a 10-requests-a-day agent does not earn a 200-item dataset.
The definition of "success" is not nailed down — eval gives a false sense of stability when the task itself is still moving. Lock the definition first.
Humans are faster — for small, rarely-run evaluations, a person doing it by hand in five minutes is more accurate.

Evals pay off for systems that need regression and comparison. They are overkill for a one-shot prototype. So the first question is always the same: how often will we change this system, and how will we know what the change did? No answer to those questions, no reason to build evals yet.

13. Picking a tool — one-line guide

Already on LangGraph — LangSmith. Least friction.
OSS-first, serious about agentic eval — Inspect AI.
Want a fast CI regression suite — Promptfoo.
Want eval on top of production trace, self-hosted — Arize Phoenix or Langfuse.
Prompt regression is the focus, you want SaaS UX — Braintrust.
Cost and request monitoring first — Helicone.
Safety and governance evals, external reproducibility required — Inspect AI.
Multiple tools at once — common. Trace ingestion (Phoenix/Langfuse) plus CI regression (Promptfoo) plus safety campaigns (Inspect) is a typical 2026 stack.

Epilogue — what gets measured is what can be improved

One-sentence summary: measure the agent, not the model. And measure daily, not once.

In 2024 AI engineering meant "which model is smarter." In 2026 it means "how well does our agent perform on our domain." Models are converging. The difference comes from the system you build on top of the model, and that system is yours to own.

Evals are how you own it. When the score is visible every day, "it works" is replaced by "this week's regression is 0.7 percent, average trajectory steps fell 4.2 to 3.9." The PM's next question changes. The priorities change. What gets fixed and what gets left alone changes.

The next decade of AI engineering is systems engineering, not model engineering. And the first tool of systems engineering is measurement.

A 12-item checklist

Is the task definition written down (input, output, success)?
Is there a golden dataset (at least 30 to 50 items)?
Does the dataset reflect production (difficulty distribution)?
Did the scorer start at binary task-completion?
Are trajectories sampled and inspected?
Are efficiency numbers (steps, tokens, cost) recorded together?
Is the eval wired to CI?
Are regressions caught on every PR?
Is there a loop that grows the dataset from production?
Is non-determinism handled with multiple seeds?
If you depend on LLM-as-judge, is it calibrated with human eval?
Can the eval tool be self-hosted (or is the SaaS choice deliberate)?

Ten anti-patterns

No eval at all, just "it works" — no measurement, no improvement.
Run a benchmark once and call it done — eval is a regression tool.
One LLM-as-judge as the only scorer — bias and drift.
Dataset that does not reflect production — passes locally, breaks live.
Never inspect trajectories — lucky completions hide.
Never measure cost or latency — production explodes on bill.
Single-seed scores — noise read as signal.
Eval not in CI — regressions reach production.
Only model changes are evaluated, prompts and harness are not — the biggest change axis is missed.
Conflating "agent eval" with "model eval" — different problems.

Candidates for next: going deep on LLM-as-judge — judge bias and calibration, building eval datasets from production traces — the flow and the tooling, agent regression alerting — wiring evals to Slack/PagerDuty.

"You cannot improve what you do not measure. And the agent is not the model — it is a system. Measure the system."

— Agent evaluation systems in 2026, end.

✍️ 필사 모드: Agent Evaluation Systems in 2026 — Inspect AI vs Promptfoo vs Phoenix vs LangSmith vs OpenAI Evals (You're Measuring the Agent, Not the Model)