✍️ 필사 모드: OpenAI AgentKit and the New Agent Evaluation Workflow: A Practical Guide to Datasets, Trace Grading, and Prompt Optimization
English- Why AgentKit matters now
- What AgentKit changes for teams
- Why agent evals matter more than single-model benchmarks
- How datasets, trace grading, and prompt optimization fit together
- A practical agent evaluation loop
- A realistic rollout checklist for product teams
- What each team should pay attention to
- Final thoughts
- References
Why AgentKit matters now
On October 6, 2025, OpenAI introduced the AgentKit page and framed it as a unified toolkit for building, deploying, and optimizing agents. The official description says AgentKit is a complete set of tools to build, deploy, and optimize agents. It introduces Agent Builder, Connector Registry, and ChatKit, and adds new evaluation capabilities including datasets, trace grading, automated prompt optimization, and third-party model support.
The practical takeaway is bigger than a product launch. These tools build on the Responses API and Agents SDK, which means the workflow for creating an agent is becoming more tightly connected to the workflow for evaluating and improving it. Teams no longer need to treat agent quality as a loose collection of prompts, demos, and manual spot checks.
This article focuses on what that means in practice for engineers, PMs, and platform teams. The goal is not to repeat marketing copy, but to explain how the new evaluation workflow can help teams ship agents more reliably.
What AgentKit changes for teams
Many AI projects still operate with a model-centric mindset. Teams compare models, tune prompts, and run a few examples manually. That approach breaks down once an agent has to perform real work across tools, policies, external systems, and user-facing flows.
AgentKit shifts the center of gravity from isolated model output to end-to-end task execution.
That matters because agent quality depends on more than whether a single response sounds good. It depends on questions like these:
- Did the agent choose the right tool at the right time
- Did it gather enough context before taking action
- Did it follow policy and permission boundaries
- Did it recover cleanly when a step failed
- Did the final answer actually help the user complete the task
Agent Builder helps teams structure and iterate on agents. Connector Registry helps manage system access and integrations. ChatKit supports the user-facing interaction layer. The new evaluation features matter because they make the whole system measurable instead of anecdotal.
In other words, AgentKit pushes teams toward managing agent performance as an operational system, not just as prompt quality.
Why agent evals matter more than single-model benchmarks
Single-model benchmarks are still useful. They can help with model selection, cost planning, and baseline capability checks. But they do not capture how a production agent behaves across a full task path.
In real products, many failures happen in the glue between steps rather than in the model alone.
- A capable model may still choose the wrong tool.
- A tool call may succeed, but the wrong intermediate state may be passed downstream.
- The final answer may look reasonable while the execution path is too slow or expensive.
- A workflow may appear successful while violating approval, security, or audit requirements.
This is why agent evaluation should be treated as a workflow problem, not just a model scoring problem.
For production teams, the more important questions are usually these:
- Which task types fail most often
- Whether failures come from reasoning, routing, tool use, or missing context
- Whether a change actually fixes the same class of failures
- Whether success rate, latency, and cost improve together or fight each other
That is where the new AgentKit evaluation workflow becomes useful. It gives teams a structure for measuring the behavior of the full agent system.
How datasets, trace grading, and prompt optimization fit together
The most useful way to understand the new workflow is not as a set of separate features, but as a loop.
1. Datasets define what good looks like
For agent teams, a dataset is not just a pile of examples. It is a structured set of tasks the product must handle well.
A strong dataset usually includes:
- common user requests that represent real production demand
- high-value workflows with clear business impact
- edge cases such as ambiguity, missing information, or permission limits
- explicit success criteria or grading rules
The key idea is simple: the dataset should reflect work the agent truly needs to complete, not just prompts that look impressive in a demo.
2. Trace grading shows where the workflow breaks
If you only score final answers, you learn that something failed but not why. Trace grading expands evaluation to the steps inside the run.
That makes it easier to answer questions such as:
- Was the first tool choice appropriate
- Did the agent inspect enough evidence before acting
- Did it make unnecessary repeated calls
- Did it follow the right guardrails before accessing sensitive data
This is especially valuable for debugging. Instead of treating agent quality as a black box, teams can inspect which stage of the execution path caused the failure.
3. Automated prompt optimization improves iteration speed
Once teams have representative datasets and trace-level grading, prompt optimization becomes much more useful. It is no longer just about trying alternative wording. It becomes part of a controlled loop for testing whether prompt changes improve real task outcomes.
Two operating rules matter here:
- optimization should always be evaluated against a representative dataset
- score gains should be checked against product goals, not treated as wins by default
A prompt that increases one benchmark score while hurting safety, latency, or user trust is not a meaningful improvement. The value of optimization comes from running it inside a broader evaluation system.
A practical agent evaluation loop
For most product teams, a realistic rollout looks like this:
- Build a dataset from real user requests and target workflows.
- Define success and failure in business terms, not only model terms.
- Capture execution traces and attach grading criteria to important steps.
- Group recurring failure patterns.
- Adjust prompts, tool descriptions, routing logic, or model configuration.
- Re-run the same dataset and compare results.
- Expand production exposure only after improvements hold up.
This loop replaces vague confidence with repeatable evidence. Engineers get clearer debugging signals. PMs get more defensible launch criteria. Platform teams get a shared operating model instead of ad hoc evaluation habits across teams.
A realistic rollout checklist for product teams
Teams do not need to deploy everything at once. A phased rollout is usually safer and faster.
Start with one high-value workflow
Pick a workflow that matters enough to justify evaluation effort, but narrow enough to control risk.
Good candidates often include:
- support operations with measurable resolution quality
- internal knowledge tasks with recurring patterns
- sales or research workflows where output quality can be reviewed
The best first workflow is one where success is visible and failure is manageable.
Build a dataset that reflects real production pressure
Early datasets do not need to be huge. They do need to be representative.
- include recent real requests if possible
- mix straightforward cases with ambiguous ones
- include examples where the agent should ask for clarification
- label outcomes that matter to the product, not just the model
For example, a support agent should not only be scored on answer quality. It may also need to be evaluated on time to resolution, escalation rate, or whether it avoided unnecessary follow-up.
Keep trace grading narrow at first
Trying to grade every internal step from day one often creates overhead without enough signal. Start with the steps that matter most.
- tool choice
- evidence quality
- policy compliance
- final action safety
This gives teams useful debugging leverage without turning evaluation into a heavyweight process.
Use automated prompt optimization with guardrails
Optimization should accelerate learning, not create hidden overfitting.
- separate tuning and validation sets
- watch cost and latency along with quality
- keep policy and safety checks outside the optimization score when needed
- preserve human review before broad rollout
Let platform teams provide shared standards
As more teams build agents, quality will drift unless someone owns the common layer. Platform teams should usually provide:
- dataset templates
- example trace grading rubrics
- pre-release acceptance thresholds
- common reporting for cost, latency, and quality
- a library of known failure patterns
That shared foundation is what makes the evaluation workflow scale across the organization.
What each team should pay attention to
Engineers
- Inspect execution paths, not just final outputs.
- Design tools and state transitions with observability in mind.
- Re-run the same dataset after every meaningful change.
PMs
- Tie evaluation scores to product outcomes.
- Separate impressive demos from dependable operations.
- Define which failures are acceptable and which are launch blockers.
Platform teams
- Treat evaluation assets as shared infrastructure.
- Standardize measurement without blocking team-level experimentation.
- Use the data coming from Responses API and Agents SDK based systems to build durable operating standards.
Final thoughts
AgentKit matters because it connects agent creation and agent improvement in a more coherent workflow. The October 6, 2025 announcement is important not only for Agent Builder, Connector Registry, or ChatKit, but for the message behind the evaluation layer: datasets, trace grading, and automated prompt optimization are becoming first-class parts of shipping agents well.
As agent products mature, the competitive edge will likely come less from model choice alone and more from how quickly teams can measure failures, understand execution paths, and ship safer improvements. That is the operational shift AgentKit signals.
References
현재 단락 (1/100)
On **October 6, 2025**, OpenAI introduced the AgentKit page and framed it as a unified toolkit for b...