✍️ 필사 모드: OpenAI RFT with Custom Graders: A Practical Guide for Product and Platform Teams
English- What RFT is
- When to use RFT instead of prompt tuning
- Why custom graders matter
- The five-step loop
- Data preparation that holds up in practice
- Evaluation discipline for training and rollout
- A decision checklist
- Official links
What RFT is
OpenAI reinforcement fine-tuning, or RFT, adapts a reasoning model using a feedback signal defined through a programmable grader. The point is not only to improve style or compliance. The point is to aim for expert-level performance inside a specific domain.
That makes RFT different from prompt tuning alone. Prompting is great for clearer instructions, better formatting, and quick iteration. RFT is better when the task has a clear success criterion and the model can be trained against a repeatable signal.
Current public docs describe RFT as supported on o-series reasoning models, and the guide names o4-mini and o4-mini-2025-04-16. OpenAI's October 6, 2025 AgentKit launch post says RFT reached general availability on o4-mini, with private beta support for GPT-5 and custom graders in beta.
When to use RFT instead of prompt tuning
Prompt tuning should come first when the problem is mainly about wording, tone, or a small behavior adjustment. RFT becomes attractive when you need the model to get reliably better at a task that is hard to specify in prose but easy to score.
RFT is usually a good fit when:
- the task is unambiguous
- success criteria are stable
- experts can agree on what good looks like
- the workflow depends on stronger reasoning in a narrow domain
- failures are expensive enough to justify the extra setup work
That makes RFT a strong option for policy review, security analysis, legal passage selection, structured decision support, and similar workflows where a rubric can be applied consistently.
Why custom graders matter
Custom graders are the center of the RFT workflow. They convert business rules into training signal.
For product teams, the grader can encode what matters in the product itself:
- did the model choose the right action
- did it follow policy
- did it produce the expected structured output
- did it avoid unsupported claims
- did it stay in scope
For platform teams, the grader is the bridge between evaluation and training. It forces the team to define success before the run starts, and it makes checkpoint comparisons much more useful because the same rubric can be reused across experiments.
If a human expert can score the output reliably and the task can be reduced to a clear numeric reward, that is often a strong candidate for a custom grader.
The five-step loop
OpenAI's docs describe RFT as a five-step loop:
- Implement a grader that assigns a numeric reward to each model response.
- Upload your prompt dataset and designate a validation split.
- Start the fine-tune job.
- Monitor and evaluate checkpoints, then revise the data or grader if needed.
- Deploy the resulting model through the standard API.
That loop is important because RFT is not just a training command. You are building a measurement system first, then using it to shape model behavior. If the grader is weak, the model can learn the wrong lesson. If the dataset is too narrow, it can overfit to an unhelpful pattern.
Data preparation that holds up in practice
Most RFT projects succeed or fail on the quality of the data.
Start with real production-like tasks. Include easy examples, borderline examples, and genuinely hard cases. Add situations where the model should refuse, ask for more information, or follow a policy instead of guessing.
A practical dataset usually includes:
- representative prompts from real workflows
- separate training and validation splits
- edge cases, not only happy paths
- labeling rules experts can apply consistently
- enough variety to avoid brittle shortcuts
If the task uses structured output, define the schema clearly and evaluate that structure directly. If the task is text based, make sure the grader does not reward polished wording at the expense of correctness.
Evaluation discipline for training and rollout
RFT works best when evaluation is treated as part of the product process.
Before training, define what success means on the validation set and what production failure would look like. During training, inspect checkpoints instead of waiting only for the final model. After training, review representative prompts separately to check for overfitting, reward hacking, or behavior drift.
The safest rollout habits are:
- keep validation data separate from training data
- compare against a prompt-only baseline
- track quality, cost, and latency together
- test edge cases and refusal cases
- keep human review for high-impact releases
That discipline matters because a better grader score does not always mean a better product. A model that wins the rubric but slows the workflow down or creates new risk is not a real improvement.
A decision checklist
Before starting an RFT project, ask:
- Can we describe the task clearly enough for a grader to score it?
- Do domain experts agree on what a correct answer looks like?
- Is the task important enough to justify data prep and training overhead?
- Has prompt tuning already reached diminishing returns?
- Do we have enough data for clean training and validation splits?
- Can we compare checkpoints against a baseline?
- Are we willing to gate rollout if the model regresses on edge cases?
If most of those answers are yes, RFT is likely worth testing. If the task is still broad, vague, or mostly about wording, prompt tuning is usually the simpler and cheaper choice.
Official links
현재 단락 (1/58)
OpenAI reinforcement fine-tuning, or RFT, adapts a reasoning model using a feedback signal defined t...