Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

What RFT is
When to use RFT instead of prompt tuning
Why custom graders matter
The five-step loop
Data preparation that holds up in practice
Evaluation discipline for training and rollout
A decision checklist
Official links

What RFT is

OpenAI reinforcement fine-tuning, or RFT, adapts a reasoning model using a feedback signal defined through a programmable grader. The point is not only to improve style or compliance. The point is to aim for expert-level performance inside a specific domain.

That makes RFT different from prompt tuning alone. Prompting is great for clearer instructions, better formatting, and quick iteration. RFT is better when the task has a clear success criterion and the model can be trained against a repeatable signal.

Current public docs describe RFT as supported on o-series reasoning models, and the guide names o4-mini and o4-mini-2025-04-16. OpenAI's October 6, 2025 AgentKit launch post says RFT reached general availability on o4-mini, with private beta support for GPT-5 and custom graders in beta.

When to use RFT instead of prompt tuning

Prompt tuning should come first when the problem is mainly about wording, tone, or a small behavior adjustment. RFT becomes attractive when you need the model to get reliably better at a task that is hard to specify in prose but easy to score.

RFT is usually a good fit when:

the task is unambiguous
success criteria are stable
experts can agree on what good looks like
the workflow depends on stronger reasoning in a narrow domain
failures are expensive enough to justify the extra setup work

That makes RFT a strong option for policy review, security analysis, legal passage selection, structured decision support, and similar workflows where a rubric can be applied consistently.

Why custom graders matter

Custom graders are the center of the RFT workflow. They convert business rules into training signal.

For product teams, the grader can encode what matters in the product itself:

did the model choose the right action
did it follow policy
did it produce the expected structured output
did it avoid unsupported claims
did it stay in scope

For platform teams, the grader is the bridge between evaluation and training. It forces the team to define success before the run starts, and it makes checkpoint comparisons much more useful because the same rubric can be reused across experiments.

If a human expert can score the output reliably and the task can be reduced to a clear numeric reward, that is often a strong candidate for a custom grader.

The five-step loop

OpenAI's docs describe RFT as a five-step loop:

Implement a grader that assigns a numeric reward to each model response.
Upload your prompt dataset and designate a validation split.
Start the fine-tune job.
Monitor and evaluate checkpoints, then revise the data or grader if needed.
Deploy the resulting model through the standard API.

That loop is important because RFT is not just a training command. You are building a measurement system first, then using it to shape model behavior. If the grader is weak, the model can learn the wrong lesson. If the dataset is too narrow, it can overfit to an unhelpful pattern.

Data preparation that holds up in practice

Most RFT projects succeed or fail on the quality of the data.

Start with real production-like tasks. Include easy examples, borderline examples, and genuinely hard cases. Add situations where the model should refuse, ask for more information, or follow a policy instead of guessing.

A practical dataset usually includes:

representative prompts from real workflows
separate training and validation splits
edge cases, not only happy paths
labeling rules experts can apply consistently
enough variety to avoid brittle shortcuts

If the task uses structured output, define the schema clearly and evaluate that structure directly. If the task is text based, make sure the grader does not reward polished wording at the expense of correctness.

Evaluation discipline for training and rollout

RFT works best when evaluation is treated as part of the product process.

Before training, define what success means on the validation set and what production failure would look like. During training, inspect checkpoints instead of waiting only for the final model. After training, review representative prompts separately to check for overfitting, reward hacking, or behavior drift.

The safest rollout habits are:

keep validation data separate from training data
compare against a prompt-only baseline
track quality, cost, and latency together
test edge cases and refusal cases
keep human review for high-impact releases

That discipline matters because a better grader score does not always mean a better product. A model that wins the rubric but slows the workflow down or creates new risk is not a real improvement.

A decision checklist

Before starting an RFT project, ask:

Can we describe the task clearly enough for a grader to score it?
Do domain experts agree on what a correct answer looks like?
Is the task important enough to justify data prep and training overhead?
Has prompt tuning already reached diminishing returns?
Do we have enough data for clean training and validation splits?
Can we compare checkpoints against a baseline?
Are we willing to gate rollout if the model regresses on edge cases?

If most of those answers are yes, RFT is likely worth testing. If the task is still broad, vague, or mostly about wording, prompt tuning is usually the simpler and cheaper choice.