Skip to content
Published on

From DPO to KTO: Latest Human Feedback Alignment Techniques Paper Review and Practical Implementation

Authors
  • Name
    Twitter
From DPO to KTO: Latest Human Feedback Alignment Techniques Paper Review and Practical Implementation

Why Alignment Techniques Beyond RLHF Are Needed

Alignment -- the task of aligning LLMs with human preferences -- has become an essential pipeline for every production LLM since ChatGPT. The RLHF (Reinforcement Learning from Human Feedback) pipeline established by OpenAI in InstructGPT (Ouyang et al., 2022, arxiv:2203.02155) consists of three stages: implant basic capabilities through SFT (Supervised Fine-Tuning), separately train a Reward Model, and optimize the policy with PPO (Proximal Policy Optimization).

This 3-stage pipeline is powerful but operationally expensive. Training the reward model requires separate GPU resources, and during PPO training, four models -- policy model, reference model, reward model, and value model -- must be loaded in memory simultaneously. For a 70B model, this means at least 8 A100 80GB GPUs. Training stability is also an issue. With many sensitive hyperparameters such as PPO clipping ratio, KL penalty coefficient, and GAE lambda, convergence on the first attempt is rare.

Starting from the second half of 2023, research fundamentally reducing this complexity surged. DPO showed that policies could be directly optimized from preference data without a reward model, IPO pointed out the risks of the Bradley-Terry assumption, and KTO proved that alignment is possible using only binary signals without pairwise preference data. In 2024-2025, SimPO, ORPO, GRPO, and others emerged, rapidly expanding the range of alignment technique options.

This article reviews papers from RLHF through DPO, IPO, KTO, and the latest techniques, and covers practical implementation code using the Hugging Face TRL library, hyperparameter tuning strategies, failure cases, and recovery procedures.

Structure and Limitations of the RLHF Pipeline

3-Stage Pipeline Details

The full RLHF flow can be summarized mathematically as follows.

Stage 1 - SFT: Fine-tune the base model with high-quality instruction-response pairs.

Stage 2 - Reward Model Training: Generate two responses for the same prompt, and have human evaluators label the preferred response (chosen) and the non-preferred response (rejected). Train a reward function r(x, y) based on the Bradley-Terry model.

L_RM = -E[log sigma(r(x, y_w) - r(x, y_l))]

Here, y_w is the chosen response and y_l is the rejected response.

Stage 3 - PPO Optimization: Maximize the reward model score while constraining KL divergence from the reference policy.

max E[r(x, y)] - beta * KL(pi_theta || pi_ref)

Practical Limitations of RLHF

LimitationDescription
Memory CostSimultaneous loading of 4 models: Policy, Reference, Reward, Value
Training InstabilityMany sensitive hyperparameters: PPO clipping, KL coefficient, learning rate
Reward HackingPolicy can learn to exploit weaknesses in the reward model
Data CostHuman evaluator costs for collecting pairwise comparison data
ReproducibilityResults vary significantly by seed even with identical settings

These limitations are the background for the emergence of RL-free alignment techniques like DPO.

DPO: Direct Preference Optimization Without a Reward Model

Key Paper Review

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023, NeurIPS 2023, arxiv:2305.18290)

The core insight of DPO is concise. The reward function optimization problem in RLHF has a closed-form solution, and by leveraging this, policies can be directly optimized from preference data without explicitly training a reward model.

The optimal policy of RLHF takes the following form:

pi*(y|x) = (1/Z(x)) * pi_ref(y|x) * exp(r(x, y) / beta)

Solving this inversely allows expressing the reward function as a ratio of policies:

r(x, y) = beta * log(pi_theta(y|x) / pi_ref(y|x)) + beta * log Z(x)

Substituting this relationship into the Bradley-Terry model cancels the Z(x) term, yielding the final DPO loss function:

L_DPO = -E[log sigma(beta * (log(pi_theta(y_w|x)/pi_ref(y_w|x)) - log(pi_theta(y_l|x)/pi_ref(y_l|x))))]

The important point is that DPO does not "remove" the reward model but "implicitly includes" it. Since the policy itself serves as the reward model, separate reward model training and PPO stages become unnecessary.

Advantages and Limitations of DPO

Advantages: Implementation is simple. Since the form resembles cross-entropy loss, it can be implemented with nearly identical training code as SFT. Memory usage also requires only policy + reference (two models), roughly half that of RLHF. Using Hugging Face TRL's DPOTrainer, the entire pipeline can be configured in just a few dozen lines of code.

Limitations: DPO depends on the Bradley-Terry model assumption. Human preferences may not always follow this model. Also, applying DPO directly to a base model without SFT has been reported to cause rambling responses or increased hallucinations. If the quality of chosen/rejected pair data is low, training may not converge or performance may actually degrade.

TRL-Based DPO Implementation

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOConfig, DPOTrainer

# 1. Load model and tokenizer
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 2. Load preference data (prompt, chosen, rejected structure)
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:5000]")

# 3. DPO training configuration
training_args = DPOConfig(
    output_dir="./dpo-qwen2.5-1.5b",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-7,       # Very low learning rate is essential for DPO
    beta=0.1,                 # KL constraint strength
    max_length=1024,
    max_prompt_length=512,
    num_train_epochs=1,
    bf16=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=500,
    warmup_ratio=0.1,
    gradient_checkpointing=True,
)

# 4. Create DPO Trainer and train
trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
)
trainer.train()
trainer.save_model("./dpo-qwen2.5-1.5b-final")

The most important hyperparameters in the code above are beta and learning_rate. If beta is too high, the model stays too close to the reference model and alignment effects are minimal; if too low, the policy becomes unstable. The learning rate is typically set at least 10x lower than SFT.

IPO: The Risks of the Bradley-Terry Assumption

Key Paper Review

A General Theoretical Paradigm to Understand Learning from Human Preferences (Azar et al., 2024, AISTATS 2024, arxiv:2310.12036)

IPO (Identity Preference Optimization) questions DPO's core assumption -- the Bradley-Terry model. The Bradley-Terry model converts pairwise preferences into pointwise reward values, and IPO's key claim is that this conversion process causes information loss and increases the risk of overfitting.

IPO proposes a more general framework called PsiPO (Psi-Preference Optimization) and derives IPO as a special case using the identity function. IPO's loss function is as follows:

L_IPO = E[(log(pi_theta(y_w|x)/pi_ref(y_w|x)) - log(pi_theta(y_l|x)/pi_ref(y_l|x)) - 1/(2*beta))^2]

Unlike DPO which uses log-sigmoid, IPO uses squared loss. This difference enhances robustness against overfitting. In DPO, optimization can proceed in a direction that infinitely increases chosen probability and infinitely decreases rejected probability, but in IPO, it is constrained to converge to a target margin (1/2beta).

IPO's recommended beta value is around 0.01, much lower than DPO's (0.1-0.5). This is because IPO's loss structure is fundamentally different from DPO, and the same beta values should not be used.

KTO: Alignment with Binary Signals Without Pairwise Data

Key Paper Review

KTO: Model Alignment as Prospect Theoretic Optimization (Ethayarajh & Jurafsky, 2024, ICML 2024, arxiv:2402.01306)

KTO's greatest innovation is the change in data requirements. While DPO and IPO require chosen/rejected pairs for the same prompt, KTO only needs binary labels ("good" or "bad") for individual responses. This makes an enormous practical difference, as the cost of collecting pairwise comparison data is 5-10x that of binary labeling.

KTO's theoretical foundation is Kahneman and Tversky's Prospect Theory. It directly incorporates loss aversion -- the phenomenon where humans are more sensitive to losses than gains -- into the alignment objective function.

KTO proposes the HALO (Human-Aware Loss Objective) framework and shows that existing alignment techniques implicitly contain biases from prospect theory. The analysis suggests that DPO's success is partially due to reflecting these human cognitive biases.

KTO's loss function is defined asymmetrically for desirable and undesirable responses:

L_KTO = E_desirable[1 - sigma(beta * (log(pi/pi_ref) - z_ref))]
      + lambda * E_undesirable[1 - sigma(beta * (z_ref - log(pi/pi_ref)))]

Here, z_ref is an estimate of KL divergence, and lambda is the loss aversion coefficient (default approximately 1.33, derived from prospect theory experimental results).

KTO Experimental Results

KTO showed performance equal to or better than DPO from 1B to 30B scale. Notably, the rambling phenomenon that appears in DPO when applied directly to base models without SFT did not occur with KTO. This is interpreted as KTO's asymmetric loss structure imposing stronger penalties on poor responses.

TRL-Based KTO Implementation

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl.experimental.kto import KTOConfig, KTOTrainer

# 1. Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 2. Load KTO data (prompt, completion, label structure)
# label: True(desirable) / False(undesirable)
dataset = load_dataset("trl-lib/kto-mix-14k", split="train")

# 3. KTO training configuration
training_args = KTOConfig(
    output_dir="./kto-qwen2.5-1.5b",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,
    beta=0.1,                       # KL constraint strength
    desirable_weight=1.0,           # Desirable response weight
    undesirable_weight=1.0,         # Undesirable response weight
    max_length=1024,
    max_prompt_length=512,
    num_train_epochs=1,
    bf16=True,
    logging_steps=10,
    gradient_checkpointing=True,
)

# 4. Create KTO Trainer and train
trainer = KTOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
)
trainer.train()
trainer.save_model("./kto-qwen2.5-1.5b-final")

The key to KTO data format is that it uses individual response labels rather than pairwise data. If you are already collecting thumbs-up/thumbs-down feedback, you can use it directly for KTO training without data conversion.

Latest Alignment Techniques: SimPO, ORPO, GRPO

SimPO (Simple Preference Optimization)

SimPO: Simple Preference Optimization with a Reference-Free Reward (Meng et al., 2024, arxiv:2405.14734)

SimPO completely removes the reference model. In DPO, the reference model serves to constrain the policy from deviating too much, but SimPO solves this by using length-normalized average log probability as the reward signal. It showed performance significantly exceeding DPO on AlpacaEval 2 and Arena-Hard.

Since no reference model is needed, memory usage is halved compared to DPO. This makes alignment training of 7B models possible on a single GPU.

ORPO (Odds-Ratio Preference Optimization)

ORPO integrates SFT and alignment into a single objective function. By adding an odds-ratio-based preference loss to the SFT loss, it achieves both instruction following and preference alignment in a single training pass without a separate SFT stage. Since KL penalty is also removed, beta tuning is unnecessary.

GRPO (Group Relative Policy Optimization)

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Shao et al., 2024, arxiv:2402.03300)

GRPO, proposed by DeepSeek, removes PPO's value network (critic model) to reduce RLHF memory requirements by approximately 50%. It samples multiple responses as a group for a single prompt and estimates advantage using relative rewards within the group. No separate critic model is needed, simplifying implementation and training.

GRPO also played a key role in training DeepSeek-R1, showing particularly strong performance on verifiable tasks such as math and coding.

Algorithm Comparison Table

ItemRLHF (PPO)DPOIPOKTOSimPOORPOGRPO
Reward ModelExplicitly trainedImplicit (in policy)ImplicitImplicitNot neededNot neededExplicit/Rule-based
Reference ModelRequiredRequiredRequiredRequiredNot neededNot neededRequired
Data FormatPairwisePairwisePairwiseBinaryPairwisePairwiseRule-based reward
Memory UsageVery high (4 models)High (2 models)High (2 models)High (2 models)Medium (1 model)Medium (1 model)High (2 models)
Key HyperparametersKL coeff, clip ratio, GAE lambdabeta, lrbetabeta, lambdagamma, betalambdaclip ratio, KL coeff
Training StabilityLowMediumHighMedium-HighHighHighMedium
Implementation DifficultyHighLowLowLowVery lowVery lowMedium
Recommended beta-0.1-0.50.010.1-0.32.0-2.5--
SFT Pre-training RequiredYesStrongly recommendedRecommendedOptionalRecommendedNot needed (integrated)Yes

Hyperparameter Tuning Guide

Beta Parameter Tuning

Beta is the most important hyperparameter in all DPO-family techniques. It determines the strength of the KL constraint that controls distance from the reference policy.

# Script to analyze training behavior by beta value
import torch
import matplotlib.pyplot as plt

def dpo_loss_landscape(beta_values, log_ratio_range):
    """Visualize DPO loss surface by beta value"""
    fig, axes = plt.subplots(1, len(beta_values), figsize=(5*len(beta_values), 4))

    for idx, beta in enumerate(beta_values):
        log_ratios = torch.linspace(-log_ratio_range, log_ratio_range, 200)
        # DPO loss: -log(sigma(beta * (log_ratio_w - log_ratio_l)))
        # Here we treat log_ratio_w - log_ratio_l as a single variable
        loss = -torch.log(torch.sigmoid(beta * log_ratios))
        gradient = -beta * (1 - torch.sigmoid(beta * log_ratios))

        ax = axes[idx]
        ax.plot(log_ratios.numpy(), loss.numpy(), label="Loss", color="blue")
        ax.plot(log_ratios.numpy(), gradient.numpy(), label="Gradient", color="red")
        ax.set_title(f"beta = {beta}")
        ax.set_xlabel("log(pi/pi_ref)_w - log(pi/pi_ref)_l")
        ax.legend()
        ax.grid(True)

    plt.tight_layout()
    plt.savefig("dpo_beta_analysis.png", dpi=150)
    plt.show()

# As beta increases, the loss surface becomes steeper and training becomes more aggressive
dpo_loss_landscape([0.05, 0.1, 0.3, 0.5], log_ratio_range=5.0)

Practical Tuning Strategies:

  • Set beta lower (0.05-0.1) for smaller models (1B-3B). Smaller models tend to diverge from the reference policy more easily.
  • Larger models (7B-70B) remain stable even with higher beta values (0.1-0.5).
  • Higher quality preference data allows lowering beta for more aggressive training.
  • When data quality is uncertain, start with beta=0.1 and adjust based on validation loss.

Learning Rate Tuning

In DPO, the standard practice is to set the learning rate 5-20x lower than SFT.

Model SizeSFT LRRecommended DPO LR RangeNotes
1B-3B2e-51e-6 to 5e-7High overfitting risk
7B-13B1e-55e-7 to 1e-7Most stable range
30B-70B5e-61e-7 to 5e-8Gradient accumulation required

If the learning rate is too high, the model rapidly diverges from the reference policy and out-of-distribution responses increase. If too low, the ability to distinguish chosen/rejected is not sufficiently learned, resulting in minimal win rate improvement.

Practical Guide to Building Preference Data

DPO Data Format Conversion

def convert_to_dpo_format(raw_data):
    """Convert various raw data to DPO training format

    DPO requires three fields: prompt, chosen, rejected.
    Each field should ideally be in conversation format (list of dict).
    """
    dpo_dataset = []

    for item in raw_data:
        prompt = item["instruction"]

        # Method 1: Direct comparison by human evaluators
        if "chosen_response" in item and "rejected_response" in item:
            entry = {
                "prompt": [{"role": "user", "content": prompt}],
                "chosen": [{"role": "assistant", "content": item["chosen_response"]}],
                "rejected": [{"role": "assistant", "content": item["rejected_response"]}],
            }
            dpo_dataset.append(entry)

        # Method 2: Automatic conversion based on scores (rating >= 4: chosen, rating <= 2: rejected)
        elif "responses" in item:
            responses = sorted(item["responses"], key=lambda x: x["rating"], reverse=True)
            if len(responses) >= 2 and responses[0]["rating"] >= 4 and responses[-1]["rating"] <= 2:
                entry = {
                    "prompt": [{"role": "user", "content": prompt}],
                    "chosen": [{"role": "assistant", "content": responses[0]["text"]}],
                    "rejected": [{"role": "assistant", "content": responses[-1]["text"]}],
                }
                dpo_dataset.append(entry)

    return dpo_dataset

KTO Data Format Conversion

def convert_to_kto_format(raw_data):
    """Convert existing feedback data to KTO format

    KTO requires three fields: prompt, completion, label.
    Label is True (desirable) or False (undesirable).
    Since pairwise data is not needed, thumbs-up/down data can be used directly.
    """
    kto_dataset = []

    for item in raw_data:
        prompt = item["instruction"]

        # Method 1: Direct use of thumbs-up/down feedback
        if "response" in item and "thumbs_up" in item:
            entry = {
                "prompt": [{"role": "user", "content": prompt}],
                "completion": [{"role": "assistant", "content": item["response"]}],
                "label": item["thumbs_up"],  # True or False
            }
            kto_dataset.append(entry)

        # Method 2: Binary conversion based on rating
        elif "response" in item and "rating" in item:
            entry = {
                "prompt": [{"role": "user", "content": prompt}],
                "completion": [{"role": "assistant", "content": item["response"]}],
                "label": item["rating"] >= 4,  # 4+ points: desirable
            }
            kto_dataset.append(entry)

        # Method 3: Generate KTO data from DPO pairwise data
        if "chosen_response" in item and "rejected_response" in item:
            kto_dataset.append({
                "prompt": [{"role": "user", "content": prompt}],
                "completion": [{"role": "assistant", "content": item["chosen_response"]}],
                "label": True,
            })
            kto_dataset.append({
                "prompt": [{"role": "user", "content": prompt}],
                "completion": [{"role": "assistant", "content": item["rejected_response"]}],
                "label": False,
            })

    return kto_dataset

Converting DPO data to KTO format is straightforward, but the reverse is impossible. This is KTO's practical advantage. For organizations that already have binary feedback data collected without pairwise comparisons, KTO may be the only option.

Practical Troubleshooting: Common Failure Cases and Recovery

Failure Case 1: Response Quality Degradation After DPO Training

Symptoms: After DPO training, the model generates short and unfaithful responses, or conversely, extremely verbose responses.

Root Cause Analysis: This occurs when the data contains many cases where the quality difference between chosen and rejected is unclear. Especially when rejected responses are actually reasonable answers that are only "slightly less good" than chosen, the model fails to learn the correct direction.

Recovery Procedure:

  1. Data filtering: Remove pairs where the reward model score difference between chosen and rejected is below a threshold (e.g., 0.5).
  2. Increase beta (0.3-0.5) to keep closer to the reference policy.
  3. Further reduce the learning rate (half the current value).
  4. Verify that the SFT checkpoint was sufficiently trained.

Failure Case 2: Desirable/Undesirable Ratio Imbalance in KTO

Symptoms: Training does not converge, or only the probability of desirable responses increases while undesirable response probabilities remain unchanged.

Root Cause Analysis: When desirable data outnumbers undesirable data by more than 5x, gradient signals for undesirable samples get diluted. The KTO paper recommends a desirable:undesirable ratio between 1:1 and 4:1.

Recovery Procedure:

  1. Increase undesirable_weight (e.g., from 1.0 to 2.0).
  2. Oversample undesirable data.
  3. If the ratio is extreme (over 10:1), consider switching to DPO.

Failure Case 3: Loss Explosion in Early Training

Symptoms: Loss increases sharply within the first few dozen steps after training starts, and NaN occurs.

Root Cause Analysis: This occurs when the initial difference between the reference model and training model is too large, or the learning rate is excessively high. This issue is particularly common when using a base model instead of an SFT checkpoint as the reference.

Recovery Procedure:

  1. Reduce the learning rate by 5x.
  2. Set warmup_ratio to 0.1 or higher.
  3. Start both the reference model and training model from the same SFT checkpoint.
  4. Switch from bf16 to fp32 to verify numerical stability.
  5. Set gradient clipping to 1.0.

Failure Case 4: Overfitting - Validation Loss Rises While Train Loss Decreases

Symptoms: Train loss decreases but validation loss starts rising from mid-epoch. Generation quality shows a pattern of directly copying chosen responses from the training data.

Recovery Procedure:

  1. Stop training at the checkpoint with the lowest validation loss (early stopping).
  2. Verify that the number of epochs is appropriate for the data volume. Usually 1-3 epochs is sufficient.
  3. Consider switching to IPO. IPO is more robust against overfitting due to its squared loss structure.

UNA: A Framework Unifying RLHF/DPO/KTO

Among recent research, the UNA (Unifying Alignments) framework (arxiv:2408.15339) is noteworthy. UNA unifies RLHF (PPO), DPO, and KTO through a generalized implicit reward function. The core insight is that all three techniques can be reinterpreted as "supervised learning that minimizes the difference between implicit and explicit rewards."

From this perspective, pairwise feedback (DPO), binary feedback (KTO), and scalar feedback (RLHF) correspond to different special cases of the same objective function. In practice, you can choose the appropriate technique based on the type of data you have, and when multiple types of feedback data coexist, the UNA framework can leverage them in a unified manner.

Alignment Technique Selection Checklist

Use this checklist in order when selecting an alignment technique.

Data Format Verification

  • Do you have pairwise comparison data (A is better than B)? -> DPO, IPO, SimPO available
  • Do you only have binary feedback data (good/bad)? -> Use KTO
  • Do you have numerical scores (1-5)? -> Convert to DPO or KTO format based on thresholds
  • Is it a task with verifiable correct answers? -> Consider GRPO

Infrastructure Verification

  • Is GPU memory 48GB or more? -> DPO, IPO, KTO all possible
  • Is GPU memory 24GB or less? -> SimPO or ORPO recommended (no reference model needed)
  • Do you have the capacity for separate SFT training? -> DPO + SFT pipeline
  • Do you want to finish SFT and alignment in one pass? -> ORPO

Quality Requirements Verification

  • Is overfitting a concern? -> IPO (prevents overfitting with squared loss)
  • Must you apply directly to a base model without SFT? -> KTO (less rambling)
  • Is data quality uncertain? -> Set beta high and use IPO
  • Is maximum performance the goal? -> DPO + high-quality pairwise data combination

Operational Considerations

  • Have you saved the reference model checkpoint separately before training?
  • Have you separated a validation set? (at least 5% of total data)
  • Have you enabled gradient checkpointing? (memory savings)
  • Have you configured training logging with wandb or similar?
  • Have you manually inspected the quality of chosen/rejected (or desirable/undesirable) data?

Evaluation Pipeline: Validating Alignment Results

Quantitatively evaluating model quality after alignment training is essential. Simply having lower loss does not mean alignment was successful.

# Win rate calculation script for alignment model evaluation
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def calculate_win_rate(model_path, ref_model_path, eval_prompts, tokenizer_name):
    """Evaluate win rate of aligned model vs reference model using a reward model"""
    model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16)
    ref_model = AutoModelForCausalLM.from_pretrained(ref_model_path, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    wins, total = 0, 0

    for prompt in eval_prompts:
        inputs = tokenizer(prompt, return_tensors="pt")

        # Generate aligned model response
        with torch.no_grad():
            aligned_output = model.generate(
                **inputs, max_new_tokens=512,
                temperature=0.7, do_sample=True
            )
            ref_output = ref_model.generate(
                **inputs, max_new_tokens=512,
                temperature=0.7, do_sample=True
            )

        aligned_text = tokenizer.decode(aligned_output[0], skip_special_tokens=True)
        ref_text = tokenizer.decode(ref_output[0], skip_special_tokens=True)

        # Add comparison evaluation logic with judge model or reward model here
        # E.g.: Use GPT-4 as judge, or use a trained reward model
        # win = judge(prompt, aligned_text, ref_text)
        # wins += win
        total += 1

    return wins / total if total > 0 else 0.0

An important point during evaluation is to avoid self-evaluation (the aligned model evaluating itself). Using a separate judge model or human evaluation in parallel is the way to obtain reliable results. Using standard benchmarks such as AlpacaEval, MT-Bench, and Arena-Hard is also recommended.

Directions of Alignment Research in 2025-2026

Current alignment research is rapidly evolving in several directions.

Verifier-driven RL: Following the success of GRPO, the direction of using rule-based rewards for verifiable tasks such as math and coding is being strengthened. The domain where alignment is possible without human feedback is expanding.

Online DPO / Iterative DPO: Online methods where the model being trained directly generates responses and updates preference data based on them are showing performance improvements over offline DPO. However, there is a trade-off of increased training costs.

Multi-objective alignment: Research on simultaneously optimizing multiple axes such as helpfulness, harmlessness, and honesty, rather than simply "good responses," is active. Extensions like Mo-KTO (Multi-Objective KTO) have been proposed.

Synthetic preference data: Techniques for generating preference data using powerful LLMs (GPT-4, Claude, etc.) instead of human evaluators are becoming mainstream. While costs are greatly reduced, caution is needed as biases from the judge model can transfer directly.

Conclusion

The evolution from RLHF to DPO, and from DPO to KTO, is not merely algorithmic improvement but a process of finding answers to the fundamental question: "What are the minimum conditions needed for alignment?" RLHF required an explicit reward model and RL loop, DPO removed the reward model, and KTO made even pairwise comparison data unnecessary.

In practice, the choice is determined not by theoretical superiority but by the type of available data, infrastructure constraints, and the team's experience level. If you have pairwise comparison data and sufficient GPUs, DPO is a proven choice. If you only have binary feedback data, KTO is the only option. If memory is limited, consider SimPO or ORPO. For verifiable tasks, GRPO is a strong alternative.

Regardless of the technique chosen, data quality determines everything. 10,000 high-quality preference data points are better than 100,000 low-quality ones. Not skipping data inspection before training, validation monitoring during training, and systematic evaluation after training is the key to successful alignment.

References

  1. DPO - Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", NeurIPS 2023. arxiv:2305.18290
  2. KTO - Ethayarajh & Jurafsky, "KTO: Model Alignment as Prospect Theoretic Optimization", ICML 2024. arxiv:2402.01306
  3. IPO - Azar et al., "A General Theoretical Paradigm to Understand Learning from Human Preferences", AISTATS 2024. arxiv:2310.12036
  4. GRPO - Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", 2024. arxiv:2402.03300
  5. SimPO - Meng et al., "SimPO: Simple Preference Optimization with a Reference-Free Reward", 2024. arxiv:2405.14734
  6. UNA - "UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function", 2024. arxiv:2408.15339
  7. DPO Comprehensive Survey - "A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications", 2024. arxiv:2410.15595
  8. Hugging Face TRL - DPO Trainer Documentation. https://huggingface.co/docs/trl/main/en/dpo_trainer
  9. Hugging Face TRL - KTO Trainer Documentation. https://huggingface.co/docs/trl/main/en/kto_trainer
  10. InstructGPT - Ouyang et al., "Training language models to follow instructions with human feedback", NeurIPS 2022. arxiv:2203.02155