From RLHF to DPO: A Deep Dive into LLM Alignment Techniques

Introduction
Defining the Alignment Problem
- Why Alignment Matters
- Mathematical Framework
Deep Analysis of the RLHF Pipeline
Constitutional AI: Alignment from AI Feedback
- Anthropic's Approach
- Two-Phase Process
DPO: Mathematical Foundations and Implementation
PPO Training Stability
- PPO in RLHF: Core Implementation
- Key Sources of PPO Instability
Recent Alignment Methods: KTO, IPO, ORPO
Comparative Analysis
Practical Application Guide
- Selection Criteria
- Evaluation Metrics Implementation
Failure Cases and Lessons Learned
References

Introduction

Large language models (LLMs) acquire remarkable linguistic capabilities from massive pretraining corpora, but pretraining alone cannot guarantee outputs aligned with human intentions and values. Models frequently generate harmful content, ignore user instructions, or confidently state falsehoods. The research field dedicated to bridging this gap is LLM Alignment.

Since OpenAI's InstructGPT paper (2022) formalized the RLHF (Reinforcement Learning from Human Feedback) pipeline, numerous successors have emerged: Anthropic's Constitutional AI, Stanford's DPO (Direct Preference Optimization), and newer methods like KTO, IPO, and ORPO. This article systematically analyzes the mathematical foundations, implementation details, and practical considerations of these key alignment techniques.

Defining the Alignment Problem

Why Alignment Matters

Pretrained LLMs optimize next-token prediction over internet text. This objective does not directly correspond to generating outputs that are "Helpful, Honest, and Harmless" (the HHH criteria). Three core challenges arise:

Helpfulness: The ability to accurately understand and follow user instructions
Honesty: Acknowledging uncertainty and minimizing hallucinations
Harmlessness: Refusing to generate harmful or biased content

Mathematical Framework

The alignment problem is formalized as a reward-constrained policy optimization. Given a prompt x, we maximize the reward for responses y generated by policy pi, while constraining divergence from a pretrained reference policy.

\max_{\pi} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(\cdot|x)} \left[ r(x, y) \right] - \beta \, \text{KL}\left[\pi(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)\right]

Here, beta controls the strength of the KL divergence penalty. If beta is too small, reward hacking occurs; if too large, learning stalls.

Deep Analysis of the RLHF Pipeline

InstructGPT's Three-Stage Pipeline

Ouyang et al. (2022) formalized RLHF into three stages:

Stage 1: Supervised Fine-Tuning (SFT)

Human labelers write demonstration data showing desired model behavior. The base model is fine-tuned on these demonstrations, learning basic instruction-following capability.

Stage 2: Reward Model (RM) Training

Multiple responses are generated for the same prompt, then human labelers rank them by preference. A reward model is trained on this ranking data.

Stage 3: Reinforcement Learning via PPO

The reward model provides feedback signals for PPO to optimize the policy.

Reward Model Training Implementation

The reward model is trained using the Bradley-Terry preference model, maximizing the log-probability difference between chosen and rejected response pairs.

import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class RewardModel(nn.Module):
    def __init__(self, model_name="gpt2"):
        super().__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name, num_labels=1
        )

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        return outputs.logits.squeeze(-1)

def reward_model_loss(reward_model, chosen_ids, chosen_mask, rejected_ids, rejected_mask):
    """Bradley-Terry preference model loss for reward model training"""
    r_chosen = reward_model(chosen_ids, chosen_mask)
    r_rejected = reward_model(rejected_ids, rejected_mask)

    # Train the reward for chosen responses to exceed rejected responses
    loss = -torch.log(torch.sigmoid(r_chosen - r_rejected)).mean()
    accuracy = (r_chosen > r_rejected).float().mean()

    return loss, accuracy

Preference Dataset Construction

The cornerstone of the RLHF pipeline is high-quality preference data. Each sample consists of a prompt, a preferred (chosen) response, and a dispreferred (rejected) response.

from datasets import Dataset

def build_preference_dataset(prompts, model, tokenizer, num_samples=4):
    """Pipeline to generate multiple responses per prompt and construct preference pairs"""
    preference_data = []

    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt")
        responses = []

        # Generate diverse responses via temperature sampling
        for _ in range(num_samples):
            output = model.generate(
                **inputs,
                max_new_tokens=256,
                temperature=0.8,
                do_sample=True,
            )
            response = tokenizer.decode(output[0], skip_special_tokens=True)
            responses.append(response)

        # In production, human labelers rank the responses
        # Here we substitute reward model scores
        scored = [(r, reward_model_score(prompt, r)) for r in responses]
        scored.sort(key=lambda x: x[1], reverse=True)

        # Construct preference pairs from best and worst
        preference_data.append({
            "prompt": prompt,
            "chosen": scored[0][0],
            "rejected": scored[-1][0],
        })

    return Dataset.from_list(preference_data)

Constitutional AI: Alignment from AI Feedback

Anthropic's Approach

Bai et al. (2022) introduced Constitutional AI to reduce RLHF's dependence on human labeling. The core idea is that AI critiques and improves its own outputs, guided by a set of predefined principles (the "constitution").

Two-Phase Process

Supervised Learning Phase (Self-Critique and Revision):

The model generates a potentially harmful response
It critiques its own response based on constitutional principles
It generates a revised response reflecting the critique
SFT is performed on the revised responses

RL Phase (RLAIF - RL from AI Feedback):

The AI judges which of two responses better adheres to the principles
A reward model is trained on these AI preference labels
PPO optimizes the policy using this reward model

The greatest advantage of Constitutional AI is dramatically reducing human labeling costs while achieving transparent, interpretable, principle-based alignment.

DPO: Mathematical Foundations and Implementation

The Analytical Solution to RLHF

Rafailov et al. (2023) introduced DPO as the most elegant alternative to RLHF. The key insight is that the KL-constrained reward maximization problem has an analytical solution.

The optimal policy takes the form:

\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)

Inverting this, the reward function can be expressed as a ratio of policies:

r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)

Substituting into the Bradley-Terry model yields the DPO loss, which directly optimizes the policy without a reward model:

\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]

DPO Training Loop Implementation

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_dpo_loss(
    policy_model,
    reference_model,
    chosen_ids,
    chosen_mask,
    rejected_ids,
    rejected_mask,
    beta=0.1,
):
    """DPO loss function: direct policy optimization without a reward model"""
    # Compute log probabilities under the policy model
    policy_chosen_logps = get_log_probs(policy_model, chosen_ids, chosen_mask)
    policy_rejected_logps = get_log_probs(policy_model, rejected_ids, rejected_mask)

    # Compute log probabilities under the reference model (no gradients needed)
    with torch.no_grad():
        ref_chosen_logps = get_log_probs(reference_model, chosen_ids, chosen_mask)
        ref_rejected_logps = get_log_probs(reference_model, rejected_ids, rejected_mask)

    # Compute DPO log ratios
    chosen_logratios = policy_chosen_logps - ref_chosen_logps
    rejected_logratios = policy_rejected_logps - ref_rejected_logps

    # DPO loss: -log(sigmoid(beta * (chosen_logratio - rejected_logratio)))
    logits = beta * (chosen_logratios - rejected_logratios)
    loss = -F.logsigmoid(logits).mean()

    # Metric: fraction where implicit reward for chosen exceeds rejected
    reward_accuracy = (logits > 0).float().mean()

    return loss, reward_accuracy

def get_log_probs(model, input_ids, attention_mask):
    """Sum token-level log probabilities over the sequence"""
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits[:, :-1, :]
    labels = input_ids[:, 1:]
    log_probs = F.log_softmax(logits, dim=-1)
    selected_log_probs = torch.gather(
        log_probs, dim=-1, index=labels.unsqueeze(-1)
    ).squeeze(-1)
    mask = attention_mask[:, 1:]
    return (selected_log_probs * mask).sum(dim=-1)

Strengths and Weaknesses of DPO

Strengths:

No reward model training needed, simplifying the pipeline
Lower memory footprint than PPO (2 models vs. 4 models)
Relatively straightforward hyperparameter tuning

Weaknesses:

Vulnerable to distribution shift from the reference model
Relies on offline data, lacking exploration
Highly sensitive to preference data quality

PPO Training Stability

PPO in RLHF: Core Implementation

Proximal Policy Optimization (PPO) is the most widely used RL algorithm in RLHF. Its core mechanism is clipping, which constrains the magnitude of policy updates to ensure training stability.

import torch
import torch.nn.functional as F

def ppo_step(
    policy_model,
    value_model,
    old_log_probs,
    states,
    actions,
    rewards,
    advantages,
    clip_epsilon=0.2,
    value_clip=0.2,
    kl_coeff=0.1,
    ref_log_probs=None,
):
    """PPO update step for LLM RLHF training"""
    # Current policy log probabilities
    new_log_probs = policy_model.get_log_probs(states, actions)

    # Importance sampling ratio
    ratio = torch.exp(new_log_probs - old_log_probs)

    # Clipped surrogate objective
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1.0 - clip_epsilon, 1.0 + clip_epsilon) * advantages
    policy_loss = -torch.min(surr1, surr2).mean()

    # Value function loss (with clipping)
    new_values = value_model(states)
    value_loss = F.mse_loss(new_values, rewards)

    # KL divergence penalty (against reference model)
    kl_penalty = 0.0
    if ref_log_probs is not None:
        kl_div = (old_log_probs - ref_log_probs).mean()
        kl_penalty = kl_coeff * kl_div

    total_loss = policy_loss + 0.5 * value_loss + kl_penalty

    return total_loss, {
        "policy_loss": policy_loss.item(),
        "value_loss": value_loss.item(),
        "kl_penalty": kl_penalty if isinstance(kl_penalty, float) else kl_penalty.item(),
        "mean_ratio": ratio.mean().item(),
    }

Key Sources of PPO Instability

Reward Hacking: Exploiting reward model weaknesses to achieve high reward for genuinely poor outputs
KL Divergence Explosion: Rapid divergence from the reference model, resulting in degenerate text
Value Function Estimation Errors: Abnormally large reward estimates destabilizing training
Long Sequence Problem: Credit assignment difficulty increases with token count

Recent Alignment Methods: KTO, IPO, ORPO

KTO (Kahneman-Tversky Optimization)

Ethayarajh et al. (2024) introduced KTO, which performs alignment without pairwise preference data. Drawing from behavioral economics' Prospect Theory, it learns from binary signals indicating whether each response is "desirable" or "undesirable."

KTO's key innovation is incorporating loss aversion: the penalty for undesirable responses is larger than the reward for desirable ones, mirroring how humans weigh losses more heavily than gains.

IPO (Identity Preference Optimization)

Azar et al. (2024) introduced IPO to address DPO's overfitting problem by adding a regularization term. When DPO overfits, it tends to push chosen response probabilities to extremes; IPO prevents this.

\mathcal{L}_{\text{IPO}} = \left(\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \frac{1}{2\beta}\right)^2

ORPO (Odds Ratio Preference Optimization)

Hong et al. (2024) introduced ORPO, which unifies SFT and preference optimization into a single stage. It requires no reference model, significantly reducing memory and compute costs. The key is an odds-ratio-based penalty term.

import torch
import torch.nn.functional as F

def compute_orpo_loss(model, chosen_ids, chosen_mask, rejected_ids, rejected_mask, alpha=1.0):
    """ORPO loss: unified SFT + odds-ratio preference optimization"""
    # SFT loss (NLL on chosen responses)
    chosen_logits = model(input_ids=chosen_ids, attention_mask=chosen_mask).logits
    sft_loss = F.cross_entropy(
        chosen_logits[:, :-1, :].reshape(-1, chosen_logits.size(-1)),
        chosen_ids[:, 1:].reshape(-1),
        reduction="mean",
    )

    # Compute log probabilities
    chosen_logps = get_sequence_log_probs(model, chosen_ids, chosen_mask)
    rejected_logps = get_sequence_log_probs(model, rejected_ids, rejected_mask)

    # Odds ratio computation
    chosen_odds = chosen_logps - torch.log1p(-torch.exp(chosen_logps))
    rejected_odds = rejected_logps - torch.log1p(-torch.exp(rejected_logps))
    log_odds_ratio = chosen_odds - rejected_odds

    # Final ORPO loss
    orpo_loss = sft_loss - alpha * F.logsigmoid(log_odds_ratio).mean()

    return orpo_loss

Comparative Analysis

A comparison of the key characteristics of each alignment technique:

Method	Data Requirements	Num Models	Compute Cost	Stability	Performance
RLHF (PPO)	Pairwise prefs	4 (policy, ref, reward, value)	Very High	Low (tuning needed)	High (when well-tuned)
DPO	Pairwise prefs	2 (policy, ref)	Medium	High	High
KTO	Binary feedback (unpaired)	2 (policy, ref)	Medium	High	Medium-High
IPO	Pairwise prefs	2 (policy, ref)	Medium	Very High	Medium-High
ORPO	Pairwise prefs	1 (policy only)	Low	High	Medium-High

Key Trade-offs:

RLHF: Highest performance ceiling, but high instability and engineering complexity
DPO: Best balance of simplicity and performance. Currently the most widely adopted method
KTO: Strong alternative when constructing pairwise data is difficult
IPO: Best suited for small datasets where DPO overfitting is severe
ORPO: Efficient choice when compute resources are limited

Practical Application Guide

Selection Criteria

When choosing an alignment technique in practice, consider:

Data format: Do you have pairwise preference data, or only binary feedback?
Compute resources: What are the GPU memory and training time constraints?
Team expertise: Does your team have PPO tuning experience?
Model size: For 7B and below, DPO is recommended; for 70B+, consider RLHF

Evaluation Metrics Implementation

Key metrics for evaluating alignment quality:

import numpy as np
from typing import List, Dict

def compute_alignment_metrics(
    model_responses: List[str],
    reference_responses: List[str],
    reward_model,
    tokenizer,
) -> Dict[str, float]:
    """Comprehensive metrics for alignment quality evaluation"""
    metrics = {}

    # 1. Reward model scores
    rewards = []
    for response in model_responses:
        tokens = tokenizer(response, return_tensors="pt")
        score = reward_model(**tokens).item()
        rewards.append(score)
    metrics["mean_reward"] = np.mean(rewards)
    metrics["reward_std"] = np.std(rewards)

    # 2. Win rate against reference model
    wins = 0
    for model_resp, ref_resp in zip(model_responses, reference_responses):
        model_tokens = tokenizer(model_resp, return_tensors="pt")
        ref_tokens = tokenizer(ref_resp, return_tensors="pt")
        model_score = reward_model(**model_tokens).item()
        ref_score = reward_model(**ref_tokens).item()
        if model_score > ref_score:
            wins += 1
    metrics["win_rate"] = wins / len(model_responses)

    # 3. Response length distribution (length hacking detection)
    lengths = [len(r.split()) for r in model_responses]
    metrics["mean_length"] = np.mean(lengths)
    metrics["length_std"] = np.std(lengths)

    # 4. Diversity metric (mode collapse detection)
    unique_bigrams = set()
    total_bigrams = 0
    for response in model_responses:
        words = response.split()
        for i in range(len(words) - 1):
            bigram = (words[i], words[i + 1])
            unique_bigrams.add(bigram)
            total_bigrams += 1
    metrics["distinct_2"] = len(unique_bigrams) / max(total_bigrams, 1)

    return metrics

Failure Cases and Lessons Learned

1. Reward Hacking

The most common failure mode in RLHF. When the reward model learns spurious correlations like "longer response = better response," the policy generates excessively long outputs regardless of content. The InstructGPT paper reported this issue and partially addressed it with length normalization and stronger KL penalties.

Lesson: Continuously monitor reward model biases and cross-validate with diverse proxy metrics.

2. Mode Collapse

During PPO training, the policy may converge to generating only a few patterns that score highly with the reward model. It produces identically structured responses for diverse questions or repeats specific phrases excessively.

Lesson: Maintain appropriate KL divergence constraints and monitor response diversity metrics during training.

3. DPO Distribution Shift

Since DPO trains on offline preference data, the distribution of the current policy progressively diverges from the training data distribution as learning proceeds. This can cause performance degradation in later stages of training.

Lesson: Apply iterative DPO, generating new preference data from the current policy each round, or use online DPO variants.

4. Constitutional AI Principle Conflicts

Principles included in the constitution can conflict with one another. For example, "be helpful" and "do not provide harmful information" may clash in certain contexts.

Lesson: Clearly define priority orderings among principles and establish separate guidelines for edge cases.

References

Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS 2022. https://arxiv.org/abs/2203.02155
Rafailov, R., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023. https://arxiv.org/abs/2305.18290
Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic. https://arxiv.org/abs/2212.08073
Ethayarajh, K., et al. (2024). "KTO: Model Alignment as Prospect Theoretic Optimization." https://arxiv.org/abs/2402.01306
Hong, J., et al. (2024). "ORPO: Monolithic Preference Optimization without Reference Model." https://arxiv.org/abs/2403.07691
Azar, M. G., et al. (2024). "A General Theoretical Paradigm to Understand Learning from Human Feedback." (IPO) https://arxiv.org/abs/2310.12036
Schulman, J., et al. (2017). "Proximal Policy Optimization Algorithms." https://arxiv.org/abs/1707.06347
Hugging Face Blog. "Preference Tuning LLMs with Direct Preference Optimization Methods." https://huggingface.co/blog/pref-tuning