Skip to content
Published on

Reinforcement Learning Complete Guide: From DQN, PPO to RLHF and DPO for LLM Alignment

Authors

Introduction

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to maximize reward by interacting with an environment. From AlphaGo conquering the world of Go to ChatGPT's RLHF alignment, it has become a core technology in modern AI.

This post covers everything from MDP mathematical foundations to the latest DPO in one place.


1. RL Fundamentals: MDP

Markov Decision Process

RL is formalized as an MDP — a 5-tuple (S,A,P,R,γ)(S, A, P, R, \gamma):

  • SS: State space
  • AA: Action space
  • P(ss,a)P(s'|s, a): Transition probability
  • R(s,a,s)R(s, a, s'): Reward function
  • γ[0,1)\gamma \in [0, 1): Discount factor

Markov Property: The future state depends only on the current state.

P(st+1st,at,st1,at1,)=P(st+1st,at)P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} | s_t, a_t)

Policy

A policy π\pi maps states to actions:

  • Deterministic policy: a=π(s)a = \pi(s)
  • Stochastic policy: aπ(as)a \sim \pi(a|s)

Value Functions

State value function: Expected cumulative reward from state ss following policy π\pi

Vπ(s)=Eπ[t=0γtrts0=s]V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t r_t \,\Big|\, s_0 = s \right]

Action value function (Q-function): Expected reward after taking action aa in state ss then following π\pi

Qπ(s,a)=Eπ[t=0γtrts0=s,a0=a]Q^\pi(s, a) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t r_t \,\Big|\, s_0 = s, a_0 = a \right]

Bellman Equations

Recursive decomposition of value functions:

Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]

Bellman Optimality Equations:

V(s)=maxasP(ss,a)[R(s,a,s)+γV(s)]V^*(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^*(s') \right]

Q(s,a)=sP(ss,a)[R(s,a,s)+γmaxaQ(s,a)]Q^*(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a'} Q^*(s', a') \right]


2. Model-Free RL

Q-Learning (Off-Policy)

Q-Learning is an off-policy method. TD update rule:

Q(st,at)Q(st,at)+α[rt+γmaxaQ(st+1,a)Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right]

Target: rt+γmaxaQ(st+1,a)r_t + \gamma \max_{a'} Q(s_{t+1}, a') (greedy policy)

SARSA (On-Policy)

SARSA is an on-policy method:

Q(st,at)Q(st,at)+α[rt+γQ(st+1,at+1)Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]

Target: rt+γQ(st+1,at+1)r_t + \gamma Q(s_{t+1}, a_{t+1}) (actual next action at+1a_{t+1})

PropertyQ-LearningSARSA
Policy typeOff-policyOn-policy
Update targetGreedy actionActual action
Cliff explorationOptimal pathSafe path

DQN (Deep Q-Network)

DQN approximates the Q-function with a neural network. Two key stabilization techniques:

  1. Experience Replay: Sample mini-batches from a replay buffer
  2. Target Network: Separate network copied periodically for stable targets
import torch
import torch.nn as nn
import numpy as np
from collections import deque
import random

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )

    def forward(self, x):
        return self.net(x)

class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (
            torch.FloatTensor(np.array(states)),
            torch.LongTensor(actions),
            torch.FloatTensor(rewards),
            torch.FloatTensor(np.array(next_states)),
            torch.FloatTensor(dones)
        )

    def __len__(self):
        return len(self.buffer)

def train_dqn_step(online_net, target_net, optimizer, buffer,
                   batch_size=64, gamma=0.99):
    if len(buffer) < batch_size:
        return
    states, actions, rewards, next_states, dones = buffer.sample(batch_size)

    # Current Q-values
    current_q = online_net(states).gather(1, actions.unsqueeze(1)).squeeze(1)

    # Target Q-values (using target network)
    with torch.no_grad():
        max_next_q = target_net(next_states).max(1)[0]
        target_q = rewards + gamma * max_next_q * (1 - dones)

    loss = nn.MSELoss()(current_q, target_q)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

Double DQN

Fixes the overestimation problem in DQN by decoupling action selection and value evaluation:

Q(st,at)rt+γQtarget ⁣(st+1,  argmaxaQonline(st+1,a))Q(s_t, a_t) \leftarrow r_t + \gamma Q_{\text{target}}\!\left(s_{t+1},\; \arg\max_{a'} Q_{\text{online}}(s_{t+1}, a')\right)

Dueling DQN

Decomposes Q-values into value function V(s)V(s) and advantage function A(s,a)A(s,a):

Q(s,a)=V(s)+A(s,a)1AaA(s,a)Q(s, a) = V(s) + A(s, a) - \frac{1}{|A|} \sum_{a'} A(s, a')

class DuelingDQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.feature = nn.Sequential(
            nn.Linear(state_dim, 128), nn.ReLU()
        )
        self.value_stream = nn.Sequential(
            nn.Linear(128, 64), nn.ReLU(), nn.Linear(64, 1)
        )
        self.advantage_stream = nn.Sequential(
            nn.Linear(128, 64), nn.ReLU(), nn.Linear(64, action_dim)
        )

    def forward(self, x):
        feat = self.feature(x)
        value = self.value_stream(feat)
        advantage = self.advantage_stream(feat)
        return value + advantage - advantage.mean(dim=1, keepdim=True)

3. Policy Gradient Methods

REINFORCE

Directly parameterize the policy and estimate gradients:

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]

where Gt=k=tTγktrkG_t = \sum_{k=t}^{T} \gamma^{k-t} r_k is the return from time tt.

Actor-Critic

Uses a value function as a baseline to reduce the high variance of REINFORCE:

θJ(θ)=E[θlogπθ(atst)A(st,at)]\nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A(s_t, a_t) \right]

Advantage function: A(st,at)=Q(st,at)V(st)A(s_t, a_t) = Q(s_t, a_t) - V(s_t)

Approximated by TD error: A(st,at)rt+γV(st+1)V(st)A(s_t, a_t) \approx r_t + \gamma V(s_{t+1}) - V(s_t)

A3C / A2C

A3C (Asynchronous Advantage Actor-Critic): Multiple workers learn asynchronously in parallel

A2C (Advantage Actor-Critic): Synchronously averages gradients from all workers before updating

PPO (Proximal Policy Optimization)

Currently the most widely used policy gradient algorithm. Core idea: constrain the magnitude of policy updates for stable learning.

Clipped objective:

LCLIP(θ)=Et[min ⁣(rt(θ)At,  clip(rt(θ),1ϵ,1+ϵ)At)]L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min\!\left( r_t(\theta) A_t,\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]

where rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} is the probability ratio.

Full PPO objective:

L(θ)=LCLIP(θ)c1LVF(θ)+c2H[πθ]L(\theta) = L^{CLIP}(\theta) - c_1 L^{VF}(\theta) + c_2 H[\pi_\theta]

  • LVFL^{VF}: Value function loss (MSE)
  • H[πθ]H[\pi_\theta]: Entropy bonus (encourages exploration)
from stable_baselines3 import PPO
import gymnasium as gym

env = gym.make("CartPole-v1")
model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,       # epsilon: clip ratio
    ent_coef=0.01,        # entropy bonus coefficient
    verbose=1
)
model.learn(total_timesteps=100_000)

obs, _ = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, _ = env.reset()

4. Advanced Methods

SAC (Soft Actor-Critic)

SAC is a maximum entropy RL framework that simultaneously maximizes reward and entropy:

π=argmaxπE[tγt(rt+αH[π(st)])]\pi^* = \arg\max_\pi \mathbb{E} \left[ \sum_t \gamma^t \left( r_t + \alpha \mathcal{H}[\pi(\cdot|s_t)] \right) \right]

α\alpha is the temperature parameter controlling the degree of exploration.

Soft Bellman equation:

Q(st,at)=rt+γEst+1[V(st+1)]Q(s_t, a_t) = r_t + \gamma \mathbb{E}_{s_{t+1}} \left[ V(s_{t+1}) \right]

V(st)=Eaπ[Q(st,a)αlogπ(ast)]V(s_t) = \mathbb{E}_{a \sim \pi} \left[ Q(s_t, a) - \alpha \log \pi(a|s_t) \right]

SAC excels at continuous action space problems and is less sensitive to hyperparameters via automatic temperature tuning.

from stable_baselines3 import SAC
import gymnasium as gym

env = gym.make("HalfCheetah-v4")
model = SAC(
    "MlpPolicy", env,
    learning_rate=3e-4,
    buffer_size=1_000_000,
    learning_starts=10_000,
    batch_size=256,
    tau=0.005,
    gamma=0.99,
    train_freq=1,
    gradient_steps=1,
    verbose=1
)
model.learn(total_timesteps=1_000_000)

TD3 (Twin Delayed Deep Deterministic)

Three techniques to fix DDPG's overestimation:

  1. Twin Critics: Take the minimum of two Q-networks
  2. Delayed Policy Updates: Update actor less frequently than critic
  3. Target Policy Smoothing: Add noise to target actions

y=r+γmini=1,2Qθi(s,a~),a~=clip(πϕ(s)+ϵ,alow,ahigh)y = r + \gamma \min_{i=1,2} Q_{\theta_i'}(s', \tilde{a}'), \quad \tilde{a}' = \text{clip}(\pi_{\phi'}(s') + \epsilon, a_{low}, a_{high})

HER (Hindsight Experience Replay)

Reuses failed experiences in sparse reward environments by retroactively replacing the goal with the state actually reached:

  • Original: (st,at,rt,st+1,g)(s_t, a_t, r_t, s_{t+1}, g) — failed to reach goal gg
  • HER: (st,at,rt,st+1,g)(s_t, a_t, r'_t, s_{t+1}, g')g=sTg' = s_T (final state reached)

Goal replacement strategies:

  • future: Random future state from same episode (default)
  • episode: Any random state from same episode
  • final: Last state of the episode
from stable_baselines3 import HerReplayBuffer, SAC

env = gym.make("FetchReach-v2")
model = SAC(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy="future",
    ),
    verbose=1,
)
model.learn(total_timesteps=100_000)

5. RLHF: Aligning LLMs with RL

InstructGPT Pipeline

RLHF (Reinforcement Learning from Human Feedback), the core of ChatGPT, consists of 3 stages:

Stage 1: SFT (Supervised Fine-Tuning)

  • Fine-tune the base model on high-quality demonstration data
  • Human experts write ideal responses

Stage 2: Reward Model Training

  • Humans choose the better of two responses (collecting preference data)
  • Train reward model on preference data

LRM=E(x,yw,yl)[logσ ⁣(rϕ(x,yw)rϕ(x,yl))]\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\!\left( r_\phi(x, y_w) - r_\phi(x, y_l) \right) \right]

ywy_w is the preferred response, yly_l is the rejected response.

Stage 3: PPO Fine-Tuning

  • Apply PPO with reward model as environment and LLM as agent
  • KL penalty keeps the model from drifting too far from the reference

reward(x,y)=rϕ(x,y)βKL ⁣[πθ(yx)πref(yx)]\text{reward}(x, y) = r_\phi(x, y) - \beta \cdot \text{KL}\!\left[\pi_\theta(y|x) \,\|\, \pi_{ref}(y|x)\right]

import torch
import torch.nn as nn

class RewardModel(nn.Module):
    def __init__(self, base_model_name="gpt2"):
        super().__init__()
        from transformers import AutoModel
        self.backbone = AutoModel.from_pretrained(base_model_name)
        hidden_size = self.backbone.config.hidden_size
        self.reward_head = nn.Linear(hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state[:, -1, :]
        return self.reward_head(last_hidden).squeeze(-1)

def compute_reward_model_loss(reward_model, chosen_ids, chosen_mask,
                               rejected_ids, rejected_mask):
    """Bradley-Terry model preference loss"""
    chosen_reward = reward_model(chosen_ids, chosen_mask)
    rejected_reward = reward_model(rejected_ids, rejected_mask)
    loss = -torch.log(torch.sigmoid(chosen_reward - rejected_reward)).mean()
    return loss

PPO fine-tuning with TRL library:

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer

ppo_config = PPOConfig(
    model_name="gpt2",
    learning_rate=1.41e-5,
    batch_size=128,
    mini_batch_size=16,
    gradient_accumulation_steps=4,
    target_kl=0.1,
    kl_penalty="kl",
)

model = AutoModelForCausalLMWithValueHead.from_pretrained(ppo_config.model_name)
tokenizer = AutoTokenizer.from_pretrained(ppo_config.model_name)
ppo_trainer = PPOTrainer(ppo_config, model, ref_model=None, tokenizer=tokenizer)

for batch in dataloader:
    query_tensors = batch["input_ids"]
    response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=200)
    rewards = [reward_model(q, r) for q, r in zip(query_tensors, response_tensors)]
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

DPO (Direct Preference Optimization)

DPO aligns LLMs directly from preference data without a separate reward model. It replaces the complex RL loop with a simple classification-style loss:

LDPO(πθ;πref)=E(x,yw,yl)[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]

import torch.nn.functional as F

def dpo_loss(pi_logps_chosen, pi_logps_rejected,
             ref_logps_chosen, ref_logps_rejected, beta=0.1):
    """
    DPO loss function
    Args:
        pi_logps_chosen:    Policy log-prob for preferred response
        pi_logps_rejected:  Policy log-prob for rejected response
        ref_logps_chosen:   Reference model log-prob for preferred response
        ref_logps_rejected: Reference model log-prob for rejected response
        beta: KL penalty strength
    """
    pi_log_ratio = pi_logps_chosen - pi_logps_rejected
    ref_log_ratio = ref_logps_chosen - ref_logps_rejected
    # Implicit reward difference
    logits = beta * (pi_log_ratio - ref_log_ratio)
    loss = -F.logsigmoid(logits).mean()
    return loss

DPO vs RLHF comparison:

AspectRLHF + PPODPO
Reward modelSeparate training requiredNot needed
Models in memoryActor + Critic + RM + ReferencePolicy + Reference
Training stabilityRL instability presentSupervised-learning level
Implementation complexityHighLow

6. Multi-Agent Reinforcement Learning

Environment Types

TypeDescriptionExamples
Fully cooperativeShared reward, team goalRobot team tasks
Fully competitiveZero-sum gameChess, Go
MixedCooperation + competitionSoccer, MOBA

MADDPG (Multi-Agent DDPG)

Decentralized Execution + Centralized Training (CTDE) paradigm:

  • At execution: Each agent acts using only its own observation
  • At training: Critics leverage all agents' observations and actions

Qiμ(x,a1,,aN)Q_i^\mu(x, a_1, \ldots, a_N)

where x=(o1,,oN)x = (o_1, \ldots, o_N) is the global state, aia_i is agent ii's action.

class MADDPGCritic(nn.Module):
    """Centralized critic using all agents' information"""
    def __init__(self, n_agents, obs_dim, action_dim):
        super().__init__()
        input_dim = n_agents * (obs_dim + action_dim)
        self.net = nn.Sequential(
            nn.Linear(input_dim, 256), nn.ReLU(),
            nn.Linear(256, 256), nn.ReLU(),
            nn.Linear(256, 1)
        )

    def forward(self, all_obs, all_actions):
        x = torch.cat([all_obs, all_actions], dim=-1)
        return self.net(x)

OpenSpiel

Google DeepMind's multi-agent RL framework supporting chess, Go, poker, and more:

import pyspiel

game = pyspiel.load_game("tic_tac_toe")
state = game.new_initial_state()

while not state.is_terminal():
    legal_actions = state.legal_actions()
    action = legal_actions[0]  # In practice, choose via policy
    state.apply_action(action)

returns = state.returns()
print(f"Player 0 reward: {returns[0]}")
print(f"Player 1 reward: {returns[1]}")

7. Training Environments

Gymnasium (OpenAI Gym successor)

import gymnasium as gym
import numpy as np

class NormalizedObsWrapper(gym.ObservationWrapper):
    """Normalizes observations to [-1, 1]"""
    def __init__(self, env):
        super().__init__(env)
        self.obs_low = env.observation_space.low
        self.obs_high = env.observation_space.high

    def observation(self, obs):
        normalized = (
            2.0 * (obs - self.obs_low)
            / (self.obs_high - self.obs_low + 1e-8)
            - 1.0
        )
        return normalized.astype(np.float32)

class RewardShapingWrapper(gym.RewardWrapper):
    """Scales rewards"""
    def __init__(self, env, scale=0.01):
        super().__init__(env)
        self.scale = scale

    def reward(self, reward):
        return reward * self.scale

base_env = gym.make("LunarLander-v2")
env = RewardShapingWrapper(NormalizedObsWrapper(base_env))
obs, info = env.reset(seed=42)

MuJoCo

Physics-based continuous control environments essential for robotics research:

  • HalfCheetah-v4: Cheetah robot running (17-dim state, 6-dim action)
  • Humanoid-v4: Humanoid robot walking (376-dim state, 17-dim action)
  • Ant-v4: Quadruped robot (111-dim state, 8-dim action)

Benchmark performance at 1M steps:

EnvironmentSACTD3PPO
HalfCheetah~12000~9000~3000
Ant~5500~4000~1500
Humanoid~5000~4500~600

Isaac Gym / Isaac Lab

NVIDIA's GPU-accelerated physics simulator running thousands of environments in parallel:

  • 2000x faster training than CPU simulation
  • Domain Randomization for better sim-to-real transfer
  • Train robots in hours rather than weeks
# Isaac Lab example (simplified)
from isaaclab.envs import DirectRLEnvCfg

class CartpoleEnvCfg(DirectRLEnvCfg):
    num_envs = 4096          # 4096 parallel environments
    episode_length_s = 5.0
    decimation = 2
    action_scale = 100.0

# 4096 CartPole environments running in parallel on GPU
# obs.shape: [4096, obs_dim]

Real-World RL Deployment Considerations

  1. Safe RL: Prevent dangerous actions during training — Constrained Policy Optimization (CPO)
  2. Sample efficiency: Real-world data is expensive and slow — use offline RL or model-based RL
  3. Sim-to-Real transfer: Domain Randomization and adaptation layers (RMA) to close the Reality Gap
  4. Offline RL: Learn from pre-collected datasets only — Conservative Q-Learning (CQL), IQL

Quiz: Test Your RL Knowledge

Q1. Explain the on-policy vs off-policy difference between Q-Learning and SARSA.

Answer: Q-Learning is off-policy; SARSA is on-policy.

Explanation: Q-Learning's update target is r+γmaxaQ(s,a)r + \gamma \max_{a'} Q(s', a'), using the greedy maximum regardless of the actual action taken. SARSA's target r+γQ(s,a)r + \gamma Q(s', a') uses aa', the action actually taken. In the CliffWalking problem, Q-Learning learns the optimal edge path but falls often during exploration; SARSA learns a safer, longer detour.

Q2. What role does the epsilon clip ratio hyperparameter play in PPO's training stability?

Answer: Epsilon controls the conservatism of policy updates.

Explanation: When rt(θ)=πθ/πoldr_t(\theta) = \pi_\theta / \pi_{old} falls outside [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon], the gradient is blocked. If ϵ\epsilon is too small, learning slows and gets stuck in local minima. If too large, the policy changes dramatically and becomes unstable. A value of ϵ=0.2\epsilon = 0.2 is a reliable default; annealing it toward zero during training is also effective.

Q3. How is training data for the reward model collected in RLHF?

Answer: Human annotators compare pairs of responses and indicate which is preferred.

Explanation: Annotators are shown two responses (y1,y2)(y_1, y_2) to the same prompt and select the better one. Pairwise comparison is more consistent and reliable than absolute ratings (e.g., 1–5 stars). The collected preference data (x,yw,yl)(x, y_w, y_l) is used to train a Bradley-Terry model. Anthropic's Constitutional AI uses RLAIF, where an AI itself plays the role of the preference annotator.

Q4. Why is DPO simpler to implement than PPO-based RLHF?

Answer: DPO requires neither a separate reward model nor an RL training loop.

Explanation: RLHF+PPO requires three stages (SFT, reward model training, PPO fine-tuning) and simultaneously loads multiple models (actor, critic, reward model, reference model) into GPU memory. DPO integrates the implicit reward directly into the loss formula from preference data, enabling fine-tuning with a single cross-entropy-style loss. Only two models are needed (policy + reference), and there is no RL-specific instability.

Q5. What role does entropy regularization play in Soft Actor-Critic?

Answer: Encourages exploration and learns robust policies across multiple optima.

Explanation: Adding the entropy term αH[π(s)]\alpha \mathcal{H}[\pi(\cdot|s)] to SAC's objective makes the agent maximize reward while maintaining action randomness. This provides an automatic exploration mechanism, helping escape local minima. When multiple actions are equally good, the policy learns to choose among them uniformly, resulting in more robust behavior. The temperature α\alpha is automatically tuned to match a target entropy level.


Algorithm Comparison Summary

AlgorithmPolicy TypeAction SpaceKey Feature
Q-LearningOff-policyDiscreteSimple, table-based
DQNOff-policyDiscreteNeural net + ER + TN
Double DQNOff-policyDiscreteReduces overestimation
Dueling DQNOff-policyDiscreteV + A decomposition
REINFORCEOn-policyDiscrete/Cont.High variance
A2COn-policyDiscrete/Cont.Actor-Critic
PPOOn-policyDiscrete/Cont.Stable, general-purpose
SACOff-policyContinuousMax entropy
TD3Off-policyContinuousDeterministic SAC variant
HEROff-policyGoal-basedSparse reward

Conclusion

Reinforcement learning is expanding explosively beyond game-playing AI into robotics, LLM alignment, autonomous driving, and drug discovery. RLHF and DPO in particular are core alignment technologies for LLMs like ChatGPT, Claude, and Gemini, making RL literacy an essential skill for modern AI practitioners.

Recommended learning path:

  1. Implement Q-Learning and DQN from scratch with Gymnasium
  2. Experiment with PPO and SAC via Stable-Baselines3
  3. Practice RLHF/DPO with the TRL library
  4. Explore robotics RL with Isaac Lab

References:

  • Sutton & Barto, "Reinforcement Learning: An Introduction" (2nd ed.)
  • Spinning Up in Deep RL (OpenAI)
  • Stable-Baselines3 documentation
  • TRL (Transformer Reinforcement Learning) by Hugging Face
  • Isaac Lab documentation