- Authors

- Name
- Youngju Kim
- @fjvbn20031
Overview
Throughout this series, we have explored various deep reinforcement learning algorithms. In this final post, we systematically organize all methods and provide a guide on which algorithm to choose in which situation. We also introduce actively researched frontiers and learning resources.
Algorithm Classification System
Overall Taxonomy
Deep RL algorithms can be broadly classified along four axes:
1. Value-Based
- Learn the value of states or actions
- Derive policy from values (typically greedy selection)
- Representative: DQN, Double DQN, Dueling DQN, Rainbow
2. Policy-Based
- Directly parameterize and learn the policy
- REINFORCE, Evolution Strategies, Genetic Algorithms
3. Actor-Critic
- Simultaneously learn policy (Actor) and value (Critic)
- A2C, A3C, PPO, TRPO, ACKTR, DDPG, SAC
4. Model-Based
- Learn an environment model and use it for planning
- I2A, World Models, Dreamer, MuZero
Value-Based Methods Summary
DQN Family
# Pseudocode showing key differences in DQN family algorithms
def dqn_target(reward, next_state, done, gamma, q_network, target_network):
"""Basic DQN: compute max Q-value with target network"""
with torch.no_grad():
max_q = target_network(next_state).max(dim=-1)[0]
target = reward + gamma * (1 - done) * max_q
return target
def double_dqn_target(reward, next_state, done, gamma,
q_network, target_network):
"""Double DQN: separate action selection and evaluation"""
with torch.no_grad():
# Select action with main network
best_actions = q_network(next_state).argmax(dim=-1)
# Evaluate value with target network
q_values = target_network(next_state)
max_q = q_values.gather(1, best_actions.unsqueeze(1)).squeeze()
target = reward + gamma * (1 - done) * max_q
return target
def dueling_network_forward(features, advantage_stream, value_stream):
"""Dueling DQN: separate value and advantage"""
value = value_stream(features) # V(s)
advantage = advantage_stream(features) # A(s,a)
# Q(s,a) = V(s) + A(s,a) - mean(A)
q_values = value + advantage - advantage.mean(dim=-1, keepdim=True)
return q_values
Value-Based Methods Comparison
| Algorithm | Key Improvement | Discrete/Continuous | Main Advantage |
|---|---|---|---|
| DQN | Experience replay + target network | Discrete | Stable learning |
| Double DQN | Reduced overestimation bias | Discrete | Accurate Q-values |
| Dueling DQN | Separated V and A | Discrete | State value learning efficiency |
| Prioritized ER | Priority learning of important experiences | Discrete | Sample efficiency |
| Noisy DQN | Parameter noise exploration | Discrete | Adaptive exploration |
| Categorical DQN | Return distribution learning | Discrete | Stability, rich signal |
| Rainbow | Integration of all above techniques | Discrete | Best performance |
Policy-Based Methods Summary
REINFORCE and Variants
import torch
def reinforce_loss(log_probs, returns):
"""Basic REINFORCE: high variance"""
return -(log_probs * returns).mean()
def reinforce_with_baseline(log_probs, returns, values):
"""REINFORCE with baseline: reduced variance"""
advantages = returns - values.detach()
policy_loss = -(log_probs * advantages).mean()
value_loss = (returns - values).pow(2).mean()
return policy_loss + 0.5 * value_loss
def ppo_clipped_loss(log_probs, old_log_probs, advantages,
clip_epsilon=0.2):
"""PPO: stable policy updates"""
ratio = torch.exp(log_probs - old_log_probs)
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - clip_epsilon,
1 + clip_epsilon) * advantages
return -torch.min(surr1, surr2).mean()
Actor-Critic Methods Summary
On-Policy vs Off-Policy
| Property | On-Policy | Off-Policy |
|---|---|---|
| Data usage | Only from current policy | Can reuse past data |
| Sample efficiency | Low | High |
| Stability | High | Relatively lower |
| Representative algorithms | A2C, PPO, TRPO | DDPG, SAC, TD3 |
SAC (Soft Actor-Critic)
SAC automatically balances exploration and exploitation by maximizing entropy:
import torch
import torch.nn as nn
import copy
class SACAgent:
"""SAC: Maximum entropy reinforcement learning"""
def __init__(self, obs_size, act_size, hidden=256,
lr=3e-4, gamma=0.99, tau=0.005):
self.gamma = gamma
self.tau = tau
self.q1 = self._make_q(obs_size, act_size, hidden)
self.q2 = self._make_q(obs_size, act_size, hidden)
self.q1_target = copy.deepcopy(self.q1)
self.q2_target = copy.deepcopy(self.q2)
self.actor = self._make_actor(obs_size, act_size, hidden)
self.log_alpha = torch.zeros(1, requires_grad=True)
self.target_entropy = -act_size
self.q1_opt = torch.optim.Adam(self.q1.parameters(), lr=lr)
self.q2_opt = torch.optim.Adam(self.q2.parameters(), lr=lr)
self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=lr)
self.alpha_opt = torch.optim.Adam([self.log_alpha], lr=lr)
def _make_q(self, obs_size, act_size, hidden):
return nn.Sequential(
nn.Linear(obs_size + act_size, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, 1),
)
def _make_actor(self, obs_size, act_size, hidden):
return GaussianActor(obs_size, act_size, hidden)
@property
def alpha(self):
return self.log_alpha.exp()
def update(self, batch):
states, actions, rewards, next_states, dones = batch
with torch.no_grad():
next_actions, next_log_probs = self.actor.sample(next_states)
q1_next = self.q1_target(torch.cat([next_states, next_actions], -1))
q2_next = self.q2_target(torch.cat([next_states, next_actions], -1))
q_next = torch.min(q1_next, q2_next)
target = rewards + self.gamma * (1 - dones) * (
q_next - self.alpha * next_log_probs
)
q1_val = self.q1(torch.cat([states, actions], -1))
q2_val = self.q2(torch.cat([states, actions], -1))
q1_loss = (q1_val - target).pow(2).mean()
q2_loss = (q2_val - target).pow(2).mean()
self.q1_opt.zero_grad(); q1_loss.backward(); self.q1_opt.step()
self.q2_opt.zero_grad(); q2_loss.backward(); self.q2_opt.step()
new_actions, log_probs = self.actor.sample(states)
q1_new = self.q1(torch.cat([states, new_actions], -1))
q2_new = self.q2(torch.cat([states, new_actions], -1))
q_new = torch.min(q1_new, q2_new)
actor_loss = (self.alpha.detach() * log_probs - q_new).mean()
self.actor_opt.zero_grad(); actor_loss.backward(); self.actor_opt.step()
alpha_loss = -(self.log_alpha * (log_probs.detach() + self.target_entropy)).mean()
self.alpha_opt.zero_grad(); alpha_loss.backward(); self.alpha_opt.step()
self._soft_update(self.q1, self.q1_target)
self._soft_update(self.q2, self.q2_target)
def _soft_update(self, source, target):
for s, t in zip(source.parameters(), target.parameters()):
t.data.copy_(self.tau * s.data + (1 - self.tau) * t.data)
class GaussianActor(nn.Module):
def __init__(self, obs_size, act_size, hidden):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_size, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
)
self.mu = nn.Linear(hidden, act_size)
self.log_std = nn.Linear(hidden, act_size)
def forward(self, obs):
features = self.net(obs)
mu = self.mu(features)
log_std = self.log_std(features).clamp(-20, 2)
return mu, log_std
def sample(self, obs):
mu, log_std = self.forward(obs)
std = log_std.exp()
dist = torch.distributions.Normal(mu, std)
z = dist.rsample()
action = torch.tanh(z)
log_prob = (dist.log_prob(z) - torch.log(1 - action.pow(2) + 1e-6)).sum(dim=-1, keepdim=True)
return action, log_prob
Comprehensive Comparison Table
Algorithm Selection Guide
| Scenario | Recommended Algorithm | Reason |
|---|---|---|
| Discrete actions, offline data available | DQN/Rainbow | Leverages replay buffer |
| Discrete actions, fast prototyping | A2C | Simple, fast experiments |
| Continuous actions, stability priority | PPO | Easy and stable |
| Continuous actions, sample efficiency priority | SAC | Off-policy + auto exploration |
| Continuous actions, deterministic policy needed | DDPG/TD3 | Deterministic policy |
| Board games | AlphaZero | MCTS + self-play |
| Non-differentiable reward | ES/GA | No gradient needed |
| Simulator available, limited samples | Dreamer/MuZero | Model-based efficiency |
Hyperparameter Sensitivity
# Key hyperparameters and typical values for each algorithm
hyperparams = {
'DQN': {
'lr': 1e-4,
'batch_size': 32,
'buffer_size': 1000000,
'target_update_freq': 1000,
'epsilon_decay': 'linear to 0.01 over 1M steps',
'sensitivity': 'medium',
},
'PPO': {
'lr': 3e-4,
'clip_epsilon': 0.2,
'num_epochs': 10,
'batch_size': 64,
'gae_lambda': 0.95,
'entropy_coef': 0.01,
'sensitivity': 'low',
},
'SAC': {
'lr': 3e-4,
'batch_size': 256,
'buffer_size': 1000000,
'tau': 0.005,
'auto_alpha': True,
'sensitivity': 'low',
},
'DDPG': {
'lr_actor': 1e-4,
'lr_critic': 1e-3,
'batch_size': 256,
'tau': 0.005,
'noise_type': 'OU or Gaussian',
'sensitivity': 'high',
},
}
Current Research Frontiers
Offline RL (Batch Reinforcement Learning)
Learning exclusively from a fixed, pre-collected dataset. Maximally leveraging existing data without additional environment interaction.
class ConservativeQLearning:
"""CQL: Core idea of Conservative Q-Learning"""
def compute_cql_loss(self, q_network, states, actions, alpha=1.0):
td_loss = self.compute_td_loss(q_network, states, actions)
random_actions = torch.rand_like(actions) * 2 - 1
random_q = q_network(states, random_actions)
data_q = q_network(states, actions)
cql_penalty = (
torch.logsumexp(random_q, dim=0).mean() - data_q.mean()
)
return td_loss + alpha * cql_penalty
Key algorithms: CQL, IQL, Decision Transformer, Diffusion Policy
Multi-Agent RL
Environments where multiple agents learn simultaneously:
- Cooperative: Teammates pursue a common goal
- Competitive: Agents compete against each other
- Mixed: Cooperation and competition coexist
Key challenges: Non-stationarity, communication, credit assignment
Safe RL
Methods that satisfy safety constraints while maximizing rewards:
class SafeRLObjective:
"""Objective for constraint-based safe RL"""
def compute_objective(self, policy, states):
expected_reward = self.estimate_reward(policy, states)
expected_cost = self.estimate_cost(policy, states)
cost_limit = 25.0
lagrangian = (expected_reward
- self.lambda_multiplier
* (expected_cost - cost_limit))
return lagrangian
Key algorithms: CPO (Constrained Policy Optimization), WCSAC, SafeOpt
Other Research Directions
- Meta-RL: Agents that quickly adapt to new tasks
- Hierarchical RL: Separate high-level/low-level policies for long-term planning
- Representation learning: Automatically learning good state representations
- LLM + RL: Leveraging LLM reasoning capabilities in RL
Learning Roadmap
Recommended Learning Order
-
Foundations (1-2 weeks)
- Understanding MDPs, Bellman equations
- Dynamic programming (policy/value iteration)
- Exploration-exploitation dilemma
-
Value-Based (2-3 weeks)
- Implement Q-learning
- Implement DQN and run Atari experiments
- Understand Double/Dueling DQN
-
Policy-Based (2-3 weeks)
- Implement REINFORCE
- Understand and implement A2C/A3C
- Implement PPO (most important)
-
Continuous Actions (1-2 weeks)
- Implement DDPG
- Implement SAC
- MuJoCo/PyBullet experiments
-
Advanced (2-4 weeks)
- Model-based RL (Dreamer)
- Multi-agent RL
- Offline RL
- Practical project
Key Implementation Frameworks
# Major RL libraries
# 1. Stable-Baselines3: Verified implementations, fast experiments
# pip install stable-baselines3
from stable_baselines3 import PPO, SAC, DQN
model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=100000)
# 2. CleanRL: Single-file implementations, best for education
# Each algorithm is fully implemented in one file
# 3. RLlib (Ray): Distributed training, production level
# pip install ray[rllib]
# 4. Tianshou: PyTorch-based, flexible structure
# pip install tianshou
Series Retrospective
A chronological summary of what was covered in this series:
| Post # | Topic | Key Algorithms/Concepts |
|---|---|---|
| 11 | A3C | Async parallel learning, data/gradient parallelism |
| 12 | Chatbot RL | Seq2Seq, SCST, reward design |
| 13 | Web Navigation | MiniWoB, grid action space, human demonstrations |
| 14 | Continuous Actions | DDPG, OU noise, distributional policy (D4PG) |
| 15 | Trust Region | PPO, TRPO, ACKTR |
| 16 | Black-Box | Evolution Strategies (ES), Genetic Algorithms (GA) |
| 17 | Model-Based | I2A, environment model, rollout encoder |
| 18 | AlphaGo Zero | MCTS, self-play, Connect4 implementation |
| 19 | Applications | Robotics, autonomous driving, recommendations, RLHF |
| 20 | Summary | Algorithm comparison, selection guide |
Conclusion
Deep reinforcement learning is a rapidly evolving field. The fundamental algorithms covered in this series form the essential foundation for understanding current cutting-edge research.
Three most important pieces of advice:
- Implement it yourself: You can only truly understand an algorithm by writing the code yourself
- Start with simple environments: What does not work on CartPole will not work in complex environments either
- Always compare baselines: Verify that new methods are truly better than simple ones
The journey of reinforcement learning is never-ending. We hope this series serves as a starting point.