[Deep RL] 20. Deep RL Summary: Algorithm Comparison and Selection Guide

Overview

Throughout this series, we have explored various deep reinforcement learning algorithms. In this final post, we systematically organize all methods and provide a guide on which algorithm to choose in which situation. We also introduce actively researched frontiers and learning resources.

Algorithm Classification System

Overall Taxonomy

Deep RL algorithms can be broadly classified along four axes:

1. Value-Based

Learn the value of states or actions
Derive policy from values (typically greedy selection)
Representative: DQN, Double DQN, Dueling DQN, Rainbow

2. Policy-Based

Directly parameterize and learn the policy
REINFORCE, Evolution Strategies, Genetic Algorithms

3. Actor-Critic

Simultaneously learn policy (Actor) and value (Critic)
A2C, A3C, PPO, TRPO, ACKTR, DDPG, SAC

4. Model-Based

Learn an environment model and use it for planning
I2A, World Models, Dreamer, MuZero

Value-Based Methods Summary

DQN Family

# Pseudocode showing key differences in DQN family algorithms

def dqn_target(reward, next_state, done, gamma, q_network, target_network):
    """Basic DQN: compute max Q-value with target network"""
    with torch.no_grad():
        max_q = target_network(next_state).max(dim=-1)[0]
        target = reward + gamma * (1 - done) * max_q
    return target

def double_dqn_target(reward, next_state, done, gamma,
                       q_network, target_network):
    """Double DQN: separate action selection and evaluation"""
    with torch.no_grad():
        # Select action with main network
        best_actions = q_network(next_state).argmax(dim=-1)
        # Evaluate value with target network
        q_values = target_network(next_state)
        max_q = q_values.gather(1, best_actions.unsqueeze(1)).squeeze()
        target = reward + gamma * (1 - done) * max_q
    return target

def dueling_network_forward(features, advantage_stream, value_stream):
    """Dueling DQN: separate value and advantage"""
    value = value_stream(features)        # V(s)
    advantage = advantage_stream(features) # A(s,a)
    # Q(s,a) = V(s) + A(s,a) - mean(A)
    q_values = value + advantage - advantage.mean(dim=-1, keepdim=True)
    return q_values

Value-Based Methods Comparison

Algorithm	Key Improvement	Discrete/Continuous	Main Advantage
DQN	Experience replay + target network	Discrete	Stable learning
Double DQN	Reduced overestimation bias	Discrete	Accurate Q-values
Dueling DQN	Separated V and A	Discrete	State value learning efficiency
Prioritized ER	Priority learning of important experiences	Discrete	Sample efficiency
Noisy DQN	Parameter noise exploration	Discrete	Adaptive exploration
Categorical DQN	Return distribution learning	Discrete	Stability, rich signal
Rainbow	Integration of all above techniques	Discrete	Best performance

Policy-Based Methods Summary

REINFORCE and Variants

import torch

def reinforce_loss(log_probs, returns):
    """Basic REINFORCE: high variance"""
    return -(log_probs * returns).mean()

def reinforce_with_baseline(log_probs, returns, values):
    """REINFORCE with baseline: reduced variance"""
    advantages = returns - values.detach()
    policy_loss = -(log_probs * advantages).mean()
    value_loss = (returns - values).pow(2).mean()
    return policy_loss + 0.5 * value_loss

def ppo_clipped_loss(log_probs, old_log_probs, advantages,
                     clip_epsilon=0.2):
    """PPO: stable policy updates"""
    ratio = torch.exp(log_probs - old_log_probs)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_epsilon,
                         1 + clip_epsilon) * advantages
    return -torch.min(surr1, surr2).mean()

Actor-Critic Methods Summary

On-Policy vs Off-Policy

Property	On-Policy	Off-Policy
Data usage	Only from current policy	Can reuse past data
Sample efficiency	Low	High
Stability	High	Relatively lower
Representative algorithms	A2C, PPO, TRPO	DDPG, SAC, TD3

SAC (Soft Actor-Critic)

SAC automatically balances exploration and exploitation by maximizing entropy:

import torch
import torch.nn as nn
import copy

class SACAgent:
    """SAC: Maximum entropy reinforcement learning"""

    def __init__(self, obs_size, act_size, hidden=256,
                 lr=3e-4, gamma=0.99, tau=0.005):
        self.gamma = gamma
        self.tau = tau
        self.q1 = self._make_q(obs_size, act_size, hidden)
        self.q2 = self._make_q(obs_size, act_size, hidden)
        self.q1_target = copy.deepcopy(self.q1)
        self.q2_target = copy.deepcopy(self.q2)
        self.actor = self._make_actor(obs_size, act_size, hidden)
        self.log_alpha = torch.zeros(1, requires_grad=True)
        self.target_entropy = -act_size
        self.q1_opt = torch.optim.Adam(self.q1.parameters(), lr=lr)
        self.q2_opt = torch.optim.Adam(self.q2.parameters(), lr=lr)
        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=lr)
        self.alpha_opt = torch.optim.Adam([self.log_alpha], lr=lr)

    def _make_q(self, obs_size, act_size, hidden):
        return nn.Sequential(
            nn.Linear(obs_size + act_size, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, 1),
        )

    def _make_actor(self, obs_size, act_size, hidden):
        return GaussianActor(obs_size, act_size, hidden)

    @property
    def alpha(self):
        return self.log_alpha.exp()

    def update(self, batch):
        states, actions, rewards, next_states, dones = batch

        with torch.no_grad():
            next_actions, next_log_probs = self.actor.sample(next_states)
            q1_next = self.q1_target(torch.cat([next_states, next_actions], -1))
            q2_next = self.q2_target(torch.cat([next_states, next_actions], -1))
            q_next = torch.min(q1_next, q2_next)
            target = rewards + self.gamma * (1 - dones) * (
                q_next - self.alpha * next_log_probs
            )

        q1_val = self.q1(torch.cat([states, actions], -1))
        q2_val = self.q2(torch.cat([states, actions], -1))
        q1_loss = (q1_val - target).pow(2).mean()
        q2_loss = (q2_val - target).pow(2).mean()

        self.q1_opt.zero_grad(); q1_loss.backward(); self.q1_opt.step()
        self.q2_opt.zero_grad(); q2_loss.backward(); self.q2_opt.step()

        new_actions, log_probs = self.actor.sample(states)
        q1_new = self.q1(torch.cat([states, new_actions], -1))
        q2_new = self.q2(torch.cat([states, new_actions], -1))
        q_new = torch.min(q1_new, q2_new)
        actor_loss = (self.alpha.detach() * log_probs - q_new).mean()

        self.actor_opt.zero_grad(); actor_loss.backward(); self.actor_opt.step()

        alpha_loss = -(self.log_alpha * (log_probs.detach() + self.target_entropy)).mean()
        self.alpha_opt.zero_grad(); alpha_loss.backward(); self.alpha_opt.step()

        self._soft_update(self.q1, self.q1_target)
        self._soft_update(self.q2, self.q2_target)

    def _soft_update(self, source, target):
        for s, t in zip(source.parameters(), target.parameters()):
            t.data.copy_(self.tau * s.data + (1 - self.tau) * t.data)

class GaussianActor(nn.Module):
    def __init__(self, obs_size, act_size, hidden):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
        )
        self.mu = nn.Linear(hidden, act_size)
        self.log_std = nn.Linear(hidden, act_size)

    def forward(self, obs):
        features = self.net(obs)
        mu = self.mu(features)
        log_std = self.log_std(features).clamp(-20, 2)
        return mu, log_std

    def sample(self, obs):
        mu, log_std = self.forward(obs)
        std = log_std.exp()
        dist = torch.distributions.Normal(mu, std)
        z = dist.rsample()
        action = torch.tanh(z)
        log_prob = (dist.log_prob(z) - torch.log(1 - action.pow(2) + 1e-6)).sum(dim=-1, keepdim=True)
        return action, log_prob

Comprehensive Comparison Table

Algorithm Selection Guide

Scenario	Recommended Algorithm	Reason
Discrete actions, offline data available	DQN/Rainbow	Leverages replay buffer
Discrete actions, fast prototyping	A2C	Simple, fast experiments
Continuous actions, stability priority	PPO	Easy and stable
Continuous actions, sample efficiency priority	SAC	Off-policy + auto exploration
Continuous actions, deterministic policy needed	DDPG/TD3	Deterministic policy
Board games	AlphaZero	MCTS + self-play
Non-differentiable reward	ES/GA	No gradient needed
Simulator available, limited samples	Dreamer/MuZero	Model-based efficiency

Hyperparameter Sensitivity

# Key hyperparameters and typical values for each algorithm

hyperparams = {
    'DQN': {
        'lr': 1e-4,
        'batch_size': 32,
        'buffer_size': 1000000,
        'target_update_freq': 1000,
        'epsilon_decay': 'linear to 0.01 over 1M steps',
        'sensitivity': 'medium',
    },
    'PPO': {
        'lr': 3e-4,
        'clip_epsilon': 0.2,
        'num_epochs': 10,
        'batch_size': 64,
        'gae_lambda': 0.95,
        'entropy_coef': 0.01,
        'sensitivity': 'low',
    },
    'SAC': {
        'lr': 3e-4,
        'batch_size': 256,
        'buffer_size': 1000000,
        'tau': 0.005,
        'auto_alpha': True,
        'sensitivity': 'low',
    },
    'DDPG': {
        'lr_actor': 1e-4,
        'lr_critic': 1e-3,
        'batch_size': 256,
        'tau': 0.005,
        'noise_type': 'OU or Gaussian',
        'sensitivity': 'high',
    },
}

Current Research Frontiers

Offline RL (Batch Reinforcement Learning)

Learning exclusively from a fixed, pre-collected dataset. Maximally leveraging existing data without additional environment interaction.

class ConservativeQLearning:
    """CQL: Core idea of Conservative Q-Learning"""

    def compute_cql_loss(self, q_network, states, actions, alpha=1.0):
        td_loss = self.compute_td_loss(q_network, states, actions)
        random_actions = torch.rand_like(actions) * 2 - 1
        random_q = q_network(states, random_actions)
        data_q = q_network(states, actions)
        cql_penalty = (
            torch.logsumexp(random_q, dim=0).mean() - data_q.mean()
        )
        return td_loss + alpha * cql_penalty

Key algorithms: CQL, IQL, Decision Transformer, Diffusion Policy

Multi-Agent RL

Environments where multiple agents learn simultaneously:

Cooperative: Teammates pursue a common goal
Competitive: Agents compete against each other
Mixed: Cooperation and competition coexist

Key challenges: Non-stationarity, communication, credit assignment

Safe RL

Methods that satisfy safety constraints while maximizing rewards:

class SafeRLObjective:
    """Objective for constraint-based safe RL"""

    def compute_objective(self, policy, states):
        expected_reward = self.estimate_reward(policy, states)
        expected_cost = self.estimate_cost(policy, states)
        cost_limit = 25.0
        lagrangian = (expected_reward
                      - self.lambda_multiplier
                      * (expected_cost - cost_limit))
        return lagrangian

Key algorithms: CPO (Constrained Policy Optimization), WCSAC, SafeOpt

Other Research Directions

Meta-RL: Agents that quickly adapt to new tasks
Hierarchical RL: Separate high-level/low-level policies for long-term planning
Representation learning: Automatically learning good state representations
LLM + RL: Leveraging LLM reasoning capabilities in RL

Learning Roadmap

Recommended Learning Order

Foundations (1-2 weeks)
- Understanding MDPs, Bellman equations
- Dynamic programming (policy/value iteration)
- Exploration-exploitation dilemma
Value-Based (2-3 weeks)
- Implement Q-learning
- Implement DQN and run Atari experiments
- Understand Double/Dueling DQN
Policy-Based (2-3 weeks)
- Implement REINFORCE
- Understand and implement A2C/A3C
- Implement PPO (most important)
Continuous Actions (1-2 weeks)
- Implement DDPG
- Implement SAC
- MuJoCo/PyBullet experiments
Advanced (2-4 weeks)
- Model-based RL (Dreamer)
- Multi-agent RL
- Offline RL
- Practical project

Key Implementation Frameworks

# Major RL libraries

# 1. Stable-Baselines3: Verified implementations, fast experiments
# pip install stable-baselines3
from stable_baselines3 import PPO, SAC, DQN

model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=100000)

# 2. CleanRL: Single-file implementations, best for education
# Each algorithm is fully implemented in one file

# 3. RLlib (Ray): Distributed training, production level
# pip install ray[rllib]

# 4. Tianshou: PyTorch-based, flexible structure
# pip install tianshou

Series Retrospective

A chronological summary of what was covered in this series:

Post #	Topic	Key Algorithms/Concepts
11	A3C	Async parallel learning, data/gradient parallelism
12	Chatbot RL	Seq2Seq, SCST, reward design
13	Web Navigation	MiniWoB, grid action space, human demonstrations
14	Continuous Actions	DDPG, OU noise, distributional policy (D4PG)
15	Trust Region	PPO, TRPO, ACKTR
16	Black-Box	Evolution Strategies (ES), Genetic Algorithms (GA)
17	Model-Based	I2A, environment model, rollout encoder
18	AlphaGo Zero	MCTS, self-play, Connect4 implementation
19	Applications	Robotics, autonomous driving, recommendations, RLHF
20	Summary	Algorithm comparison, selection guide

Conclusion

Deep reinforcement learning is a rapidly evolving field. The fundamental algorithms covered in this series form the essential foundation for understanding current cutting-edge research.

Three most important pieces of advice:

Implement it yourself: You can only truly understand an algorithm by writing the code yourself
Start with simple environments: What does not work on CartPole will not work in complex environments either
Always compare baselines: Verify that new methods are truly better than simple ones

The journey of reinforcement learning is never-ending. We hope this series serves as a starting point.