Skip to content
Published on

[Deep RL] 20. Deep RL Summary: Algorithm Comparison and Selection Guide

Authors

Overview

Throughout this series, we have explored various deep reinforcement learning algorithms. In this final post, we systematically organize all methods and provide a guide on which algorithm to choose in which situation. We also introduce actively researched frontiers and learning resources.


Algorithm Classification System

Overall Taxonomy

Deep RL algorithms can be broadly classified along four axes:

1. Value-Based

  • Learn the value of states or actions
  • Derive policy from values (typically greedy selection)
  • Representative: DQN, Double DQN, Dueling DQN, Rainbow

2. Policy-Based

  • Directly parameterize and learn the policy
  • REINFORCE, Evolution Strategies, Genetic Algorithms

3. Actor-Critic

  • Simultaneously learn policy (Actor) and value (Critic)
  • A2C, A3C, PPO, TRPO, ACKTR, DDPG, SAC

4. Model-Based

  • Learn an environment model and use it for planning
  • I2A, World Models, Dreamer, MuZero

Value-Based Methods Summary

DQN Family

# Pseudocode showing key differences in DQN family algorithms

def dqn_target(reward, next_state, done, gamma, q_network, target_network):
    """Basic DQN: compute max Q-value with target network"""
    with torch.no_grad():
        max_q = target_network(next_state).max(dim=-1)[0]
        target = reward + gamma * (1 - done) * max_q
    return target

def double_dqn_target(reward, next_state, done, gamma,
                       q_network, target_network):
    """Double DQN: separate action selection and evaluation"""
    with torch.no_grad():
        # Select action with main network
        best_actions = q_network(next_state).argmax(dim=-1)
        # Evaluate value with target network
        q_values = target_network(next_state)
        max_q = q_values.gather(1, best_actions.unsqueeze(1)).squeeze()
        target = reward + gamma * (1 - done) * max_q
    return target

def dueling_network_forward(features, advantage_stream, value_stream):
    """Dueling DQN: separate value and advantage"""
    value = value_stream(features)        # V(s)
    advantage = advantage_stream(features) # A(s,a)
    # Q(s,a) = V(s) + A(s,a) - mean(A)
    q_values = value + advantage - advantage.mean(dim=-1, keepdim=True)
    return q_values

Value-Based Methods Comparison

AlgorithmKey ImprovementDiscrete/ContinuousMain Advantage
DQNExperience replay + target networkDiscreteStable learning
Double DQNReduced overestimation biasDiscreteAccurate Q-values
Dueling DQNSeparated V and ADiscreteState value learning efficiency
Prioritized ERPriority learning of important experiencesDiscreteSample efficiency
Noisy DQNParameter noise explorationDiscreteAdaptive exploration
Categorical DQNReturn distribution learningDiscreteStability, rich signal
RainbowIntegration of all above techniquesDiscreteBest performance

Policy-Based Methods Summary

REINFORCE and Variants

import torch

def reinforce_loss(log_probs, returns):
    """Basic REINFORCE: high variance"""
    return -(log_probs * returns).mean()

def reinforce_with_baseline(log_probs, returns, values):
    """REINFORCE with baseline: reduced variance"""
    advantages = returns - values.detach()
    policy_loss = -(log_probs * advantages).mean()
    value_loss = (returns - values).pow(2).mean()
    return policy_loss + 0.5 * value_loss

def ppo_clipped_loss(log_probs, old_log_probs, advantages,
                     clip_epsilon=0.2):
    """PPO: stable policy updates"""
    ratio = torch.exp(log_probs - old_log_probs)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_epsilon,
                         1 + clip_epsilon) * advantages
    return -torch.min(surr1, surr2).mean()

Actor-Critic Methods Summary

On-Policy vs Off-Policy

PropertyOn-PolicyOff-Policy
Data usageOnly from current policyCan reuse past data
Sample efficiencyLowHigh
StabilityHighRelatively lower
Representative algorithmsA2C, PPO, TRPODDPG, SAC, TD3

SAC (Soft Actor-Critic)

SAC automatically balances exploration and exploitation by maximizing entropy:

import torch
import torch.nn as nn
import copy

class SACAgent:
    """SAC: Maximum entropy reinforcement learning"""

    def __init__(self, obs_size, act_size, hidden=256,
                 lr=3e-4, gamma=0.99, tau=0.005):
        self.gamma = gamma
        self.tau = tau
        self.q1 = self._make_q(obs_size, act_size, hidden)
        self.q2 = self._make_q(obs_size, act_size, hidden)
        self.q1_target = copy.deepcopy(self.q1)
        self.q2_target = copy.deepcopy(self.q2)
        self.actor = self._make_actor(obs_size, act_size, hidden)
        self.log_alpha = torch.zeros(1, requires_grad=True)
        self.target_entropy = -act_size
        self.q1_opt = torch.optim.Adam(self.q1.parameters(), lr=lr)
        self.q2_opt = torch.optim.Adam(self.q2.parameters(), lr=lr)
        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=lr)
        self.alpha_opt = torch.optim.Adam([self.log_alpha], lr=lr)

    def _make_q(self, obs_size, act_size, hidden):
        return nn.Sequential(
            nn.Linear(obs_size + act_size, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, 1),
        )

    def _make_actor(self, obs_size, act_size, hidden):
        return GaussianActor(obs_size, act_size, hidden)

    @property
    def alpha(self):
        return self.log_alpha.exp()

    def update(self, batch):
        states, actions, rewards, next_states, dones = batch

        with torch.no_grad():
            next_actions, next_log_probs = self.actor.sample(next_states)
            q1_next = self.q1_target(torch.cat([next_states, next_actions], -1))
            q2_next = self.q2_target(torch.cat([next_states, next_actions], -1))
            q_next = torch.min(q1_next, q2_next)
            target = rewards + self.gamma * (1 - dones) * (
                q_next - self.alpha * next_log_probs
            )

        q1_val = self.q1(torch.cat([states, actions], -1))
        q2_val = self.q2(torch.cat([states, actions], -1))
        q1_loss = (q1_val - target).pow(2).mean()
        q2_loss = (q2_val - target).pow(2).mean()

        self.q1_opt.zero_grad(); q1_loss.backward(); self.q1_opt.step()
        self.q2_opt.zero_grad(); q2_loss.backward(); self.q2_opt.step()

        new_actions, log_probs = self.actor.sample(states)
        q1_new = self.q1(torch.cat([states, new_actions], -1))
        q2_new = self.q2(torch.cat([states, new_actions], -1))
        q_new = torch.min(q1_new, q2_new)
        actor_loss = (self.alpha.detach() * log_probs - q_new).mean()

        self.actor_opt.zero_grad(); actor_loss.backward(); self.actor_opt.step()

        alpha_loss = -(self.log_alpha * (log_probs.detach() + self.target_entropy)).mean()
        self.alpha_opt.zero_grad(); alpha_loss.backward(); self.alpha_opt.step()

        self._soft_update(self.q1, self.q1_target)
        self._soft_update(self.q2, self.q2_target)

    def _soft_update(self, source, target):
        for s, t in zip(source.parameters(), target.parameters()):
            t.data.copy_(self.tau * s.data + (1 - self.tau) * t.data)

class GaussianActor(nn.Module):
    def __init__(self, obs_size, act_size, hidden):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
        )
        self.mu = nn.Linear(hidden, act_size)
        self.log_std = nn.Linear(hidden, act_size)

    def forward(self, obs):
        features = self.net(obs)
        mu = self.mu(features)
        log_std = self.log_std(features).clamp(-20, 2)
        return mu, log_std

    def sample(self, obs):
        mu, log_std = self.forward(obs)
        std = log_std.exp()
        dist = torch.distributions.Normal(mu, std)
        z = dist.rsample()
        action = torch.tanh(z)
        log_prob = (dist.log_prob(z) - torch.log(1 - action.pow(2) + 1e-6)).sum(dim=-1, keepdim=True)
        return action, log_prob

Comprehensive Comparison Table

Algorithm Selection Guide

ScenarioRecommended AlgorithmReason
Discrete actions, offline data availableDQN/RainbowLeverages replay buffer
Discrete actions, fast prototypingA2CSimple, fast experiments
Continuous actions, stability priorityPPOEasy and stable
Continuous actions, sample efficiency prioritySACOff-policy + auto exploration
Continuous actions, deterministic policy neededDDPG/TD3Deterministic policy
Board gamesAlphaZeroMCTS + self-play
Non-differentiable rewardES/GANo gradient needed
Simulator available, limited samplesDreamer/MuZeroModel-based efficiency

Hyperparameter Sensitivity

# Key hyperparameters and typical values for each algorithm

hyperparams = {
    'DQN': {
        'lr': 1e-4,
        'batch_size': 32,
        'buffer_size': 1000000,
        'target_update_freq': 1000,
        'epsilon_decay': 'linear to 0.01 over 1M steps',
        'sensitivity': 'medium',
    },
    'PPO': {
        'lr': 3e-4,
        'clip_epsilon': 0.2,
        'num_epochs': 10,
        'batch_size': 64,
        'gae_lambda': 0.95,
        'entropy_coef': 0.01,
        'sensitivity': 'low',
    },
    'SAC': {
        'lr': 3e-4,
        'batch_size': 256,
        'buffer_size': 1000000,
        'tau': 0.005,
        'auto_alpha': True,
        'sensitivity': 'low',
    },
    'DDPG': {
        'lr_actor': 1e-4,
        'lr_critic': 1e-3,
        'batch_size': 256,
        'tau': 0.005,
        'noise_type': 'OU or Gaussian',
        'sensitivity': 'high',
    },
}

Current Research Frontiers

Offline RL (Batch Reinforcement Learning)

Learning exclusively from a fixed, pre-collected dataset. Maximally leveraging existing data without additional environment interaction.

class ConservativeQLearning:
    """CQL: Core idea of Conservative Q-Learning"""

    def compute_cql_loss(self, q_network, states, actions, alpha=1.0):
        td_loss = self.compute_td_loss(q_network, states, actions)
        random_actions = torch.rand_like(actions) * 2 - 1
        random_q = q_network(states, random_actions)
        data_q = q_network(states, actions)
        cql_penalty = (
            torch.logsumexp(random_q, dim=0).mean() - data_q.mean()
        )
        return td_loss + alpha * cql_penalty

Key algorithms: CQL, IQL, Decision Transformer, Diffusion Policy

Multi-Agent RL

Environments where multiple agents learn simultaneously:

  • Cooperative: Teammates pursue a common goal
  • Competitive: Agents compete against each other
  • Mixed: Cooperation and competition coexist

Key challenges: Non-stationarity, communication, credit assignment

Safe RL

Methods that satisfy safety constraints while maximizing rewards:

class SafeRLObjective:
    """Objective for constraint-based safe RL"""

    def compute_objective(self, policy, states):
        expected_reward = self.estimate_reward(policy, states)
        expected_cost = self.estimate_cost(policy, states)
        cost_limit = 25.0
        lagrangian = (expected_reward
                      - self.lambda_multiplier
                      * (expected_cost - cost_limit))
        return lagrangian

Key algorithms: CPO (Constrained Policy Optimization), WCSAC, SafeOpt

Other Research Directions

  • Meta-RL: Agents that quickly adapt to new tasks
  • Hierarchical RL: Separate high-level/low-level policies for long-term planning
  • Representation learning: Automatically learning good state representations
  • LLM + RL: Leveraging LLM reasoning capabilities in RL

Learning Roadmap

  1. Foundations (1-2 weeks)

    • Understanding MDPs, Bellman equations
    • Dynamic programming (policy/value iteration)
    • Exploration-exploitation dilemma
  2. Value-Based (2-3 weeks)

    • Implement Q-learning
    • Implement DQN and run Atari experiments
    • Understand Double/Dueling DQN
  3. Policy-Based (2-3 weeks)

    • Implement REINFORCE
    • Understand and implement A2C/A3C
    • Implement PPO (most important)
  4. Continuous Actions (1-2 weeks)

    • Implement DDPG
    • Implement SAC
    • MuJoCo/PyBullet experiments
  5. Advanced (2-4 weeks)

    • Model-based RL (Dreamer)
    • Multi-agent RL
    • Offline RL
    • Practical project

Key Implementation Frameworks

# Major RL libraries

# 1. Stable-Baselines3: Verified implementations, fast experiments
# pip install stable-baselines3
from stable_baselines3 import PPO, SAC, DQN

model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=100000)

# 2. CleanRL: Single-file implementations, best for education
# Each algorithm is fully implemented in one file

# 3. RLlib (Ray): Distributed training, production level
# pip install ray[rllib]

# 4. Tianshou: PyTorch-based, flexible structure
# pip install tianshou

Series Retrospective

A chronological summary of what was covered in this series:

Post #TopicKey Algorithms/Concepts
11A3CAsync parallel learning, data/gradient parallelism
12Chatbot RLSeq2Seq, SCST, reward design
13Web NavigationMiniWoB, grid action space, human demonstrations
14Continuous ActionsDDPG, OU noise, distributional policy (D4PG)
15Trust RegionPPO, TRPO, ACKTR
16Black-BoxEvolution Strategies (ES), Genetic Algorithms (GA)
17Model-BasedI2A, environment model, rollout encoder
18AlphaGo ZeroMCTS, self-play, Connect4 implementation
19ApplicationsRobotics, autonomous driving, recommendations, RLHF
20SummaryAlgorithm comparison, selection guide

Conclusion

Deep reinforcement learning is a rapidly evolving field. The fundamental algorithms covered in this series form the essential foundation for understanding current cutting-edge research.

Three most important pieces of advice:

  1. Implement it yourself: You can only truly understand an algorithm by writing the code yourself
  2. Start with simple environments: What does not work on CartPole will not work in complex environments either
  3. Always compare baselines: Verify that new methods are truly better than simple ones

The journey of reinforcement learning is never-ending. We hope this series serves as a starting point.