Skip to content

Split View: [심층 강화학습] 20. 심층 강화학습 총정리: 알고리즘 비교와 선택 가이드

|

[심층 강화학습] 20. 심층 강화학습 총정리: 알고리즘 비교와 선택 가이드

개요

이 시리즈에서 다양한 심층 강화학습 알고리즘을 살펴보았다. 마지막 글에서는 모든 방법을 체계적으로 정리하고, 어떤 상황에서 어떤 알고리즘을 선택해야 하는지 가이드를 제공한다. 또한 현재 활발히 연구되는 최신 분야와 학습 자원을 소개한다.


알고리즘 분류 체계

전체 분류도

심층 강화학습 알고리즘은 크게 네 가지 축으로 분류할 수 있다:

1. 가치 기반 (Value-Based)

  • 상태 또는 행동의 가치를 학습
  • 가치로부터 정책을 유도 (보통 탐욕적 선택)
  • 대표: DQN, Double DQN, Dueling DQN, Rainbow

2. 정책 기반 (Policy-Based)

  • 정책을 직접 파라미터화하여 학습
  • REINFORCE, 진화 전략, 유전 알고리즘

3. Actor-Critic

  • 정책(Actor)과 가치(Critic)를 동시에 학습
  • A2C, A3C, PPO, TRPO, ACKTR, DDPG, SAC

4. 모델 기반 (Model-Based)

  • 환경 모델을 학습하여 계획에 활용
  • I2A, World Models, Dreamer, MuZero

가치 기반 방법 정리

DQN 계열

# DQN 계열 알고리즘의 핵심 차이를 보여주는 의사 코드

def dqn_target(reward, next_state, done, gamma, q_network, target_network):
    """기본 DQN: 타겟 네트워크로 최대 Q값 계산"""
    with torch.no_grad():
        max_q = target_network(next_state).max(dim=-1)[0]
        target = reward + gamma * (1 - done) * max_q
    return target

def double_dqn_target(reward, next_state, done, gamma,
                       q_network, target_network):
    """Double DQN: 행동 선택과 평가를 분리"""
    with torch.no_grad():
        # 메인 네트워크로 행동 선택
        best_actions = q_network(next_state).argmax(dim=-1)
        # 타겟 네트워크로 가치 평가
        q_values = target_network(next_state)
        max_q = q_values.gather(1, best_actions.unsqueeze(1)).squeeze()
        target = reward + gamma * (1 - done) * max_q
    return target

def dueling_network_forward(features, advantage_stream, value_stream):
    """Dueling DQN: 가치와 어드밴티지를 분리"""
    value = value_stream(features)        # V(s)
    advantage = advantage_stream(features) # A(s,a)
    # Q(s,a) = V(s) + A(s,a) - mean(A)
    q_values = value + advantage - advantage.mean(dim=-1, keepdim=True)
    return q_values

가치 기반 방법 비교

알고리즘핵심 개선이산/연속주요 장점
DQN경험 리플레이 + 타겟 네트워크이산안정적 학습
Double DQN과대추정 편향 감소이산정확한 Q값
Dueling DQNV와 A 분리이산상태 가치 학습 효율
Prioritized ER중요 경험 우선 학습이산샘플 효율성
Noisy DQN파라미터 노이즈 탐색이산적응적 탐색
Categorical DQN리턴 분포 학습이산안정성, 풍부한 신호
Rainbow위 모든 기법 통합이산최고 성능

정책 기반 방법 정리

REINFORCE와 변형들

import torch

def reinforce_loss(log_probs, returns):
    """기본 REINFORCE: 높은 분산"""
    return -(log_probs * returns).mean()

def reinforce_with_baseline(log_probs, returns, values):
    """기준선 REINFORCE: 분산 감소"""
    advantages = returns - values.detach()
    policy_loss = -(log_probs * advantages).mean()
    value_loss = (returns - values).pow(2).mean()
    return policy_loss + 0.5 * value_loss

def ppo_clipped_loss(log_probs, old_log_probs, advantages,
                     clip_epsilon=0.2):
    """PPO: 안정적 정책 업데이트"""
    ratio = torch.exp(log_probs - old_log_probs)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_epsilon,
                         1 + clip_epsilon) * advantages
    return -torch.min(surr1, surr2).mean()

Actor-Critic 방법 정리

On-Policy vs Off-Policy

속성On-PolicyOff-Policy
데이터 사용현재 정책에서만 수집과거 데이터 재사용 가능
샘플 효율성낮음높음
안정성높음상대적으로 낮음
대표 알고리즘A2C, PPO, TRPODDPG, SAC, TD3

SAC (Soft Actor-Critic)

SAC는 엔트로피를 최대화하여 탐색과 활용을 자동으로 균형 잡는 알고리즘이다:

import torch
import torch.nn as nn
import copy

class SACAgent:
    """SAC: 최대 엔트로피 강화학습"""

    def __init__(self, obs_size, act_size, hidden=256,
                 lr=3e-4, gamma=0.99, tau=0.005):
        self.gamma = gamma
        self.tau = tau

        # 두 개의 Q 네트워크 (쌍둥이 Critic)
        self.q1 = self._make_q(obs_size, act_size, hidden)
        self.q2 = self._make_q(obs_size, act_size, hidden)
        self.q1_target = copy.deepcopy(self.q1)
        self.q2_target = copy.deepcopy(self.q2)

        # 정책 네트워크
        self.actor = self._make_actor(obs_size, act_size, hidden)

        # 자동 온도 조절
        self.log_alpha = torch.zeros(1, requires_grad=True)
        self.target_entropy = -act_size  # 목표 엔트로피

        self.q1_opt = torch.optim.Adam(self.q1.parameters(), lr=lr)
        self.q2_opt = torch.optim.Adam(self.q2.parameters(), lr=lr)
        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=lr)
        self.alpha_opt = torch.optim.Adam([self.log_alpha], lr=lr)

    def _make_q(self, obs_size, act_size, hidden):
        return nn.Sequential(
            nn.Linear(obs_size + act_size, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, 1),
        )

    def _make_actor(self, obs_size, act_size, hidden):
        return GaussianActor(obs_size, act_size, hidden)

    @property
    def alpha(self):
        return self.log_alpha.exp()

    def update(self, batch):
        states, actions, rewards, next_states, dones = batch

        # === Critic 업데이트 ===
        with torch.no_grad():
            next_actions, next_log_probs = self.actor.sample(next_states)
            q1_next = self.q1_target(
                torch.cat([next_states, next_actions], -1)
            )
            q2_next = self.q2_target(
                torch.cat([next_states, next_actions], -1)
            )
            q_next = torch.min(q1_next, q2_next)
            target = rewards + self.gamma * (1 - dones) * (
                q_next - self.alpha * next_log_probs
            )

        q1_val = self.q1(torch.cat([states, actions], -1))
        q2_val = self.q2(torch.cat([states, actions], -1))
        q1_loss = (q1_val - target).pow(2).mean()
        q2_loss = (q2_val - target).pow(2).mean()

        self.q1_opt.zero_grad()
        q1_loss.backward()
        self.q1_opt.step()

        self.q2_opt.zero_grad()
        q2_loss.backward()
        self.q2_opt.step()

        # === Actor 업데이트 ===
        new_actions, log_probs = self.actor.sample(states)
        q1_new = self.q1(torch.cat([states, new_actions], -1))
        q2_new = self.q2(torch.cat([states, new_actions], -1))
        q_new = torch.min(q1_new, q2_new)

        actor_loss = (self.alpha.detach() * log_probs - q_new).mean()

        self.actor_opt.zero_grad()
        actor_loss.backward()
        self.actor_opt.step()

        # === 온도 자동 조절 ===
        alpha_loss = -(self.log_alpha * (
            log_probs.detach() + self.target_entropy
        )).mean()

        self.alpha_opt.zero_grad()
        alpha_loss.backward()
        self.alpha_opt.step()

        # 타겟 네트워크 소프트 업데이트
        self._soft_update(self.q1, self.q1_target)
        self._soft_update(self.q2, self.q2_target)

    def _soft_update(self, source, target):
        for s, t in zip(source.parameters(), target.parameters()):
            t.data.copy_(self.tau * s.data + (1 - self.tau) * t.data)

class GaussianActor(nn.Module):
    def __init__(self, obs_size, act_size, hidden):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
        )
        self.mu = nn.Linear(hidden, act_size)
        self.log_std = nn.Linear(hidden, act_size)

    def forward(self, obs):
        features = self.net(obs)
        mu = self.mu(features)
        log_std = self.log_std(features).clamp(-20, 2)
        return mu, log_std

    def sample(self, obs):
        mu, log_std = self.forward(obs)
        std = log_std.exp()
        dist = torch.distributions.Normal(mu, std)
        # Reparameterization trick
        z = dist.rsample()
        action = torch.tanh(z)
        # 로그 확률 보정 (tanh squashing)
        log_prob = (dist.log_prob(z) - torch.log(
            1 - action.pow(2) + 1e-6
        )).sum(dim=-1, keepdim=True)
        return action, log_prob

종합 비교표

알고리즘 선택 가이드

상황추천 알고리즘이유
이산 행동, 오프라인 데이터 있음DQN/Rainbow리플레이 버퍼 활용
이산 행동, 빠른 프로토타이핑A2C구현 간단, 빠른 실험
연속 행동, 안정성 중시PPO구현 쉽고 안정적
연속 행동, 샘플 효율 중시SACOff-policy + 자동 탐색
연속 행동, 결정적 정책 필요DDPG/TD3결정적 정책
보드 게임AlphaZeroMCTS + 자기 대국
미분 불가 보상ES/GA그래디언트 불필요
시뮬레이터 있음, 샘플 제한Dreamer/MuZero모델 기반 효율성

하이퍼파라미터 민감도

# 각 알고리즘의 주요 하이퍼파라미터와 일반적인 값

hyperparams = {
    'DQN': {
        'lr': 1e-4,
        'batch_size': 32,
        'buffer_size': 1000000,
        'target_update_freq': 1000,
        'epsilon_decay': 'linear to 0.01 over 1M steps',
        'sensitivity': 'medium',
    },
    'PPO': {
        'lr': 3e-4,
        'clip_epsilon': 0.2,
        'num_epochs': 10,
        'batch_size': 64,
        'gae_lambda': 0.95,
        'entropy_coef': 0.01,
        'sensitivity': 'low',
    },
    'SAC': {
        'lr': 3e-4,
        'batch_size': 256,
        'buffer_size': 1000000,
        'tau': 0.005,
        'auto_alpha': True,
        'sensitivity': 'low',
    },
    'DDPG': {
        'lr_actor': 1e-4,
        'lr_critic': 1e-3,
        'batch_size': 256,
        'tau': 0.005,
        'noise_type': 'OU or Gaussian',
        'sensitivity': 'high',
    },
}

현재 연구 최전선

Offline RL (배치 강화학습)

사전에 수집된 고정 데이터셋으로만 학습하는 방법. 추가 환경 상호작용 없이 기존 데이터를 최대한 활용한다.

class ConservativeQLearning:
    """CQL: 보수적 Q 학습의 핵심 아이디어"""

    def compute_cql_loss(self, q_network, states, actions, alpha=1.0):
        # 기존 Q 학습 손실
        td_loss = self.compute_td_loss(q_network, states, actions)

        # CQL 정규화: OOD 행동의 Q값을 낮춤
        # 데이터에 없는 행동의 가치를 과대평가하지 않도록
        random_actions = torch.rand_like(actions) * 2 - 1
        random_q = q_network(states, random_actions)
        data_q = q_network(states, actions)

        cql_penalty = (
            torch.logsumexp(random_q, dim=0).mean()
            - data_q.mean()
        )

        return td_loss + alpha * cql_penalty

핵심 알고리즘: CQL, IQL, Decision Transformer, Diffusion Policy

Multi-Agent RL (다중 에이전트)

여러 에이전트가 동시에 학습하는 환경:

  • 협력 (Cooperative): 팀원들이 공통 목표 추구
  • 경쟁 (Competitive): 에이전트 간 경쟁
  • 혼합 (Mixed): 협력과 경쟁이 혼재

핵심 과제: 비정상성(non-stationarity), 통신, 크레딧 할당

Safe RL (안전한 강화학습)

보상 최대화와 동시에 안전 제약을 만족하는 방법:

class SafeRLObjective:
    """제약 기반 안전 강화학습의 목적"""

    def compute_objective(self, policy, states):
        # 보상 최대화
        expected_reward = self.estimate_reward(policy, states)

        # 안전 제약: 위험 행동의 기대 비용이 한계 이하
        expected_cost = self.estimate_cost(policy, states)
        cost_limit = 25.0  # 허용 가능한 최대 비용

        # 라그랑주 완화
        lagrangian = (expected_reward
                      - self.lambda_multiplier
                      * (expected_cost - cost_limit))

        return lagrangian

핵심 알고리즘: CPO (Constrained Policy Optimization), WCSAC, SafeOpt

기타 연구 방향

  • 메타 강화학습(Meta-RL): 새로운 작업에 빠르게 적응하는 에이전트
  • 계층적 강화학습(Hierarchical RL): 고수준/저수준 정책을 분리하여 장기 계획 수립
  • 표현 학습: 좋은 상태 표현을 자동으로 학습
  • 대규모 언어모델 + RL: LLM의 추론 능력을 RL에 활용

학습 로드맵

추천 학습 순서

  1. 기초 (1-2주)

    • MDP, 벨만 방정식 이해
    • 동적 프로그래밍(정책/가치 반복)
    • 탐색-활용 딜레마
  2. 가치 기반 (2-3주)

    • Q-learning 구현
    • DQN 구현 및 Atari 실험
    • Double/Dueling DQN 이해
  3. 정책 기반 (2-3주)

    • REINFORCE 구현
    • A2C/A3C 이해 및 구현
    • PPO 구현 (가장 중요)
  4. 연속 행동 (1-2주)

    • DDPG 구현
    • SAC 구현
    • MuJoCo/PyBullet 실험
  5. 심화 (2-4주)

    • 모델 기반 RL (Dreamer)
    • 멀티 에이전트 RL
    • Offline RL
    • 실전 프로젝트

핵심 구현 프레임워크

# 주요 RL 라이브러리

# 1. Stable-Baselines3: 검증된 구현, 빠른 실험
# pip install stable-baselines3
from stable_baselines3 import PPO, SAC, DQN

model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=100000)

# 2. CleanRL: 단일 파일 구현, 교육용으로 최고
# 각 알고리즘이 하나의 파일에 완전히 구현됨

# 3. RLlib (Ray): 분산 학습, 프로덕션 레벨
# pip install ray[rllib]

# 4. Tianshou: PyTorch 기반, 유연한 구조
# pip install tianshou

시리즈 회고

이 시리즈에서 다룬 내용을 시간순으로 정리하면:

글번호주제핵심 알고리즘/개념
11A3C비동기 병렬 학습, 데이터/그래디언트 병렬화
12챗봇 RLSeq2Seq, SCST, 보상 설계
13웹 내비게이션MiniWoB, 그리드 행동 공간, 인간 시연
14연속 행동DDPG, OU 노이즈, 분포 정책(D4PG)
15Trust RegionPPO, TRPO, ACKTR
16Black-Box진화 전략(ES), 유전 알고리즘(GA)
17모델 기반I2A, 환경 모델, 롤아웃 인코더
18AlphaGo ZeroMCTS, 자기 대국, Connect4 구현
19응용 사례로봇, 자율주행, 추천, RLHF
20총정리알고리즘 비교, 선택 가이드

마무리

심층 강화학습은 빠르게 발전하는 분야이다. 이 시리즈에서 다룬 기본 알고리즘들은 현재의 최신 연구를 이해하기 위한 필수 토대가 된다.

가장 중요한 조언 세 가지:

  1. 직접 구현하라: 코드를 직접 작성해야 알고리즘을 진정으로 이해할 수 있다
  2. 간단한 환경부터 시작하라: CartPole에서 작동하지 않는 것은 복잡한 환경에서도 작동하지 않는다
  3. 기준선을 항상 비교하라: 새로운 방법이 단순한 방법보다 정말 나은지 확인하라

강화학습의 여정은 끝이 없다. 이 시리즈가 출발점이 되기를 바란다.

[Deep RL] 20. Deep RL Summary: Algorithm Comparison and Selection Guide

Overview

Throughout this series, we have explored various deep reinforcement learning algorithms. In this final post, we systematically organize all methods and provide a guide on which algorithm to choose in which situation. We also introduce actively researched frontiers and learning resources.


Algorithm Classification System

Overall Taxonomy

Deep RL algorithms can be broadly classified along four axes:

1. Value-Based

  • Learn the value of states or actions
  • Derive policy from values (typically greedy selection)
  • Representative: DQN, Double DQN, Dueling DQN, Rainbow

2. Policy-Based

  • Directly parameterize and learn the policy
  • REINFORCE, Evolution Strategies, Genetic Algorithms

3. Actor-Critic

  • Simultaneously learn policy (Actor) and value (Critic)
  • A2C, A3C, PPO, TRPO, ACKTR, DDPG, SAC

4. Model-Based

  • Learn an environment model and use it for planning
  • I2A, World Models, Dreamer, MuZero

Value-Based Methods Summary

DQN Family

# Pseudocode showing key differences in DQN family algorithms

def dqn_target(reward, next_state, done, gamma, q_network, target_network):
    """Basic DQN: compute max Q-value with target network"""
    with torch.no_grad():
        max_q = target_network(next_state).max(dim=-1)[0]
        target = reward + gamma * (1 - done) * max_q
    return target

def double_dqn_target(reward, next_state, done, gamma,
                       q_network, target_network):
    """Double DQN: separate action selection and evaluation"""
    with torch.no_grad():
        # Select action with main network
        best_actions = q_network(next_state).argmax(dim=-1)
        # Evaluate value with target network
        q_values = target_network(next_state)
        max_q = q_values.gather(1, best_actions.unsqueeze(1)).squeeze()
        target = reward + gamma * (1 - done) * max_q
    return target

def dueling_network_forward(features, advantage_stream, value_stream):
    """Dueling DQN: separate value and advantage"""
    value = value_stream(features)        # V(s)
    advantage = advantage_stream(features) # A(s,a)
    # Q(s,a) = V(s) + A(s,a) - mean(A)
    q_values = value + advantage - advantage.mean(dim=-1, keepdim=True)
    return q_values

Value-Based Methods Comparison

AlgorithmKey ImprovementDiscrete/ContinuousMain Advantage
DQNExperience replay + target networkDiscreteStable learning
Double DQNReduced overestimation biasDiscreteAccurate Q-values
Dueling DQNSeparated V and ADiscreteState value learning efficiency
Prioritized ERPriority learning of important experiencesDiscreteSample efficiency
Noisy DQNParameter noise explorationDiscreteAdaptive exploration
Categorical DQNReturn distribution learningDiscreteStability, rich signal
RainbowIntegration of all above techniquesDiscreteBest performance

Policy-Based Methods Summary

REINFORCE and Variants

import torch

def reinforce_loss(log_probs, returns):
    """Basic REINFORCE: high variance"""
    return -(log_probs * returns).mean()

def reinforce_with_baseline(log_probs, returns, values):
    """REINFORCE with baseline: reduced variance"""
    advantages = returns - values.detach()
    policy_loss = -(log_probs * advantages).mean()
    value_loss = (returns - values).pow(2).mean()
    return policy_loss + 0.5 * value_loss

def ppo_clipped_loss(log_probs, old_log_probs, advantages,
                     clip_epsilon=0.2):
    """PPO: stable policy updates"""
    ratio = torch.exp(log_probs - old_log_probs)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_epsilon,
                         1 + clip_epsilon) * advantages
    return -torch.min(surr1, surr2).mean()

Actor-Critic Methods Summary

On-Policy vs Off-Policy

PropertyOn-PolicyOff-Policy
Data usageOnly from current policyCan reuse past data
Sample efficiencyLowHigh
StabilityHighRelatively lower
Representative algorithmsA2C, PPO, TRPODDPG, SAC, TD3

SAC (Soft Actor-Critic)

SAC automatically balances exploration and exploitation by maximizing entropy:

import torch
import torch.nn as nn
import copy

class SACAgent:
    """SAC: Maximum entropy reinforcement learning"""

    def __init__(self, obs_size, act_size, hidden=256,
                 lr=3e-4, gamma=0.99, tau=0.005):
        self.gamma = gamma
        self.tau = tau
        self.q1 = self._make_q(obs_size, act_size, hidden)
        self.q2 = self._make_q(obs_size, act_size, hidden)
        self.q1_target = copy.deepcopy(self.q1)
        self.q2_target = copy.deepcopy(self.q2)
        self.actor = self._make_actor(obs_size, act_size, hidden)
        self.log_alpha = torch.zeros(1, requires_grad=True)
        self.target_entropy = -act_size
        self.q1_opt = torch.optim.Adam(self.q1.parameters(), lr=lr)
        self.q2_opt = torch.optim.Adam(self.q2.parameters(), lr=lr)
        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=lr)
        self.alpha_opt = torch.optim.Adam([self.log_alpha], lr=lr)

    def _make_q(self, obs_size, act_size, hidden):
        return nn.Sequential(
            nn.Linear(obs_size + act_size, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, 1),
        )

    def _make_actor(self, obs_size, act_size, hidden):
        return GaussianActor(obs_size, act_size, hidden)

    @property
    def alpha(self):
        return self.log_alpha.exp()

    def update(self, batch):
        states, actions, rewards, next_states, dones = batch

        with torch.no_grad():
            next_actions, next_log_probs = self.actor.sample(next_states)
            q1_next = self.q1_target(torch.cat([next_states, next_actions], -1))
            q2_next = self.q2_target(torch.cat([next_states, next_actions], -1))
            q_next = torch.min(q1_next, q2_next)
            target = rewards + self.gamma * (1 - dones) * (
                q_next - self.alpha * next_log_probs
            )

        q1_val = self.q1(torch.cat([states, actions], -1))
        q2_val = self.q2(torch.cat([states, actions], -1))
        q1_loss = (q1_val - target).pow(2).mean()
        q2_loss = (q2_val - target).pow(2).mean()

        self.q1_opt.zero_grad(); q1_loss.backward(); self.q1_opt.step()
        self.q2_opt.zero_grad(); q2_loss.backward(); self.q2_opt.step()

        new_actions, log_probs = self.actor.sample(states)
        q1_new = self.q1(torch.cat([states, new_actions], -1))
        q2_new = self.q2(torch.cat([states, new_actions], -1))
        q_new = torch.min(q1_new, q2_new)
        actor_loss = (self.alpha.detach() * log_probs - q_new).mean()

        self.actor_opt.zero_grad(); actor_loss.backward(); self.actor_opt.step()

        alpha_loss = -(self.log_alpha * (log_probs.detach() + self.target_entropy)).mean()
        self.alpha_opt.zero_grad(); alpha_loss.backward(); self.alpha_opt.step()

        self._soft_update(self.q1, self.q1_target)
        self._soft_update(self.q2, self.q2_target)

    def _soft_update(self, source, target):
        for s, t in zip(source.parameters(), target.parameters()):
            t.data.copy_(self.tau * s.data + (1 - self.tau) * t.data)

class GaussianActor(nn.Module):
    def __init__(self, obs_size, act_size, hidden):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
        )
        self.mu = nn.Linear(hidden, act_size)
        self.log_std = nn.Linear(hidden, act_size)

    def forward(self, obs):
        features = self.net(obs)
        mu = self.mu(features)
        log_std = self.log_std(features).clamp(-20, 2)
        return mu, log_std

    def sample(self, obs):
        mu, log_std = self.forward(obs)
        std = log_std.exp()
        dist = torch.distributions.Normal(mu, std)
        z = dist.rsample()
        action = torch.tanh(z)
        log_prob = (dist.log_prob(z) - torch.log(1 - action.pow(2) + 1e-6)).sum(dim=-1, keepdim=True)
        return action, log_prob

Comprehensive Comparison Table

Algorithm Selection Guide

ScenarioRecommended AlgorithmReason
Discrete actions, offline data availableDQN/RainbowLeverages replay buffer
Discrete actions, fast prototypingA2CSimple, fast experiments
Continuous actions, stability priorityPPOEasy and stable
Continuous actions, sample efficiency prioritySACOff-policy + auto exploration
Continuous actions, deterministic policy neededDDPG/TD3Deterministic policy
Board gamesAlphaZeroMCTS + self-play
Non-differentiable rewardES/GANo gradient needed
Simulator available, limited samplesDreamer/MuZeroModel-based efficiency

Hyperparameter Sensitivity

# Key hyperparameters and typical values for each algorithm

hyperparams = {
    'DQN': {
        'lr': 1e-4,
        'batch_size': 32,
        'buffer_size': 1000000,
        'target_update_freq': 1000,
        'epsilon_decay': 'linear to 0.01 over 1M steps',
        'sensitivity': 'medium',
    },
    'PPO': {
        'lr': 3e-4,
        'clip_epsilon': 0.2,
        'num_epochs': 10,
        'batch_size': 64,
        'gae_lambda': 0.95,
        'entropy_coef': 0.01,
        'sensitivity': 'low',
    },
    'SAC': {
        'lr': 3e-4,
        'batch_size': 256,
        'buffer_size': 1000000,
        'tau': 0.005,
        'auto_alpha': True,
        'sensitivity': 'low',
    },
    'DDPG': {
        'lr_actor': 1e-4,
        'lr_critic': 1e-3,
        'batch_size': 256,
        'tau': 0.005,
        'noise_type': 'OU or Gaussian',
        'sensitivity': 'high',
    },
}

Current Research Frontiers

Offline RL (Batch Reinforcement Learning)

Learning exclusively from a fixed, pre-collected dataset. Maximally leveraging existing data without additional environment interaction.

class ConservativeQLearning:
    """CQL: Core idea of Conservative Q-Learning"""

    def compute_cql_loss(self, q_network, states, actions, alpha=1.0):
        td_loss = self.compute_td_loss(q_network, states, actions)
        random_actions = torch.rand_like(actions) * 2 - 1
        random_q = q_network(states, random_actions)
        data_q = q_network(states, actions)
        cql_penalty = (
            torch.logsumexp(random_q, dim=0).mean() - data_q.mean()
        )
        return td_loss + alpha * cql_penalty

Key algorithms: CQL, IQL, Decision Transformer, Diffusion Policy

Multi-Agent RL

Environments where multiple agents learn simultaneously:

  • Cooperative: Teammates pursue a common goal
  • Competitive: Agents compete against each other
  • Mixed: Cooperation and competition coexist

Key challenges: Non-stationarity, communication, credit assignment

Safe RL

Methods that satisfy safety constraints while maximizing rewards:

class SafeRLObjective:
    """Objective for constraint-based safe RL"""

    def compute_objective(self, policy, states):
        expected_reward = self.estimate_reward(policy, states)
        expected_cost = self.estimate_cost(policy, states)
        cost_limit = 25.0
        lagrangian = (expected_reward
                      - self.lambda_multiplier
                      * (expected_cost - cost_limit))
        return lagrangian

Key algorithms: CPO (Constrained Policy Optimization), WCSAC, SafeOpt

Other Research Directions

  • Meta-RL: Agents that quickly adapt to new tasks
  • Hierarchical RL: Separate high-level/low-level policies for long-term planning
  • Representation learning: Automatically learning good state representations
  • LLM + RL: Leveraging LLM reasoning capabilities in RL

Learning Roadmap

  1. Foundations (1-2 weeks)

    • Understanding MDPs, Bellman equations
    • Dynamic programming (policy/value iteration)
    • Exploration-exploitation dilemma
  2. Value-Based (2-3 weeks)

    • Implement Q-learning
    • Implement DQN and run Atari experiments
    • Understand Double/Dueling DQN
  3. Policy-Based (2-3 weeks)

    • Implement REINFORCE
    • Understand and implement A2C/A3C
    • Implement PPO (most important)
  4. Continuous Actions (1-2 weeks)

    • Implement DDPG
    • Implement SAC
    • MuJoCo/PyBullet experiments
  5. Advanced (2-4 weeks)

    • Model-based RL (Dreamer)
    • Multi-agent RL
    • Offline RL
    • Practical project

Key Implementation Frameworks

# Major RL libraries

# 1. Stable-Baselines3: Verified implementations, fast experiments
# pip install stable-baselines3
from stable_baselines3 import PPO, SAC, DQN

model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=100000)

# 2. CleanRL: Single-file implementations, best for education
# Each algorithm is fully implemented in one file

# 3. RLlib (Ray): Distributed training, production level
# pip install ray[rllib]

# 4. Tianshou: PyTorch-based, flexible structure
# pip install tianshou

Series Retrospective

A chronological summary of what was covered in this series:

Post #TopicKey Algorithms/Concepts
11A3CAsync parallel learning, data/gradient parallelism
12Chatbot RLSeq2Seq, SCST, reward design
13Web NavigationMiniWoB, grid action space, human demonstrations
14Continuous ActionsDDPG, OU noise, distributional policy (D4PG)
15Trust RegionPPO, TRPO, ACKTR
16Black-BoxEvolution Strategies (ES), Genetic Algorithms (GA)
17Model-BasedI2A, environment model, rollout encoder
18AlphaGo ZeroMCTS, self-play, Connect4 implementation
19ApplicationsRobotics, autonomous driving, recommendations, RLHF
20SummaryAlgorithm comparison, selection guide

Conclusion

Deep reinforcement learning is a rapidly evolving field. The fundamental algorithms covered in this series form the essential foundation for understanding current cutting-edge research.

Three most important pieces of advice:

  1. Implement it yourself: You can only truly understand an algorithm by writing the code yourself
  2. Start with simple environments: What does not work on CartPole will not work in complex environments either
  3. Always compare baselines: Verify that new methods are truly better than simple ones

The journey of reinforcement learning is never-ending. We hope this series serves as a starting point.