Skip to content
Published on

[Deep RL] 10. Actor-Critic Methods: A2C and Hyperparameter Tuning

Authors

Review of REINFORCE Variance Problem

In the previous article, we examined the REINFORCE algorithm. The core problem was the high variance of gradient estimation.

REINFORCE can only update after the entire episode ends (Monte Carlo), and the gradient computed from a single episode has very large noise.

While baselines can reduce variance, a more fundamental solution is needed.


Actor-Critic Architecture

Actor-Critic combines two components:

  • Actor (policy): Selects actions from states. pi(a|s; theta)
  • Critic (value function): Evaluates the value of the current state. V(s; phi)

The key idea is to use TD (Temporal Difference) estimates instead of Monte Carlo returns to reduce variance.

REINFORCE vs Actor-Critic

REINFORCE:     grad = log pi(a|s) * G_t         (에피소드 끝까지 기다림)
Actor-Critic:  grad = log pi(a|s) * (r + gamma * V(s') - V(s))  (한 스텝만 필요)

r + gamma * V(s') - V(s) is called the TD error or advantage estimate. V(s) serves as the baseline while simultaneously providing an estimate of the return.


A2C (Advantage Actor-Critic) Implementation

A2C is the synchronous version of Actor-Critic. It runs multiple environments in parallel to collect diverse experience simultaneously.

Network Architecture

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

class A2CNetwork(nn.Module):
    """A2C를 위한 공유 네트워크 (Actor + Critic)"""
    def __init__(self, obs_size, n_actions, hidden_size=256):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(obs_size, hidden_size), nn.ReLU(),
            nn.Linear(hidden_size, hidden_size), nn.ReLU(),
        )
        self.actor = nn.Linear(hidden_size, n_actions)
        self.critic = nn.Linear(hidden_size, 1)

    def forward(self, x):
        features = self.shared(x)
        logits = self.actor(features)
        value = self.critic(features)
        return logits, value

    def get_action_and_value(self, state):
        logits, value = self.forward(state)
        probs = F.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        return action, log_prob, value.squeeze(-1), entropy

CNN A2C for Atari

class A2CCNN(nn.Module):
    """Atari용 CNN 기반 A2C 네트워크"""
    def __init__(self, input_channels, n_actions):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(input_channels, 32, kernel_size=8, stride=4), nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.ReLU(),
        )
        conv_out_size = self._get_conv_out(input_channels)
        self.fc = nn.Sequential(nn.Linear(conv_out_size, 512), nn.ReLU())
        self.actor = nn.Linear(512, n_actions)
        self.critic = nn.Linear(512, 1)

    def _get_conv_out(self, channels):
        o = self.conv(torch.zeros(1, channels, 84, 84))
        return int(np.prod(o.size()))

    def forward(self, x):
        x = x.float() / 255.0
        conv_out = self.conv(x).view(x.size(0), -1)
        features = self.fc(conv_out)
        return self.actor(features), self.critic(features)

    def get_action_and_value(self, state):
        logits, value = self.forward(state)
        probs = F.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        return action, dist.log_prob(action), value.squeeze(-1), dist.entropy()

N-step Advantage Computation

A2C uses rewards from multiple steps rather than a single step to compute advantages, balancing bias and variance.

def compute_advantages(rewards, values, dones, next_value, gamma=0.99):
    """N-step 어드밴티지 계산"""
    n_steps = len(rewards)
    returns = []
    advantages = []
    R = next_value
    for t in reversed(range(n_steps)):
        if dones[t]:
            R = 0.0
        R = rewards[t] + gamma * R
        returns.insert(0, R)
        advantages.insert(0, R - values[t])
    returns = torch.tensor(returns, dtype=torch.float32)
    advantages = torch.tensor(advantages, dtype=torch.float32)
    return returns, advantages

GAE (Generalized Advantage Estimation)

GAE estimates advantages by exponentially weighting TD errors of multiple lengths.

def compute_gae(rewards, values, dones, next_value, gamma=0.99, gae_lambda=0.95):
    """GAE (Generalized Advantage Estimation) 계산"""
    n_steps = len(rewards)
    advantages = np.zeros(n_steps)
    last_gae = 0.0

    for t in reversed(range(n_steps)):
        if t == n_steps - 1:
            next_val = next_value
        else:
            next_val = values[t + 1]

        if dones[t]:
            next_val = 0.0
            last_gae = 0.0

        delta = rewards[t] + gamma * next_val - values[t]
        advantages[t] = last_gae = delta + gamma * gae_lambda * last_gae

    returns = advantages + np.array(values)
    return torch.tensor(returns, dtype=torch.float32), \
           torch.tensor(advantages, dtype=torch.float32)

GAE's lambda parameter controls the bias-variance tradeoff:

  • lambda = 0: 1-step TD (low variance, high bias)
  • lambda = 1: Monte Carlo return (high variance, low bias)
  • lambda = 0.95: Commonly used value in practice

A2C Training Loop

CartPole A2C

import gymnasium as gym

def train_a2c_cartpole():
    """A2C로 CartPole 학습"""
    N_ENVS = 8
    N_STEPS = 5
    GAMMA = 0.99
    LEARNING_RATE = 7e-4
    VALUE_LOSS_COEF = 0.5
    ENTROPY_COEF = 0.01
    MAX_GRAD_NORM = 0.5
    TOTAL_STEPS = 200000

    envs = gym.make_vec("CartPole-v1", num_envs=N_ENVS)
    obs_size = envs.single_observation_space.shape[0]
    n_actions = envs.single_action_space.n
    device = torch.device("cpu")
    model = A2CNetwork(obs_size, n_actions).to(device)
    optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

    obs, _ = envs.reset()
    episode_rewards = np.zeros(N_ENVS)
    completed_rewards = []
    global_step = 0

    while global_step < TOTAL_STEPS:
        batch_obs = []
        batch_actions = []
        batch_log_probs = []
        batch_values = []
        batch_rewards = []
        batch_dones = []
        batch_entropies = []

        for step in range(N_STEPS):
            obs_t = torch.tensor(obs, dtype=torch.float32).to(device)
            with torch.no_grad():
                actions, log_probs, values, entropies = model.get_action_and_value(obs_t)

            next_obs, rewards, terminateds, truncateds, infos = envs.step(actions.numpy())
            dones = np.logical_or(terminateds, truncateds)

            batch_obs.append(obs_t)
            batch_actions.append(actions)
            batch_log_probs.append(log_probs)
            batch_values.append(values)
            batch_rewards.append(rewards)
            batch_dones.append(dones)
            batch_entropies.append(entropies)

            episode_rewards += rewards
            for i, done in enumerate(dones):
                if done:
                    completed_rewards.append(episode_rewards[i])
                    episode_rewards[i] = 0
            obs = next_obs
            global_step += N_ENVS

        with torch.no_grad():
            _, next_value = model(torch.tensor(obs, dtype=torch.float32).to(device))
            next_value = next_value.squeeze(-1)

        values_list = [v.detach().numpy() for v in batch_values]
        returns_list = []
        advantages_list = []

        for env_idx in range(N_ENVS):
            env_rewards = [batch_rewards[t][env_idx] for t in range(N_STEPS)]
            env_values = [values_list[t][env_idx] for t in range(N_STEPS)]
            env_dones = [batch_dones[t][env_idx] for t in range(N_STEPS)]
            env_next_val = next_value[env_idx].item()
            rets, advs = compute_gae(env_rewards, env_values, env_dones, env_next_val, GAMMA)
            returns_list.append(rets)
            advantages_list.append(advs)

        all_log_probs = torch.stack(batch_log_probs).view(-1)
        all_values = torch.stack(batch_values).view(-1)
        all_entropies = torch.stack(batch_entropies).view(-1)
        all_returns = torch.stack(returns_list, dim=1).view(-1)
        all_advantages = torch.stack(advantages_list, dim=1).view(-1)
        all_advantages = (all_advantages - all_advantages.mean()) / (all_advantages.std() + 1e-8)

        policy_loss = -(all_log_probs * all_advantages.detach()).mean()
        value_loss = F.mse_loss(all_values, all_returns.detach())
        entropy_loss = all_entropies.mean()
        total_loss = policy_loss + VALUE_LOSS_COEF * value_loss - ENTROPY_COEF * entropy_loss

        optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)
        optimizer.step()

        if len(completed_rewards) >= 10 and global_step % 1000 < N_ENVS * N_STEPS:
            mean_reward = np.mean(completed_rewards[-10:])
            print(f"스텝 {global_step}: 평균 보상={mean_reward:.1f}, 정책 손실={policy_loss.item():.4f}")
            if mean_reward >= 475:
                print(f"스텝 {global_step}에서 해결!")
                break

    envs.close()
    return model, completed_rewards

# model, rewards = train_a2c_cartpole()

Hyperparameter Tuning

A2C performance is sensitive to hyperparameters. Here is a guide for each parameter:

Learning Rate

  • Too large (1e-2): Training becomes unstable and may diverge
  • Appropriate (7e-4 ~ 1e-3): Fast and stable training
  • Too small (1e-5): Training is very slow, takes long to converge

Entropy Coefficient

  • Entropy coefficient of 0: No exploration, may get stuck in local optima
  • Entropy coefficient of 0.01: Good balance of exploration and exploitation
  • Entropy coefficient of 0.5: Excessive exploration, very slow learning
  • Recommended range: 0.001 ~ 0.05

Number of Parallel Environments

More parallel environments lead to more diverse data per update:

  • 1 environment: High variance, slow learning
  • 8 environments: Good balance (recommended for CartPole)
  • 16 environments: Suitable for Atari games
  • 32 environments: More stable but increased memory usage

Hyperparameter Summary

ParameterCartPole RecommendedPong RecommendedRole
Learning rate7e-47e-4Parameter update size
Gamma0.990.99Future reward discount
Entropy coefficient0.010.01Exploration intensity
Value loss coefficient0.50.5Critic learning strength
Gradient clipping0.50.5Training stability
N-steps55Update interval
Parallel environments816Data diversity
GAE lambda0.950.95Bias-variance tradeoff

A2C vs A3C

A3C (Asynchronous Advantage Actor-Critic) is the asynchronous version of A2C.

In practice, A2C is used more than A3C. It allows batch processing with GPUs, implementation is simpler, and performance is equal or better.


Debugging Tips

Common problems and solutions when training A2C:

  • Reward not changing: Check if learning rate is too small, check if entropy converges to 0 (early convergence), check for vanishing gradients
  • Reward dropping sharply: Check if learning rate is too large, ensure gradient clipping is applied, check if value loss is exploding
  • Entropy converging to 0: Increase entropy coefficient, decrease learning rate, verify action space is correct
  • Value loss not decreasing: Increase value loss coefficient, verify return computation is correct, check if gamma is appropriate

Metrics to Monitor

  1. Reward (most important)
  2. Policy loss (should decrease stably)
  3. Value loss (should decrease stably)
  4. Entropy (should decrease gradually but never reach 0)
  5. Advantage statistics (mean near 0, appropriate variance)
  6. Gradient norm (should not explode)

Complete Series Summary

Summary of the core topics covered in this deep reinforcement learning series:

PartTopicKey Concepts
01What is RLMDP, agent-environment interaction, reward
02OpenAI GymEnvironment API, wrappers, vector environments
03PyTorch basicsTensors, autograd, neural networks
04Cross-EntropyElite episode selection, CartPole
05Bellman equationValue functions, value iteration, Q-learning
06DQNExperience replay, target network
07DQN extensionsDouble, Dueling, Rainbow
08Stock tradingFinancial environment design, reward function
09Policy GradientREINFORCE, variance reduction
10Actor-CriticA2C, hyperparameter tuning

Next Steps

Advanced topics not covered in this series:

  • PPO (Proximal Policy Optimization): Currently the most widely used policy-based algorithm
  • SAC (Soft Actor-Critic): Off-policy actor-critic with entropy regularization
  • Model-based RL: Learning environment models for improved sample efficiency
  • Multi-agent RL: Environments where multiple agents cooperate/compete
  • RLHF: Reinforcement learning from human feedback (used in LLM training)

Summary

  1. Actor-Critic: Simultaneously learns policy (Actor) and value (Critic) to reduce variance
  2. A2C: Improves data collection efficiency with synchronized parallel environments
  3. GAE: Controls bias-variance tradeoff with lambda for advantage estimation
  4. Hyperparameters: Learning rate, entropy coefficient, number of environments, and N-step are key
  5. Debugging: Continuously monitor reward, loss, entropy, and gradient norm

Actor-Critic methods are the foundation of modern reinforcement learning. State-of-the-art algorithms like PPO and SAC are all based on the Actor-Critic architecture.