[Deep RL] 10. Actor-Critic Methods: A2C and Hyperparameter Tuning

Review of REINFORCE Variance Problem

In the previous article, we examined the REINFORCE algorithm. The core problem was the high variance of gradient estimation.

REINFORCE can only update after the entire episode ends (Monte Carlo), and the gradient computed from a single episode has very large noise.

While baselines can reduce variance, a more fundamental solution is needed.

Actor-Critic Architecture

Actor-Critic combines two components:

Actor (policy): Selects actions from states. pi(a|s; theta)
Critic (value function): Evaluates the value of the current state. V(s; phi)

The key idea is to use TD (Temporal Difference) estimates instead of Monte Carlo returns to reduce variance.

REINFORCE vs Actor-Critic

REINFORCE:     grad = log pi(a|s) * G_t         (에피소드 끝까지 기다림)
Actor-Critic:  grad = log pi(a|s) * (r + gamma * V(s') - V(s))  (한 스텝만 필요)

r + gamma * V(s') - V(s) is called the TD error or advantage estimate. V(s) serves as the baseline while simultaneously providing an estimate of the return.

A2C (Advantage Actor-Critic) Implementation

A2C is the synchronous version of Actor-Critic. It runs multiple environments in parallel to collect diverse experience simultaneously.

Network Architecture

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

class A2CNetwork(nn.Module):
    """A2C를 위한 공유 네트워크 (Actor + Critic)"""
    def __init__(self, obs_size, n_actions, hidden_size=256):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(obs_size, hidden_size), nn.ReLU(),
            nn.Linear(hidden_size, hidden_size), nn.ReLU(),
        )
        self.actor = nn.Linear(hidden_size, n_actions)
        self.critic = nn.Linear(hidden_size, 1)

    def forward(self, x):
        features = self.shared(x)
        logits = self.actor(features)
        value = self.critic(features)
        return logits, value

    def get_action_and_value(self, state):
        logits, value = self.forward(state)
        probs = F.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        return action, log_prob, value.squeeze(-1), entropy

CNN A2C for Atari

class A2CCNN(nn.Module):
    """Atari용 CNN 기반 A2C 네트워크"""
    def __init__(self, input_channels, n_actions):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(input_channels, 32, kernel_size=8, stride=4), nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.ReLU(),
        )
        conv_out_size = self._get_conv_out(input_channels)
        self.fc = nn.Sequential(nn.Linear(conv_out_size, 512), nn.ReLU())
        self.actor = nn.Linear(512, n_actions)
        self.critic = nn.Linear(512, 1)

    def _get_conv_out(self, channels):
        o = self.conv(torch.zeros(1, channels, 84, 84))
        return int(np.prod(o.size()))

    def forward(self, x):
        x = x.float() / 255.0
        conv_out = self.conv(x).view(x.size(0), -1)
        features = self.fc(conv_out)
        return self.actor(features), self.critic(features)

    def get_action_and_value(self, state):
        logits, value = self.forward(state)
        probs = F.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        return action, dist.log_prob(action), value.squeeze(-1), dist.entropy()

N-step Advantage Computation

A2C uses rewards from multiple steps rather than a single step to compute advantages, balancing bias and variance.

def compute_advantages(rewards, values, dones, next_value, gamma=0.99):
    """N-step 어드밴티지 계산"""
    n_steps = len(rewards)
    returns = []
    advantages = []
    R = next_value
    for t in reversed(range(n_steps)):
        if dones[t]:
            R = 0.0
        R = rewards[t] + gamma * R
        returns.insert(0, R)
        advantages.insert(0, R - values[t])
    returns = torch.tensor(returns, dtype=torch.float32)
    advantages = torch.tensor(advantages, dtype=torch.float32)
    return returns, advantages

GAE (Generalized Advantage Estimation)

GAE estimates advantages by exponentially weighting TD errors of multiple lengths.

def compute_gae(rewards, values, dones, next_value, gamma=0.99, gae_lambda=0.95):
    """GAE (Generalized Advantage Estimation) 계산"""
    n_steps = len(rewards)
    advantages = np.zeros(n_steps)
    last_gae = 0.0

    for t in reversed(range(n_steps)):
        if t == n_steps - 1:
            next_val = next_value
        else:
            next_val = values[t + 1]

        if dones[t]:
            next_val = 0.0
            last_gae = 0.0

        delta = rewards[t] + gamma * next_val - values[t]
        advantages[t] = last_gae = delta + gamma * gae_lambda * last_gae

    returns = advantages + np.array(values)
    return torch.tensor(returns, dtype=torch.float32), \
           torch.tensor(advantages, dtype=torch.float32)

GAE's lambda parameter controls the bias-variance tradeoff:

lambda = 0: 1-step TD (low variance, high bias)
lambda = 1: Monte Carlo return (high variance, low bias)
lambda = 0.95: Commonly used value in practice

A2C Training Loop

CartPole A2C

import gymnasium as gym

def train_a2c_cartpole():
    """A2C로 CartPole 학습"""
    N_ENVS = 8
    N_STEPS = 5
    GAMMA = 0.99
    LEARNING_RATE = 7e-4
    VALUE_LOSS_COEF = 0.5
    ENTROPY_COEF = 0.01
    MAX_GRAD_NORM = 0.5
    TOTAL_STEPS = 200000

    envs = gym.make_vec("CartPole-v1", num_envs=N_ENVS)
    obs_size = envs.single_observation_space.shape[0]
    n_actions = envs.single_action_space.n
    device = torch.device("cpu")
    model = A2CNetwork(obs_size, n_actions).to(device)
    optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

    obs, _ = envs.reset()
    episode_rewards = np.zeros(N_ENVS)
    completed_rewards = []
    global_step = 0

    while global_step < TOTAL_STEPS:
        batch_obs = []
        batch_actions = []
        batch_log_probs = []
        batch_values = []
        batch_rewards = []
        batch_dones = []
        batch_entropies = []

        for step in range(N_STEPS):
            obs_t = torch.tensor(obs, dtype=torch.float32).to(device)
            with torch.no_grad():
                actions, log_probs, values, entropies = model.get_action_and_value(obs_t)

            next_obs, rewards, terminateds, truncateds, infos = envs.step(actions.numpy())
            dones = np.logical_or(terminateds, truncateds)

            batch_obs.append(obs_t)
            batch_actions.append(actions)
            batch_log_probs.append(log_probs)
            batch_values.append(values)
            batch_rewards.append(rewards)
            batch_dones.append(dones)
            batch_entropies.append(entropies)

            episode_rewards += rewards
            for i, done in enumerate(dones):
                if done:
                    completed_rewards.append(episode_rewards[i])
                    episode_rewards[i] = 0
            obs = next_obs
            global_step += N_ENVS

        with torch.no_grad():
            _, next_value = model(torch.tensor(obs, dtype=torch.float32).to(device))
            next_value = next_value.squeeze(-1)

        values_list = [v.detach().numpy() for v in batch_values]
        returns_list = []
        advantages_list = []

        for env_idx in range(N_ENVS):
            env_rewards = [batch_rewards[t][env_idx] for t in range(N_STEPS)]
            env_values = [values_list[t][env_idx] for t in range(N_STEPS)]
            env_dones = [batch_dones[t][env_idx] for t in range(N_STEPS)]
            env_next_val = next_value[env_idx].item()
            rets, advs = compute_gae(env_rewards, env_values, env_dones, env_next_val, GAMMA)
            returns_list.append(rets)
            advantages_list.append(advs)

        all_log_probs = torch.stack(batch_log_probs).view(-1)
        all_values = torch.stack(batch_values).view(-1)
        all_entropies = torch.stack(batch_entropies).view(-1)
        all_returns = torch.stack(returns_list, dim=1).view(-1)
        all_advantages = torch.stack(advantages_list, dim=1).view(-1)
        all_advantages = (all_advantages - all_advantages.mean()) / (all_advantages.std() + 1e-8)

        policy_loss = -(all_log_probs * all_advantages.detach()).mean()
        value_loss = F.mse_loss(all_values, all_returns.detach())
        entropy_loss = all_entropies.mean()
        total_loss = policy_loss + VALUE_LOSS_COEF * value_loss - ENTROPY_COEF * entropy_loss

        optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)
        optimizer.step()

        if len(completed_rewards) >= 10 and global_step % 1000 < N_ENVS * N_STEPS:
            mean_reward = np.mean(completed_rewards[-10:])
            print(f"스텝 {global_step}: 평균 보상={mean_reward:.1f}, 정책 손실={policy_loss.item():.4f}")
            if mean_reward >= 475:
                print(f"스텝 {global_step}에서 해결!")
                break

    envs.close()
    return model, completed_rewards

# model, rewards = train_a2c_cartpole()

Hyperparameter Tuning

A2C performance is sensitive to hyperparameters. Here is a guide for each parameter:

Learning Rate

Too large (1e-2): Training becomes unstable and may diverge
Appropriate (7e-4 ~ 1e-3): Fast and stable training
Too small (1e-5): Training is very slow, takes long to converge

Entropy Coefficient

Entropy coefficient of 0: No exploration, may get stuck in local optima
Entropy coefficient of 0.01: Good balance of exploration and exploitation
Entropy coefficient of 0.5: Excessive exploration, very slow learning
Recommended range: 0.001 ~ 0.05

Number of Parallel Environments

More parallel environments lead to more diverse data per update:

1 environment: High variance, slow learning
8 environments: Good balance (recommended for CartPole)
16 environments: Suitable for Atari games
32 environments: More stable but increased memory usage

Hyperparameter Summary

Parameter	CartPole Recommended	Pong Recommended	Role
Learning rate	7e-4	7e-4	Parameter update size
Gamma	0.99	0.99	Future reward discount
Entropy coefficient	0.01	0.01	Exploration intensity
Value loss coefficient	0.5	0.5	Critic learning strength
Gradient clipping	0.5	0.5	Training stability
N-steps	5	5	Update interval
Parallel environments	8	16	Data diversity
GAE lambda	0.95	0.95	Bias-variance tradeoff

A2C vs A3C

A3C (Asynchronous Advantage Actor-Critic) is the asynchronous version of A2C.

In practice, A2C is used more than A3C. It allows batch processing with GPUs, implementation is simpler, and performance is equal or better.

Debugging Tips

Common problems and solutions when training A2C:

Reward not changing: Check if learning rate is too small, check if entropy converges to 0 (early convergence), check for vanishing gradients
Reward dropping sharply: Check if learning rate is too large, ensure gradient clipping is applied, check if value loss is exploding
Entropy converging to 0: Increase entropy coefficient, decrease learning rate, verify action space is correct
Value loss not decreasing: Increase value loss coefficient, verify return computation is correct, check if gamma is appropriate

Metrics to Monitor

Reward (most important)
Policy loss (should decrease stably)
Value loss (should decrease stably)
Entropy (should decrease gradually but never reach 0)
Advantage statistics (mean near 0, appropriate variance)
Gradient norm (should not explode)

Complete Series Summary

Summary of the core topics covered in this deep reinforcement learning series:

Part	Topic	Key Concepts
01	What is RL	MDP, agent-environment interaction, reward
02	OpenAI Gym	Environment API, wrappers, vector environments
03	PyTorch basics	Tensors, autograd, neural networks
04	Cross-Entropy	Elite episode selection, CartPole
05	Bellman equation	Value functions, value iteration, Q-learning
06	DQN	Experience replay, target network
07	DQN extensions	Double, Dueling, Rainbow
08	Stock trading	Financial environment design, reward function
09	Policy Gradient	REINFORCE, variance reduction
10	Actor-Critic	A2C, hyperparameter tuning

Next Steps

Advanced topics not covered in this series:

PPO (Proximal Policy Optimization): Currently the most widely used policy-based algorithm
SAC (Soft Actor-Critic): Off-policy actor-critic with entropy regularization
Model-based RL: Learning environment models for improved sample efficiency
Multi-agent RL: Environments where multiple agents cooperate/compete
RLHF: Reinforcement learning from human feedback (used in LLM training)

Summary

Actor-Critic: Simultaneously learns policy (Actor) and value (Critic) to reduce variance
A2C: Improves data collection efficiency with synchronized parallel environments
GAE: Controls bias-variance tradeoff with lambda for advantage estimation
Hyperparameters: Learning rate, entropy coefficient, number of environments, and N-step are key
Debugging: Continuously monitor reward, loss, entropy, and gradient norm

Actor-Critic methods are the foundation of modern reinforcement learning. State-of-the-art algorithms like PPO and SAC are all based on the Actor-Critic architecture.