- Authors

- Name
- Youngju Kim
- @fjvbn20031
Review of REINFORCE Variance Problem
In the previous article, we examined the REINFORCE algorithm. The core problem was the high variance of gradient estimation.
REINFORCE can only update after the entire episode ends (Monte Carlo), and the gradient computed from a single episode has very large noise.
While baselines can reduce variance, a more fundamental solution is needed.
Actor-Critic Architecture
Actor-Critic combines two components:
- Actor (policy): Selects actions from states. pi(a|s; theta)
- Critic (value function): Evaluates the value of the current state. V(s; phi)
The key idea is to use TD (Temporal Difference) estimates instead of Monte Carlo returns to reduce variance.
REINFORCE vs Actor-Critic
REINFORCE: grad = log pi(a|s) * G_t (에피소드 끝까지 기다림)
Actor-Critic: grad = log pi(a|s) * (r + gamma * V(s') - V(s)) (한 스텝만 필요)
r + gamma * V(s') - V(s) is called the TD error or advantage estimate. V(s) serves as the baseline while simultaneously providing an estimate of the return.
A2C (Advantage Actor-Critic) Implementation
A2C is the synchronous version of Actor-Critic. It runs multiple environments in parallel to collect diverse experience simultaneously.
Network Architecture
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
class A2CNetwork(nn.Module):
"""A2C를 위한 공유 네트워크 (Actor + Critic)"""
def __init__(self, obs_size, n_actions, hidden_size=256):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(obs_size, hidden_size), nn.ReLU(),
nn.Linear(hidden_size, hidden_size), nn.ReLU(),
)
self.actor = nn.Linear(hidden_size, n_actions)
self.critic = nn.Linear(hidden_size, 1)
def forward(self, x):
features = self.shared(x)
logits = self.actor(features)
value = self.critic(features)
return logits, value
def get_action_and_value(self, state):
logits, value = self.forward(state)
probs = F.softmax(logits, dim=-1)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
log_prob = dist.log_prob(action)
entropy = dist.entropy()
return action, log_prob, value.squeeze(-1), entropy
CNN A2C for Atari
class A2CCNN(nn.Module):
"""Atari용 CNN 기반 A2C 네트워크"""
def __init__(self, input_channels, n_actions):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(input_channels, 32, kernel_size=8, stride=4), nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.ReLU(),
)
conv_out_size = self._get_conv_out(input_channels)
self.fc = nn.Sequential(nn.Linear(conv_out_size, 512), nn.ReLU())
self.actor = nn.Linear(512, n_actions)
self.critic = nn.Linear(512, 1)
def _get_conv_out(self, channels):
o = self.conv(torch.zeros(1, channels, 84, 84))
return int(np.prod(o.size()))
def forward(self, x):
x = x.float() / 255.0
conv_out = self.conv(x).view(x.size(0), -1)
features = self.fc(conv_out)
return self.actor(features), self.critic(features)
def get_action_and_value(self, state):
logits, value = self.forward(state)
probs = F.softmax(logits, dim=-1)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
return action, dist.log_prob(action), value.squeeze(-1), dist.entropy()
N-step Advantage Computation
A2C uses rewards from multiple steps rather than a single step to compute advantages, balancing bias and variance.
def compute_advantages(rewards, values, dones, next_value, gamma=0.99):
"""N-step 어드밴티지 계산"""
n_steps = len(rewards)
returns = []
advantages = []
R = next_value
for t in reversed(range(n_steps)):
if dones[t]:
R = 0.0
R = rewards[t] + gamma * R
returns.insert(0, R)
advantages.insert(0, R - values[t])
returns = torch.tensor(returns, dtype=torch.float32)
advantages = torch.tensor(advantages, dtype=torch.float32)
return returns, advantages
GAE (Generalized Advantage Estimation)
GAE estimates advantages by exponentially weighting TD errors of multiple lengths.
def compute_gae(rewards, values, dones, next_value, gamma=0.99, gae_lambda=0.95):
"""GAE (Generalized Advantage Estimation) 계산"""
n_steps = len(rewards)
advantages = np.zeros(n_steps)
last_gae = 0.0
for t in reversed(range(n_steps)):
if t == n_steps - 1:
next_val = next_value
else:
next_val = values[t + 1]
if dones[t]:
next_val = 0.0
last_gae = 0.0
delta = rewards[t] + gamma * next_val - values[t]
advantages[t] = last_gae = delta + gamma * gae_lambda * last_gae
returns = advantages + np.array(values)
return torch.tensor(returns, dtype=torch.float32), \
torch.tensor(advantages, dtype=torch.float32)
GAE's lambda parameter controls the bias-variance tradeoff:
- lambda = 0: 1-step TD (low variance, high bias)
- lambda = 1: Monte Carlo return (high variance, low bias)
- lambda = 0.95: Commonly used value in practice
A2C Training Loop
CartPole A2C
import gymnasium as gym
def train_a2c_cartpole():
"""A2C로 CartPole 학습"""
N_ENVS = 8
N_STEPS = 5
GAMMA = 0.99
LEARNING_RATE = 7e-4
VALUE_LOSS_COEF = 0.5
ENTROPY_COEF = 0.01
MAX_GRAD_NORM = 0.5
TOTAL_STEPS = 200000
envs = gym.make_vec("CartPole-v1", num_envs=N_ENVS)
obs_size = envs.single_observation_space.shape[0]
n_actions = envs.single_action_space.n
device = torch.device("cpu")
model = A2CNetwork(obs_size, n_actions).to(device)
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
obs, _ = envs.reset()
episode_rewards = np.zeros(N_ENVS)
completed_rewards = []
global_step = 0
while global_step < TOTAL_STEPS:
batch_obs = []
batch_actions = []
batch_log_probs = []
batch_values = []
batch_rewards = []
batch_dones = []
batch_entropies = []
for step in range(N_STEPS):
obs_t = torch.tensor(obs, dtype=torch.float32).to(device)
with torch.no_grad():
actions, log_probs, values, entropies = model.get_action_and_value(obs_t)
next_obs, rewards, terminateds, truncateds, infos = envs.step(actions.numpy())
dones = np.logical_or(terminateds, truncateds)
batch_obs.append(obs_t)
batch_actions.append(actions)
batch_log_probs.append(log_probs)
batch_values.append(values)
batch_rewards.append(rewards)
batch_dones.append(dones)
batch_entropies.append(entropies)
episode_rewards += rewards
for i, done in enumerate(dones):
if done:
completed_rewards.append(episode_rewards[i])
episode_rewards[i] = 0
obs = next_obs
global_step += N_ENVS
with torch.no_grad():
_, next_value = model(torch.tensor(obs, dtype=torch.float32).to(device))
next_value = next_value.squeeze(-1)
values_list = [v.detach().numpy() for v in batch_values]
returns_list = []
advantages_list = []
for env_idx in range(N_ENVS):
env_rewards = [batch_rewards[t][env_idx] for t in range(N_STEPS)]
env_values = [values_list[t][env_idx] for t in range(N_STEPS)]
env_dones = [batch_dones[t][env_idx] for t in range(N_STEPS)]
env_next_val = next_value[env_idx].item()
rets, advs = compute_gae(env_rewards, env_values, env_dones, env_next_val, GAMMA)
returns_list.append(rets)
advantages_list.append(advs)
all_log_probs = torch.stack(batch_log_probs).view(-1)
all_values = torch.stack(batch_values).view(-1)
all_entropies = torch.stack(batch_entropies).view(-1)
all_returns = torch.stack(returns_list, dim=1).view(-1)
all_advantages = torch.stack(advantages_list, dim=1).view(-1)
all_advantages = (all_advantages - all_advantages.mean()) / (all_advantages.std() + 1e-8)
policy_loss = -(all_log_probs * all_advantages.detach()).mean()
value_loss = F.mse_loss(all_values, all_returns.detach())
entropy_loss = all_entropies.mean()
total_loss = policy_loss + VALUE_LOSS_COEF * value_loss - ENTROPY_COEF * entropy_loss
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)
optimizer.step()
if len(completed_rewards) >= 10 and global_step % 1000 < N_ENVS * N_STEPS:
mean_reward = np.mean(completed_rewards[-10:])
print(f"스텝 {global_step}: 평균 보상={mean_reward:.1f}, 정책 손실={policy_loss.item():.4f}")
if mean_reward >= 475:
print(f"스텝 {global_step}에서 해결!")
break
envs.close()
return model, completed_rewards
# model, rewards = train_a2c_cartpole()
Hyperparameter Tuning
A2C performance is sensitive to hyperparameters. Here is a guide for each parameter:
Learning Rate
- Too large (1e-2): Training becomes unstable and may diverge
- Appropriate (7e-4 ~ 1e-3): Fast and stable training
- Too small (1e-5): Training is very slow, takes long to converge
Entropy Coefficient
- Entropy coefficient of 0: No exploration, may get stuck in local optima
- Entropy coefficient of 0.01: Good balance of exploration and exploitation
- Entropy coefficient of 0.5: Excessive exploration, very slow learning
- Recommended range: 0.001 ~ 0.05
Number of Parallel Environments
More parallel environments lead to more diverse data per update:
- 1 environment: High variance, slow learning
- 8 environments: Good balance (recommended for CartPole)
- 16 environments: Suitable for Atari games
- 32 environments: More stable but increased memory usage
Hyperparameter Summary
| Parameter | CartPole Recommended | Pong Recommended | Role |
|---|---|---|---|
| Learning rate | 7e-4 | 7e-4 | Parameter update size |
| Gamma | 0.99 | 0.99 | Future reward discount |
| Entropy coefficient | 0.01 | 0.01 | Exploration intensity |
| Value loss coefficient | 0.5 | 0.5 | Critic learning strength |
| Gradient clipping | 0.5 | 0.5 | Training stability |
| N-steps | 5 | 5 | Update interval |
| Parallel environments | 8 | 16 | Data diversity |
| GAE lambda | 0.95 | 0.95 | Bias-variance tradeoff |
A2C vs A3C
A3C (Asynchronous Advantage Actor-Critic) is the asynchronous version of A2C.
In practice, A2C is used more than A3C. It allows batch processing with GPUs, implementation is simpler, and performance is equal or better.
Debugging Tips
Common problems and solutions when training A2C:
- Reward not changing: Check if learning rate is too small, check if entropy converges to 0 (early convergence), check for vanishing gradients
- Reward dropping sharply: Check if learning rate is too large, ensure gradient clipping is applied, check if value loss is exploding
- Entropy converging to 0: Increase entropy coefficient, decrease learning rate, verify action space is correct
- Value loss not decreasing: Increase value loss coefficient, verify return computation is correct, check if gamma is appropriate
Metrics to Monitor
- Reward (most important)
- Policy loss (should decrease stably)
- Value loss (should decrease stably)
- Entropy (should decrease gradually but never reach 0)
- Advantage statistics (mean near 0, appropriate variance)
- Gradient norm (should not explode)
Complete Series Summary
Summary of the core topics covered in this deep reinforcement learning series:
| Part | Topic | Key Concepts |
|---|---|---|
| 01 | What is RL | MDP, agent-environment interaction, reward |
| 02 | OpenAI Gym | Environment API, wrappers, vector environments |
| 03 | PyTorch basics | Tensors, autograd, neural networks |
| 04 | Cross-Entropy | Elite episode selection, CartPole |
| 05 | Bellman equation | Value functions, value iteration, Q-learning |
| 06 | DQN | Experience replay, target network |
| 07 | DQN extensions | Double, Dueling, Rainbow |
| 08 | Stock trading | Financial environment design, reward function |
| 09 | Policy Gradient | REINFORCE, variance reduction |
| 10 | Actor-Critic | A2C, hyperparameter tuning |
Next Steps
Advanced topics not covered in this series:
- PPO (Proximal Policy Optimization): Currently the most widely used policy-based algorithm
- SAC (Soft Actor-Critic): Off-policy actor-critic with entropy regularization
- Model-based RL: Learning environment models for improved sample efficiency
- Multi-agent RL: Environments where multiple agents cooperate/compete
- RLHF: Reinforcement learning from human feedback (used in LLM training)
Summary
- Actor-Critic: Simultaneously learns policy (Actor) and value (Critic) to reduce variance
- A2C: Improves data collection efficiency with synchronized parallel environments
- GAE: Controls bias-variance tradeoff with lambda for advantage estimation
- Hyperparameters: Learning rate, entropy coefficient, number of environments, and N-step are key
- Debugging: Continuously monitor reward, loss, entropy, and gradient norm
Actor-Critic methods are the foundation of modern reinforcement learning. State-of-the-art algorithms like PPO and SAC are all based on the Actor-Critic architecture.