Split View: [심층 강화학습] 10. Actor-Critic 방법: A2C와 하이퍼파라미터 튜닝

[심층 강화학습] 10. Actor-Critic 방법: A2C와 하이퍼파라미터 튜닝

REINFORCE의 분산 문제 복습

이전 글에서 REINFORCE 알고리즘을 살펴봤습니다. 핵심 문제는 그래디언트 추정의 높은 분산이었습니다.

REINFORCE는 전체 에피소드가 끝나야 업데이트할 수 있고(몬테카를로), 하나의 에피소드에서 계산한 그래디언트의 노이즈가 매우 큽니다.

베이스라인으로 분산을 줄일 수 있지만, 더 근본적인 해결책이 필요합니다.

Actor-Critic 아키텍처

Actor-Critic은 두 개의 구성 요소를 결합합니다.

Actor (정책): 상태에서 행동을 선택합니다. pi(a|s; theta)
Critic (가치 함수): 현재 상태의 가치를 평가합니다. V(s; phi)

핵심 아이디어는 몬테카를로 리턴 대신 TD(Temporal Difference) 추정을 사용하여 분산을 줄이는 것입니다.

REINFORCE vs Actor-Critic

REINFORCE:     grad = log pi(a|s) * G_t         (에피소드 끝까지 기다림)
Actor-Critic:  grad = log pi(a|s) * (r + gamma * V(s') - V(s))  (한 스텝만 필요)

r + gamma * V(s') - V(s)를 TD 오차 또는 어드밴티지 추정값이라 합니다. V(s)가 베이스라인 역할을 하면서 동시에 리턴의 추정치도 제공합니다.

A2C (Advantage Actor-Critic) 구현

A2C는 Actor-Critic의 동기화(synchronous) 버전입니다. 여러 환경을 병렬로 실행하여 다양한 경험을 동시에 수집합니다.

네트워크 구조

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

class A2CNetwork(nn.Module):
    """A2C를 위한 공유 네트워크 (Actor + Critic)"""
    def __init__(self, obs_size, n_actions, hidden_size=256):
        super().__init__()

        # 공유 특징 추출기
        self.shared = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
        )

        # Actor 헤드: 행동 확률 출력
        self.actor = nn.Linear(hidden_size, n_actions)

        # Critic 헤드: 상태 가치 출력
        self.critic = nn.Linear(hidden_size, 1)

    def forward(self, x):
        features = self.shared(x)
        logits = self.actor(features)
        value = self.critic(features)
        return logits, value

    def get_action_and_value(self, state):
        """행동 샘플링과 가치 평가를 동시에 수행"""
        logits, value = self.forward(state)
        probs = F.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        return action, log_prob, value.squeeze(-1), entropy

Atari용 CNN A2C

class A2CCNN(nn.Module):
    """Atari용 CNN 기반 A2C 네트워크"""
    def __init__(self, input_channels, n_actions):
        super().__init__()

        self.conv = nn.Sequential(
            nn.Conv2d(input_channels, 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU(),
        )

        conv_out_size = self._get_conv_out(input_channels)

        self.fc = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
        )

        self.actor = nn.Linear(512, n_actions)
        self.critic = nn.Linear(512, 1)

    def _get_conv_out(self, channels):
        o = self.conv(torch.zeros(1, channels, 84, 84))
        return int(np.prod(o.size()))

    def forward(self, x):
        x = x.float() / 255.0
        conv_out = self.conv(x).view(x.size(0), -1)
        features = self.fc(conv_out)
        logits = self.actor(features)
        value = self.critic(features)
        return logits, value

    def get_action_and_value(self, state):
        logits, value = self.forward(state)
        probs = F.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        return action, log_prob, value.squeeze(-1), entropy

N-step 어드밴티지 계산

A2C에서는 한 스텝이 아닌 여러 스텝의 보상을 사용하여 어드밴티지를 계산합니다. 이를 통해 편향과 분산의 균형을 맞춥니다.

def compute_advantages(rewards, values, dones, next_value, gamma=0.99):
    """
    N-step 어드밴티지 계산

    rewards: 각 스텝의 보상 리스트 (길이 N)
    values: 각 스텝의 가치 추정 리스트 (길이 N)
    dones: 에피소드 종료 여부 리스트 (길이 N)
    next_value: 마지막 다음 상태의 가치 추정
    """
    n_steps = len(rewards)
    returns = []
    advantages = []

    # 마지막 상태부터 역순으로 리턴 계산
    R = next_value
    for t in reversed(range(n_steps)):
        if dones[t]:
            R = 0.0
        R = rewards[t] + gamma * R
        returns.insert(0, R)
        advantages.insert(0, R - values[t])

    returns = torch.tensor(returns, dtype=torch.float32)
    advantages = torch.tensor(advantages, dtype=torch.float32)

    return returns, advantages

GAE (Generalized Advantage Estimation)

GAE는 여러 길이의 TD 오차를 지수 가중 평균하여 어드밴티지를 추정합니다.

def compute_gae(rewards, values, dones, next_value, gamma=0.99, gae_lambda=0.95):
    """GAE (Generalized Advantage Estimation) 계산"""
    n_steps = len(rewards)
    advantages = np.zeros(n_steps)
    last_gae = 0.0

    for t in reversed(range(n_steps)):
        if t == n_steps - 1:
            next_val = next_value
        else:
            next_val = values[t + 1]

        if dones[t]:
            next_val = 0.0
            last_gae = 0.0

        # TD 오차
        delta = rewards[t] + gamma * next_val - values[t]

        # GAE: 지수 가중 합
        advantages[t] = last_gae = delta + gamma * gae_lambda * last_gae

    returns = advantages + np.array(values)
    return torch.tensor(returns, dtype=torch.float32), \
           torch.tensor(advantages, dtype=torch.float32)

GAE의 lambda 파라미터는 편향-분산 트레이드오프를 제어합니다.

lambda = 0: 1-step TD (낮은 분산, 높은 편향)
lambda = 1: 몬테카를로 리턴 (높은 분산, 낮은 편향)
lambda = 0.95: 실무에서 자주 사용되는 값

A2C 학습 루프

CartPole A2C

import gymnasium as gym

def train_a2c_cartpole():
    """A2C로 CartPole 학습"""
    # 하이퍼파라미터
    N_ENVS = 8          # 병렬 환경 수
    N_STEPS = 5          # 업데이트 간격 (스텝)
    GAMMA = 0.99
    LEARNING_RATE = 7e-4
    VALUE_LOSS_COEF = 0.5
    ENTROPY_COEF = 0.01
    MAX_GRAD_NORM = 0.5
    TOTAL_STEPS = 200000

    # 벡터 환경 생성
    envs = gym.make_vec("CartPole-v1", num_envs=N_ENVS)
    obs_size = envs.single_observation_space.shape[0]
    n_actions = envs.single_action_space.n

    device = torch.device("cpu")
    model = A2CNetwork(obs_size, n_actions).to(device)
    optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

    obs, _ = envs.reset()
    episode_rewards = np.zeros(N_ENVS)
    completed_rewards = []
    global_step = 0

    while global_step < TOTAL_STEPS:
        # N 스텝 데이터 수집
        batch_obs = []
        batch_actions = []
        batch_log_probs = []
        batch_values = []
        batch_rewards = []
        batch_dones = []
        batch_entropies = []

        for step in range(N_STEPS):
            obs_t = torch.tensor(obs, dtype=torch.float32).to(device)

            with torch.no_grad():
                actions, log_probs, values, entropies = model.get_action_and_value(obs_t)

            # 환경 스텝
            next_obs, rewards, terminateds, truncateds, infos = envs.step(actions.numpy())
            dones = np.logical_or(terminateds, truncateds)

            batch_obs.append(obs_t)
            batch_actions.append(actions)
            batch_log_probs.append(log_probs)
            batch_values.append(values)
            batch_rewards.append(rewards)
            batch_dones.append(dones)
            batch_entropies.append(entropies)

            # 에피소드 보상 추적
            episode_rewards += rewards
            for i, done in enumerate(dones):
                if done:
                    completed_rewards.append(episode_rewards[i])
                    episode_rewards[i] = 0

            obs = next_obs
            global_step += N_ENVS

        # 마지막 상태의 가치 계산 (부트스트래핑)
        with torch.no_grad():
            _, next_value = model(torch.tensor(obs, dtype=torch.float32).to(device))
            next_value = next_value.squeeze(-1)

        # 리턴과 어드밴티지 계산
        values_list = [v.detach().numpy() for v in batch_values]
        returns_list = []
        advantages_list = []

        for env_idx in range(N_ENVS):
            env_rewards = [batch_rewards[t][env_idx] for t in range(N_STEPS)]
            env_values = [values_list[t][env_idx] for t in range(N_STEPS)]
            env_dones = [batch_dones[t][env_idx] for t in range(N_STEPS)]
            env_next_val = next_value[env_idx].item()

            rets, advs = compute_gae(env_rewards, env_values, env_dones, env_next_val, GAMMA)
            returns_list.append(rets)
            advantages_list.append(advs)

        # 텐서로 변환
        all_log_probs = torch.stack(batch_log_probs).view(-1)
        all_values = torch.stack(batch_values).view(-1)
        all_entropies = torch.stack(batch_entropies).view(-1)
        all_returns = torch.stack(returns_list, dim=1).view(-1)
        all_advantages = torch.stack(advantages_list, dim=1).view(-1)

        # 어드밴티지 정규화
        all_advantages = (all_advantages - all_advantages.mean()) / (all_advantages.std() + 1e-8)

        # 손실 계산
        policy_loss = -(all_log_probs * all_advantages.detach()).mean()
        value_loss = F.mse_loss(all_values, all_returns.detach())
        entropy_loss = all_entropies.mean()

        total_loss = policy_loss + VALUE_LOSS_COEF * value_loss - ENTROPY_COEF * entropy_loss

        # 업데이트
        optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)
        optimizer.step()

        # 로깅
        if len(completed_rewards) >= 10 and global_step % 1000 < N_ENVS * N_STEPS:
            mean_reward = np.mean(completed_rewards[-10:])
            print(
                f"스텝 {global_step}: "
                f"평균 보상={mean_reward:.1f}, "
                f"정책 손실={policy_loss.item():.4f}, "
                f"가치 손실={value_loss.item():.4f}, "
                f"엔트로피={entropy_loss.item():.4f}"
            )

            if mean_reward >= 475:
                print(f"스텝 {global_step}에서 해결!")
                break

    envs.close()
    return model, completed_rewards

# model, rewards = train_a2c_cartpole()

Pong에서의 A2C

Pong에 A2C를 적용하는 구조입니다.

def train_a2c_pong():
    """A2C로 Pong 학습 (구조 예시)"""
    N_ENVS = 16
    N_STEPS = 5
    GAMMA = 0.99
    LEARNING_RATE = 7e-4
    VALUE_LOSS_COEF = 0.5
    ENTROPY_COEF = 0.01
    MAX_GRAD_NORM = 0.5

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Atari 환경 생성 (각 환경에 전처리 래퍼 적용)
    # envs = make_vec_atari_envs("ALE/Pong-v5", N_ENVS)

    model = A2CCNN(input_channels=4, n_actions=6).to(device)
    optimizer = optim.RMSprop(model.parameters(), lr=LEARNING_RATE, alpha=0.99, eps=1e-5)

    # 학습 루프는 CartPole과 동일한 구조
    # 주요 차이점:
    # 1. CNN 네트워크 사용
    # 2. RMSprop 옵티마이저 (Atari에서 더 안정적)
    # 3. 더 많은 병렬 환경 (16개)
    # 4. 더 긴 학습 시간 (수백만 프레임)

    print("Pong A2C 학습 구조:")
    print(f"  병렬 환경: {N_ENVS}개")
    print(f"  업데이트 간격: {N_STEPS}스텝")
    print(f"  배치 크기: {N_ENVS * N_STEPS} = {N_ENVS * N_STEPS}개 전이")
    print(f"  예상 학습 시간: ~1000만 프레임 (GPU 기준 수 시간)")

# train_a2c_pong()

하이퍼파라미터 튜닝

A2C의 성능은 하이퍼파라미터에 민감합니다. 각 파라미터의 역할과 튜닝 방법을 살펴봅니다.

학습률 (Learning Rate)

def experiment_learning_rate():
    """학습률의 영향을 실험"""
    learning_rates = [1e-2, 7e-4, 1e-4, 1e-5]

    for lr in learning_rates:
        print(f"\n학습률: {lr}")

        env = gym.make("CartPole-v1")
        obs_size = env.observation_space.shape[0]
        n_actions = env.action_space.n

        model = A2CNetwork(obs_size, n_actions)
        optimizer = optim.Adam(model.parameters(), lr=lr)

        episode_rewards = []
        obs, _ = env.reset()

        for episode in range(300):
            log_probs = []
            values = []
            rewards = []
            entropies = []

            while True:
                obs_t = torch.tensor([obs], dtype=torch.float32)
                action, log_prob, value, entropy = model.get_action_and_value(obs_t)

                next_obs, reward, terminated, truncated, _ = env.step(action.item())
                log_probs.append(log_prob)
                values.append(value)
                rewards.append(reward)
                entropies.append(entropy)

                obs = next_obs
                if terminated or truncated:
                    break

            # 리턴 계산
            returns = []
            G = 0
            for r in reversed(rewards):
                G = r + 0.99 * G
                returns.insert(0, G)

            returns = torch.tensor(returns)
            values_t = torch.stack(values)
            advantages = returns - values_t.detach()

            # 업데이트
            policy_loss = -(torch.stack(log_probs) * advantages).mean()
            value_loss = F.mse_loss(values_t, returns)
            entropy_bonus = torch.stack(entropies).mean()

            loss = policy_loss + 0.5 * value_loss - 0.01 * entropy_bonus

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
            optimizer.step()

            episode_rewards.append(sum(rewards))
            obs, _ = env.reset()

        env.close()
        final_avg = np.mean(episode_rewards[-50:])
        print(f"  최종 50 에피소드 평균: {final_avg:.1f}")

# experiment_learning_rate()

학습률 튜닝 가이드라인은 다음과 같습니다.

너무 큰 학습률 (1e-2): 학습이 불안정하고 발산할 수 있음
적절한 학습률 (7e-4 ~ 1e-3): 빠르고 안정적인 학습
너무 작은 학습률 (1e-5): 학습이 매우 느려 수렴에 오랜 시간 소요

엔트로피 계수 (Entropy Beta)

엔트로피 계수는 탐색과 활용의 균형을 제어합니다.

def experiment_entropy_coef():
    """엔트로피 계수의 영향을 실험"""
    entropy_coefs = [0.0, 0.001, 0.01, 0.1, 0.5]

    for entropy_coef in entropy_coefs:
        print(f"\n엔트로피 계수: {entropy_coef}")
        # 학습 코드 (위와 동일한 구조)
        # entropy_coef가 0이면: 탐색 없이 빠르게 수렴하지만 지역 최적에 빠질 수 있음
        # entropy_coef가 0.01이면: 적절한 탐색과 활용의 균형
        # entropy_coef가 0.5이면: 과도한 탐색으로 학습이 매우 느림
        pass

# 권장 범위: 0.001 ~ 0.05

병렬 환경 수 (Number of Environments)

병렬 환경이 많을수록 각 업데이트에 사용되는 데이터가 다양해집니다.

def experiment_n_envs():
    """병렬 환경 수의 영향"""
    configs = {
        1: "단일 환경: 높은 분산, 느린 학습",
        4: "4개 환경: 적당한 다양성",
        8: "8개 환경: 좋은 균형 (CartPole 추천)",
        16: "16개 환경: Atari 게임에 적합",
        32: "32개 환경: 더 안정적이지만 메모리 사용 증가",
    }

    for n_envs, description in configs.items():
        print(f"N_ENVS={n_envs}: {description}")

# experiment_n_envs()

배치 크기와 업데이트 주기 (N-steps)

def experiment_n_steps():
    """N-step의 영향"""
    n_steps_configs = {
        1: "1-step: 빈번한 업데이트, 높은 편향, 낮은 분산",
        5: "5-step: 일반적인 선택 (A2C 논문 기본값)",
        20: "20-step: 몬테카를로에 가까움, 낮은 편향, 높은 분산",
        128: "128-step: PPO에서 자주 사용",
    }

    for n_steps, description in n_steps_configs.items():
        print(f"N_STEPS={n_steps}: {description}")
        # 실제 배치 크기 = N_ENVS * N_STEPS
        # N_ENVS=8, N_STEPS=5 -> 배치 40개 전이

# experiment_n_steps()

하이퍼파라미터 요약

파라미터	CartPole 추천값	Pong 추천값	역할
학습률	7e-4	7e-4	파라미터 업데이트 크기
감마	0.99	0.99	미래 보상 할인
엔트로피 계수	0.01	0.01	탐색 강도
가치 손실 계수	0.5	0.5	Critic 학습 강도
그래디언트 클리핑	0.5	0.5	학습 안정성
N-steps	5	5	업데이트 간격
병렬 환경 수	8	16	데이터 다양성
GAE lambda	0.95	0.95	편향-분산 트레이드오프

A2C vs A3C

A3C(Asynchronous Advantage Actor-Critic)는 A2C의 비동기 버전입니다.

# A3C와 A2C의 차이점 비교

class A3CvsA2C:
    """A3C와 A2C의 비교"""

    def a2c_description(self):
        """
        A2C (Synchronous):
        - 모든 워커가 동시에 N 스텝 데이터를 수집
        - 데이터를 모아서 한 번에 업데이트
        - GPU 활용에 효율적
        - 구현이 단순
        """
        pass

    def a3c_description(self):
        """
        A3C (Asynchronous):
        - 각 워커가 독립적으로 데이터 수집 및 그래디언트 계산
        - 비동기적으로 글로벌 모델 업데이트
        - CPU 멀티코어에 적합
        - 워커 간 통신 오버헤드
        """
        pass

    def comparison(self):
        results = {
            "성능": "A2C가 A3C와 동등하거나 더 좋음",
            "구현": "A2C가 훨씬 단순",
            "GPU 활용": "A2C가 더 효율적 (배치 처리 가능)",
            "실무 사용": "A2C가 더 많이 사용됨",
        }
        return results

실무에서는 A2C가 A3C보다 많이 사용됩니다. GPU를 활용한 배치 처리가 가능하고, 구현이 단순하며, 성능도 동등하거나 더 좋기 때문입니다.

디버깅 팁

A2C 학습 시 자주 만나는 문제와 해결 방법입니다.

def debug_checklist():
    """A2C 디버깅 체크리스트"""
    checks = {
        "보상이 변하지 않음": [
            "학습률이 너무 작은지 확인",
            "엔트로피가 0으로 수렴하는지 확인 (조기 수렴)",
            "그래디언트가 0인지 확인 (vanishing gradient)",
        ],
        "보상이 급격히 하락": [
            "학습률이 너무 큰지 확인",
            "그래디언트 클리핑이 적용되었는지 확인",
            "가치 손실이 폭발하는지 확인",
        ],
        "엔트로피가 0으로 수렴": [
            "엔트로피 계수를 높이기",
            "학습률을 줄이기",
            "행동 공간이 올바른지 확인",
        ],
        "가치 손실이 줄지 않음": [
            "가치 손실 계수를 높이기",
            "리턴 계산이 올바른지 확인",
            "감마가 적절한지 확인",
        ],
    }

    for problem, solutions in checks.items():
        print(f"\n문제: {problem}")
        for i, solution in enumerate(solutions, 1):
            print(f"  {i}. {solution}")

# debug_checklist()

모니터링해야 할 지표

def log_training_metrics(writer, step, policy_loss, value_loss, entropy,
                          mean_reward, advantages):
    """학습 과정에서 모니터링해야 할 핵심 지표"""
    # 1. 보상 (가장 중요)
    # writer.add_scalar("reward/mean", mean_reward, step)

    # 2. 정책 손실 (안정적으로 감소해야 함)
    # writer.add_scalar("loss/policy", policy_loss, step)

    # 3. 가치 손실 (안정적으로 감소해야 함)
    # writer.add_scalar("loss/value", value_loss, step)

    # 4. 엔트로피 (서서히 감소하되 0이 되면 안 됨)
    # writer.add_scalar("entropy", entropy, step)

    # 5. 어드밴티지 통계 (평균이 0 근처, 분산이 적절해야 함)
    # writer.add_scalar("advantage/mean", advantages.mean().item(), step)
    # writer.add_scalar("advantage/std", advantages.std().item(), step)

    # 6. 그래디언트 노름 (폭발하지 않아야 함)
    pass

전체 시리즈 정리

이번 시리즈에서 다룬 심층 강화학습의 핵심 주제들을 정리합니다.

회차	주제	핵심 개념
01	강화학습이란	MDP, 에이전트-환경 상호작용, 보상
02	OpenAI Gym	환경 API, 래퍼, 벡터 환경
03	PyTorch 기초	텐서, 자동 미분, 신경망
04	Cross-Entropy	엘리트 에피소드 선별, CartPole
05	벨만 방정식	가치 함수, 가치 반복, Q-러닝
06	DQN	경험 리플레이, 타겟 네트워크
07	DQN 확장	Double, Dueling, Rainbow
08	주식 트레이딩	금융 환경 설계, 보상 함수
09	Policy Gradient	REINFORCE, 분산 감소
10	Actor-Critic	A2C, 하이퍼파라미터 튜닝

다음 단계

이 시리즈에서 다루지 못한 고급 주제들입니다.

PPO (Proximal Policy Optimization): 현재 가장 널리 사용되는 정책 기반 알고리즘
SAC (Soft Actor-Critic): 엔트로피 정규화를 사용한 오프폴리시 액터-크리틱
모델 기반 RL: 환경 모델을 학습하여 샘플 효율 향상
멀티 에이전트 RL: 여러 에이전트가 협력/경쟁하는 환경
RLHF: 인간 피드백을 통한 강화학습 (LLM 학습에 활용)

정리

Actor-Critic: 정책(Actor)과 가치(Critic)를 동시에 학습하여 분산 감소
A2C: 동기화된 병렬 환경으로 데이터 수집 효율 향상
GAE: 편향-분산 트레이드오프를 lambda로 제어하는 어드밴티지 추정
하이퍼파라미터: 학습률, 엔트로피 계수, 환경 수, N-step이 핵심
디버깅: 보상, 손실, 엔트로피, 그래디언트 노름을 지속적으로 모니터링

Actor-Critic 방법은 현대 강화학습의 기초입니다. PPO, SAC 등 최신 알고리즘들도 모두 Actor-Critic 구조를 기반으로 합니다.

[Deep RL] 10. Actor-Critic Methods: A2C and Hyperparameter Tuning

Review of REINFORCE Variance Problem

In the previous article, we examined the REINFORCE algorithm. The core problem was the high variance of gradient estimation.

REINFORCE can only update after the entire episode ends (Monte Carlo), and the gradient computed from a single episode has very large noise.

While baselines can reduce variance, a more fundamental solution is needed.

Actor-Critic Architecture

Actor-Critic combines two components:

Actor (policy): Selects actions from states. pi(a|s; theta)
Critic (value function): Evaluates the value of the current state. V(s; phi)

The key idea is to use TD (Temporal Difference) estimates instead of Monte Carlo returns to reduce variance.

REINFORCE vs Actor-Critic

REINFORCE:     grad = log pi(a|s) * G_t         (에피소드 끝까지 기다림)
Actor-Critic:  grad = log pi(a|s) * (r + gamma * V(s') - V(s))  (한 스텝만 필요)

r + gamma * V(s') - V(s) is called the TD error or advantage estimate. V(s) serves as the baseline while simultaneously providing an estimate of the return.

A2C (Advantage Actor-Critic) Implementation

A2C is the synchronous version of Actor-Critic. It runs multiple environments in parallel to collect diverse experience simultaneously.

Network Architecture

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

class A2CNetwork(nn.Module):
    """A2C를 위한 공유 네트워크 (Actor + Critic)"""
    def __init__(self, obs_size, n_actions, hidden_size=256):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(obs_size, hidden_size), nn.ReLU(),
            nn.Linear(hidden_size, hidden_size), nn.ReLU(),
        )
        self.actor = nn.Linear(hidden_size, n_actions)
        self.critic = nn.Linear(hidden_size, 1)

    def forward(self, x):
        features = self.shared(x)
        logits = self.actor(features)
        value = self.critic(features)
        return logits, value

    def get_action_and_value(self, state):
        logits, value = self.forward(state)
        probs = F.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        return action, log_prob, value.squeeze(-1), entropy

CNN A2C for Atari

class A2CCNN(nn.Module):
    """Atari용 CNN 기반 A2C 네트워크"""
    def __init__(self, input_channels, n_actions):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(input_channels, 32, kernel_size=8, stride=4), nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.ReLU(),
        )
        conv_out_size = self._get_conv_out(input_channels)
        self.fc = nn.Sequential(nn.Linear(conv_out_size, 512), nn.ReLU())
        self.actor = nn.Linear(512, n_actions)
        self.critic = nn.Linear(512, 1)

    def _get_conv_out(self, channels):
        o = self.conv(torch.zeros(1, channels, 84, 84))
        return int(np.prod(o.size()))

    def forward(self, x):
        x = x.float() / 255.0
        conv_out = self.conv(x).view(x.size(0), -1)
        features = self.fc(conv_out)
        return self.actor(features), self.critic(features)

    def get_action_and_value(self, state):
        logits, value = self.forward(state)
        probs = F.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        return action, dist.log_prob(action), value.squeeze(-1), dist.entropy()

N-step Advantage Computation

A2C uses rewards from multiple steps rather than a single step to compute advantages, balancing bias and variance.

def compute_advantages(rewards, values, dones, next_value, gamma=0.99):
    """N-step 어드밴티지 계산"""
    n_steps = len(rewards)
    returns = []
    advantages = []
    R = next_value
    for t in reversed(range(n_steps)):
        if dones[t]:
            R = 0.0
        R = rewards[t] + gamma * R
        returns.insert(0, R)
        advantages.insert(0, R - values[t])
    returns = torch.tensor(returns, dtype=torch.float32)
    advantages = torch.tensor(advantages, dtype=torch.float32)
    return returns, advantages

GAE (Generalized Advantage Estimation)

GAE estimates advantages by exponentially weighting TD errors of multiple lengths.

def compute_gae(rewards, values, dones, next_value, gamma=0.99, gae_lambda=0.95):
    """GAE (Generalized Advantage Estimation) 계산"""
    n_steps = len(rewards)
    advantages = np.zeros(n_steps)
    last_gae = 0.0

    for t in reversed(range(n_steps)):
        if t == n_steps - 1:
            next_val = next_value
        else:
            next_val = values[t + 1]

        if dones[t]:
            next_val = 0.0
            last_gae = 0.0

        delta = rewards[t] + gamma * next_val - values[t]
        advantages[t] = last_gae = delta + gamma * gae_lambda * last_gae

    returns = advantages + np.array(values)
    return torch.tensor(returns, dtype=torch.float32), \
           torch.tensor(advantages, dtype=torch.float32)

GAE's lambda parameter controls the bias-variance tradeoff:

lambda = 0: 1-step TD (low variance, high bias)
lambda = 1: Monte Carlo return (high variance, low bias)
lambda = 0.95: Commonly used value in practice

A2C Training Loop

CartPole A2C

import gymnasium as gym

def train_a2c_cartpole():
    """A2C로 CartPole 학습"""
    N_ENVS = 8
    N_STEPS = 5
    GAMMA = 0.99
    LEARNING_RATE = 7e-4
    VALUE_LOSS_COEF = 0.5
    ENTROPY_COEF = 0.01
    MAX_GRAD_NORM = 0.5
    TOTAL_STEPS = 200000

    envs = gym.make_vec("CartPole-v1", num_envs=N_ENVS)
    obs_size = envs.single_observation_space.shape[0]
    n_actions = envs.single_action_space.n
    device = torch.device("cpu")
    model = A2CNetwork(obs_size, n_actions).to(device)
    optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

    obs, _ = envs.reset()
    episode_rewards = np.zeros(N_ENVS)
    completed_rewards = []
    global_step = 0

    while global_step < TOTAL_STEPS:
        batch_obs = []
        batch_actions = []
        batch_log_probs = []
        batch_values = []
        batch_rewards = []
        batch_dones = []
        batch_entropies = []

        for step in range(N_STEPS):
            obs_t = torch.tensor(obs, dtype=torch.float32).to(device)
            with torch.no_grad():
                actions, log_probs, values, entropies = model.get_action_and_value(obs_t)

            next_obs, rewards, terminateds, truncateds, infos = envs.step(actions.numpy())
            dones = np.logical_or(terminateds, truncateds)

            batch_obs.append(obs_t)
            batch_actions.append(actions)
            batch_log_probs.append(log_probs)
            batch_values.append(values)
            batch_rewards.append(rewards)
            batch_dones.append(dones)
            batch_entropies.append(entropies)

            episode_rewards += rewards
            for i, done in enumerate(dones):
                if done:
                    completed_rewards.append(episode_rewards[i])
                    episode_rewards[i] = 0
            obs = next_obs
            global_step += N_ENVS

        with torch.no_grad():
            _, next_value = model(torch.tensor(obs, dtype=torch.float32).to(device))
            next_value = next_value.squeeze(-1)

        values_list = [v.detach().numpy() for v in batch_values]
        returns_list = []
        advantages_list = []

        for env_idx in range(N_ENVS):
            env_rewards = [batch_rewards[t][env_idx] for t in range(N_STEPS)]
            env_values = [values_list[t][env_idx] for t in range(N_STEPS)]
            env_dones = [batch_dones[t][env_idx] for t in range(N_STEPS)]
            env_next_val = next_value[env_idx].item()
            rets, advs = compute_gae(env_rewards, env_values, env_dones, env_next_val, GAMMA)
            returns_list.append(rets)
            advantages_list.append(advs)

        all_log_probs = torch.stack(batch_log_probs).view(-1)
        all_values = torch.stack(batch_values).view(-1)
        all_entropies = torch.stack(batch_entropies).view(-1)
        all_returns = torch.stack(returns_list, dim=1).view(-1)
        all_advantages = torch.stack(advantages_list, dim=1).view(-1)
        all_advantages = (all_advantages - all_advantages.mean()) / (all_advantages.std() + 1e-8)

        policy_loss = -(all_log_probs * all_advantages.detach()).mean()
        value_loss = F.mse_loss(all_values, all_returns.detach())
        entropy_loss = all_entropies.mean()
        total_loss = policy_loss + VALUE_LOSS_COEF * value_loss - ENTROPY_COEF * entropy_loss

        optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)
        optimizer.step()

        if len(completed_rewards) >= 10 and global_step % 1000 < N_ENVS * N_STEPS:
            mean_reward = np.mean(completed_rewards[-10:])
            print(f"스텝 {global_step}: 평균 보상={mean_reward:.1f}, 정책 손실={policy_loss.item():.4f}")
            if mean_reward >= 475:
                print(f"스텝 {global_step}에서 해결!")
                break

    envs.close()
    return model, completed_rewards

# model, rewards = train_a2c_cartpole()

Hyperparameter Tuning

A2C performance is sensitive to hyperparameters. Here is a guide for each parameter:

Learning Rate

Too large (1e-2): Training becomes unstable and may diverge
Appropriate (7e-4 ~ 1e-3): Fast and stable training
Too small (1e-5): Training is very slow, takes long to converge

Entropy Coefficient

Entropy coefficient of 0: No exploration, may get stuck in local optima
Entropy coefficient of 0.01: Good balance of exploration and exploitation
Entropy coefficient of 0.5: Excessive exploration, very slow learning
Recommended range: 0.001 ~ 0.05

Number of Parallel Environments

More parallel environments lead to more diverse data per update:

1 environment: High variance, slow learning
8 environments: Good balance (recommended for CartPole)
16 environments: Suitable for Atari games
32 environments: More stable but increased memory usage

Hyperparameter Summary

Parameter	CartPole Recommended	Pong Recommended	Role
Learning rate	7e-4	7e-4	Parameter update size
Gamma	0.99	0.99	Future reward discount
Entropy coefficient	0.01	0.01	Exploration intensity
Value loss coefficient	0.5	0.5	Critic learning strength
Gradient clipping	0.5	0.5	Training stability
N-steps	5	5	Update interval
Parallel environments	8	16	Data diversity
GAE lambda	0.95	0.95	Bias-variance tradeoff

A2C vs A3C

A3C (Asynchronous Advantage Actor-Critic) is the asynchronous version of A2C.

In practice, A2C is used more than A3C. It allows batch processing with GPUs, implementation is simpler, and performance is equal or better.

Debugging Tips

Common problems and solutions when training A2C:

Reward not changing: Check if learning rate is too small, check if entropy converges to 0 (early convergence), check for vanishing gradients
Reward dropping sharply: Check if learning rate is too large, ensure gradient clipping is applied, check if value loss is exploding
Entropy converging to 0: Increase entropy coefficient, decrease learning rate, verify action space is correct
Value loss not decreasing: Increase value loss coefficient, verify return computation is correct, check if gamma is appropriate

Metrics to Monitor

Reward (most important)
Policy loss (should decrease stably)
Value loss (should decrease stably)
Entropy (should decrease gradually but never reach 0)
Advantage statistics (mean near 0, appropriate variance)
Gradient norm (should not explode)

Complete Series Summary

Summary of the core topics covered in this deep reinforcement learning series:

Part	Topic	Key Concepts
01	What is RL	MDP, agent-environment interaction, reward
02	OpenAI Gym	Environment API, wrappers, vector environments
03	PyTorch basics	Tensors, autograd, neural networks
04	Cross-Entropy	Elite episode selection, CartPole
05	Bellman equation	Value functions, value iteration, Q-learning
06	DQN	Experience replay, target network
07	DQN extensions	Double, Dueling, Rainbow
08	Stock trading	Financial environment design, reward function
09	Policy Gradient	REINFORCE, variance reduction
10	Actor-Critic	A2C, hyperparameter tuning

Next Steps

Advanced topics not covered in this series:

PPO (Proximal Policy Optimization): Currently the most widely used policy-based algorithm
SAC (Soft Actor-Critic): Off-policy actor-critic with entropy regularization
Model-based RL: Learning environment models for improved sample efficiency
Multi-agent RL: Environments where multiple agents cooperate/compete
RLHF: Reinforcement learning from human feedback (used in LLM training)

Summary

Actor-Critic: Simultaneously learns policy (Actor) and value (Critic) to reduce variance
A2C: Improves data collection efficiency with synchronized parallel environments
GAE: Controls bias-variance tradeoff with lambda for advantage estimation
Hyperparameters: Learning rate, entropy coefficient, number of environments, and N-step are key
Debugging: Continuously monitor reward, loss, entropy, and gradient norm

Actor-Critic methods are the foundation of modern reinforcement learning. State-of-the-art algorithms like PPO and SAC are all based on the Actor-Critic architecture.