[Deep RL] 09. Policy Gradient: Policy-Based Reinforcement Learning

Value-Based vs Policy-Based

The DQN family of methods covered so far were value-based approaches -- learning a Q function and indirectly selecting the action with the highest Q value.

Policy-based methods directly parameterize and optimize the policy. The policy network pi(a|s; theta) outputs a probability distribution over actions for each state.

Limitations of Value-Based Methods

Limited to discrete actions: DQN is difficult to directly apply to continuous action spaces
Deterministic policy: Taking the argmax of Q values makes it hard to naturally represent stochastic policies
Convergence instability: Small changes in the value function can cause drastic policy changes

Advantages of Policy-Based Methods

Continuous action spaces: Naturally handles continuous actions through Gaussian policies, etc.
Stochastic policies: Exploration is built into the policy, eliminating the need for separate epsilon schedules
Convergence guarantee: Convergence to a local optimum is theoretically guaranteed (with appropriate learning rate)
Partially observable environments: Stochastic policies are more natural for partial observability

Policy Representation

Discrete Actions: Softmax Policy

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class DiscretePolicyNetwork(nn.Module):
    """이산 행동 공간을 위한 정책 네트워크"""
    def __init__(self, obs_size, n_actions, hidden_size=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size), nn.ReLU(),
            nn.Linear(hidden_size, hidden_size), nn.ReLU(),
            nn.Linear(hidden_size, n_actions),
        )

    def forward(self, x):
        return self.net(x)

    def get_action_prob(self, state):
        logits = self.forward(state)
        probs = F.softmax(logits, dim=-1)
        return probs

    def select_action(self, state):
        probs = self.get_action_prob(state)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob

Continuous Actions: Gaussian Policy

class ContinuousPolicyNetwork(nn.Module):
    """연속 행동 공간을 위한 가우시안 정책 네트워크"""
    def __init__(self, obs_size, action_size, hidden_size=128):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(obs_size, hidden_size), nn.ReLU(),
            nn.Linear(hidden_size, hidden_size), nn.ReLU(),
        )
        self.mean_head = nn.Linear(hidden_size, action_size)
        self.log_std_head = nn.Linear(hidden_size, action_size)

    def forward(self, x):
        features = self.shared(x)
        mean = self.mean_head(features)
        log_std = self.log_std_head(features).clamp(-20, 2)
        std = log_std.exp()
        return mean, std

    def select_action(self, state):
        mean, std = self.forward(state)
        dist = torch.distributions.Normal(mean, std)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum(dim=-1)
        return action.detach().numpy(), log_prob

Policy Gradient Derivation

Objective Function

The goal of policy-based methods is to maximize the expected cumulative reward:

J(theta) = E_pi[ sum_t gamma^t * r_t ]

We need to differentiate this objective with respect to theta to obtain the gradient.

Policy Gradient Theorem

The key result is:

grad J(theta) = E_pi[ sum_t grad log pi(a_t | s_t; theta) * G_t ]

where G_t is the discounted cumulative reward from time step t.

Intuitively, this means:

Actions that received high reward: Update parameters in the direction of the log pi gradient, increasing the probability of that action
Actions that received low reward: Update in the opposite direction, decreasing the probability

Core of the Derivation

The key trick is the "log-derivative trick":

grad pi(a|s; theta) = pi(a|s; theta) * grad log pi(a|s; theta)

This enables transformation to an expectation form, making approximation through sampling possible.

REINFORCE Algorithm

REINFORCE is the most basic Policy Gradient algorithm. It collects complete episodes in a Monte Carlo fashion before updating.

Algorithm Pseudocode

1. 정책 네트워크 pi(a|s; theta) 초기화
2. 반복:
   a. 현재 정책으로 에피소드 하나를 수집
      - 각 스텝에서 (s_t, a_t, r_t, log pi(a_t|s_t))를 기록
   b. 각 시점의 할인 누적 보상 G_t를 계산
   c. 정책 그래디언트 계산:
      loss = -sum_t log pi(a_t|s_t) * G_t
   d. 역전파로 theta 업데이트

CartPole REINFORCE Implementation

import gymnasium as gym
import torch
import torch.optim as optim

def train_reinforce_cartpole():
    """REINFORCE로 CartPole 학습"""
    env = gym.make("CartPole-v1")
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n

    policy = DiscretePolicyNetwork(obs_size, n_actions, hidden_size=128)
    optimizer = optim.Adam(policy.parameters(), lr=0.001)
    gamma = 0.99
    rewards_history = []

    for episode in range(1000):
        log_probs = []
        rewards = []
        obs, _ = env.reset()

        while True:
            obs_tensor = torch.tensor([obs], dtype=torch.float32)
            action, log_prob = policy.select_action(obs_tensor)
            next_obs, reward, terminated, truncated, _ = env.step(action)
            log_probs.append(log_prob)
            rewards.append(reward)
            obs = next_obs
            if terminated or truncated:
                break

        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)

        returns = torch.tensor(returns, dtype=torch.float32)
        if len(returns) > 1:
            returns = (returns - returns.mean()) / (returns.std() + 1e-8)

        log_probs_tensor = torch.stack(log_probs)
        policy_loss = -(log_probs_tensor * returns).sum()

        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()

        total_reward = sum(rewards)
        rewards_history.append(total_reward)

        if episode % 50 == 0:
            mean_reward = np.mean(rewards_history[-50:])
            print(f"에피소드 {episode}: 보상={total_reward:.0f}, 평균={mean_reward:.1f}")
            if mean_reward >= 475:
                print(f"에피소드 {episode}에서 해결!")
                break

    env.close()
    return policy, rewards_history

# policy, history = train_reinforce_cartpole()

Variance Reduction with Baselines

Problem: High Variance

The gradient estimate of basic REINFORCE has very high variance. The gradient computed from a single episode is very noisy, making training unstable.

Solution: Baseline Function

Subtracting a constant baseline b from the gradient does not change the expectation but reduces variance:

grad J(theta) = E_pi[ sum_t grad log pi(a_t|s_t; theta) * (G_t - b) ]

The most common baseline is the state value function V(s).

class PolicyWithBaseline(nn.Module):
    """베이스라인이 있는 정책 네트워크"""
    def __init__(self, obs_size, n_actions, hidden_size=128):
        super().__init__()
        self.shared = nn.Sequential(nn.Linear(obs_size, hidden_size), nn.ReLU())
        self.policy_head = nn.Sequential(nn.Linear(hidden_size, hidden_size), nn.ReLU(), nn.Linear(hidden_size, n_actions))
        self.value_head = nn.Sequential(nn.Linear(hidden_size, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 1))

    def forward(self, x):
        features = self.shared(x)
        logits = self.policy_head(features)
        value = self.value_head(features)
        return logits, value

    def select_action(self, state):
        logits, value = self.forward(state)
        probs = F.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob, value

Exploration and Entropy Bonus

Early Convergence Problem

Policy-based methods can converge to a suboptimal policy before sufficient exploration because they quickly increase the probability of actions that received high rewards.

Entropy Bonus

Adding policy entropy to the loss function encourages exploration:

total_loss = policy_loss + value_loss_coef * value_loss - entropy_coef * entropy

High entropy means action probabilities are uniform, so maximizing entropy promotes exploration.

def compute_entropy_loss(logits):
    """정책 엔트로피 계산"""
    probs = F.softmax(logits, dim=-1)
    log_probs = F.log_softmax(logits, dim=-1)
    entropy = -(probs * log_probs).sum(dim=-1)
    return entropy.mean()

Variance Reduction Techniques Comparison

Technique	Description	Variance Reduction
Return normalization	Normalize G_t to mean 0, variance 1	Medium
Baseline	Use advantage G_t - V(s_t)	High
Time-dependent baseline	Consider only future rewards	High
GAE	Weighted average of multi-step advantages	Very High

Summary

Policy-based methods: Directly parameterize and optimize the policy
Policy Gradient Theorem: Express the gradient of expected reward as the product of log probability and return
REINFORCE: The most basic Monte Carlo Policy Gradient algorithm
Baseline: Use value function as baseline to greatly reduce variance
Entropy bonus: Increase policy entropy to prevent early convergence
Limitations: High variance, low data efficiency due to on-policy learning

REINFORCE works well on simple environments like CartPole, but training is very slow on complex environments like Pong. In the next article, we will cover Actor-Critic methods that address this problem.