- Authors

- Name
- Youngju Kim
- @fjvbn20031
Value-Based vs Policy-Based
The DQN family of methods covered so far were value-based approaches -- learning a Q function and indirectly selecting the action with the highest Q value.
Policy-based methods directly parameterize and optimize the policy. The policy network pi(a|s; theta) outputs a probability distribution over actions for each state.
Limitations of Value-Based Methods
- Limited to discrete actions: DQN is difficult to directly apply to continuous action spaces
- Deterministic policy: Taking the argmax of Q values makes it hard to naturally represent stochastic policies
- Convergence instability: Small changes in the value function can cause drastic policy changes
Advantages of Policy-Based Methods
- Continuous action spaces: Naturally handles continuous actions through Gaussian policies, etc.
- Stochastic policies: Exploration is built into the policy, eliminating the need for separate epsilon schedules
- Convergence guarantee: Convergence to a local optimum is theoretically guaranteed (with appropriate learning rate)
- Partially observable environments: Stochastic policies are more natural for partial observability
Policy Representation
Discrete Actions: Softmax Policy
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
class DiscretePolicyNetwork(nn.Module):
"""이산 행동 공간을 위한 정책 네트워크"""
def __init__(self, obs_size, n_actions, hidden_size=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_size, hidden_size), nn.ReLU(),
nn.Linear(hidden_size, hidden_size), nn.ReLU(),
nn.Linear(hidden_size, n_actions),
)
def forward(self, x):
return self.net(x)
def get_action_prob(self, state):
logits = self.forward(state)
probs = F.softmax(logits, dim=-1)
return probs
def select_action(self, state):
probs = self.get_action_prob(state)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
log_prob = dist.log_prob(action)
return action.item(), log_prob
Continuous Actions: Gaussian Policy
class ContinuousPolicyNetwork(nn.Module):
"""연속 행동 공간을 위한 가우시안 정책 네트워크"""
def __init__(self, obs_size, action_size, hidden_size=128):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(obs_size, hidden_size), nn.ReLU(),
nn.Linear(hidden_size, hidden_size), nn.ReLU(),
)
self.mean_head = nn.Linear(hidden_size, action_size)
self.log_std_head = nn.Linear(hidden_size, action_size)
def forward(self, x):
features = self.shared(x)
mean = self.mean_head(features)
log_std = self.log_std_head(features).clamp(-20, 2)
std = log_std.exp()
return mean, std
def select_action(self, state):
mean, std = self.forward(state)
dist = torch.distributions.Normal(mean, std)
action = dist.sample()
log_prob = dist.log_prob(action).sum(dim=-1)
return action.detach().numpy(), log_prob
Policy Gradient Derivation
Objective Function
The goal of policy-based methods is to maximize the expected cumulative reward:
J(theta) = E_pi[ sum_t gamma^t * r_t ]
We need to differentiate this objective with respect to theta to obtain the gradient.
Policy Gradient Theorem
The key result is:
grad J(theta) = E_pi[ sum_t grad log pi(a_t | s_t; theta) * G_t ]
where G_t is the discounted cumulative reward from time step t.
Intuitively, this means:
- Actions that received high reward: Update parameters in the direction of the log pi gradient, increasing the probability of that action
- Actions that received low reward: Update in the opposite direction, decreasing the probability
Core of the Derivation
The key trick is the "log-derivative trick":
grad pi(a|s; theta) = pi(a|s; theta) * grad log pi(a|s; theta)
This enables transformation to an expectation form, making approximation through sampling possible.
REINFORCE Algorithm
REINFORCE is the most basic Policy Gradient algorithm. It collects complete episodes in a Monte Carlo fashion before updating.
Algorithm Pseudocode
1. 정책 네트워크 pi(a|s; theta) 초기화
2. 반복:
a. 현재 정책으로 에피소드 하나를 수집
- 각 스텝에서 (s_t, a_t, r_t, log pi(a_t|s_t))를 기록
b. 각 시점의 할인 누적 보상 G_t를 계산
c. 정책 그래디언트 계산:
loss = -sum_t log pi(a_t|s_t) * G_t
d. 역전파로 theta 업데이트
CartPole REINFORCE Implementation
import gymnasium as gym
import torch
import torch.optim as optim
def train_reinforce_cartpole():
"""REINFORCE로 CartPole 학습"""
env = gym.make("CartPole-v1")
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n
policy = DiscretePolicyNetwork(obs_size, n_actions, hidden_size=128)
optimizer = optim.Adam(policy.parameters(), lr=0.001)
gamma = 0.99
rewards_history = []
for episode in range(1000):
log_probs = []
rewards = []
obs, _ = env.reset()
while True:
obs_tensor = torch.tensor([obs], dtype=torch.float32)
action, log_prob = policy.select_action(obs_tensor)
next_obs, reward, terminated, truncated, _ = env.step(action)
log_probs.append(log_prob)
rewards.append(reward)
obs = next_obs
if terminated or truncated:
break
returns = []
G = 0
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.tensor(returns, dtype=torch.float32)
if len(returns) > 1:
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
log_probs_tensor = torch.stack(log_probs)
policy_loss = -(log_probs_tensor * returns).sum()
optimizer.zero_grad()
policy_loss.backward()
optimizer.step()
total_reward = sum(rewards)
rewards_history.append(total_reward)
if episode % 50 == 0:
mean_reward = np.mean(rewards_history[-50:])
print(f"에피소드 {episode}: 보상={total_reward:.0f}, 평균={mean_reward:.1f}")
if mean_reward >= 475:
print(f"에피소드 {episode}에서 해결!")
break
env.close()
return policy, rewards_history
# policy, history = train_reinforce_cartpole()
Variance Reduction with Baselines
Problem: High Variance
The gradient estimate of basic REINFORCE has very high variance. The gradient computed from a single episode is very noisy, making training unstable.
Solution: Baseline Function
Subtracting a constant baseline b from the gradient does not change the expectation but reduces variance:
grad J(theta) = E_pi[ sum_t grad log pi(a_t|s_t; theta) * (G_t - b) ]
The most common baseline is the state value function V(s).
class PolicyWithBaseline(nn.Module):
"""베이스라인이 있는 정책 네트워크"""
def __init__(self, obs_size, n_actions, hidden_size=128):
super().__init__()
self.shared = nn.Sequential(nn.Linear(obs_size, hidden_size), nn.ReLU())
self.policy_head = nn.Sequential(nn.Linear(hidden_size, hidden_size), nn.ReLU(), nn.Linear(hidden_size, n_actions))
self.value_head = nn.Sequential(nn.Linear(hidden_size, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 1))
def forward(self, x):
features = self.shared(x)
logits = self.policy_head(features)
value = self.value_head(features)
return logits, value
def select_action(self, state):
logits, value = self.forward(state)
probs = F.softmax(logits, dim=-1)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
log_prob = dist.log_prob(action)
return action.item(), log_prob, value
Exploration and Entropy Bonus
Early Convergence Problem
Policy-based methods can converge to a suboptimal policy before sufficient exploration because they quickly increase the probability of actions that received high rewards.
Entropy Bonus
Adding policy entropy to the loss function encourages exploration:
total_loss = policy_loss + value_loss_coef * value_loss - entropy_coef * entropy
High entropy means action probabilities are uniform, so maximizing entropy promotes exploration.
def compute_entropy_loss(logits):
"""정책 엔트로피 계산"""
probs = F.softmax(logits, dim=-1)
log_probs = F.log_softmax(logits, dim=-1)
entropy = -(probs * log_probs).sum(dim=-1)
return entropy.mean()
Variance Reduction Techniques Comparison
| Technique | Description | Variance Reduction |
|---|---|---|
| Return normalization | Normalize G_t to mean 0, variance 1 | Medium |
| Baseline | Use advantage G_t - V(s_t) | High |
| Time-dependent baseline | Consider only future rewards | High |
| GAE | Weighted average of multi-step advantages | Very High |
Summary
- Policy-based methods: Directly parameterize and optimize the policy
- Policy Gradient Theorem: Express the gradient of expected reward as the product of log probability and return
- REINFORCE: The most basic Monte Carlo Policy Gradient algorithm
- Baseline: Use value function as baseline to greatly reduce variance
- Entropy bonus: Increase policy entropy to prevent early convergence
- Limitations: High variance, low data efficiency due to on-policy learning
REINFORCE works well on simple environments like CartPole, but training is very slow on complex environments like Pong. In the next article, we will cover Actor-Critic methods that address this problem.