- Authors

- Name
- Youngju Kim
- @fjvbn20031
Taxonomy of Reinforcement Learning Methods
Reinforcement learning algorithms can be classified by various criteria. Understanding the big picture makes it easy to identify where each algorithm fits.
Model-Based vs Model-Free
- Model-Based: Knows or learns the environment's transition probabilities and reward function. Uses them for planning.
- Model-Free: Learns directly from experience without an environment model. Most practical deep RL falls here.
Value-Based vs Policy-Based
- Value-Based: Estimates the value of states or actions, then selects actions based on these values. DQN is representative.
- Policy-Based: Directly parameterizes and optimizes the policy itself. REINFORCE, PPO, etc.
- Actor-Critic: Combines the advantages of both value-based and policy-based methods.
On-Policy vs Off-Policy
- On-Policy: Learns only from data collected by the current policy. Lower data efficiency but more stable.
- Off-Policy: Can also use data collected by past or different policies. Higher data efficiency.
| Classification | Algorithm Examples |
|---|---|
| Value-based, Off-policy | DQN, Double DQN |
| Policy-based, On-policy | REINFORCE, PPO |
| Actor-Critic, On-policy | A2C, A3C |
| Actor-Critic, Off-policy | SAC, DDPG, TD3 |
Core Idea of the Cross-Entropy Method
The Cross-Entropy method is one of the simplest algorithms in reinforcement learning. The core idea is:
- Run multiple episodes with the current policy
- Select only the top-performing episodes (elite episodes) based on reward
- Update the policy using state-action pairs from elite episodes
- Repeat
This method is a type of evolutionary strategy that improves the policy by imitating good episodes.
Algorithm Pseudocode
1. 정책 네트워크 초기화
2. 반복:
a. N개의 에피소드를 현재 정책으로 수집
b. 각 에피소드의 총 보상을 계산
c. 상위 p%의 엘리트 에피소드를 선별 (보상 기준 상위 임계값)
d. 엘리트 에피소드의 (상태, 행동) 쌍으로 정책 네트워크 학습
- 손실 함수: Cross-Entropy Loss (분류 문제로 취급)
e. 평균 보상이 충분히 높으면 종료
CartPole Implementation
Policy Network Definition
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gymnasium as gym
from collections import namedtuple
# 에피소드 데이터 구조
Episode = namedtuple('Episode', ['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', ['observation', 'action'])
class PolicyNetwork(nn.Module):
"""Cross-Entropy 방법을 위한 정책 네트워크"""
def __init__(self, obs_size, hidden_size, n_actions):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, n_actions),
)
def forward(self, x):
return self.net(x)
Episode Collection
def generate_batch(env, net, batch_size, device="cpu"):
"""현재 정책으로 에피소드 배치를 생성"""
batch = []
episode_reward = 0.0
episode_steps = []
obs, _ = env.reset()
sm = nn.Softmax(dim=1)
while True:
obs_tensor = torch.tensor(np.array([obs]), dtype=torch.float32).to(device)
action_probs = sm(net(obs_tensor))
action_probs_np = action_probs.data.cpu().numpy()[0]
# 확률에 따라 행동 선택
action = np.random.choice(len(action_probs_np), p=action_probs_np)
next_obs, reward, terminated, truncated, _ = env.step(action)
episode_reward += reward
episode_steps.append(EpisodeStep(observation=obs, action=action))
if terminated or truncated:
batch.append(Episode(reward=episode_reward, steps=episode_steps))
episode_reward = 0.0
episode_steps = []
next_obs, _ = env.reset()
if len(batch) >= batch_size:
break
obs = next_obs
return batch
Elite Episode Filtering
def filter_elite_episodes(batch, percentile):
"""상위 percentile%의 엘리트 에피소드 선별"""
rewards = [episode.reward for episode in batch]
reward_threshold = np.percentile(rewards, percentile)
elite_observations = []
elite_actions = []
for episode in batch:
if episode.reward >= reward_threshold:
for step in episode.steps:
elite_observations.append(step.observation)
elite_actions.append(step.action)
return (
torch.tensor(np.array(elite_observations), dtype=torch.float32),
torch.tensor(np.array(elite_actions), dtype=torch.long),
reward_threshold,
np.mean(rewards),
)
Training Loop
def train_cross_entropy_cartpole():
"""Cross-Entropy 방법으로 CartPole 학습"""
# 하이퍼파라미터
HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70
LEARNING_RATE = 0.01
GOAL_REWARD = 475 # CartPole-v1의 목표 보상
env = gym.make("CartPole-v1")
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n
net = PolicyNetwork(obs_size, HIDDEN_SIZE, n_actions)
optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
loss_fn = nn.CrossEntropyLoss()
for iteration in range(100):
# 1. 에피소드 배치 생성
batch = generate_batch(env, net, BATCH_SIZE)
# 2. 엘리트 에피소드 선별
elite_obs, elite_actions, reward_threshold, mean_reward = \
filter_elite_episodes(batch, PERCENTILE)
# 3. 정책 네트워크 학습
optimizer.zero_grad()
action_logits = net(elite_obs)
loss = loss_fn(action_logits, elite_actions)
loss.backward()
optimizer.step()
print(
f"반복 {iteration:3d}: "
f"평균 보상={mean_reward:6.1f}, "
f"임계값={reward_threshold:6.1f}, "
f"손실={loss.item():.4f}"
)
if mean_reward >= GOAL_REWARD:
print(f"목표 달성! 평균 보상: {mean_reward:.1f}")
break
env.close()
return net
# 학습 실행
trained_net = train_cross_entropy_cartpole()
Expected Output
반복 0: 평균 보상= 21.3, 임계값= 23.0, 손실=0.6897
반복 1: 평균 보상= 25.1, 임계값= 27.0, 손실=0.6753
...
반복 20: 평균 보상= 152.4, 임계값= 185.0, 손실=0.5832
...
반복 40: 평균 보상= 421.8, 임계값= 500.0, 손실=0.5106
반복 45: 평균 보상= 487.2, 임계값= 500.0, 손실=0.4923
목표 달성! 평균 보상: 487.2
Evaluating the Trained Agent
def evaluate_agent(env_name, net, n_episodes=100, render=False):
"""학습된 에이전트의 성능을 평가"""
render_mode = "human" if render else None
env = gym.make(env_name, render_mode=render_mode)
rewards = []
for _ in range(n_episodes):
obs, _ = env.reset()
total_reward = 0
while True:
obs_tensor = torch.tensor(np.array([obs]), dtype=torch.float32)
with torch.no_grad():
action_logits = net(obs_tensor)
action = action_logits.argmax(dim=1).item()
obs, reward, terminated, truncated, _ = env.step(action)
total_reward += reward
if terminated or truncated:
break
rewards.append(total_reward)
env.close()
print(f"평가 결과 ({n_episodes}회):")
print(f" 평균 보상: {np.mean(rewards):.1f} +/- {np.std(rewards):.1f}")
print(f" 최소/최대: {np.min(rewards):.1f} / {np.max(rewards):.1f}")
# evaluate_agent("CartPole-v1", trained_net)
FrozenLake Implementation
FrozenLake is a problem where the agent navigates from the start to the goal on a 4x4 ice surface. Due to the slippery ice, the agent may not move in the intended direction.
FrozenLake Characteristics
- States: 16 (each position in the 4x4 grid)
- Actions: 4 (up, down, left, right)
- Reward: 1 when reaching the goal, 0 when falling in a hole (episode terminates)
- Slipperiness: Only 1/3 probability of moving in the intended direction; 2/3 probability of slipping sideways
One-Hot Encoding for State Representation
Since FrozenLake's observation is a single integer (current position), we use one-hot encoding as input to the neural network.
class DiscreteOneHotWrapper(gym.ObservationWrapper):
"""이산 관찰을 원핫 벡터로 변환하는 래퍼"""
def __init__(self, env):
super().__init__(env)
n = env.observation_space.n
self.observation_space = gym.spaces.Box(
low=0.0, high=1.0, shape=(n,), dtype=np.float32
)
def observation(self, observation):
result = np.zeros(self.observation_space.shape[0], dtype=np.float32)
result[observation] = 1.0
return result
Applying Cross-Entropy to FrozenLake
FrozenLake requires some modifications because rewards are very sparse (only 1 upon reaching the goal).
def train_cross_entropy_frozenlake():
"""Cross-Entropy 방법으로 FrozenLake 학습"""
HIDDEN_SIZE = 128
BATCH_SIZE = 100 # 더 많은 에피소드 필요
PERCENTILE = 30 # 더 낮은 퍼센타일 (성공 에피소드가 적으므로)
LEARNING_RATE = 0.001
env = DiscreteOneHotWrapper(gym.make("FrozenLake-v1"))
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n
net = PolicyNetwork(obs_size, HIDDEN_SIZE, n_actions)
optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
loss_fn = nn.CrossEntropyLoss()
for iteration in range(200):
batch = generate_batch(env, net, BATCH_SIZE)
elite_obs, elite_actions, reward_threshold, mean_reward = \
filter_elite_episodes(batch, PERCENTILE)
# 엘리트 에피소드가 없는 경우 건너뛰기
if len(elite_obs) == 0:
continue
optimizer.zero_grad()
action_logits = net(elite_obs)
loss = loss_fn(action_logits, elite_actions)
loss.backward()
optimizer.step()
if iteration % 10 == 0:
print(
f"반복 {iteration:3d}: "
f"평균 보상={mean_reward:.3f}, "
f"임계값={reward_threshold:.3f}, "
f"손실={loss.item():.4f}"
)
if mean_reward > 0.8:
print(f"목표 달성! 평균 보상: {mean_reward:.3f}")
break
env.close()
return net
# trained_net_fl = train_cross_entropy_frozenlake()
Theoretical Background of the Cross-Entropy Method
Why "Cross-Entropy"?
The name comes from the loss function. Elite episode actions are treated as ground truth labels in supervised learning, and Cross-Entropy loss is used to compare them with the policy network output.
The mathematical meaning of Cross-Entropy loss is measuring the distance between two probability distributions:
H(p, q) = -sum( p(x) * log(q(x)) )
Here, p is the action distribution from elite episodes (ground truth) and q is the action probability output by the policy network.
Advantages and Disadvantages
Advantages
- Very simple to implement
- Few hyperparameters
- Converges quickly on simple problems like CartPole
- Stable training
Disadvantages
- Can only learn after episodes complete (no online learning)
- Not effective in sparse reward environments
- Lacks scalability for complex environments
- Difficult to directly apply to continuous action spaces
Impact of Hyperparameters
def experiment_hyperparameters():
"""다양한 하이퍼파라미터 조합 실험"""
configs = [
{"batch_size": 8, "percentile": 70, "hidden": 64},
{"batch_size": 16, "percentile": 70, "hidden": 128},
{"batch_size": 32, "percentile": 80, "hidden": 128},
{"batch_size": 16, "percentile": 50, "hidden": 256},
]
for i, config in enumerate(configs):
print(f"\n=== 설정 {i+1}: batch={config['batch_size']}, "
f"percentile={config['percentile']}, hidden={config['hidden']} ===")
env = gym.make("CartPole-v1")
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n
net = PolicyNetwork(obs_size, config["hidden"], n_actions)
optimizer = optim.Adam(net.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()
for iteration in range(50):
batch = generate_batch(env, net, config["batch_size"])
elite_obs, elite_actions, threshold, mean_reward = \
filter_elite_episodes(batch, config["percentile"])
if len(elite_obs) == 0:
continue
optimizer.zero_grad()
loss = loss_fn(net(elite_obs), elite_actions)
loss.backward()
optimizer.step()
if mean_reward >= 475:
print(f" 반복 {iteration}에서 목표 달성!")
break
env.close()
# experiment_hyperparameters()
Summary
- RL taxonomy: Classified by model-based/free, value/policy-based, and on/off-policy
- Cross-Entropy method: Selects top episodes and updates the policy in a supervised learning fashion
- CartPole: Can reach maximum reward in about 40-50 iterations
- FrozenLake: Requires more episodes and lower percentile thresholds in sparse reward environments
- Theory: Uses Cross-Entropy loss to imitate the action distribution of elite episodes
The Cross-Entropy method is simple but has limitations for more complex environments. In the next article, we will build the foundations for more powerful algorithms through the Bellman equation and value iteration.