[Deep RL] 04. Solving CartPole with the Cross-Entropy Method

Taxonomy of Reinforcement Learning Methods

Reinforcement learning algorithms can be classified by various criteria. Understanding the big picture makes it easy to identify where each algorithm fits.

Model-Based vs Model-Free

Model-Based: Knows or learns the environment's transition probabilities and reward function. Uses them for planning.
Model-Free: Learns directly from experience without an environment model. Most practical deep RL falls here.

Value-Based vs Policy-Based

Value-Based: Estimates the value of states or actions, then selects actions based on these values. DQN is representative.
Policy-Based: Directly parameterizes and optimizes the policy itself. REINFORCE, PPO, etc.
Actor-Critic: Combines the advantages of both value-based and policy-based methods.

On-Policy vs Off-Policy

On-Policy: Learns only from data collected by the current policy. Lower data efficiency but more stable.
Off-Policy: Can also use data collected by past or different policies. Higher data efficiency.

Classification	Algorithm Examples
Value-based, Off-policy	DQN, Double DQN
Policy-based, On-policy	REINFORCE, PPO
Actor-Critic, On-policy	A2C, A3C
Actor-Critic, Off-policy	SAC, DDPG, TD3

Core Idea of the Cross-Entropy Method

The Cross-Entropy method is one of the simplest algorithms in reinforcement learning. The core idea is:

Run multiple episodes with the current policy
Select only the top-performing episodes (elite episodes) based on reward
Update the policy using state-action pairs from elite episodes
Repeat

This method is a type of evolutionary strategy that improves the policy by imitating good episodes.

Algorithm Pseudocode

1. 정책 네트워크 초기화
2. 반복:
   a. N개의 에피소드를 현재 정책으로 수집
   b. 각 에피소드의 총 보상을 계산
   c. 상위 p%의 엘리트 에피소드를 선별 (보상 기준 상위 임계값)
   d. 엘리트 에피소드의 (상태, 행동) 쌍으로 정책 네트워크 학습
      - 손실 함수: Cross-Entropy Loss (분류 문제로 취급)
   e. 평균 보상이 충분히 높으면 종료

CartPole Implementation

Policy Network Definition

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gymnasium as gym
from collections import namedtuple

# 에피소드 데이터 구조
Episode = namedtuple('Episode', ['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', ['observation', 'action'])

class PolicyNetwork(nn.Module):
    """Cross-Entropy 방법을 위한 정책 네트워크"""
    def __init__(self, obs_size, hidden_size, n_actions):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions),
        )

    def forward(self, x):
        return self.net(x)

Episode Collection

def generate_batch(env, net, batch_size, device="cpu"):
    """현재 정책으로 에피소드 배치를 생성"""
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs, _ = env.reset()
    sm = nn.Softmax(dim=1)

    while True:
        obs_tensor = torch.tensor(np.array([obs]), dtype=torch.float32).to(device)
        action_probs = sm(net(obs_tensor))
        action_probs_np = action_probs.data.cpu().numpy()[0]

        # 확률에 따라 행동 선택
        action = np.random.choice(len(action_probs_np), p=action_probs_np)

        next_obs, reward, terminated, truncated, _ = env.step(action)
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))

        if terminated or truncated:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs, _ = env.reset()

            if len(batch) >= batch_size:
                break

        obs = next_obs

    return batch

Elite Episode Filtering

def filter_elite_episodes(batch, percentile):
    """상위 percentile%의 엘리트 에피소드 선별"""
    rewards = [episode.reward for episode in batch]
    reward_threshold = np.percentile(rewards, percentile)

    elite_observations = []
    elite_actions = []

    for episode in batch:
        if episode.reward >= reward_threshold:
            for step in episode.steps:
                elite_observations.append(step.observation)
                elite_actions.append(step.action)

    return (
        torch.tensor(np.array(elite_observations), dtype=torch.float32),
        torch.tensor(np.array(elite_actions), dtype=torch.long),
        reward_threshold,
        np.mean(rewards),
    )

Training Loop

def train_cross_entropy_cartpole():
    """Cross-Entropy 방법으로 CartPole 학습"""
    # 하이퍼파라미터
    HIDDEN_SIZE = 128
    BATCH_SIZE = 16
    PERCENTILE = 70
    LEARNING_RATE = 0.01
    GOAL_REWARD = 475  # CartPole-v1의 목표 보상

    env = gym.make("CartPole-v1")
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n

    net = PolicyNetwork(obs_size, HIDDEN_SIZE, n_actions)
    optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
    loss_fn = nn.CrossEntropyLoss()

    for iteration in range(100):
        # 1. 에피소드 배치 생성
        batch = generate_batch(env, net, BATCH_SIZE)

        # 2. 엘리트 에피소드 선별
        elite_obs, elite_actions, reward_threshold, mean_reward = \
            filter_elite_episodes(batch, PERCENTILE)

        # 3. 정책 네트워크 학습
        optimizer.zero_grad()
        action_logits = net(elite_obs)
        loss = loss_fn(action_logits, elite_actions)
        loss.backward()
        optimizer.step()

        print(
            f"반복 {iteration:3d}: "
            f"평균 보상={mean_reward:6.1f}, "
            f"임계값={reward_threshold:6.1f}, "
            f"손실={loss.item():.4f}"
        )

        if mean_reward >= GOAL_REWARD:
            print(f"목표 달성! 평균 보상: {mean_reward:.1f}")
            break

    env.close()
    return net

# 학습 실행
trained_net = train_cross_entropy_cartpole()

Expected Output

반복   0: 평균 보상=  21.3, 임계값=  23.0, 손실=0.6897
반복   1: 평균 보상=  25.1, 임계값=  27.0, 손실=0.6753
...
반복  20: 평균 보상= 152.4, 임계값= 185.0, 손실=0.5832
...
반복  40: 평균 보상= 421.8, 임계값= 500.0, 손실=0.5106
반복  45: 평균 보상= 487.2, 임계값= 500.0, 손실=0.4923
목표 달성! 평균 보상: 487.2

Evaluating the Trained Agent

def evaluate_agent(env_name, net, n_episodes=100, render=False):
    """학습된 에이전트의 성능을 평가"""
    render_mode = "human" if render else None
    env = gym.make(env_name, render_mode=render_mode)
    rewards = []

    for _ in range(n_episodes):
        obs, _ = env.reset()
        total_reward = 0

        while True:
            obs_tensor = torch.tensor(np.array([obs]), dtype=torch.float32)
            with torch.no_grad():
                action_logits = net(obs_tensor)
            action = action_logits.argmax(dim=1).item()

            obs, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward

            if terminated or truncated:
                break

        rewards.append(total_reward)

    env.close()
    print(f"평가 결과 ({n_episodes}회):")
    print(f"  평균 보상: {np.mean(rewards):.1f} +/- {np.std(rewards):.1f}")
    print(f"  최소/최대: {np.min(rewards):.1f} / {np.max(rewards):.1f}")

# evaluate_agent("CartPole-v1", trained_net)

FrozenLake Implementation

FrozenLake is a problem where the agent navigates from the start to the goal on a 4x4 ice surface. Due to the slippery ice, the agent may not move in the intended direction.

FrozenLake Characteristics

States: 16 (each position in the 4x4 grid)
Actions: 4 (up, down, left, right)
Reward: 1 when reaching the goal, 0 when falling in a hole (episode terminates)
Slipperiness: Only 1/3 probability of moving in the intended direction; 2/3 probability of slipping sideways

One-Hot Encoding for State Representation

Since FrozenLake's observation is a single integer (current position), we use one-hot encoding as input to the neural network.

class DiscreteOneHotWrapper(gym.ObservationWrapper):
    """이산 관찰을 원핫 벡터로 변환하는 래퍼"""
    def __init__(self, env):
        super().__init__(env)
        n = env.observation_space.n
        self.observation_space = gym.spaces.Box(
            low=0.0, high=1.0, shape=(n,), dtype=np.float32
        )

    def observation(self, observation):
        result = np.zeros(self.observation_space.shape[0], dtype=np.float32)
        result[observation] = 1.0
        return result

Applying Cross-Entropy to FrozenLake

FrozenLake requires some modifications because rewards are very sparse (only 1 upon reaching the goal).

def train_cross_entropy_frozenlake():
    """Cross-Entropy 방법으로 FrozenLake 학습"""
    HIDDEN_SIZE = 128
    BATCH_SIZE = 100  # 더 많은 에피소드 필요
    PERCENTILE = 30   # 더 낮은 퍼센타일 (성공 에피소드가 적으므로)
    LEARNING_RATE = 0.001

    env = DiscreteOneHotWrapper(gym.make("FrozenLake-v1"))
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n

    net = PolicyNetwork(obs_size, HIDDEN_SIZE, n_actions)
    optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
    loss_fn = nn.CrossEntropyLoss()

    for iteration in range(200):
        batch = generate_batch(env, net, BATCH_SIZE)
        elite_obs, elite_actions, reward_threshold, mean_reward = \
            filter_elite_episodes(batch, PERCENTILE)

        # 엘리트 에피소드가 없는 경우 건너뛰기
        if len(elite_obs) == 0:
            continue

        optimizer.zero_grad()
        action_logits = net(elite_obs)
        loss = loss_fn(action_logits, elite_actions)
        loss.backward()
        optimizer.step()

        if iteration % 10 == 0:
            print(
                f"반복 {iteration:3d}: "
                f"평균 보상={mean_reward:.3f}, "
                f"임계값={reward_threshold:.3f}, "
                f"손실={loss.item():.4f}"
            )

        if mean_reward > 0.8:
            print(f"목표 달성! 평균 보상: {mean_reward:.3f}")
            break

    env.close()
    return net

# trained_net_fl = train_cross_entropy_frozenlake()

Theoretical Background of the Cross-Entropy Method

Why "Cross-Entropy"?

The name comes from the loss function. Elite episode actions are treated as ground truth labels in supervised learning, and Cross-Entropy loss is used to compare them with the policy network output.

The mathematical meaning of Cross-Entropy loss is measuring the distance between two probability distributions:

H(p, q) = -sum( p(x) * log(q(x)) )

Here, p is the action distribution from elite episodes (ground truth) and q is the action probability output by the policy network.

Advantages and Disadvantages

Advantages

Very simple to implement
Few hyperparameters
Converges quickly on simple problems like CartPole
Stable training

Disadvantages

Can only learn after episodes complete (no online learning)
Not effective in sparse reward environments
Lacks scalability for complex environments
Difficult to directly apply to continuous action spaces

Impact of Hyperparameters

def experiment_hyperparameters():
    """다양한 하이퍼파라미터 조합 실험"""
    configs = [
        {"batch_size": 8, "percentile": 70, "hidden": 64},
        {"batch_size": 16, "percentile": 70, "hidden": 128},
        {"batch_size": 32, "percentile": 80, "hidden": 128},
        {"batch_size": 16, "percentile": 50, "hidden": 256},
    ]

    for i, config in enumerate(configs):
        print(f"\n=== 설정 {i+1}: batch={config['batch_size']}, "
              f"percentile={config['percentile']}, hidden={config['hidden']} ===")

        env = gym.make("CartPole-v1")
        obs_size = env.observation_space.shape[0]
        n_actions = env.action_space.n

        net = PolicyNetwork(obs_size, config["hidden"], n_actions)
        optimizer = optim.Adam(net.parameters(), lr=0.01)
        loss_fn = nn.CrossEntropyLoss()

        for iteration in range(50):
            batch = generate_batch(env, net, config["batch_size"])
            elite_obs, elite_actions, threshold, mean_reward = \
                filter_elite_episodes(batch, config["percentile"])

            if len(elite_obs) == 0:
                continue

            optimizer.zero_grad()
            loss = loss_fn(net(elite_obs), elite_actions)
            loss.backward()
            optimizer.step()

            if mean_reward >= 475:
                print(f"  반복 {iteration}에서 목표 달성!")
                break

        env.close()

# experiment_hyperparameters()

Summary

RL taxonomy: Classified by model-based/free, value/policy-based, and on/off-policy
Cross-Entropy method: Selects top episodes and updates the policy in a supervised learning fashion
CartPole: Can reach maximum reward in about 40-50 iterations
FrozenLake: Requires more episodes and lower percentile thresholds in sparse reward environments
Theory: Uses Cross-Entropy loss to imitate the action distribution of elite episodes

The Cross-Entropy method is simple but has limitations for more complex environments. In the next article, we will build the foundations for more powerful algorithms through the Bellman equation and value iteration.