- Authors

- Name
- Youngju Kim
- @fjvbn20031
Limitations of Value Iteration in the Real World
The table-based methods covered in the previous article can only be used when the state space is small. In real-world problems, we face the following limitations.
Explosion of the State Space
- Atari game screens: 210 x 160 pixels, 3 color channels = about 100,000 continuous values
- Go: Approximately 10 to the power of 170 possible states
- Autonomous driving: Sensor data combinations are practically infinite
In such environments, creating a Q table itself is impossible. We must approximate the Q function with neural networks.
From Table Q-Learning to Deep Q-Learning
Core Idea
Instead of a Q table, we use a neural network Q(s, a; theta) to approximate Q values, where theta represents the network parameters (weights).
Q-테이블: Q[state_index][action_index] = value (정확한 값 저장)
DQN: Q(state; theta) -> [q_action_0, q_action_1, ...] (함수 근사)
Why Simply Adding a Neural Network Is Not Enough
Simply substituting a neural network into Q-learning makes training very unstable. There are three core problems:
- Correlation between data: Consecutive experiences are strongly correlated, violating the i.i.d. assumption
- Non-stationary targets: Learning targets keep changing, making convergence difficult
- Markov property violation: A single frame alone cannot fully represent the environment state
Core Techniques of DQN
1. Experience Replay
Agent experiences are stored in a buffer and randomly sampled during training. This breaks the correlation between data.
import numpy as np
from collections import deque
import random
class ReplayBuffer:
"""경험 리플레이 버퍼"""
def __init__(self, capacity):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
return (
np.array(states),
np.array(actions),
np.array(rewards, dtype=np.float32),
np.array(next_states),
np.array(dones, dtype=np.bool_),
)
def __len__(self):
return len(self.buffer)
# 사용 예시
buffer = ReplayBuffer(capacity=100000)
# 경험 저장
buffer.push(
state=np.array([0.1, 0.2, 0.3, 0.4]),
action=1,
reward=1.0,
next_state=np.array([0.2, 0.3, 0.4, 0.5]),
done=False,
)
# 배치 샘플링
if len(buffer) >= 32:
states, actions, rewards, next_states, dones = buffer.sample(32)
2. Target Network
A separate network is used to compute the target for Q value updates. The target network periodically copies weights from the online network.
TD 타겟 = r + gamma * max_{a'} Q_target(s', a'; theta_target)
손실 = (Q(s, a; theta) - TD 타겟)^2
Using a target network fixes the learning target for a period, stabilizing training.
3. Frame Stacking
In Atari games, a single frame cannot tell us the direction of ball movement, etc. We stack 4 consecutive frames as one observation.
DQN Model Implementation
DQN Network for Atari
import torch
import torch.nn as nn
class DQN(nn.Module):
"""Atari 게임용 DQN 네트워크"""
def __init__(self, input_channels, n_actions):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(input_channels, 32, kernel_size=8, stride=4),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.ReLU(),
)
conv_out_size = self._get_conv_out(input_channels)
self.fc = nn.Sequential(
nn.Linear(conv_out_size, 512),
nn.ReLU(),
nn.Linear(512, n_actions),
)
def _get_conv_out(self, channels):
o = self.conv(torch.zeros(1, channels, 84, 84))
return int(np.prod(o.size()))
def forward(self, x):
conv_out = self.conv(x.float() / 255.0)
return self.fc(conv_out.view(conv_out.size(0), -1))
DQN Agent Implementation
import torch
import torch.optim as optim
class DQNAgent:
"""DQN 에이전트"""
def __init__(self, env, device="cpu", buffer_size=100000,
batch_size=32, gamma=0.99, lr=1e-4,
epsilon_start=1.0, epsilon_end=0.01,
epsilon_decay=100000, target_update=1000):
self.env = env
self.device = device
self.batch_size = batch_size
self.gamma = gamma
self.target_update = target_update
# 엡실론 스케줄
self.epsilon_start = epsilon_start
self.epsilon_end = epsilon_end
self.epsilon_decay = epsilon_decay
n_actions = env.action_space.n
# 온라인 네트워크와 타겟 네트워크
self.online_net = DQN(4, n_actions).to(device)
self.target_net = DQN(4, n_actions).to(device)
self.target_net.load_state_dict(self.online_net.state_dict())
self.target_net.eval()
self.optimizer = optim.Adam(self.online_net.parameters(), lr=lr)
self.buffer = ReplayBuffer(buffer_size)
self.step_count = 0
def get_epsilon(self):
"""현재 엡실론 값 계산 (선형 감소)"""
epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
max(0, 1 - self.step_count / self.epsilon_decay)
return epsilon
def select_action(self, state):
"""엡실론-탐욕 정책으로 행동 선택"""
epsilon = self.get_epsilon()
if random.random() < epsilon:
return self.env.action_space.sample()
with torch.no_grad():
state_tensor = torch.tensor(
np.array([state]), dtype=torch.uint8
).to(self.device)
q_values = self.online_net(state_tensor)
return q_values.argmax(dim=1).item()
def update(self):
"""경험 리플레이에서 배치를 샘플링하여 네트워크 업데이트"""
if len(self.buffer) < self.batch_size:
return None
states, actions, rewards, next_states, dones = self.buffer.sample(self.batch_size)
# 텐서 변환
states_t = torch.tensor(states, dtype=torch.uint8).to(self.device)
actions_t = torch.tensor(actions, dtype=torch.long).to(self.device)
rewards_t = torch.tensor(rewards, dtype=torch.float32).to(self.device)
next_states_t = torch.tensor(next_states, dtype=torch.uint8).to(self.device)
dones_t = torch.tensor(dones, dtype=torch.bool).to(self.device)
# 현재 Q값: Q(s, a)
current_q = self.online_net(states_t).gather(1, actions_t.unsqueeze(1)).squeeze(1)
# 타겟 Q값: r + gamma * max_a' Q_target(s', a')
with torch.no_grad():
next_q = self.target_net(next_states_t).max(dim=1)[0]
next_q[dones_t] = 0.0
target_q = rewards_t + self.gamma * next_q
# 손실 계산 및 역전파
loss = nn.SmoothL1Loss()(current_q, target_q)
self.optimizer.zero_grad()
loss.backward()
# 그래디언트 클리핑 (안정적 학습)
torch.nn.utils.clip_grad_norm_(self.online_net.parameters(), max_norm=10)
self.optimizer.step()
return loss.item()
def sync_target_network(self):
"""타겟 네트워크를 온라인 네트워크로 동기화"""
self.target_net.load_state_dict(self.online_net.state_dict())
Atari Environment Preprocessing
The Atari environment preprocessing pipeline for DQN training:
import gymnasium as gym
from gymnasium import ObservationWrapper, Wrapper
from gymnasium.spaces import Box
import numpy as np
import cv2
class FireResetWrapper(Wrapper):
"""에피소드 시작 시 FIRE 행동을 자동 실행"""
def __init__(self, env):
super().__init__(env)
def reset(self, **kwargs):
obs, info = self.env.reset(**kwargs)
obs, _, terminated, truncated, info = self.env.step(1) # FIRE
if terminated or truncated:
obs, info = self.env.reset(**kwargs)
return obs, info
class MaxAndSkipWrapper(Wrapper):
"""프레임 스킵: 4프레임마다 행동을 선택하고 중간 프레임은 반복"""
def __init__(self, env, skip=4):
super().__init__(env)
self.skip = skip
self.obs_buffer = np.zeros((2,) + env.observation_space.shape, dtype=np.uint8)
def step(self, action):
total_reward = 0.0
done = False
for i in range(self.skip):
obs, reward, terminated, truncated, info = self.env.step(action)
total_reward += reward
if i == self.skip - 2:
self.obs_buffer[0] = obs
if i == self.skip - 1:
self.obs_buffer[1] = obs
if terminated or truncated:
done = True
break
max_frame = self.obs_buffer.max(axis=0)
return max_frame, total_reward, terminated, truncated, info
class ProcessFrame84Wrapper(ObservationWrapper):
"""프레임을 84x84 그레이스케일로 변환"""
def __init__(self, env):
super().__init__(env)
self.observation_space = Box(
low=0, high=255, shape=(84, 84, 1), dtype=np.uint8
)
def observation(self, obs):
return self._process(obs)
def _process(self, frame):
img = np.mean(frame, axis=2).astype(np.uint8)
resized = cv2.resize(img, (84, 84), interpolation=cv2.INTER_AREA)
return resized.reshape(84, 84, 1)
class FrameStackWrapper(ObservationWrapper):
"""연속 n개의 프레임을 채널 방향으로 스태킹"""
def __init__(self, env, n_frames=4):
super().__init__(env)
self.n_frames = n_frames
old_shape = env.observation_space.shape
self.observation_space = Box(
low=0, high=255,
shape=(old_shape[0], old_shape[1], n_frames),
dtype=np.uint8,
)
self.frames = deque(maxlen=n_frames)
def reset(self, **kwargs):
obs, info = self.env.reset(**kwargs)
for _ in range(self.n_frames):
self.frames.append(obs)
return np.concatenate(list(self.frames), axis=-1), info
def observation(self, obs):
self.frames.append(obs)
return np.concatenate(list(self.frames), axis=-1)
class ImageToChannelsFirstWrapper(ObservationWrapper):
"""(H, W, C) -> (C, H, W) 변환"""
def __init__(self, env):
super().__init__(env)
old_shape = env.observation_space.shape
self.observation_space = Box(
low=0, high=255,
shape=(old_shape[2], old_shape[0], old_shape[1]),
dtype=np.uint8,
)
def observation(self, obs):
return np.transpose(obs, (2, 0, 1))
def make_atari_env(env_name="ALE/Pong-v5"):
"""Atari 환경 생성 및 래퍼 적용"""
env = gym.make(env_name)
env = MaxAndSkipWrapper(env)
env = FireResetWrapper(env)
env = ProcessFrame84Wrapper(env)
env = FrameStackWrapper(env)
env = ImageToChannelsFirstWrapper(env)
return env
DQN Training Loop
from torch.utils.tensorboard import SummaryWriter
def train_dqn_pong():
"""DQN으로 Pong 학습"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"사용 장치: {device}")
env = make_atari_env("ALE/Pong-v5")
agent = DQNAgent(
env=env,
device=device,
buffer_size=100000,
batch_size=32,
gamma=0.99,
lr=1e-4,
epsilon_start=1.0,
epsilon_end=0.02,
epsilon_decay=100000,
target_update=1000,
)
writer = SummaryWriter("runs/dqn_pong")
obs, _ = env.reset()
episode_reward = 0
episode_count = 0
best_mean_reward = -float("inf")
for frame in range(1, 1000001):
# 행동 선택 및 실행
action = agent.select_action(obs)
next_obs, reward, terminated, truncated, _ = env.step(action)
# 경험 저장
agent.buffer.push(obs, action, reward, next_obs, terminated or truncated)
episode_reward += reward
agent.step_count += 1
if terminated or truncated:
episode_count += 1
writer.add_scalar("reward/episode", episode_reward, episode_count)
writer.add_scalar("epsilon", agent.get_epsilon(), frame)
if episode_count % 10 == 0:
print(
f"프레임 {frame}, 에피소드 {episode_count}: "
f"보상={episode_reward:.0f}, "
f"엡실론={agent.get_epsilon():.3f}, "
f"버퍼={len(agent.buffer)}"
)
episode_reward = 0
obs, _ = env.reset()
else:
obs = next_obs
# 네트워크 업데이트
loss = agent.update()
if loss is not None and frame % 1000 == 0:
writer.add_scalar("loss/td", loss, frame)
# 타겟 네트워크 동기화
if frame % agent.target_update == 0:
agent.sync_target_network()
# 모델 저장
if frame % 50000 == 0:
torch.save(agent.online_net.state_dict(), f"dqn_pong_{frame}.pth")
writer.close()
env.close()
# train_dqn_pong()
DQN Training Results Analysis
DQN training in the Pong environment generally shows the following pattern:
Learning Curve Characteristics
- Initial (0-100K frames): Near-random actions, reward around -21 (complete loss)
- Exploration phase (100K-300K frames): Epsilon decreases, occasionally scores points
- Learning phase (300K-700K frames): Rapid performance improvement, reward rises to around 0
- Convergence phase (700K+): Reward reaches +19 to +21, near-perfect play
Impact of Key Hyperparameters
| Parameter | Value | Role |
|---|---|---|
| Learning rate | 1e-4 | Too large = unstable, too small = slow |
| Batch size | 32 | Larger = more stable but more memory |
| Buffer size | 100K | Small = only recent experience, large = diverse |
| Target update period | 1000 | Short = unstable, long = stale target |
| Epsilon decay range | 100K frames | Speed of transition from exploration to exploitation |
| Discount factor | 0.99 | Importance of future rewards |
DQN Experiment on Simple Environments
Training Pong requires significant GPU time. The core concepts of DQN can be verified on CartPole.
class SimpleDQN(nn.Module):
"""CartPole용 간단한 DQN"""
def __init__(self, obs_size, n_actions):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_size, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, n_actions),
)
def forward(self, x):
return self.net(x)
def train_dqn_cartpole():
"""CartPole에서 DQN 학습"""
env = gym.make("CartPole-v1")
device = torch.device("cpu")
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n
online_net = SimpleDQN(obs_size, n_actions).to(device)
target_net = SimpleDQN(obs_size, n_actions).to(device)
target_net.load_state_dict(online_net.state_dict())
optimizer = optim.Adam(online_net.parameters(), lr=1e-3)
buffer = ReplayBuffer(10000)
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
gamma = 0.99
batch_size = 64
target_update = 100
rewards_history = []
step = 0
for episode in range(500):
obs, _ = env.reset()
total_reward = 0
while True:
# 엡실론-탐욕 행동 선택
if random.random() < epsilon:
action = env.action_space.sample()
else:
with torch.no_grad():
q = online_net(torch.tensor([obs], dtype=torch.float32))
action = q.argmax(dim=1).item()
next_obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
buffer.push(obs, action, reward, next_obs, done)
total_reward += reward
obs = next_obs
step += 1
# 학습
if len(buffer) >= batch_size:
s, a, r, ns, d = buffer.sample(batch_size)
s_t = torch.tensor(s, dtype=torch.float32)
a_t = torch.tensor(a, dtype=torch.long)
r_t = torch.tensor(r, dtype=torch.float32)
ns_t = torch.tensor(ns, dtype=torch.float32)
d_t = torch.tensor(d, dtype=torch.bool)
current_q = online_net(s_t).gather(1, a_t.unsqueeze(1)).squeeze(1)
with torch.no_grad():
next_q = target_net(ns_t).max(dim=1)[0]
next_q[d_t] = 0.0
target_q = r_t + gamma * next_q
loss = nn.MSELoss()(current_q, target_q)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# 타겟 네트워크 업데이트
if step % target_update == 0:
target_net.load_state_dict(online_net.state_dict())
if done:
break
epsilon = max(epsilon_min, epsilon * epsilon_decay)
rewards_history.append(total_reward)
if episode % 50 == 0:
mean_reward = np.mean(rewards_history[-50:])
print(f"에피소드 {episode}: 평균 보상={mean_reward:.1f}, 엡실론={epsilon:.3f}")
if mean_reward >= 475:
print("CartPole 해결!")
break
env.close()
return online_net
# trained_model = train_dqn_cartpole()
Summary
- Limitations of table Q-learning: Q tables cannot be used in large state spaces
- Core DQN techniques: Experience replay (removes correlation), target network (stabilizes training)
- Atari preprocessing: Frame skip, grayscale conversion, 84x84 resize, frame stacking
- Training process: Human-level performance on Pong through over one million frames of training
- Epsilon schedule: Gradual transition from initial exploration to exploitation
In the next article, we will cover various extension techniques that significantly improve DQN performance (Double DQN, Dueling DQN, Prioritized Replay, etc.).