- Authors

- Name
- Youngju Kim
- @fjvbn20031
Directions for Improving DQN
While basic DQN showed human-level performance in Atari games, there is still much room for improvement. In this article, we explore six extension techniques that significantly enhance DQN performance.
1. N-step DQN
Problem: Limitations of Single-Step TD Targets
Basic DQN uses 1-step TD targets:
1-step: target = r_t + gamma * max_a Q(s_{t+1}, a)
This directly uses only the immediate reward and relies on estimation for the rest.
Solution: Using N-step Rewards
The N-step method uses actual rewards from multiple steps to create more accurate targets:
N-step: target = r_t + gamma * r_{t+1} + ... + gamma^{n-1} * r_{t+n-1} + gamma^n * max_a Q(s_{t+n}, a)
import torch
import torch.nn as nn
import numpy as np
from collections import deque
class NStepBuffer:
"""N-step 리턴을 위한 버퍼"""
def __init__(self, n_steps, gamma):
self.n_steps = n_steps
self.gamma = gamma
self.buffer = deque(maxlen=n_steps)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def get(self):
"""N-step 전이 데이터를 반환"""
# 첫 번째 전이의 상태와 행동
state, action, _, _, _ = self.buffer[0]
# N-step 리턴 계산
n_step_return = 0.0
for i, (_, _, reward, _, done) in enumerate(self.buffer):
n_step_return += (self.gamma ** i) * reward
if done:
break
# 마지막 전이의 다음 상태
_, _, _, next_state, done = self.buffer[-1]
return state, action, n_step_return, next_state, done
def is_ready(self):
return len(self.buffer) == self.n_steps
# N-step DQN 학습에서의 사용
def compute_nstep_target(n_step_return, next_state, done, target_net, gamma, n_steps, device):
"""N-step TD 타겟 계산"""
if done:
return n_step_return
with torch.no_grad():
next_q = target_net(
torch.tensor([next_state], dtype=torch.float32).to(device)
).max(dim=1)[0].item()
return n_step_return + (gamma ** n_steps) * next_q
2. Double DQN
Problem: Q-Value Overestimation
The max operator in basic DQN target computation tends to systematically overestimate Q values:
기본 DQN 타겟: r + gamma * max_a Q_target(s', a)
Because the max operation simultaneously performs action selection and value evaluation, overestimation occurs when actions with noisy high estimates are selected.
Solution: Separating Action Selection and Value Evaluation
Double DQN uses the online network for action selection and the target network for value evaluation:
Double DQN 타겟:
a* = argmax_a Q_online(s', a) # 온라인 네트워크로 행동 선택
target = r + gamma * Q_target(s', a*) # 타겟 네트워크로 가치 평가
def compute_double_dqn_loss(online_net, target_net, states, actions,
rewards, next_states, dones, gamma, device):
"""Double DQN 손실 계산"""
states_t = torch.tensor(states, dtype=torch.float32).to(device)
actions_t = torch.tensor(actions, dtype=torch.long).to(device)
rewards_t = torch.tensor(rewards, dtype=torch.float32).to(device)
next_states_t = torch.tensor(next_states, dtype=torch.float32).to(device)
dones_t = torch.tensor(dones, dtype=torch.bool).to(device)
# 현재 Q값
current_q = online_net(states_t).gather(1, actions_t.unsqueeze(1)).squeeze(1)
with torch.no_grad():
# 핵심: 온라인 네트워크로 행동 선택
best_actions = online_net(next_states_t).argmax(dim=1)
# 타겟 네트워크로 해당 행동의 가치 평가
next_q = target_net(next_states_t).gather(1, best_actions.unsqueeze(1)).squeeze(1)
next_q[dones_t] = 0.0
target_q = rewards_t + gamma * next_q
loss = nn.MSELoss()(current_q, target_q)
return loss
Overestimation Reduction Effect
Double DQN is especially effective when the action space is large or rewards are noisy. It shows more stable and better final performance compared to basic DQN.
3. Noisy Networks
Problem: Limitations of Epsilon-Greedy
The epsilon-greedy policy performs random exploration at the same probability in all states, unable to adjust the appropriate level of exploration per state.
Solution: Parameter Noise
Learnable noise is added to network weights. The agent automatically learns the degree of exploration.
import torch
import torch.nn as nn
import math
class NoisyLinear(nn.Module):
"""Factorized Gaussian Noise를 사용하는 선형 레이어"""
def __init__(self, in_features, out_features, sigma_init=0.5):
super().__init__()
self.in_features = in_features
self.out_features = out_features
# 학습 가능한 파라미터
self.weight_mu = nn.Parameter(torch.empty(out_features, in_features))
self.weight_sigma = nn.Parameter(torch.empty(out_features, in_features))
self.bias_mu = nn.Parameter(torch.empty(out_features))
self.bias_sigma = nn.Parameter(torch.empty(out_features))
# 노이즈를 위한 버퍼 (학습 대상 아님)
self.register_buffer('weight_epsilon', torch.empty(out_features, in_features))
self.register_buffer('bias_epsilon', torch.empty(out_features))
self.sigma_init = sigma_init
self.reset_parameters()
self.reset_noise()
def reset_parameters(self):
mu_range = 1.0 / math.sqrt(self.in_features)
self.weight_mu.data.uniform_(-mu_range, mu_range)
self.weight_sigma.data.fill_(self.sigma_init / math.sqrt(self.in_features))
self.bias_mu.data.uniform_(-mu_range, mu_range)
self.bias_sigma.data.fill_(self.sigma_init / math.sqrt(self.out_features))
def reset_noise(self):
"""새로운 노이즈 생성"""
epsilon_in = self._scale_noise(self.in_features)
epsilon_out = self._scale_noise(self.out_features)
self.weight_epsilon.copy_(epsilon_out.outer(epsilon_in))
self.bias_epsilon.copy_(epsilon_out)
def _scale_noise(self, size):
x = torch.randn(size)
return x.sign() * x.abs().sqrt()
def forward(self, x):
if self.training:
weight = self.weight_mu + self.weight_sigma * self.weight_epsilon
bias = self.bias_mu + self.bias_sigma * self.bias_epsilon
else:
weight = self.weight_mu
bias = self.bias_mu
return nn.functional.linear(x, weight, bias)
class NoisyDQN(nn.Module):
"""Noisy Network를 사용하는 DQN"""
def __init__(self, obs_size, n_actions):
super().__init__()
self.fc1 = nn.Linear(obs_size, 128)
self.noisy_fc2 = NoisyLinear(128, 128)
self.noisy_fc3 = NoisyLinear(128, n_actions)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.noisy_fc2(x))
return self.noisy_fc3(x)
def reset_noise(self):
"""모든 NoisyLinear 레이어의 노이즈 재생성"""
self.noisy_fc2.reset_noise()
self.noisy_fc3.reset_noise()
With Noisy Networks, no epsilon schedule is needed. The network itself manages exploration.
4. Prioritized Experience Replay
Problem: Inefficiency of Uniform Sampling
Basic experience replay samples all experiences with equal probability. However, some experiences are more useful for learning (those with large TD errors).
Solution: TD Error-Based Priority
Experiences with larger TD errors are sampled more frequently.
class SumTree:
"""효율적인 우선순위 샘플링을 위한 합 트리"""
def __init__(self, capacity):
self.capacity = capacity
self.tree = np.zeros(2 * capacity - 1)
self.data = [None] * capacity
self.write_idx = 0
self.size = 0
def total(self):
return self.tree[0]
def add(self, priority, data):
idx = self.write_idx + self.capacity - 1
self.data[self.write_idx] = data
self._update(idx, priority)
self.write_idx = (self.write_idx + 1) % self.capacity
self.size = min(self.size + 1, self.capacity)
def _update(self, idx, priority):
change = priority - self.tree[idx]
self.tree[idx] = priority
while idx > 0:
idx = (idx - 1) // 2
self.tree[idx] += change
def get(self, value):
"""값에 해당하는 데이터를 검색"""
idx = 0
while idx < self.capacity - 1:
left = 2 * idx + 1
right = left + 1
if value <= self.tree[left]:
idx = left
else:
value -= self.tree[left]
idx = right
data_idx = idx - self.capacity + 1
return idx, self.tree[idx], self.data[data_idx]
class PrioritizedReplayBuffer:
"""우선순위 경험 리플레이 버퍼"""
def __init__(self, capacity, alpha=0.6, beta_start=0.4, beta_frames=100000):
self.tree = SumTree(capacity)
self.alpha = alpha # 우선순위의 지수 (0=균일, 1=완전 우선순위)
self.beta_start = beta_start
self.beta_frames = beta_frames
self.frame = 0
self.max_priority = 1.0
def get_beta(self):
"""중요도 샘플링 보정 계수 (점차 1로 증가)"""
beta = self.beta_start + (1.0 - self.beta_start) * \
min(1.0, self.frame / self.beta_frames)
self.frame += 1
return beta
def push(self, state, action, reward, next_state, done):
data = (state, action, reward, next_state, done)
priority = self.max_priority ** self.alpha
self.tree.add(priority, data)
def sample(self, batch_size):
beta = self.get_beta()
batch = []
indices = []
priorities = []
segment = self.tree.total() / batch_size
for i in range(batch_size):
low = segment * i
high = segment * (i + 1)
value = np.random.uniform(low, high)
idx, priority, data = self.tree.get(value)
batch.append(data)
indices.append(idx)
priorities.append(priority)
# 중요도 샘플링 가중치 계산
probs = np.array(priorities) / self.tree.total()
weights = (self.tree.size * probs) ** (-beta)
weights = weights / weights.max()
states, actions, rewards, next_states, dones = zip(*batch)
return (
np.array(states),
np.array(actions),
np.array(rewards, dtype=np.float32),
np.array(next_states),
np.array(dones, dtype=np.bool_),
np.array(indices),
torch.tensor(weights, dtype=torch.float32),
)
def update_priorities(self, indices, td_errors):
"""TD 오차를 기반으로 우선순위 업데이트"""
for idx, td_error in zip(indices, td_errors):
priority = (abs(td_error) + 1e-6) ** self.alpha
self.max_priority = max(self.max_priority, priority)
self.tree._update(idx, priority)
5. Dueling DQN
Core Idea
Q values are separated into state value V(s) and advantage A(s, a):
Q(s, a) = V(s) + A(s, a) - mean_a(A(s, a))
The advantage of this separation is that in some states the outcome is similar regardless of which action is taken, and in such cases only V(s) needs to be estimated accurately.
class DuelingDQN(nn.Module):
"""Dueling DQN 네트워크"""
def __init__(self, obs_size, n_actions):
super().__init__()
# 공유 특징 추출기
self.feature = nn.Sequential(
nn.Linear(obs_size, 128),
nn.ReLU(),
)
# 가치 스트림: V(s) 추정
self.value_stream = nn.Sequential(
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, 1),
)
# 어드밴티지 스트림: A(s, a) 추정
self.advantage_stream = nn.Sequential(
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, n_actions),
)
def forward(self, x):
features = self.feature(x)
value = self.value_stream(features)
advantage = self.advantage_stream(features)
# Q = V + (A - mean(A))
q_values = value + advantage - advantage.mean(dim=-1, keepdim=True)
return q_values
6. Categorical DQN (C51)
Core Idea
Traditional DQN only estimates the expected value of Q. Categorical DQN estimates the entire probability distribution of Q values.
class CategoricalDQN(nn.Module):
"""Categorical DQN (C51) 네트워크"""
def __init__(self, obs_size, n_actions, n_atoms=51, v_min=-10, v_max=10):
super().__init__()
self.n_actions = n_actions
self.n_atoms = n_atoms
self.v_min = v_min
self.v_max = v_max
self.delta_z = (v_max - v_min) / (n_atoms - 1)
self.register_buffer(
'support',
torch.linspace(v_min, v_max, n_atoms)
)
self.network = nn.Sequential(
nn.Linear(obs_size, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, n_actions * n_atoms),
)
def forward(self, x):
logits = self.network(x)
# (batch, n_actions * n_atoms) -> (batch, n_actions, n_atoms)
logits = logits.view(-1, self.n_actions, self.n_atoms)
# 소프트맥스로 확률 분포 변환
probs = torch.softmax(logits, dim=-1)
return probs
def get_q_values(self, x):
"""확률 분포로부터 Q값(기대값) 계산"""
probs = self.forward(x)
# Q(s, a) = sum_i p_i * z_i
q_values = (probs * self.support.unsqueeze(0).unsqueeze(0)).sum(dim=-1)
return q_values
Rainbow: Combining Everything
Rainbow is an algorithm that combines all six techniques above. Each technique contributes independently, and combining them produces synergistic effects.
Rainbow Components
- N-step returns: More accurate targets
- Double DQN: Prevents overestimation
- Noisy Networks: Automatic exploration management
- Prioritized Replay: Efficient experience utilization
- Dueling Architecture: V and A separation
- Categorical DQN: Value distribution learning
class RainbowDQN(nn.Module):
"""Rainbow DQN: 모든 확장 기법 결합"""
def __init__(self, obs_size, n_actions, n_atoms=51, v_min=-10, v_max=10):
super().__init__()
self.n_actions = n_actions
self.n_atoms = n_atoms
self.v_min = v_min
self.v_max = v_max
self.register_buffer('support', torch.linspace(v_min, v_max, n_atoms))
# 공유 특징 추출기
self.feature = nn.Sequential(
nn.Linear(obs_size, 128),
nn.ReLU(),
)
# Dueling + Noisy + Categorical
# 가치 스트림 (Noisy 레이어 사용)
self.value_noisy1 = NoisyLinear(128, 128)
self.value_noisy2 = NoisyLinear(128, n_atoms)
# 어드밴티지 스트림 (Noisy 레이어 사용)
self.advantage_noisy1 = NoisyLinear(128, 128)
self.advantage_noisy2 = NoisyLinear(128, n_actions * n_atoms)
def forward(self, x):
features = self.feature(x)
# 가치 스트림
value = torch.relu(self.value_noisy1(features))
value = self.value_noisy2(value).view(-1, 1, self.n_atoms)
# 어드밴티지 스트림
advantage = torch.relu(self.advantage_noisy1(features))
advantage = self.advantage_noisy2(advantage).view(-1, self.n_actions, self.n_atoms)
# Dueling: Q = V + A - mean(A) (분포 수준에서)
q_dist = value + advantage - advantage.mean(dim=1, keepdim=True)
probs = torch.softmax(q_dist, dim=-1)
return probs
def get_q_values(self, x):
probs = self.forward(x)
q_values = (probs * self.support.unsqueeze(0).unsqueeze(0)).sum(dim=-1)
return q_values
def reset_noise(self):
self.value_noisy1.reset_noise()
self.value_noisy2.reset_noise()
self.advantage_noisy1.reset_noise()
self.advantage_noisy2.reset_noise()
Performance Contribution of Each Technique
The contribution of each technique in Atari games:
| Technique | Primary Effect | Contribution |
|---|---|---|
| Prioritized Replay | Improved learning efficiency | High |
| N-step Returns | Faster convergence | High |
| Categorical DQN | Value distribution learning | Medium-High |
| Dueling | State value separation | Medium |
| Double DQN | Overestimation prevention | Medium |
| Noisy Networks | Adaptive exploration | Medium |
Rainbow, combining all techniques, shows far superior performance to individual techniques. It shows particularly large improvements in data efficiency, achieving much higher scores with the same number of frames.
Summary
- N-step DQN: Uses actual rewards from multiple steps to improve target accuracy
- Double DQN: Separates action selection and value evaluation to prevent overestimation
- Noisy Networks: Adaptive exploration per state through learnable noise
- Prioritized Replay: Learns more frequently from experiences with large TD errors
- Dueling DQN: Separates Q values into V(s) and A(s, a) for improved learning efficiency
- Categorical DQN: Learns the probability distribution of values for richer information
- Rainbow: Achieves best performance through combination of all six techniques
In the next article, we will apply reinforcement learning to a real financial problem and build a stock trading agent.