- Authors

- Name
- Youngju Kim
- @fjvbn20031
The Concept of Value
In reinforcement learning, "value" is a numerical measure of how good a particular state or state-action pair is. Accurately estimating this value enables selecting optimal actions.
State Value Function V(s)
The expected discounted cumulative reward when starting from state s and following policy pi:
V_pi(s) = E_pi[ r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... | s_t = s ]
Action Value Function Q(s, a)
The expected discounted cumulative reward when taking action a in state s, then following policy pi:
Q_pi(s, a) = E_pi[ r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... | s_t = s, a_t = a ]
Relationship Between V and Q
V and Q are closely connected:
V_pi(s) = sum_a pi(a|s) * Q_pi(s, a)
That is, the state value is the weighted average of Q values for each action according to the policy.
Optimal Value Functions
The optimal value function provides the highest value among all possible policies.
Optimal State Value Function
V*(s) = max_pi V_pi(s) = max_a Q*(s, a)
Optimal Action Value Function
Q*(s, a) = max_pi Q_pi(s, a)
If we know the optimal Q function, the optimal policy is simple -- just select the action with the highest Q value in each state:
pi*(s) = argmax_a Q*(s, a)
Bellman Equation
The Bellman equation represents the recursive relationship of value functions. It is the core principle of dynamic programming.
Bellman Expectation Equation
The recursive relationship of the value function for policy pi:
V_pi(s) = sum_a pi(a|s) * sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * V_pi(s') ]
Breaking this down, the value of the current state decomposes into:
- Probability of selecting action a according to current policy pi(a|s)
- Moving to next state s' according to transition probability P(s'|s,a)
- Receiving immediate reward R(s,a,s')
- Adding the discounted value of the next state V_pi(s')
Bellman Optimality Equation
The recursive relationship for the optimal policy:
V*(s) = max_a sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * V*(s') ]
The Bellman optimality equation for Q functions:
Q*(s, a) = sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * max_{a'} Q*(s', a') ]
The key insight is that the current optimal value can be computed from the optimal value of the next state.
Value of Action
To compute the value of taking action a in state s, we calculate the expected value over all possible next states:
Q(s, a) = sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * V(s') ]
This is important because if we know V(s'), we can compute Q(s, a), and through Q values we can select optimal actions.
import numpy as np
def compute_action_value(transition_probs, rewards, values, gamma, state, action):
"""
행동 가치 Q(s, a)를 계산합니다.
transition_probs: 전이 확률 P(s'|s,a) - 딕셔너리
rewards: 보상 R(s,a,s')
values: 현재 상태 가치 추정치 V(s)
"""
q_value = 0.0
for next_state, prob in transition_probs[state][action].items():
reward = rewards[state][action][next_state]
q_value += prob * (reward + gamma * values[next_state])
return q_value
Value Iteration
Value iteration finds the optimal value function by repeatedly applying the Bellman optimality equation.
Algorithm
1. 모든 상태의 가치를 0으로 초기화: V(s) = 0
2. 반복:
a. 각 상태 s에 대해:
- 모든 행동 a에 대해 Q(s,a) 계산
- V(s) = max_a Q(s,a)로 업데이트
b. 가치의 변화량이 임계값 이하이면 종료
3. 최적 정책: pi*(s) = argmax_a Q(s,a)
Implementation
import gymnasium as gym
import numpy as np
from collections import defaultdict
class ValueIteration:
"""가치 반복법 구현"""
def __init__(self, env, gamma=0.99):
self.env = env
self.gamma = gamma
self.values = defaultdict(float) # V(s)
self.transition_counts = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
self.reward_sums = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
def collect_experience(self, n_steps=10000):
"""무작위 행동으로 환경 동역학을 수집"""
obs, _ = self.env.reset()
for _ in range(n_steps):
action = self.env.action_space.sample()
next_obs, reward, terminated, truncated, _ = self.env.step(action)
# 전이 횟수와 보상 기록
self.transition_counts[obs][action][next_obs] += 1
self.reward_sums[obs][action][next_obs] += reward
if terminated or truncated:
obs, _ = self.env.reset()
else:
obs = next_obs
def compute_action_value(self, state, action):
"""Q(s, a) 계산"""
total_transitions = sum(self.transition_counts[state][action].values())
if total_transitions == 0:
return 0.0
q_value = 0.0
for next_state, count in self.transition_counts[state][action].items():
prob = count / total_transitions
avg_reward = self.reward_sums[state][action][next_state] / count
q_value += prob * (avg_reward + self.gamma * self.values[next_state])
return q_value
def value_iteration_step(self):
"""가치 반복 한 스텝"""
max_delta = 0.0
for state in self.transition_counts:
q_values = [
self.compute_action_value(state, action)
for action in range(self.env.action_space.n)
]
new_value = max(q_values) if q_values else 0.0
delta = abs(new_value - self.values[state])
max_delta = max(max_delta, delta)
self.values[state] = new_value
return max_delta
def get_best_action(self, state):
"""현재 가치 함수에 기반한 최적 행동"""
q_values = [
self.compute_action_value(state, action)
for action in range(self.env.action_space.n)
]
return np.argmax(q_values)
def train(self, n_iterations=100, threshold=1e-4):
"""가치 반복 학습"""
print("경험 수집 중...")
self.collect_experience()
print("가치 반복 시작...")
for i in range(n_iterations):
delta = self.value_iteration_step()
if i % 10 == 0:
print(f"반복 {i}: 최대 변화량 = {delta:.6f}")
if delta < threshold:
print(f"반복 {i}에서 수렴! (변화량: {delta:.6f})")
break
Applying Value Iteration to FrozenLake
def train_value_iteration_frozenlake():
"""FrozenLake에서 가치 반복법 학습 및 평가"""
env = gym.make("FrozenLake-v1", is_slippery=True)
agent = ValueIteration(env, gamma=0.99)
agent.train(n_iterations=100)
# 학습된 가치 함수 시각화
print("\n=== 학습된 상태 가치 ===")
for row in range(4):
values = []
for col in range(4):
state = row * 4 + col
values.append(f"{agent.values[state]:6.3f}")
print(" | ".join(values))
# 학습된 정책으로 평가
print("\n=== 정책 평가 ===")
n_eval = 1000
successes = 0
for _ in range(n_eval):
obs, _ = env.reset()
while True:
action = agent.get_best_action(obs)
obs, reward, terminated, truncated, _ = env.step(action)
if terminated or truncated:
if reward > 0:
successes += 1
break
success_rate = successes / n_eval
print(f"성공률: {success_rate:.1%} ({successes}/{n_eval})")
# 학습된 정책 시각화
action_symbols = ['U', 'D', 'L', 'R'] # 상, 하, 좌, 우
print("\n=== 학습된 정책 ===")
for row in range(4):
actions = []
for col in range(4):
state = row * 4 + col
best_action = agent.get_best_action(state)
actions.append(action_symbols[best_action])
print(" ".join(actions))
env.close()
# train_value_iteration_frozenlake()
Q-Learning
Q-learning is based on the Bellman optimality equation but learns Q values directly from experience without an environment model (model-free).
Q-Learning Update Rule
Q(s, a) <- Q(s, a) + alpha * [ r + gamma * max_{a'} Q(s', a') - Q(s, a) ]
Here alpha is the learning rate, and the term in brackets is called the TD error (Temporal Difference error).
Q-Learning Implementation
class QLearningAgent:
"""테이블 기반 Q-러닝 에이전트"""
def __init__(self, n_states, n_actions, learning_rate=0.1,
gamma=0.99, epsilon=1.0, epsilon_decay=0.999,
epsilon_min=0.01):
self.q_table = np.zeros((n_states, n_actions))
self.lr = learning_rate
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.n_actions = n_actions
def select_action(self, state):
"""엡실론-탐욕 정책으로 행동 선택"""
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
return np.argmax(self.q_table[state])
def update(self, state, action, reward, next_state, done):
"""Q-러닝 업데이트"""
if done:
td_target = reward
else:
td_target = reward + self.gamma * np.max(self.q_table[next_state])
td_error = td_target - self.q_table[state, action]
self.q_table[state, action] += self.lr * td_error
return td_error
def decay_epsilon(self):
"""탐색률 감소"""
self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
Q-Learning on FrozenLake
def train_q_learning_frozenlake():
"""Q-러닝으로 FrozenLake 학습"""
env = gym.make("FrozenLake-v1", is_slippery=True)
n_states = env.observation_space.n
n_actions = env.action_space.n
agent = QLearningAgent(
n_states=n_states,
n_actions=n_actions,
learning_rate=0.1,
gamma=0.99,
epsilon=1.0,
epsilon_decay=0.9995,
epsilon_min=0.01,
)
n_episodes = 20000
rewards_history = []
eval_interval = 1000
for episode in range(n_episodes):
obs, _ = env.reset()
total_reward = 0
while True:
action = agent.select_action(obs)
next_obs, reward, terminated, truncated, _ = env.step(action)
agent.update(obs, action, reward, next_obs, terminated)
total_reward += reward
obs = next_obs
if terminated or truncated:
break
agent.decay_epsilon()
rewards_history.append(total_reward)
if (episode + 1) % eval_interval == 0:
recent_avg = np.mean(rewards_history[-eval_interval:])
print(
f"에피소드 {episode + 1}: "
f"최근 평균 보상={recent_avg:.3f}, "
f"엡실론={agent.epsilon:.4f}"
)
env.close()
# 최종 결과
print("\n=== Q 테이블 (첫 4개 상태) ===")
action_names = ['상', '하', '좌', '우']
for s in range(4):
q_vals = ", ".join(f"{action_names[a]}={agent.q_table[s, a]:.3f}" for a in range(4))
print(f"상태 {s}: {q_vals}")
# 학습된 정책 평가
agent.epsilon = 0 # 탐색 비활성화
successes = 0
for _ in range(1000):
obs, _ = env.reset()
while True:
action = agent.select_action(obs)
obs, reward, terminated, truncated, _ = env.step(action)
if terminated or truncated:
if reward > 0:
successes += 1
break
print(f"\n최종 성공률: {successes / 1000:.1%}")
return agent
# agent = train_q_learning_frozenlake()
Comparing Value Iteration and Q-Learning
| Property | Value Iteration | Q-Learning |
|---|---|---|
| Environment model | Yes (transition probabilities) | No |
| Learning method | Simultaneous state updates | One experience at a time |
| Convergence guarantee | Theoretically guaranteed | Theoretically guaranteed (with conditions) |
| Large state spaces | Impossible (table size) | Difficult (table size) |
| Scalability | Low | Medium (extensible with function approx.) |
Summary
- State value V(s): Expected cumulative reward starting from a state
- Action value Q(s, a): Expected cumulative reward after taking a specific action in a state
- Bellman equation: Recursive relationship of value functions, the foundation of dynamic programming
- Value iteration: Converges to optimal value function by repeatedly applying the Bellman optimality equation
- Q-learning: A representative algorithm that learns Q values in a model-free manner
Table-based methods can only be used when the state space is small. In the next article, we will cover Deep Q-Networks (DQN) which approximate Q functions with neural networks.