[Deep RL] 05. Bellman Equation and Value Iteration

The Concept of Value

In reinforcement learning, "value" is a numerical measure of how good a particular state or state-action pair is. Accurately estimating this value enables selecting optimal actions.

State Value Function V(s)

The expected discounted cumulative reward when starting from state s and following policy pi:

V_pi(s) = E_pi[ r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... | s_t = s ]

Action Value Function Q(s, a)

The expected discounted cumulative reward when taking action a in state s, then following policy pi:

Q_pi(s, a) = E_pi[ r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... | s_t = s, a_t = a ]

Relationship Between V and Q

V and Q are closely connected:

V_pi(s) = sum_a pi(a|s) * Q_pi(s, a)

That is, the state value is the weighted average of Q values for each action according to the policy.

Optimal Value Functions

The optimal value function provides the highest value among all possible policies.

Optimal State Value Function

V*(s) = max_pi V_pi(s) = max_a Q*(s, a)

Optimal Action Value Function

Q*(s, a) = max_pi Q_pi(s, a)

If we know the optimal Q function, the optimal policy is simple -- just select the action with the highest Q value in each state:

pi*(s) = argmax_a Q*(s, a)

Bellman Equation

The Bellman equation represents the recursive relationship of value functions. It is the core principle of dynamic programming.

Bellman Expectation Equation

The recursive relationship of the value function for policy pi:

V_pi(s) = sum_a pi(a|s) * sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * V_pi(s') ]

Breaking this down, the value of the current state decomposes into:

Probability of selecting action a according to current policy pi(a|s)
Moving to next state s' according to transition probability P(s'|s,a)
Receiving immediate reward R(s,a,s')
Adding the discounted value of the next state V_pi(s')

Bellman Optimality Equation

The recursive relationship for the optimal policy:

V*(s) = max_a sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * V*(s') ]

The Bellman optimality equation for Q functions:

Q*(s, a) = sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * max_{a'} Q*(s', a') ]

The key insight is that the current optimal value can be computed from the optimal value of the next state.

Value of Action

To compute the value of taking action a in state s, we calculate the expected value over all possible next states:

Q(s, a) = sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * V(s') ]

This is important because if we know V(s'), we can compute Q(s, a), and through Q values we can select optimal actions.

import numpy as np

def compute_action_value(transition_probs, rewards, values, gamma, state, action):
    """
    행동 가치 Q(s, a)를 계산합니다.

    transition_probs: 전이 확률 P(s'|s,a) - 딕셔너리
    rewards: 보상 R(s,a,s')
    values: 현재 상태 가치 추정치 V(s)
    """
    q_value = 0.0
    for next_state, prob in transition_probs[state][action].items():
        reward = rewards[state][action][next_state]
        q_value += prob * (reward + gamma * values[next_state])
    return q_value

Value Iteration

Value iteration finds the optimal value function by repeatedly applying the Bellman optimality equation.

Algorithm

1. 모든 상태의 가치를 0으로 초기화: V(s) = 0
2. 반복:
   a. 각 상태 s에 대해:
      - 모든 행동 a에 대해 Q(s,a) 계산
      - V(s) = max_a Q(s,a)로 업데이트
   b. 가치의 변화량이 임계값 이하이면 종료
3. 최적 정책: pi*(s) = argmax_a Q(s,a)

Implementation

import gymnasium as gym
import numpy as np
from collections import defaultdict

class ValueIteration:
    """가치 반복법 구현"""
    def __init__(self, env, gamma=0.99):
        self.env = env
        self.gamma = gamma
        self.values = defaultdict(float)  # V(s)
        self.transition_counts = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
        self.reward_sums = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))

    def collect_experience(self, n_steps=10000):
        """무작위 행동으로 환경 동역학을 수집"""
        obs, _ = self.env.reset()
        for _ in range(n_steps):
            action = self.env.action_space.sample()
            next_obs, reward, terminated, truncated, _ = self.env.step(action)

            # 전이 횟수와 보상 기록
            self.transition_counts[obs][action][next_obs] += 1
            self.reward_sums[obs][action][next_obs] += reward

            if terminated or truncated:
                obs, _ = self.env.reset()
            else:
                obs = next_obs

    def compute_action_value(self, state, action):
        """Q(s, a) 계산"""
        total_transitions = sum(self.transition_counts[state][action].values())
        if total_transitions == 0:
            return 0.0

        q_value = 0.0
        for next_state, count in self.transition_counts[state][action].items():
            prob = count / total_transitions
            avg_reward = self.reward_sums[state][action][next_state] / count
            q_value += prob * (avg_reward + self.gamma * self.values[next_state])
        return q_value

    def value_iteration_step(self):
        """가치 반복 한 스텝"""
        max_delta = 0.0
        for state in self.transition_counts:
            q_values = [
                self.compute_action_value(state, action)
                for action in range(self.env.action_space.n)
            ]
            new_value = max(q_values) if q_values else 0.0
            delta = abs(new_value - self.values[state])
            max_delta = max(max_delta, delta)
            self.values[state] = new_value
        return max_delta

    def get_best_action(self, state):
        """현재 가치 함수에 기반한 최적 행동"""
        q_values = [
            self.compute_action_value(state, action)
            for action in range(self.env.action_space.n)
        ]
        return np.argmax(q_values)

    def train(self, n_iterations=100, threshold=1e-4):
        """가치 반복 학습"""
        print("경험 수집 중...")
        self.collect_experience()

        print("가치 반복 시작...")
        for i in range(n_iterations):
            delta = self.value_iteration_step()
            if i % 10 == 0:
                print(f"반복 {i}: 최대 변화량 = {delta:.6f}")
            if delta < threshold:
                print(f"반복 {i}에서 수렴! (변화량: {delta:.6f})")
                break

Applying Value Iteration to FrozenLake

def train_value_iteration_frozenlake():
    """FrozenLake에서 가치 반복법 학습 및 평가"""
    env = gym.make("FrozenLake-v1", is_slippery=True)

    agent = ValueIteration(env, gamma=0.99)
    agent.train(n_iterations=100)

    # 학습된 가치 함수 시각화
    print("\n=== 학습된 상태 가치 ===")
    for row in range(4):
        values = []
        for col in range(4):
            state = row * 4 + col
            values.append(f"{agent.values[state]:6.3f}")
        print(" | ".join(values))

    # 학습된 정책으로 평가
    print("\n=== 정책 평가 ===")
    n_eval = 1000
    successes = 0

    for _ in range(n_eval):
        obs, _ = env.reset()
        while True:
            action = agent.get_best_action(obs)
            obs, reward, terminated, truncated, _ = env.step(action)
            if terminated or truncated:
                if reward > 0:
                    successes += 1
                break

    success_rate = successes / n_eval
    print(f"성공률: {success_rate:.1%} ({successes}/{n_eval})")

    # 학습된 정책 시각화
    action_symbols = ['U', 'D', 'L', 'R']  # 상, 하, 좌, 우
    print("\n=== 학습된 정책 ===")
    for row in range(4):
        actions = []
        for col in range(4):
            state = row * 4 + col
            best_action = agent.get_best_action(state)
            actions.append(action_symbols[best_action])
        print("  ".join(actions))

    env.close()

# train_value_iteration_frozenlake()

Q-Learning

Q-learning is based on the Bellman optimality equation but learns Q values directly from experience without an environment model (model-free).

Q-Learning Update Rule

Q(s, a) <- Q(s, a) + alpha * [ r + gamma * max_{a'} Q(s', a') - Q(s, a) ]

Here alpha is the learning rate, and the term in brackets is called the TD error (Temporal Difference error).

Q-Learning Implementation

class QLearningAgent:
    """테이블 기반 Q-러닝 에이전트"""
    def __init__(self, n_states, n_actions, learning_rate=0.1,
                 gamma=0.99, epsilon=1.0, epsilon_decay=0.999,
                 epsilon_min=0.01):
        self.q_table = np.zeros((n_states, n_actions))
        self.lr = learning_rate
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.n_actions = n_actions

    def select_action(self, state):
        """엡실론-탐욕 정책으로 행동 선택"""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return np.argmax(self.q_table[state])

    def update(self, state, action, reward, next_state, done):
        """Q-러닝 업데이트"""
        if done:
            td_target = reward
        else:
            td_target = reward + self.gamma * np.max(self.q_table[next_state])

        td_error = td_target - self.q_table[state, action]
        self.q_table[state, action] += self.lr * td_error

        return td_error

    def decay_epsilon(self):
        """탐색률 감소"""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

Q-Learning on FrozenLake

def train_q_learning_frozenlake():
    """Q-러닝으로 FrozenLake 학습"""
    env = gym.make("FrozenLake-v1", is_slippery=True)
    n_states = env.observation_space.n
    n_actions = env.action_space.n

    agent = QLearningAgent(
        n_states=n_states,
        n_actions=n_actions,
        learning_rate=0.1,
        gamma=0.99,
        epsilon=1.0,
        epsilon_decay=0.9995,
        epsilon_min=0.01,
    )

    n_episodes = 20000
    rewards_history = []
    eval_interval = 1000

    for episode in range(n_episodes):
        obs, _ = env.reset()
        total_reward = 0

        while True:
            action = agent.select_action(obs)
            next_obs, reward, terminated, truncated, _ = env.step(action)

            agent.update(obs, action, reward, next_obs, terminated)
            total_reward += reward
            obs = next_obs

            if terminated or truncated:
                break

        agent.decay_epsilon()
        rewards_history.append(total_reward)

        if (episode + 1) % eval_interval == 0:
            recent_avg = np.mean(rewards_history[-eval_interval:])
            print(
                f"에피소드 {episode + 1}: "
                f"최근 평균 보상={recent_avg:.3f}, "
                f"엡실론={agent.epsilon:.4f}"
            )

    env.close()

    # 최종 결과
    print("\n=== Q 테이블 (첫 4개 상태) ===")
    action_names = ['상', '하', '좌', '우']
    for s in range(4):
        q_vals = ", ".join(f"{action_names[a]}={agent.q_table[s, a]:.3f}" for a in range(4))
        print(f"상태 {s}: {q_vals}")

    # 학습된 정책 평가
    agent.epsilon = 0  # 탐색 비활성화
    successes = 0
    for _ in range(1000):
        obs, _ = env.reset()
        while True:
            action = agent.select_action(obs)
            obs, reward, terminated, truncated, _ = env.step(action)
            if terminated or truncated:
                if reward > 0:
                    successes += 1
                break

    print(f"\n최종 성공률: {successes / 1000:.1%}")

    return agent

# agent = train_q_learning_frozenlake()

Comparing Value Iteration and Q-Learning

Property	Value Iteration	Q-Learning
Environment model	Yes (transition probabilities)	No
Learning method	Simultaneous state updates	One experience at a time
Convergence guarantee	Theoretically guaranteed	Theoretically guaranteed (with conditions)
Large state spaces	Impossible (table size)	Difficult (table size)
Scalability	Low	Medium (extensible with function approx.)

Summary

State value V(s): Expected cumulative reward starting from a state
Action value Q(s, a): Expected cumulative reward after taking a specific action in a state
Bellman equation: Recursive relationship of value functions, the foundation of dynamic programming
Value iteration: Converges to optimal value function by repeatedly applying the Bellman optimality equation
Q-learning: A representative algorithm that learns Q values in a model-free manner

Table-based methods can only be used when the state space is small. In the next article, we will cover Deep Q-Networks (DQN) which approximate Q functions with neural networks.