Split View: [심층 강화학습] 05. 벨만 방정식과 가치 반복

[심층 강화학습] 05. 벨만 방정식과 가치 반복

가치(Value)의 개념

강화학습에서 "가치"란 특정 상태 또는 상태-행동 쌍이 얼마나 좋은지를 나타내는 수치입니다. 이 값을 정확하게 추정하면 최적의 행동을 선택할 수 있습니다.

상태 가치 함수 V(s)

상태 s에서 시작하여 정책 pi를 따랐을 때 얻을 수 있는 기대 할인 누적 보상입니다.

V_pi(s) = E_pi[ r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... | s_t = s ]

행동 가치 함수 Q(s, a)

상태 s에서 행동 a를 취한 후, 정책 pi를 따랐을 때의 기대 할인 누적 보상입니다.

Q_pi(s, a) = E_pi[ r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... | s_t = s, a_t = a ]

V와 Q의 관계

V와 Q는 서로 밀접하게 연결되어 있습니다.

V_pi(s) = sum_a pi(a|s) * Q_pi(s, a)

즉, 상태 가치는 각 행동의 Q값을 정책에 따라 가중 평균한 것입니다.

최적 가치 함수

모든 가능한 정책 중에서 가장 높은 가치를 제공하는 것이 최적 가치 함수입니다.

최적 상태 가치 함수

V*(s) = max_pi V_pi(s) = max_a Q*(s, a)

최적 행동 가치 함수

Q*(s, a) = max_pi Q_pi(s, a)

최적 Q함수를 알면 최적 정책은 간단합니다. 각 상태에서 Q값이 가장 높은 행동을 선택하면 됩니다.

pi*(s) = argmax_a Q*(s, a)

벨만 방정식 (Bellman Equation)

벨만 방정식은 가치 함수의 재귀적 관계를 나타냅니다. 동적 프로그래밍의 핵심 원리입니다.

벨만 기대 방정식

정책 pi에 대한 가치 함수의 재귀 관계입니다.

V_pi(s) = sum_a pi(a|s) * sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * V_pi(s') ]

풀어서 설명하면, 현재 상태의 가치는 다음과 같이 분해됩니다.

현재 정책에 따라 행동 a를 선택할 확률 pi(a|s)
환경의 전이 확률 P(s'|s,a)에 따라 다음 상태 s'로 이동
즉시 보상 R(s,a,s')를 받고
다음 상태의 가치 V_pi(s')에 할인 인자를 곱하여 더함

벨만 최적 방정식

최적 정책에 대한 재귀 관계입니다.

V*(s) = max_a sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * V*(s') ]

Q함수에 대한 벨만 최적 방정식은 다음과 같습니다.

Q*(s, a) = sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * max_{a'} Q*(s', a') ]

이 방정식의 핵심은 현재의 최적 가치가 다음 상태의 최적 가치로부터 계산될 수 있다는 점입니다.

행동의 가치 (Value of Action)

상태 s에서 행동 a를 취했을 때의 가치를 구하려면, 가능한 모든 다음 상태에 대해 기대값을 계산합니다.

Q(s, a) = sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * V(s') ]

이것이 중요한 이유는, V(s')을 알고 있으면 Q(s, a)를 계산할 수 있고, Q값을 통해 최적 행동을 선택할 수 있기 때문입니다.

import numpy as np

def compute_action_value(transition_probs, rewards, values, gamma, state, action):
    """
    행동 가치 Q(s, a)를 계산합니다.

    transition_probs: 전이 확률 P(s'|s,a) - 딕셔너리
    rewards: 보상 R(s,a,s')
    values: 현재 상태 가치 추정치 V(s)
    """
    q_value = 0.0
    for next_state, prob in transition_probs[state][action].items():
        reward = rewards[state][action][next_state]
        q_value += prob * (reward + gamma * values[next_state])
    return q_value

가치 반복법 (Value Iteration)

벨만 최적 방정식을 반복적으로 적용하여 최적 가치 함수를 구하는 방법입니다.

알고리즘

1. 모든 상태의 가치를 0으로 초기화: V(s) = 0
2. 반복:
   a. 각 상태 s에 대해:
      - 모든 행동 a에 대해 Q(s,a) 계산
      - V(s) = max_a Q(s,a)로 업데이트
   b. 가치의 변화량이 임계값 이하이면 종료
3. 최적 정책: pi*(s) = argmax_a Q(s,a)

구현

import gymnasium as gym
import numpy as np
from collections import defaultdict

class ValueIteration:
    """가치 반복법 구현"""
    def __init__(self, env, gamma=0.99):
        self.env = env
        self.gamma = gamma
        self.values = defaultdict(float)  # V(s)
        self.transition_counts = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
        self.reward_sums = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))

    def collect_experience(self, n_steps=10000):
        """무작위 행동으로 환경 동역학을 수집"""
        obs, _ = self.env.reset()
        for _ in range(n_steps):
            action = self.env.action_space.sample()
            next_obs, reward, terminated, truncated, _ = self.env.step(action)

            # 전이 횟수와 보상 기록
            self.transition_counts[obs][action][next_obs] += 1
            self.reward_sums[obs][action][next_obs] += reward

            if terminated or truncated:
                obs, _ = self.env.reset()
            else:
                obs = next_obs

    def compute_action_value(self, state, action):
        """Q(s, a) 계산"""
        total_transitions = sum(self.transition_counts[state][action].values())
        if total_transitions == 0:
            return 0.0

        q_value = 0.0
        for next_state, count in self.transition_counts[state][action].items():
            prob = count / total_transitions
            avg_reward = self.reward_sums[state][action][next_state] / count
            q_value += prob * (avg_reward + self.gamma * self.values[next_state])
        return q_value

    def value_iteration_step(self):
        """가치 반복 한 스텝"""
        max_delta = 0.0
        for state in self.transition_counts:
            q_values = [
                self.compute_action_value(state, action)
                for action in range(self.env.action_space.n)
            ]
            new_value = max(q_values) if q_values else 0.0
            delta = abs(new_value - self.values[state])
            max_delta = max(max_delta, delta)
            self.values[state] = new_value
        return max_delta

    def get_best_action(self, state):
        """현재 가치 함수에 기반한 최적 행동"""
        q_values = [
            self.compute_action_value(state, action)
            for action in range(self.env.action_space.n)
        ]
        return np.argmax(q_values)

    def train(self, n_iterations=100, threshold=1e-4):
        """가치 반복 학습"""
        print("경험 수집 중...")
        self.collect_experience()

        print("가치 반복 시작...")
        for i in range(n_iterations):
            delta = self.value_iteration_step()
            if i % 10 == 0:
                print(f"반복 {i}: 최대 변화량 = {delta:.6f}")
            if delta < threshold:
                print(f"반복 {i}에서 수렴! (변화량: {delta:.6f})")
                break

FrozenLake에서 가치 반복법 적용

def train_value_iteration_frozenlake():
    """FrozenLake에서 가치 반복법 학습 및 평가"""
    env = gym.make("FrozenLake-v1", is_slippery=True)

    agent = ValueIteration(env, gamma=0.99)
    agent.train(n_iterations=100)

    # 학습된 가치 함수 시각화
    print("\n=== 학습된 상태 가치 ===")
    for row in range(4):
        values = []
        for col in range(4):
            state = row * 4 + col
            values.append(f"{agent.values[state]:6.3f}")
        print(" | ".join(values))

    # 학습된 정책으로 평가
    print("\n=== 정책 평가 ===")
    n_eval = 1000
    successes = 0

    for _ in range(n_eval):
        obs, _ = env.reset()
        while True:
            action = agent.get_best_action(obs)
            obs, reward, terminated, truncated, _ = env.step(action)
            if terminated or truncated:
                if reward > 0:
                    successes += 1
                break

    success_rate = successes / n_eval
    print(f"성공률: {success_rate:.1%} ({successes}/{n_eval})")

    # 학습된 정책 시각화
    action_symbols = ['U', 'D', 'L', 'R']  # 상, 하, 좌, 우
    print("\n=== 학습된 정책 ===")
    for row in range(4):
        actions = []
        for col in range(4):
            state = row * 4 + col
            best_action = agent.get_best_action(state)
            actions.append(action_symbols[best_action])
        print("  ".join(actions))

    env.close()

# train_value_iteration_frozenlake()

Q-러닝 (Q-Learning)

Q-러닝은 벨만 최적 방정식을 기반으로 하지만, 환경 모델(전이 확률) 없이 직접 경험으로부터 Q값을 학습하는 모델 프리 방법입니다.

Q-러닝 업데이트 규칙

Q(s, a) <- Q(s, a) + alpha * [ r + gamma * max_{a'} Q(s', a') - Q(s, a) ]

여기서 alpha는 학습률이고, 괄호 안의 값은 TD 오차(Temporal Difference error)라 합니다.

Q-러닝 구현

class QLearningAgent:
    """테이블 기반 Q-러닝 에이전트"""
    def __init__(self, n_states, n_actions, learning_rate=0.1,
                 gamma=0.99, epsilon=1.0, epsilon_decay=0.999,
                 epsilon_min=0.01):
        self.q_table = np.zeros((n_states, n_actions))
        self.lr = learning_rate
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.n_actions = n_actions

    def select_action(self, state):
        """엡실론-탐욕 정책으로 행동 선택"""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return np.argmax(self.q_table[state])

    def update(self, state, action, reward, next_state, done):
        """Q-러닝 업데이트"""
        if done:
            td_target = reward
        else:
            td_target = reward + self.gamma * np.max(self.q_table[next_state])

        td_error = td_target - self.q_table[state, action]
        self.q_table[state, action] += self.lr * td_error

        return td_error

    def decay_epsilon(self):
        """탐색률 감소"""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

FrozenLake에서 Q-러닝

def train_q_learning_frozenlake():
    """Q-러닝으로 FrozenLake 학습"""
    env = gym.make("FrozenLake-v1", is_slippery=True)
    n_states = env.observation_space.n
    n_actions = env.action_space.n

    agent = QLearningAgent(
        n_states=n_states,
        n_actions=n_actions,
        learning_rate=0.1,
        gamma=0.99,
        epsilon=1.0,
        epsilon_decay=0.9995,
        epsilon_min=0.01,
    )

    n_episodes = 20000
    rewards_history = []
    eval_interval = 1000

    for episode in range(n_episodes):
        obs, _ = env.reset()
        total_reward = 0

        while True:
            action = agent.select_action(obs)
            next_obs, reward, terminated, truncated, _ = env.step(action)

            agent.update(obs, action, reward, next_obs, terminated)
            total_reward += reward
            obs = next_obs

            if terminated or truncated:
                break

        agent.decay_epsilon()
        rewards_history.append(total_reward)

        if (episode + 1) % eval_interval == 0:
            recent_avg = np.mean(rewards_history[-eval_interval:])
            print(
                f"에피소드 {episode + 1}: "
                f"최근 평균 보상={recent_avg:.3f}, "
                f"엡실론={agent.epsilon:.4f}"
            )

    env.close()

    # 최종 결과
    print("\n=== Q 테이블 (첫 4개 상태) ===")
    action_names = ['상', '하', '좌', '우']
    for s in range(4):
        q_vals = ", ".join(f"{action_names[a]}={agent.q_table[s, a]:.3f}" for a in range(4))
        print(f"상태 {s}: {q_vals}")

    # 학습된 정책 평가
    agent.epsilon = 0  # 탐색 비활성화
    successes = 0
    for _ in range(1000):
        obs, _ = env.reset()
        while True:
            action = agent.select_action(obs)
            obs, reward, terminated, truncated, _ = env.step(action)
            if terminated or truncated:
                if reward > 0:
                    successes += 1
                break

    print(f"\n최종 성공률: {successes / 1000:.1%}")

    return agent

# agent = train_q_learning_frozenlake()

가치 반복과 Q-러닝 비교

특성	가치 반복	Q-러닝
환경 모델 필요	예 (전이 확률)	아니오
학습 방식	전체 상태 동시 업데이트	경험 하나씩 업데이트
수렴 보장	이론적으로 보장	이론적으로 보장 (조건 충족 시)
대규모 상태 공간	불가능 (테이블 크기)	어려움 (테이블 크기)
확장성	낮음	중간 (함수 근사로 확장 가능)

정리

상태 가치 V(s): 상태에서 시작하여 얻을 수 있는 기대 누적 보상
행동 가치 Q(s, a): 상태에서 특정 행동 후 얻을 수 있는 기대 누적 보상
벨만 방정식: 가치 함수의 재귀적 관계, 동적 프로그래밍의 기초
가치 반복법: 벨만 최적 방정식을 반복 적용하여 최적 가치 함수 수렴
Q-러닝: 모델 프리 방식으로 Q값을 학습하는 대표적 알고리즘

테이블 기반 방법은 상태 공간이 작을 때만 사용할 수 있습니다. 다음 글에서는 신경망으로 Q함수를 근사하는 Deep Q-Network(DQN)를 다루겠습니다.

[Deep RL] 05. Bellman Equation and Value Iteration

The Concept of Value

In reinforcement learning, "value" is a numerical measure of how good a particular state or state-action pair is. Accurately estimating this value enables selecting optimal actions.

State Value Function V(s)

The expected discounted cumulative reward when starting from state s and following policy pi:

V_pi(s) = E_pi[ r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... | s_t = s ]

Action Value Function Q(s, a)

The expected discounted cumulative reward when taking action a in state s, then following policy pi:

Q_pi(s, a) = E_pi[ r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... | s_t = s, a_t = a ]

Relationship Between V and Q

V and Q are closely connected:

V_pi(s) = sum_a pi(a|s) * Q_pi(s, a)

That is, the state value is the weighted average of Q values for each action according to the policy.

Optimal Value Functions

The optimal value function provides the highest value among all possible policies.

Optimal State Value Function

V*(s) = max_pi V_pi(s) = max_a Q*(s, a)

Optimal Action Value Function

Q*(s, a) = max_pi Q_pi(s, a)

If we know the optimal Q function, the optimal policy is simple -- just select the action with the highest Q value in each state:

pi*(s) = argmax_a Q*(s, a)

Bellman Equation

The Bellman equation represents the recursive relationship of value functions. It is the core principle of dynamic programming.

Bellman Expectation Equation

The recursive relationship of the value function for policy pi:

V_pi(s) = sum_a pi(a|s) * sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * V_pi(s') ]

Breaking this down, the value of the current state decomposes into:

Probability of selecting action a according to current policy pi(a|s)
Moving to next state s' according to transition probability P(s'|s,a)
Receiving immediate reward R(s,a,s')
Adding the discounted value of the next state V_pi(s')

Bellman Optimality Equation

The recursive relationship for the optimal policy:

V*(s) = max_a sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * V*(s') ]

The Bellman optimality equation for Q functions:

Q*(s, a) = sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * max_{a'} Q*(s', a') ]

The key insight is that the current optimal value can be computed from the optimal value of the next state.

Value of Action

To compute the value of taking action a in state s, we calculate the expected value over all possible next states:

Q(s, a) = sum_{s'} P(s'|s,a) * [ R(s,a,s') + gamma * V(s') ]

This is important because if we know V(s'), we can compute Q(s, a), and through Q values we can select optimal actions.

import numpy as np

def compute_action_value(transition_probs, rewards, values, gamma, state, action):
    """
    행동 가치 Q(s, a)를 계산합니다.

    transition_probs: 전이 확률 P(s'|s,a) - 딕셔너리
    rewards: 보상 R(s,a,s')
    values: 현재 상태 가치 추정치 V(s)
    """
    q_value = 0.0
    for next_state, prob in transition_probs[state][action].items():
        reward = rewards[state][action][next_state]
        q_value += prob * (reward + gamma * values[next_state])
    return q_value

Value Iteration

Value iteration finds the optimal value function by repeatedly applying the Bellman optimality equation.

Algorithm

1. 모든 상태의 가치를 0으로 초기화: V(s) = 0
2. 반복:
   a. 각 상태 s에 대해:
      - 모든 행동 a에 대해 Q(s,a) 계산
      - V(s) = max_a Q(s,a)로 업데이트
   b. 가치의 변화량이 임계값 이하이면 종료
3. 최적 정책: pi*(s) = argmax_a Q(s,a)

Implementation

import gymnasium as gym
import numpy as np
from collections import defaultdict

class ValueIteration:
    """가치 반복법 구현"""
    def __init__(self, env, gamma=0.99):
        self.env = env
        self.gamma = gamma
        self.values = defaultdict(float)  # V(s)
        self.transition_counts = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
        self.reward_sums = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))

    def collect_experience(self, n_steps=10000):
        """무작위 행동으로 환경 동역학을 수집"""
        obs, _ = self.env.reset()
        for _ in range(n_steps):
            action = self.env.action_space.sample()
            next_obs, reward, terminated, truncated, _ = self.env.step(action)

            # 전이 횟수와 보상 기록
            self.transition_counts[obs][action][next_obs] += 1
            self.reward_sums[obs][action][next_obs] += reward

            if terminated or truncated:
                obs, _ = self.env.reset()
            else:
                obs = next_obs

    def compute_action_value(self, state, action):
        """Q(s, a) 계산"""
        total_transitions = sum(self.transition_counts[state][action].values())
        if total_transitions == 0:
            return 0.0

        q_value = 0.0
        for next_state, count in self.transition_counts[state][action].items():
            prob = count / total_transitions
            avg_reward = self.reward_sums[state][action][next_state] / count
            q_value += prob * (avg_reward + self.gamma * self.values[next_state])
        return q_value

    def value_iteration_step(self):
        """가치 반복 한 스텝"""
        max_delta = 0.0
        for state in self.transition_counts:
            q_values = [
                self.compute_action_value(state, action)
                for action in range(self.env.action_space.n)
            ]
            new_value = max(q_values) if q_values else 0.0
            delta = abs(new_value - self.values[state])
            max_delta = max(max_delta, delta)
            self.values[state] = new_value
        return max_delta

    def get_best_action(self, state):
        """현재 가치 함수에 기반한 최적 행동"""
        q_values = [
            self.compute_action_value(state, action)
            for action in range(self.env.action_space.n)
        ]
        return np.argmax(q_values)

    def train(self, n_iterations=100, threshold=1e-4):
        """가치 반복 학습"""
        print("경험 수집 중...")
        self.collect_experience()

        print("가치 반복 시작...")
        for i in range(n_iterations):
            delta = self.value_iteration_step()
            if i % 10 == 0:
                print(f"반복 {i}: 최대 변화량 = {delta:.6f}")
            if delta < threshold:
                print(f"반복 {i}에서 수렴! (변화량: {delta:.6f})")
                break

Applying Value Iteration to FrozenLake

def train_value_iteration_frozenlake():
    """FrozenLake에서 가치 반복법 학습 및 평가"""
    env = gym.make("FrozenLake-v1", is_slippery=True)

    agent = ValueIteration(env, gamma=0.99)
    agent.train(n_iterations=100)

    # 학습된 가치 함수 시각화
    print("\n=== 학습된 상태 가치 ===")
    for row in range(4):
        values = []
        for col in range(4):
            state = row * 4 + col
            values.append(f"{agent.values[state]:6.3f}")
        print(" | ".join(values))

    # 학습된 정책으로 평가
    print("\n=== 정책 평가 ===")
    n_eval = 1000
    successes = 0

    for _ in range(n_eval):
        obs, _ = env.reset()
        while True:
            action = agent.get_best_action(obs)
            obs, reward, terminated, truncated, _ = env.step(action)
            if terminated or truncated:
                if reward > 0:
                    successes += 1
                break

    success_rate = successes / n_eval
    print(f"성공률: {success_rate:.1%} ({successes}/{n_eval})")

    # 학습된 정책 시각화
    action_symbols = ['U', 'D', 'L', 'R']  # 상, 하, 좌, 우
    print("\n=== 학습된 정책 ===")
    for row in range(4):
        actions = []
        for col in range(4):
            state = row * 4 + col
            best_action = agent.get_best_action(state)
            actions.append(action_symbols[best_action])
        print("  ".join(actions))

    env.close()

# train_value_iteration_frozenlake()

Q-Learning

Q-learning is based on the Bellman optimality equation but learns Q values directly from experience without an environment model (model-free).

Q-Learning Update Rule

Q(s, a) <- Q(s, a) + alpha * [ r + gamma * max_{a'} Q(s', a') - Q(s, a) ]

Here alpha is the learning rate, and the term in brackets is called the TD error (Temporal Difference error).

Q-Learning Implementation

class QLearningAgent:
    """테이블 기반 Q-러닝 에이전트"""
    def __init__(self, n_states, n_actions, learning_rate=0.1,
                 gamma=0.99, epsilon=1.0, epsilon_decay=0.999,
                 epsilon_min=0.01):
        self.q_table = np.zeros((n_states, n_actions))
        self.lr = learning_rate
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.n_actions = n_actions

    def select_action(self, state):
        """엡실론-탐욕 정책으로 행동 선택"""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return np.argmax(self.q_table[state])

    def update(self, state, action, reward, next_state, done):
        """Q-러닝 업데이트"""
        if done:
            td_target = reward
        else:
            td_target = reward + self.gamma * np.max(self.q_table[next_state])

        td_error = td_target - self.q_table[state, action]
        self.q_table[state, action] += self.lr * td_error

        return td_error

    def decay_epsilon(self):
        """탐색률 감소"""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

Q-Learning on FrozenLake

def train_q_learning_frozenlake():
    """Q-러닝으로 FrozenLake 학습"""
    env = gym.make("FrozenLake-v1", is_slippery=True)
    n_states = env.observation_space.n
    n_actions = env.action_space.n

    agent = QLearningAgent(
        n_states=n_states,
        n_actions=n_actions,
        learning_rate=0.1,
        gamma=0.99,
        epsilon=1.0,
        epsilon_decay=0.9995,
        epsilon_min=0.01,
    )

    n_episodes = 20000
    rewards_history = []
    eval_interval = 1000

    for episode in range(n_episodes):
        obs, _ = env.reset()
        total_reward = 0

        while True:
            action = agent.select_action(obs)
            next_obs, reward, terminated, truncated, _ = env.step(action)

            agent.update(obs, action, reward, next_obs, terminated)
            total_reward += reward
            obs = next_obs

            if terminated or truncated:
                break

        agent.decay_epsilon()
        rewards_history.append(total_reward)

        if (episode + 1) % eval_interval == 0:
            recent_avg = np.mean(rewards_history[-eval_interval:])
            print(
                f"에피소드 {episode + 1}: "
                f"최근 평균 보상={recent_avg:.3f}, "
                f"엡실론={agent.epsilon:.4f}"
            )

    env.close()

    # 최종 결과
    print("\n=== Q 테이블 (첫 4개 상태) ===")
    action_names = ['상', '하', '좌', '우']
    for s in range(4):
        q_vals = ", ".join(f"{action_names[a]}={agent.q_table[s, a]:.3f}" for a in range(4))
        print(f"상태 {s}: {q_vals}")

    # 학습된 정책 평가
    agent.epsilon = 0  # 탐색 비활성화
    successes = 0
    for _ in range(1000):
        obs, _ = env.reset()
        while True:
            action = agent.select_action(obs)
            obs, reward, terminated, truncated, _ = env.step(action)
            if terminated or truncated:
                if reward > 0:
                    successes += 1
                break

    print(f"\n최종 성공률: {successes / 1000:.1%}")

    return agent

# agent = train_q_learning_frozenlake()

Comparing Value Iteration and Q-Learning

Property	Value Iteration	Q-Learning
Environment model	Yes (transition probabilities)	No
Learning method	Simultaneous state updates	One experience at a time
Convergence guarantee	Theoretically guaranteed	Theoretically guaranteed (with conditions)
Large state spaces	Impossible (table size)	Difficult (table size)
Scalability	Low	Medium (extensible with function approx.)

Summary

State value V(s): Expected cumulative reward starting from a state
Action value Q(s, a): Expected cumulative reward after taking a specific action in a state
Bellman equation: Recursive relationship of value functions, the foundation of dynamic programming
Value iteration: Converges to optimal value function by repeatedly applying the Bellman optimality equation
Q-learning: A representative algorithm that learns Q values in a model-free manner

Table-based methods can only be used when the state space is small. In the next article, we will cover Deep Q-Networks (DQN) which approximate Q functions with neural networks.