Split View: [심층 강화학습] 01. 강화학습이란: MDP와 기본 개념

[심층 강화학습] 01. 강화학습이란: MDP와 기본 개념

머신러닝의 세 가지 패러다임

머신러닝은 크게 세 가지 학습 방법으로 나뉩니다. 각각의 접근 방식이 해결하려는 문제의 성격이 다릅니다.

지도학습 (Supervised Learning)

지도학습은 입력과 정답 쌍이 주어진 상태에서 학습합니다. 이미지 분류, 텍스트 번역 등이 대표적인 예시입니다. 핵심은 "정답 레이블"이 존재한다는 점입니다.

비지도학습 (Unsupervised Learning)

비지도학습은 정답 없이 데이터의 숨겨진 구조를 발견합니다. 클러스터링, 차원 축소, 생성 모델 등이 여기에 해당합니다.

강화학습 (Reinforcement Learning)

강화학습은 앞의 두 방식과 근본적으로 다릅니다. 에이전트가 환경과 상호작용하면서 시행착오를 통해 학습합니다. 정답이 명시적으로 주어지지 않고, 대신 보상 신호를 통해 어떤 행동이 좋은지 간접적으로 알게 됩니다.

특성	지도학습	비지도학습	강화학습
데이터	입력-정답 쌍	입력만	상태-행동-보상
피드백	즉시 (정답)	없음	지연된 보상
목표	예측 정확도	구조 발견	누적 보상 최대화
예시	이미지 분류	클러스터링	게임 플레이

강화학습의 핵심 구성 요소

강화학습 시스템은 다음과 같은 요소들로 구성됩니다.

에이전트 (Agent)

에이전트는 환경 속에서 행동을 선택하고 실행하는 주체입니다. 게임에서의 플레이어, 로봇의 제어 시스템 등이 에이전트에 해당합니다.

환경 (Environment)

환경은 에이전트가 상호작용하는 외부 세계입니다. 에이전트의 행동에 반응하여 새로운 상태와 보상을 제공합니다.

행동 (Action)

에이전트가 각 시점에서 선택할 수 있는 것들의 집합입니다. 행동 공간은 이산적(위, 아래, 왼쪽, 오른쪽)이거나 연속적(조향 각도, 가속도)일 수 있습니다.

관찰 (Observation)

에이전트가 환경으로부터 받는 정보입니다. 전체 환경 상태를 볼 수 있는 경우(완전 관찰)와 일부만 볼 수 있는 경우(부분 관찰)가 있습니다.

보상 (Reward)

환경이 에이전트의 행동에 대해 제공하는 스칼라 피드백 신호입니다. 강화학습의 궁극적 목표는 시간에 걸친 누적 보상의 합을 최대화하는 것입니다.

에이전트와 환경의 상호작용 흐름을 간단히 표현하면 다음과 같습니다.

에이전트 --[행동]--> 환경 --[관찰, 보상]--> 에이전트 --[행동]--> ...

에이전트-환경 상호작용의 기본 코드 구조

# 강화학습의 기본 루프
total_reward = 0
obs = env.reset()

while True:
    action = agent.select_action(obs)
    next_obs, reward, done, info = env.step(action)
    total_reward += reward
    agent.learn(obs, action, reward, next_obs, done)
    obs = next_obs
    if done:
        break

print(f"에피소드 종료, 총 보상: {total_reward}")

보상에 대한 깊은 이해

할인 보상 (Discounted Reward)

미래의 보상은 현재의 보상보다 불확실합니다. 이를 반영하기 위해 할인 인자 감마(gamma, 0과 1 사이의 값)를 사용합니다.

시점 t에서의 할인된 누적 보상 G_t는 다음과 같이 계산됩니다.

G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + gamma^3 * r_{t+3} + ...

감마가 0에 가까우면 에이전트는 즉각적인 보상만 중시하고, 1에 가까우면 먼 미래의 보상도 현재와 거의 동등하게 취급합니다.

def calculate_discounted_return(rewards, gamma=0.99):
    """할인된 누적 보상 계산"""
    G = 0
    returns = []
    for reward in reversed(rewards):
        G = reward + gamma * G
        returns.insert(0, G)
    return returns

# 예시: 각 스텝에서 보상 1을 받는 경우
rewards = [1, 1, 1, 1, 1]
returns = calculate_discounted_return(rewards, gamma=0.99)
print(f"할인된 누적 보상: {returns}")
# 출력: [4.90, 3.94, 2.97, 1.99, 1.0]

마르코프 과정 (Markov Process)

강화학습의 수학적 기반은 마르코프 성질에서 출발합니다.

마르코프 성질 (Markov Property)

미래 상태는 오직 현재 상태에만 의존하고, 과거의 이력에는 의존하지 않는다는 성질입니다. 수학적으로 표현하면 다음과 같습니다.

P(s_{t+1} | s_t) = P(s_{t+1} | s_1, s_2, ..., s_t)

즉, 현재 상태 s_t만 알면 미래를 예측하기에 충분하다는 의미입니다.

마르코프 체인 (Markov Chain)

마르코프 체인은 마르코프 성질을 만족하는 상태들의 시퀀스입니다. 유한한 상태 집합 S와 전이 확률 행렬 P로 정의됩니다.

import numpy as np

# 날씨 마르코프 체인 예시
# 상태: 맑음(0), 흐림(1), 비(2)
transition_matrix = np.array([
    [0.7, 0.2, 0.1],  # 맑음 -> 맑음, 흐림, 비
    [0.3, 0.4, 0.3],  # 흐림 -> 맑음, 흐림, 비
    [0.2, 0.3, 0.5],  # 비   -> 맑음, 흐림, 비
])

states = ['맑음', '흐림', '비']

def simulate_markov_chain(transition_matrix, initial_state, n_steps):
    """마르코프 체인 시뮬레이션"""
    current = initial_state
    trajectory = [current]
    for _ in range(n_steps):
        current = np.random.choice(
            len(transition_matrix),
            p=transition_matrix[current]
        )
        trajectory.append(current)
    return trajectory

# 맑음에서 시작하여 10일간 시뮬레이션
trajectory = simulate_markov_chain(transition_matrix, 0, 10)
print("날씨 변화:", [states[s] for s in trajectory])

마르코프 보상 과정 (Markov Reward Process, MRP)

마르코프 체인에 보상의 개념을 추가하면 마르코프 보상 과정이 됩니다. MRP는 다음 네 가지 요소로 정의됩니다.

S: 유한한 상태 집합
P: 전이 확률 행렬
R: 보상 함수 (각 상태에서 받는 기대 보상)
감마: 할인 인자 (0 이상 1 이하)

상태 가치 함수 (State Value Function)

상태 s의 가치 V(s)는 해당 상태에서 시작하여 얻을 수 있는 기대 할인 누적 보상입니다.

V(s) = E[G_t | s_t = s]
     = E[r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... | s_t = s]

이를 재귀적으로 분해하면 벨만 방정식의 기본 형태를 얻습니다.

V(s) = R(s) + gamma * sum_over_s'( P(s' | s) * V(s') )

def compute_mrp_value(transition_matrix, rewards, gamma=0.99, n_iterations=1000):
    """반복법을 이용한 MRP 상태 가치 계산"""
    n_states = len(rewards)
    V = np.zeros(n_states)

    for _ in range(n_iterations):
        V_new = np.zeros(n_states)
        for s in range(n_states):
            V_new[s] = rewards[s] + gamma * np.sum(
                transition_matrix[s] * V
            )
        V = V_new

    return V

# 날씨 MRP: 맑음=+2, 흐림=0, 비=-1
rewards = np.array([2.0, 0.0, -1.0])
values = compute_mrp_value(transition_matrix, rewards, gamma=0.9)
for s, v in zip(states, values):
    print(f"상태 '{s}'의 가치: {v:.2f}")

마르코프 결정 과정 (Markov Decision Process, MDP)

MDP는 MRP에 **행동(Action)**의 개념을 추가한 것입니다. 에이전트가 행동을 선택할 수 있게 되면서 비로소 의사결정 문제가 됩니다.

MDP는 다섯 가지 요소로 정의됩니다.

S: 유한한 상태 집합
A: 유한한 행동 집합
P: 전이 확률 함수 P(s' | s, a) - 상태 s에서 행동 a를 했을 때 s'로 전이할 확률
R: 보상 함수 R(s, a) - 상태 s에서 행동 a를 했을 때 받는 기대 보상
감마: 할인 인자

정책 (Policy)

정책 pi는 에이전트가 각 상태에서 어떤 행동을 선택할지 결정하는 규칙입니다.

결정적 정책: 각 상태에서 하나의 행동을 확정적으로 선택합니다. a = pi(s)
확률적 정책: 각 상태에서 행동의 확률 분포를 정의합니다. pi(a | s) = 상태 s에서 행동 a를 선택할 확률

import random

class RandomPolicy:
    """무작위 정책"""
    def __init__(self, n_actions):
        self.n_actions = n_actions

    def select_action(self, state):
        return random.randint(0, self.n_actions - 1)

class GreedyPolicy:
    """탐욕적 정책: Q값이 가장 높은 행동 선택"""
    def __init__(self, q_table):
        self.q_table = q_table

    def select_action(self, state):
        return int(np.argmax(self.q_table[state]))

class EpsilonGreedyPolicy:
    """엡실론-탐욕 정책: 탐색과 활용의 균형"""
    def __init__(self, q_table, epsilon=0.1):
        self.q_table = q_table
        self.epsilon = epsilon

    def select_action(self, state):
        if random.random() < self.epsilon:
            return random.randint(0, len(self.q_table[state]) - 1)
        return int(np.argmax(self.q_table[state]))

행동 가치 함수 (Action-Value Function)

정책 pi 하에서 상태 s에서 행동 a를 취했을 때의 기대 누적 보상을 Q(s, a)라 합니다.

Q_pi(s, a) = E_pi[G_t | s_t = s, a_t = a]

최적 행동 가치 함수 Q*(s, a)는 모든 가능한 정책 중에서 가장 높은 Q값을 의미합니다.

Q*(s, a) = max_pi Q_pi(s, a)

그리드월드 MDP 구현 예시

간단한 4x4 그리드월드를 MDP로 구현해 보겠습니다.

import numpy as np

class GridWorldMDP:
    """간단한 4x4 그리드월드 MDP"""
    def __init__(self):
        self.size = 4
        self.n_states = self.size * self.size
        self.n_actions = 4  # 상, 하, 좌, 우
        self.goal = (3, 3)  # 목표 위치
        self.trap = (1, 1)  # 함정 위치

        # 행동 방향: 상, 하, 좌, 우
        self.action_deltas = [(-1, 0), (1, 0), (0, -1), (0, 1)]
        self.action_names = ['상', '하', '좌', '우']

    def state_to_pos(self, state):
        return (state // self.size, state % self.size)

    def pos_to_state(self, row, col):
        return row * self.size + col

    def step(self, state, action):
        row, col = self.state_to_pos(state)
        dr, dc = self.action_deltas[action]
        new_row = max(0, min(self.size - 1, row + dr))
        new_col = max(0, min(self.size - 1, col + dc))

        next_state = self.pos_to_state(new_row, new_col)
        next_pos = (new_row, new_col)

        if next_pos == self.goal:
            return next_state, 10.0, True
        elif next_pos == self.trap:
            return next_state, -10.0, True
        else:
            return next_state, -0.1, False

    def reset(self):
        return 0  # 시작 상태: (0, 0)

# 그리드월드에서 무작위 에이전트 실행
env = GridWorldMDP()
policy = RandomPolicy(env.n_actions)

n_episodes = 1000
total_rewards = []

for episode in range(n_episodes):
    state = env.reset()
    episode_reward = 0
    for step in range(100):
        action = policy.select_action(state)
        next_state, reward, done = env.step(state, action)
        episode_reward += reward
        state = next_state
        if done:
            break
    total_rewards.append(episode_reward)

avg_reward = np.mean(total_rewards)
print(f"무작위 정책의 평균 보상 ({n_episodes}회): {avg_reward:.2f}")

탐색과 활용의 딜레마 (Exploration vs Exploitation)

강화학습에서 가장 근본적인 문제 중 하나는 탐색과 활용의 균형입니다.

탐색(Exploration): 새로운 행동을 시도하여 환경에 대한 정보를 수집합니다
활용(Exploitation): 지금까지 알려진 최선의 행동을 선택하여 보상을 극대화합니다

너무 탐색에 치중하면 이미 알고 있는 좋은 전략을 활용하지 못하고, 너무 활용에 치중하면 더 좋은 전략을 발견하지 못합니다.

엡실론-탐욕 정책은 이 문제에 대한 간단한 해결책입니다. 확률 epsilon으로 무작위 행동을 선택(탐색)하고, 1 - epsilon의 확률로 현재 최선의 행동을 선택(활용)합니다.

정리

이번 글에서 다룬 핵심 개념들을 정리하면 다음과 같습니다.

강화학습은 에이전트가 환경과 상호작용하며 보상을 최대화하는 학습 방법입니다
마르코프 성질은 미래가 현재에만 의존한다는 가정으로, 강화학습의 수학적 기반입니다
MRP는 마르코프 체인에 보상을 추가한 것이며, 상태 가치 함수를 정의할 수 있습니다
MDP는 MRP에 행동을 추가하여 에이전트의 의사결정을 모델링합니다
정책은 상태에서 행동으로의 매핑이며, 최적 정책을 찾는 것이 강화학습의 목표입니다

다음 글에서는 OpenAI Gym을 사용하여 실제로 강화학습 환경을 구축하고 에이전트를 학습시키는 방법을 살펴보겠습니다.

[Deep RL] 01. What is Reinforcement Learning: MDP and Fundamental Concepts

Three Paradigms of Machine Learning

Machine learning is broadly divided into three learning methods. Each approach addresses problems of different nature.

Supervised Learning

Supervised learning trains on input-label pairs. Image classification and text translation are typical examples. The key is that "ground truth labels" exist.

Unsupervised Learning

Unsupervised learning discovers hidden structures in data without labels. Clustering, dimensionality reduction, and generative models fall into this category.

Reinforcement Learning

Reinforcement learning is fundamentally different from the previous two approaches. An agent learns through trial and error by interacting with an environment. Correct answers are not explicitly given; instead, the agent indirectly learns which actions are good through reward signals.

Property	Supervised Learning	Unsupervised Learning	Reinforcement Learning
Data	Input-label pairs	Input only	State-action-reward
Feedback	Immediate (labels)	None	Delayed rewards
Goal	Prediction accuracy	Structure discovery	Maximize cumulative reward
Example	Image classification	Clustering	Game playing

Core Components of Reinforcement Learning

A reinforcement learning system consists of the following components.

Agent

The agent is the entity that selects and executes actions within the environment. Players in games and robot control systems are examples of agents.

Environment

The environment is the external world with which the agent interacts. It responds to the agent's actions by providing new states and rewards.

Action

The set of things an agent can choose at each time step. The action space can be discrete (up, down, left, right) or continuous (steering angle, acceleration).

Observation

Information the agent receives from the environment. There are cases where the agent can see the entire environment state (full observation) and cases where only partial information is available (partial observation).

Reward

A scalar feedback signal provided by the environment in response to the agent's action. The ultimate goal of reinforcement learning is to maximize the sum of cumulative rewards over time.

The interaction flow between agent and environment can be simply expressed as follows:

Agent --[Action]--> Environment --[Observation, Reward]--> Agent --[Action]--> ...

Basic Code Structure of Agent-Environment Interaction

# Basic loop of reinforcement learning
total_reward = 0
obs = env.reset()

while True:
    action = agent.select_action(obs)
    next_obs, reward, done, info = env.step(action)
    total_reward += reward
    agent.learn(obs, action, reward, next_obs, done)
    obs = next_obs
    if done:
        break

print(f"에피소드 종료, 총 보상: {total_reward}")

Deep Understanding of Rewards

Discounted Reward

Future rewards are more uncertain than present rewards. To reflect this, we use a discount factor gamma (a value between 0 and 1).

The discounted cumulative reward G_t at time step t is calculated as follows:

G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + gamma^3 * r_{t+3} + ...

When gamma is close to 0, the agent prioritizes immediate rewards only; when close to 1, it treats distant future rewards nearly equally to present ones.

def calculate_discounted_return(rewards, gamma=0.99):
    """할인된 누적 보상 계산"""
    G = 0
    returns = []
    for reward in reversed(rewards):
        G = reward + gamma * G
        returns.insert(0, G)
    return returns

# 예시: 각 스텝에서 보상 1을 받는 경우
rewards = [1, 1, 1, 1, 1]
returns = calculate_discounted_return(rewards, gamma=0.99)
print(f"할인된 누적 보상: {returns}")
# 출력: [4.90, 3.94, 2.97, 1.99, 1.0]

Markov Process

The mathematical foundation of reinforcement learning starts from the Markov property.

Markov Property

The property that future states depend only on the current state and not on the history of past states. Mathematically:

P(s_{t+1} | s_t) = P(s_{t+1} | s_1, s_2, ..., s_t)

In other words, knowing the current state s_t is sufficient to predict the future.

Markov Chain

A Markov chain is a sequence of states that satisfies the Markov property. It is defined by a finite set of states S and a transition probability matrix P.

import numpy as np

# 날씨 마르코프 체인 예시
# 상태: 맑음(0), 흐림(1), 비(2)
transition_matrix = np.array([
    [0.7, 0.2, 0.1],  # 맑음 -> 맑음, 흐림, 비
    [0.3, 0.4, 0.3],  # 흐림 -> 맑음, 흐림, 비
    [0.2, 0.3, 0.5],  # 비   -> 맑음, 흐림, 비
])

states = ['맑음', '흐림', '비']

def simulate_markov_chain(transition_matrix, initial_state, n_steps):
    """마르코프 체인 시뮬레이션"""
    current = initial_state
    trajectory = [current]
    for _ in range(n_steps):
        current = np.random.choice(
            len(transition_matrix),
            p=transition_matrix[current]
        )
        trajectory.append(current)
    return trajectory

# 맑음에서 시작하여 10일간 시뮬레이션
trajectory = simulate_markov_chain(transition_matrix, 0, 10)
print("날씨 변화:", [states[s] for s in trajectory])

Markov Reward Process (MRP)

Adding the concept of reward to a Markov chain creates a Markov Reward Process. An MRP is defined by four elements:

S: Finite set of states
P: Transition probability matrix
R: Reward function (expected reward at each state)
Gamma: Discount factor (between 0 and 1)

State Value Function

The value V(s) of state s is the expected discounted cumulative reward starting from that state.

V(s) = E[G_t | s_t = s]
     = E[r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... | s_t = s]

Decomposing this recursively yields the basic form of the Bellman equation:

V(s) = R(s) + gamma * sum_over_s'( P(s' | s) * V(s') )

def compute_mrp_value(transition_matrix, rewards, gamma=0.99, n_iterations=1000):
    """반복법을 이용한 MRP 상태 가치 계산"""
    n_states = len(rewards)
    V = np.zeros(n_states)

    for _ in range(n_iterations):
        V_new = np.zeros(n_states)
        for s in range(n_states):
            V_new[s] = rewards[s] + gamma * np.sum(
                transition_matrix[s] * V
            )
        V = V_new

    return V

# 날씨 MRP: 맑음=+2, 흐림=0, 비=-1
rewards = np.array([2.0, 0.0, -1.0])
values = compute_mrp_value(transition_matrix, rewards, gamma=0.9)
for s, v in zip(states, values):
    print(f"상태 '{s}'의 가치: {v:.2f}")

Markov Decision Process (MDP)

An MDP adds the concept of Action to an MRP. By allowing the agent to choose actions, it becomes a decision-making problem.

An MDP is defined by five elements:

S: Finite set of states
A: Finite set of actions
P: Transition probability function P(s' | s, a) - probability of transitioning to s' when taking action a in state s
R: Reward function R(s, a) - expected reward when taking action a in state s
Gamma: Discount factor

Policy

A policy pi is a rule that determines which action to select in each state.

Deterministic policy: Selects one action definitively in each state. a = pi(s)
Stochastic policy: Defines a probability distribution over actions in each state. pi(a | s) = probability of selecting action a in state s

import random

class RandomPolicy:
    """무작위 정책"""
    def __init__(self, n_actions):
        self.n_actions = n_actions

    def select_action(self, state):
        return random.randint(0, self.n_actions - 1)

class GreedyPolicy:
    """탐욕적 정책: Q값이 가장 높은 행동 선택"""
    def __init__(self, q_table):
        self.q_table = q_table

    def select_action(self, state):
        return int(np.argmax(self.q_table[state]))

class EpsilonGreedyPolicy:
    """엡실론-탐욕 정책: 탐색과 활용의 균형"""
    def __init__(self, q_table, epsilon=0.1):
        self.q_table = q_table
        self.epsilon = epsilon

    def select_action(self, state):
        if random.random() < self.epsilon:
            return random.randint(0, len(self.q_table[state]) - 1)
        return int(np.argmax(self.q_table[state]))

Action-Value Function

The expected cumulative reward when taking action a in state s under policy pi is denoted Q(s, a).

Q_pi(s, a) = E_pi[G_t | s_t = s, a_t = a]

The optimal action-value function Q*(s, a) represents the highest Q value among all possible policies.

Q*(s, a) = max_pi Q_pi(s, a)

GridWorld MDP Implementation Example

Let us implement a simple 4x4 GridWorld as an MDP.

import numpy as np

class GridWorldMDP:
    """간단한 4x4 그리드월드 MDP"""
    def __init__(self):
        self.size = 4
        self.n_states = self.size * self.size
        self.n_actions = 4  # 상, 하, 좌, 우
        self.goal = (3, 3)  # 목표 위치
        self.trap = (1, 1)  # 함정 위치

        # 행동 방향: 상, 하, 좌, 우
        self.action_deltas = [(-1, 0), (1, 0), (0, -1), (0, 1)]
        self.action_names = ['상', '하', '좌', '우']

    def state_to_pos(self, state):
        return (state // self.size, state % self.size)

    def pos_to_state(self, row, col):
        return row * self.size + col

    def step(self, state, action):
        row, col = self.state_to_pos(state)
        dr, dc = self.action_deltas[action]
        new_row = max(0, min(self.size - 1, row + dr))
        new_col = max(0, min(self.size - 1, col + dc))

        next_state = self.pos_to_state(new_row, new_col)
        next_pos = (new_row, new_col)

        if next_pos == self.goal:
            return next_state, 10.0, True
        elif next_pos == self.trap:
            return next_state, -10.0, True
        else:
            return next_state, -0.1, False

    def reset(self):
        return 0  # 시작 상태: (0, 0)

# 그리드월드에서 무작위 에이전트 실행
env = GridWorldMDP()
policy = RandomPolicy(env.n_actions)

n_episodes = 1000
total_rewards = []

for episode in range(n_episodes):
    state = env.reset()
    episode_reward = 0
    for step in range(100):
        action = policy.select_action(state)
        next_state, reward, done = env.step(state, action)
        episode_reward += reward
        state = next_state
        if done:
            break
    total_rewards.append(episode_reward)

avg_reward = np.mean(total_rewards)
print(f"무작위 정책의 평균 보상 ({n_episodes}회): {avg_reward:.2f}")

Exploration vs Exploitation Dilemma

One of the most fundamental problems in reinforcement learning is the balance between exploration and exploitation.

Exploration: Trying new actions to gather information about the environment
Exploitation: Selecting the best known action to maximize reward

Focusing too much on exploration means failing to utilize known good strategies, while focusing too much on exploitation means missing potentially better strategies.

The epsilon-greedy policy is a simple solution to this problem. It selects a random action (exploration) with probability epsilon and the current best action (exploitation) with probability 1 - epsilon.

Summary

Here is a summary of the core concepts covered in this article:

Reinforcement learning is a learning method where an agent interacts with an environment to maximize rewards
Markov property is the assumption that the future depends only on the present, forming the mathematical foundation of reinforcement learning
MRP adds rewards to a Markov chain, allowing us to define a state value function
MDP adds actions to an MRP to model agent decision-making
Policy is a mapping from states to actions, and finding the optimal policy is the goal of reinforcement learning

In the next article, we will use OpenAI Gym to build reinforcement learning environments and train agents.