[Deep RL] 01. What is Reinforcement Learning: MDP and Fundamental Concepts

Three Paradigms of Machine Learning

Machine learning is broadly divided into three learning methods. Each approach addresses problems of different nature.

Supervised Learning

Supervised learning trains on input-label pairs. Image classification and text translation are typical examples. The key is that "ground truth labels" exist.

Unsupervised Learning

Unsupervised learning discovers hidden structures in data without labels. Clustering, dimensionality reduction, and generative models fall into this category.

Reinforcement Learning

Reinforcement learning is fundamentally different from the previous two approaches. An agent learns through trial and error by interacting with an environment. Correct answers are not explicitly given; instead, the agent indirectly learns which actions are good through reward signals.

Property	Supervised Learning	Unsupervised Learning	Reinforcement Learning
Data	Input-label pairs	Input only	State-action-reward
Feedback	Immediate (labels)	None	Delayed rewards
Goal	Prediction accuracy	Structure discovery	Maximize cumulative reward
Example	Image classification	Clustering	Game playing

Core Components of Reinforcement Learning

A reinforcement learning system consists of the following components.

Agent

The agent is the entity that selects and executes actions within the environment. Players in games and robot control systems are examples of agents.

Environment

The environment is the external world with which the agent interacts. It responds to the agent's actions by providing new states and rewards.

Action

The set of things an agent can choose at each time step. The action space can be discrete (up, down, left, right) or continuous (steering angle, acceleration).

Observation

Information the agent receives from the environment. There are cases where the agent can see the entire environment state (full observation) and cases where only partial information is available (partial observation).

Reward

A scalar feedback signal provided by the environment in response to the agent's action. The ultimate goal of reinforcement learning is to maximize the sum of cumulative rewards over time.

The interaction flow between agent and environment can be simply expressed as follows:

Agent --[Action]--> Environment --[Observation, Reward]--> Agent --[Action]--> ...

Basic Code Structure of Agent-Environment Interaction

# Basic loop of reinforcement learning
total_reward = 0
obs = env.reset()

while True:
    action = agent.select_action(obs)
    next_obs, reward, done, info = env.step(action)
    total_reward += reward
    agent.learn(obs, action, reward, next_obs, done)
    obs = next_obs
    if done:
        break

print(f"에피소드 종료, 총 보상: {total_reward}")

Deep Understanding of Rewards

Discounted Reward

Future rewards are more uncertain than present rewards. To reflect this, we use a discount factor gamma (a value between 0 and 1).

The discounted cumulative reward G_t at time step t is calculated as follows:

G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + gamma^3 * r_{t+3} + ...

When gamma is close to 0, the agent prioritizes immediate rewards only; when close to 1, it treats distant future rewards nearly equally to present ones.

def calculate_discounted_return(rewards, gamma=0.99):
    """할인된 누적 보상 계산"""
    G = 0
    returns = []
    for reward in reversed(rewards):
        G = reward + gamma * G
        returns.insert(0, G)
    return returns

# 예시: 각 스텝에서 보상 1을 받는 경우
rewards = [1, 1, 1, 1, 1]
returns = calculate_discounted_return(rewards, gamma=0.99)
print(f"할인된 누적 보상: {returns}")
# 출력: [4.90, 3.94, 2.97, 1.99, 1.0]

Markov Process

The mathematical foundation of reinforcement learning starts from the Markov property.

Markov Property

The property that future states depend only on the current state and not on the history of past states. Mathematically:

P(s_{t+1} | s_t) = P(s_{t+1} | s_1, s_2, ..., s_t)

In other words, knowing the current state s_t is sufficient to predict the future.

Markov Chain

A Markov chain is a sequence of states that satisfies the Markov property. It is defined by a finite set of states S and a transition probability matrix P.

import numpy as np

# 날씨 마르코프 체인 예시
# 상태: 맑음(0), 흐림(1), 비(2)
transition_matrix = np.array([
    [0.7, 0.2, 0.1],  # 맑음 -> 맑음, 흐림, 비
    [0.3, 0.4, 0.3],  # 흐림 -> 맑음, 흐림, 비
    [0.2, 0.3, 0.5],  # 비   -> 맑음, 흐림, 비
])

states = ['맑음', '흐림', '비']

def simulate_markov_chain(transition_matrix, initial_state, n_steps):
    """마르코프 체인 시뮬레이션"""
    current = initial_state
    trajectory = [current]
    for _ in range(n_steps):
        current = np.random.choice(
            len(transition_matrix),
            p=transition_matrix[current]
        )
        trajectory.append(current)
    return trajectory

# 맑음에서 시작하여 10일간 시뮬레이션
trajectory = simulate_markov_chain(transition_matrix, 0, 10)
print("날씨 변화:", [states[s] for s in trajectory])

Markov Reward Process (MRP)

Adding the concept of reward to a Markov chain creates a Markov Reward Process. An MRP is defined by four elements:

S: Finite set of states
P: Transition probability matrix
R: Reward function (expected reward at each state)
Gamma: Discount factor (between 0 and 1)

State Value Function

The value V(s) of state s is the expected discounted cumulative reward starting from that state.

V(s) = E[G_t | s_t = s]
     = E[r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... | s_t = s]

Decomposing this recursively yields the basic form of the Bellman equation:

V(s) = R(s) + gamma * sum_over_s'( P(s' | s) * V(s') )

def compute_mrp_value(transition_matrix, rewards, gamma=0.99, n_iterations=1000):
    """반복법을 이용한 MRP 상태 가치 계산"""
    n_states = len(rewards)
    V = np.zeros(n_states)

    for _ in range(n_iterations):
        V_new = np.zeros(n_states)
        for s in range(n_states):
            V_new[s] = rewards[s] + gamma * np.sum(
                transition_matrix[s] * V
            )
        V = V_new

    return V

# 날씨 MRP: 맑음=+2, 흐림=0, 비=-1
rewards = np.array([2.0, 0.0, -1.0])
values = compute_mrp_value(transition_matrix, rewards, gamma=0.9)
for s, v in zip(states, values):
    print(f"상태 '{s}'의 가치: {v:.2f}")

Markov Decision Process (MDP)

An MDP adds the concept of Action to an MRP. By allowing the agent to choose actions, it becomes a decision-making problem.

An MDP is defined by five elements:

S: Finite set of states
A: Finite set of actions
P: Transition probability function P(s' | s, a) - probability of transitioning to s' when taking action a in state s
R: Reward function R(s, a) - expected reward when taking action a in state s
Gamma: Discount factor

Policy

A policy pi is a rule that determines which action to select in each state.

Deterministic policy: Selects one action definitively in each state. a = pi(s)
Stochastic policy: Defines a probability distribution over actions in each state. pi(a | s) = probability of selecting action a in state s

import random

class RandomPolicy:
    """무작위 정책"""
    def __init__(self, n_actions):
        self.n_actions = n_actions

    def select_action(self, state):
        return random.randint(0, self.n_actions - 1)

class GreedyPolicy:
    """탐욕적 정책: Q값이 가장 높은 행동 선택"""
    def __init__(self, q_table):
        self.q_table = q_table

    def select_action(self, state):
        return int(np.argmax(self.q_table[state]))

class EpsilonGreedyPolicy:
    """엡실론-탐욕 정책: 탐색과 활용의 균형"""
    def __init__(self, q_table, epsilon=0.1):
        self.q_table = q_table
        self.epsilon = epsilon

    def select_action(self, state):
        if random.random() < self.epsilon:
            return random.randint(0, len(self.q_table[state]) - 1)
        return int(np.argmax(self.q_table[state]))

Action-Value Function

The expected cumulative reward when taking action a in state s under policy pi is denoted Q(s, a).

Q_pi(s, a) = E_pi[G_t | s_t = s, a_t = a]

The optimal action-value function Q*(s, a) represents the highest Q value among all possible policies.

Q*(s, a) = max_pi Q_pi(s, a)

GridWorld MDP Implementation Example

Let us implement a simple 4x4 GridWorld as an MDP.

import numpy as np

class GridWorldMDP:
    """간단한 4x4 그리드월드 MDP"""
    def __init__(self):
        self.size = 4
        self.n_states = self.size * self.size
        self.n_actions = 4  # 상, 하, 좌, 우
        self.goal = (3, 3)  # 목표 위치
        self.trap = (1, 1)  # 함정 위치

        # 행동 방향: 상, 하, 좌, 우
        self.action_deltas = [(-1, 0), (1, 0), (0, -1), (0, 1)]
        self.action_names = ['상', '하', '좌', '우']

    def state_to_pos(self, state):
        return (state // self.size, state % self.size)

    def pos_to_state(self, row, col):
        return row * self.size + col

    def step(self, state, action):
        row, col = self.state_to_pos(state)
        dr, dc = self.action_deltas[action]
        new_row = max(0, min(self.size - 1, row + dr))
        new_col = max(0, min(self.size - 1, col + dc))

        next_state = self.pos_to_state(new_row, new_col)
        next_pos = (new_row, new_col)

        if next_pos == self.goal:
            return next_state, 10.0, True
        elif next_pos == self.trap:
            return next_state, -10.0, True
        else:
            return next_state, -0.1, False

    def reset(self):
        return 0  # 시작 상태: (0, 0)

# 그리드월드에서 무작위 에이전트 실행
env = GridWorldMDP()
policy = RandomPolicy(env.n_actions)

n_episodes = 1000
total_rewards = []

for episode in range(n_episodes):
    state = env.reset()
    episode_reward = 0
    for step in range(100):
        action = policy.select_action(state)
        next_state, reward, done = env.step(state, action)
        episode_reward += reward
        state = next_state
        if done:
            break
    total_rewards.append(episode_reward)

avg_reward = np.mean(total_rewards)
print(f"무작위 정책의 평균 보상 ({n_episodes}회): {avg_reward:.2f}")

Exploration vs Exploitation Dilemma

One of the most fundamental problems in reinforcement learning is the balance between exploration and exploitation.

Exploration: Trying new actions to gather information about the environment
Exploitation: Selecting the best known action to maximize reward

Focusing too much on exploration means failing to utilize known good strategies, while focusing too much on exploitation means missing potentially better strategies.

The epsilon-greedy policy is a simple solution to this problem. It selects a random action (exploration) with probability epsilon and the current best action (exploitation) with probability 1 - epsilon.

Summary

Here is a summary of the core concepts covered in this article:

Reinforcement learning is a learning method where an agent interacts with an environment to maximize rewards
Markov property is the assumption that the future depends only on the present, forming the mathematical foundation of reinforcement learning
MRP adds rewards to a Markov chain, allowing us to define a state value function
MDP adds actions to an MRP to model agent decision-making
Policy is a mapping from states to actions, and finding the optimal policy is the goal of reinforcement learning

In the next article, we will use OpenAI Gym to build reinforcement learning environments and train agents.