- Authors

- Name
- Youngju Kim
- @fjvbn20031
Three Paradigms of Machine Learning
Machine learning is broadly divided into three learning methods. Each approach addresses problems of different nature.
Supervised Learning
Supervised learning trains on input-label pairs. Image classification and text translation are typical examples. The key is that "ground truth labels" exist.
Unsupervised Learning
Unsupervised learning discovers hidden structures in data without labels. Clustering, dimensionality reduction, and generative models fall into this category.
Reinforcement Learning
Reinforcement learning is fundamentally different from the previous two approaches. An agent learns through trial and error by interacting with an environment. Correct answers are not explicitly given; instead, the agent indirectly learns which actions are good through reward signals.
| Property | Supervised Learning | Unsupervised Learning | Reinforcement Learning |
|---|---|---|---|
| Data | Input-label pairs | Input only | State-action-reward |
| Feedback | Immediate (labels) | None | Delayed rewards |
| Goal | Prediction accuracy | Structure discovery | Maximize cumulative reward |
| Example | Image classification | Clustering | Game playing |
Core Components of Reinforcement Learning
A reinforcement learning system consists of the following components.
Agent
The agent is the entity that selects and executes actions within the environment. Players in games and robot control systems are examples of agents.
Environment
The environment is the external world with which the agent interacts. It responds to the agent's actions by providing new states and rewards.
Action
The set of things an agent can choose at each time step. The action space can be discrete (up, down, left, right) or continuous (steering angle, acceleration).
Observation
Information the agent receives from the environment. There are cases where the agent can see the entire environment state (full observation) and cases where only partial information is available (partial observation).
Reward
A scalar feedback signal provided by the environment in response to the agent's action. The ultimate goal of reinforcement learning is to maximize the sum of cumulative rewards over time.
The interaction flow between agent and environment can be simply expressed as follows:
Agent --[Action]--> Environment --[Observation, Reward]--> Agent --[Action]--> ...
Basic Code Structure of Agent-Environment Interaction
# Basic loop of reinforcement learning
total_reward = 0
obs = env.reset()
while True:
action = agent.select_action(obs)
next_obs, reward, done, info = env.step(action)
total_reward += reward
agent.learn(obs, action, reward, next_obs, done)
obs = next_obs
if done:
break
print(f"에피소드 종료, 총 보상: {total_reward}")
Deep Understanding of Rewards
Discounted Reward
Future rewards are more uncertain than present rewards. To reflect this, we use a discount factor gamma (a value between 0 and 1).
The discounted cumulative reward G_t at time step t is calculated as follows:
G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + gamma^3 * r_{t+3} + ...
When gamma is close to 0, the agent prioritizes immediate rewards only; when close to 1, it treats distant future rewards nearly equally to present ones.
def calculate_discounted_return(rewards, gamma=0.99):
"""할인된 누적 보상 계산"""
G = 0
returns = []
for reward in reversed(rewards):
G = reward + gamma * G
returns.insert(0, G)
return returns
# 예시: 각 스텝에서 보상 1을 받는 경우
rewards = [1, 1, 1, 1, 1]
returns = calculate_discounted_return(rewards, gamma=0.99)
print(f"할인된 누적 보상: {returns}")
# 출력: [4.90, 3.94, 2.97, 1.99, 1.0]
Markov Process
The mathematical foundation of reinforcement learning starts from the Markov property.
Markov Property
The property that future states depend only on the current state and not on the history of past states. Mathematically:
P(s_{t+1} | s_t) = P(s_{t+1} | s_1, s_2, ..., s_t)
In other words, knowing the current state s_t is sufficient to predict the future.
Markov Chain
A Markov chain is a sequence of states that satisfies the Markov property. It is defined by a finite set of states S and a transition probability matrix P.
import numpy as np
# 날씨 마르코프 체인 예시
# 상태: 맑음(0), 흐림(1), 비(2)
transition_matrix = np.array([
[0.7, 0.2, 0.1], # 맑음 -> 맑음, 흐림, 비
[0.3, 0.4, 0.3], # 흐림 -> 맑음, 흐림, 비
[0.2, 0.3, 0.5], # 비 -> 맑음, 흐림, 비
])
states = ['맑음', '흐림', '비']
def simulate_markov_chain(transition_matrix, initial_state, n_steps):
"""마르코프 체인 시뮬레이션"""
current = initial_state
trajectory = [current]
for _ in range(n_steps):
current = np.random.choice(
len(transition_matrix),
p=transition_matrix[current]
)
trajectory.append(current)
return trajectory
# 맑음에서 시작하여 10일간 시뮬레이션
trajectory = simulate_markov_chain(transition_matrix, 0, 10)
print("날씨 변화:", [states[s] for s in trajectory])
Markov Reward Process (MRP)
Adding the concept of reward to a Markov chain creates a Markov Reward Process. An MRP is defined by four elements:
- S: Finite set of states
- P: Transition probability matrix
- R: Reward function (expected reward at each state)
- Gamma: Discount factor (between 0 and 1)
State Value Function
The value V(s) of state s is the expected discounted cumulative reward starting from that state.
V(s) = E[G_t | s_t = s]
= E[r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... | s_t = s]
Decomposing this recursively yields the basic form of the Bellman equation:
V(s) = R(s) + gamma * sum_over_s'( P(s' | s) * V(s') )
def compute_mrp_value(transition_matrix, rewards, gamma=0.99, n_iterations=1000):
"""반복법을 이용한 MRP 상태 가치 계산"""
n_states = len(rewards)
V = np.zeros(n_states)
for _ in range(n_iterations):
V_new = np.zeros(n_states)
for s in range(n_states):
V_new[s] = rewards[s] + gamma * np.sum(
transition_matrix[s] * V
)
V = V_new
return V
# 날씨 MRP: 맑음=+2, 흐림=0, 비=-1
rewards = np.array([2.0, 0.0, -1.0])
values = compute_mrp_value(transition_matrix, rewards, gamma=0.9)
for s, v in zip(states, values):
print(f"상태 '{s}'의 가치: {v:.2f}")
Markov Decision Process (MDP)
An MDP adds the concept of Action to an MRP. By allowing the agent to choose actions, it becomes a decision-making problem.
An MDP is defined by five elements:
- S: Finite set of states
- A: Finite set of actions
- P: Transition probability function P(s' | s, a) - probability of transitioning to s' when taking action a in state s
- R: Reward function R(s, a) - expected reward when taking action a in state s
- Gamma: Discount factor
Policy
A policy pi is a rule that determines which action to select in each state.
- Deterministic policy: Selects one action definitively in each state. a = pi(s)
- Stochastic policy: Defines a probability distribution over actions in each state. pi(a | s) = probability of selecting action a in state s
import random
class RandomPolicy:
"""무작위 정책"""
def __init__(self, n_actions):
self.n_actions = n_actions
def select_action(self, state):
return random.randint(0, self.n_actions - 1)
class GreedyPolicy:
"""탐욕적 정책: Q값이 가장 높은 행동 선택"""
def __init__(self, q_table):
self.q_table = q_table
def select_action(self, state):
return int(np.argmax(self.q_table[state]))
class EpsilonGreedyPolicy:
"""엡실론-탐욕 정책: 탐색과 활용의 균형"""
def __init__(self, q_table, epsilon=0.1):
self.q_table = q_table
self.epsilon = epsilon
def select_action(self, state):
if random.random() < self.epsilon:
return random.randint(0, len(self.q_table[state]) - 1)
return int(np.argmax(self.q_table[state]))
Action-Value Function
The expected cumulative reward when taking action a in state s under policy pi is denoted Q(s, a).
Q_pi(s, a) = E_pi[G_t | s_t = s, a_t = a]
The optimal action-value function Q*(s, a) represents the highest Q value among all possible policies.
Q*(s, a) = max_pi Q_pi(s, a)
GridWorld MDP Implementation Example
Let us implement a simple 4x4 GridWorld as an MDP.
import numpy as np
class GridWorldMDP:
"""간단한 4x4 그리드월드 MDP"""
def __init__(self):
self.size = 4
self.n_states = self.size * self.size
self.n_actions = 4 # 상, 하, 좌, 우
self.goal = (3, 3) # 목표 위치
self.trap = (1, 1) # 함정 위치
# 행동 방향: 상, 하, 좌, 우
self.action_deltas = [(-1, 0), (1, 0), (0, -1), (0, 1)]
self.action_names = ['상', '하', '좌', '우']
def state_to_pos(self, state):
return (state // self.size, state % self.size)
def pos_to_state(self, row, col):
return row * self.size + col
def step(self, state, action):
row, col = self.state_to_pos(state)
dr, dc = self.action_deltas[action]
new_row = max(0, min(self.size - 1, row + dr))
new_col = max(0, min(self.size - 1, col + dc))
next_state = self.pos_to_state(new_row, new_col)
next_pos = (new_row, new_col)
if next_pos == self.goal:
return next_state, 10.0, True
elif next_pos == self.trap:
return next_state, -10.0, True
else:
return next_state, -0.1, False
def reset(self):
return 0 # 시작 상태: (0, 0)
# 그리드월드에서 무작위 에이전트 실행
env = GridWorldMDP()
policy = RandomPolicy(env.n_actions)
n_episodes = 1000
total_rewards = []
for episode in range(n_episodes):
state = env.reset()
episode_reward = 0
for step in range(100):
action = policy.select_action(state)
next_state, reward, done = env.step(state, action)
episode_reward += reward
state = next_state
if done:
break
total_rewards.append(episode_reward)
avg_reward = np.mean(total_rewards)
print(f"무작위 정책의 평균 보상 ({n_episodes}회): {avg_reward:.2f}")
Exploration vs Exploitation Dilemma
One of the most fundamental problems in reinforcement learning is the balance between exploration and exploitation.
- Exploration: Trying new actions to gather information about the environment
- Exploitation: Selecting the best known action to maximize reward
Focusing too much on exploration means failing to utilize known good strategies, while focusing too much on exploitation means missing potentially better strategies.
The epsilon-greedy policy is a simple solution to this problem. It selects a random action (exploration) with probability epsilon and the current best action (exploitation) with probability 1 - epsilon.
Summary
Here is a summary of the core concepts covered in this article:
- Reinforcement learning is a learning method where an agent interacts with an environment to maximize rewards
- Markov property is the assumption that the future depends only on the present, forming the mathematical foundation of reinforcement learning
- MRP adds rewards to a Markov chain, allowing us to define a state value function
- MDP adds actions to an MRP to model agent decision-making
- Policy is a mapping from states to actions, and finding the optimal policy is the goal of reinforcement learning
In the next article, we will use OpenAI Gym to build reinforcement learning environments and train agents.