Split View: AI를 위한 수학 완전 가이드 — 선형대수부터 정보이론까지

AI를 위한 수학 완전 가이드 — 선형대수부터 정보이론까지

들어가며
Part 1: 선형대수 (Linear Algebra) — AI의 뼈대
Part 2: 미적분 (Calculus) — 학습의 엔진
Part 3: 확률과 통계 — AI의 언어
Part 4: 정보이론 — LLM의 수학적 기초
- 엔트로피 (Entropy) — 불확실성의 척도
- KL Divergence — 두 분포의 차이
수학 → AI 매핑 요약
공부 로드맵
추천 리소스
📖 관련 시리즈 & 추천 포스팅
- 참고 자료

$Math for AI$

들어가며

"AI 공부하려면 수학 어디까지 해야 해요?"

답: 선형대수 + 미적분 + 확률/통계 + 최적화. 이 4가지면 논문의 90%를 읽을 수 있습니다.

이 글은 수학 교과서가 아닙니다. AI에서 왜 이 수학이 쓰이는지, 코드와 직관으로 설명합니다. nanoGPT를 처음부터 만들 때, 이미지 생성 모델(DDPM)을 학습할 때 — 이 수학이 어디서 등장하는지 연결합니다.

Part 1: 선형대수 (Linear Algebra) — AI의 뼈대

왜 필요한가?

신경망의 모든 연산은 행렬 곱셈입니다.

import numpy as np

# 뉴런 1개 = 벡터 내적
weights = np.array([0.5, -0.3, 0.8])  # 가중치
inputs = np.array([1.0, 2.0, 3.0])     # 입력
bias = 0.1

output = np.dot(weights, inputs) + bias
# 0.5*1.0 + (-0.3)*2.0 + 0.8*3.0 + 0.1 = 2.5

# 레이어 전체 = 행렬 곱
W = np.random.randn(4, 3)  # 3 → 4 뉴런
x = np.random.randn(3)      # 입력 벡터
h = W @ x                    # 행렬-벡터 곱 = 레이어 출력

벡터 (Vector) — 데이터의 표현

# 단어를 벡터로 표현 (Word Embedding)
king = np.array([0.8, 0.2, 0.9, -0.5])
queen = np.array([0.7, 0.8, 0.85, -0.4])
man = np.array([0.9, 0.1, 0.5, -0.6])
woman = np.array([0.8, 0.7, 0.45, -0.5])

# king - man + woman ≈ queen (유명한 관계!)
result = king - man + woman
print(f"king - man + woman = {result}")
print(f"queen              = {queen}")
# 거의 같다!

# 코사인 유사도 — 두 벡터가 얼마나 비슷한지
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_similarity(result, queen))  # ~0.95 (매우 유사!)

행렬 곱셈 (Matrix Multiplication) — 신경망의 핵심

# Transformer의 Self-Attention도 행렬 곱!
# Q, K, V = 입력에 가중치 행렬을 곱한 것
batch_size, seq_len, d_model = 2, 10, 64

X = np.random.randn(batch_size, seq_len, d_model)
W_Q = np.random.randn(d_model, d_model)
W_K = np.random.randn(d_model, d_model)
W_V = np.random.randn(d_model, d_model)

Q = X @ W_Q  # Query: (2, 10, 64)
K = X @ W_K  # Key:   (2, 10, 64)
V = X @ W_V  # Value: (2, 10, 64)

# Attention Score = Q × K^T / √d
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_model)
# scores shape: (2, 10, 10) — 각 토큰이 다른 토큰에 주는 attention

고유값 분해 (Eigenvalue Decomposition) — PCA, SVD

# PCA: 데이터의 주요 방향 찾기
from sklearn.decomposition import PCA

# 100차원 데이터 → 2차원으로 축소
data = np.random.randn(1000, 100)
pca = PCA(n_components=2)
reduced = pca.fit_transform(data)

# 내부적으로 일어나는 일:
# 1. 공분산 행렬 계산: C = X^T X / n
# 2. 고유값 분해: C = V Λ V^T
# 3. 가장 큰 고유값에 해당하는 고유벡터 선택
# → 데이터의 분산이 가장 큰 방향!

# SVD (Singular Value Decomposition) — LLM 압축에 사용!
# LoRA가 바로 이것: 큰 행렬을 작은 행렬 2개로 분해

W = np.random.randn(768, 768)  # GPT-2의 attention weight

# SVD: W = U × Σ × V^T
U, S, Vt = np.linalg.svd(W)

# 상위 r개만 남기면 근사 (LoRA의 원리!)
r = 16  # rank
W_approx = U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :]

# 원본: 768×768 = 589,824 파라미터
# LoRA: 768×16 + 16×768 = 24,576 파라미터 (4% 만!)
error = np.linalg.norm(W - W_approx) / np.linalg.norm(W)
print(f"Rank-{r} 근사 오차: {error:.4f}")

Part 2: 미적분 (Calculus) — 학습의 엔진

왜 필요한가?

신경망 학습 = 손실 함수를 최소화하는 것 = 미분으로 기울기를 구해서 파라미터를 업데이트

편미분 (Partial Derivative)

# f(x, y) = x² + 2xy + y²
# ∂f/∂x = 2x + 2y (y를 상수로 취급)
# ∂f/∂y = 2x + 2y (x를 상수로 취급)

def f(x, y):
    return x**2 + 2*x*y + y**2

def df_dx(x, y):
    return 2*x + 2*y  # x 방향 기울기

def df_dy(x, y):
    return 2*x + 2*y  # y 방향 기울기

# 기울기 벡터 (Gradient)
gradient = np.array([df_dx(3, 2), df_dy(3, 2)])
print(f"∇f(3,2) = {gradient}")  # [10, 10]
# → 이 방향의 반대로 가면 함수값이 줄어듦!

연쇄 법칙 (Chain Rule) — 역전파의 수학적 기초!

# y = f(g(x)) → dy/dx = df/dg × dg/dx

# 신경망에서:
# Loss = CrossEntropy(softmax(Wx + b), target)
# dLoss/dW = dLoss/dsoftmax × dsoftmax/d(Wx+b) × d(Wx+b)/dW

# PyTorch가 이걸 자동으로 해줌!
import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
W = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)

# Forward
h = W @ x + b
loss = h.sum()

# Backward (연쇄 법칙 자동 적용!)
loss.backward()

print(f"dLoss/dW = {W.grad}")  # 자동 미분!
print(f"dLoss/db = {b.grad}")
print(f"dLoss/dx = {x.grad}")

경사 하강법 (Gradient Descent) — 산을 내려오는 눈먼 등산가

# 손실 함수: L(w) = (w - 3)²
# 최솟값: w = 3

def loss(w):
    return (w - 3) ** 2

def grad(w):
    return 2 * (w - 3)

# 경사 하강법
w = 10.0  # 시작점 (엉뚱한 곳)
lr = 0.1  # 학습률

for step in range(20):
    g = grad(w)
    w = w - lr * g  # 기울기 반대 방향으로 이동!
    if step % 5 == 0:
        print(f"Step {step}: w={w:.4f}, loss={loss(w):.4f}")

# Step 0:  w=8.6000, loss=31.3600
# Step 5:  w=3.2150, loss=0.0462
# Step 10: w=3.0070, loss=0.0000
# Step 15: w=3.0002, loss=0.0000
# → w가 3에 수렴!

학습률의 중요성

lr이 너무 크면:
  w: 10 → -4 → 18 → -22 → 발산! 💥

lr이 너무 작으면:
  w: 10 → 9.86 → 9.72 → ... → 100만 스텝 후 → 3.001

적절한 lr:
  w: 10 → 8.6 → 7.48 → ... → 20스텝 → 3.0002 ✅

실전 옵티마이저: Adam

# Adam = Momentum + RMSprop (현대 딥러닝 표준)
class Adam:
    def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1  # 모멘텀 (관성)
        self.beta2 = beta2  # 기울기 제곱의 이동 평균
        self.eps = eps
        self.m = {id(p): 0 for p in params}  # 1차 모멘트
        self.v = {id(p): 0 for p in params}  # 2차 모멘트
        self.t = 0

    def step(self, params, grads):
        self.t += 1
        for p, g in zip(params, grads):
            pid = id(p)
            # 모멘텀: 이전 기울기 방향을 기억
            self.m[pid] = self.beta1 * self.m[pid] + (1 - self.beta1) * g
            # 적응적 학습률: 기울기 크기에 따라 조절
            self.v[pid] = self.beta2 * self.v[pid] + (1 - self.beta2) * g**2
            # 편향 보정
            m_hat = self.m[pid] / (1 - self.beta1**self.t)
            v_hat = self.v[pid] / (1 - self.beta2**self.t)
            # 업데이트
            p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

Part 3: 확률과 통계 — AI의 언어

왜 필요한가?

AI 모델의 출력은 거의 항상 확률 분포입니다.

# GPT의 출력 = 다음 토큰의 확률 분포
logits = np.array([2.0, 1.0, 0.1, -1.0, 3.0])  # 모델 출력 (raw)
vocab = ["the", "cat", "sat", "on", "mat"]

# Softmax: logits → 확률
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # 수치 안정성
    return exp_x / exp_x.sum()

probs = softmax(logits)
for word, p in zip(vocab, probs):
    print(f"  {word}: {p:.4f}")
# the: 0.2312, cat: 0.0851, sat: 0.0346, on: 0.0115, mat: 0.6276
# → "mat"이 가장 높은 확률!

베이즈 정리 (Bayes' Theorem)

# P(A|B) = P(B|A) × P(A) / P(B)
# "데이터를 봤을 때 모델이 맞을 확률"

# 예: 스팸 필터
# P(스팸|"무료") = P("무료"|스팸) × P(스팸) / P("무료")
p_free_given_spam = 0.8    # 스팸에서 "무료" 등장 확률
p_spam = 0.3               # 전체 메일 중 스팸 비율
p_free = 0.35              # 전체 메일에서 "무료" 등장 확률

p_spam_given_free = (p_free_given_spam * p_spam) / p_free
print(f"P(스팸|'무료') = {p_spam_given_free:.2f}")  # 0.69 (69%!)

확률 분포

# 가우시안 (정규) 분포 — Diffusion Model의 핵심!
def gaussian(x, mu=0, sigma=1):
    return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# DDPM (이미지 생성):
# Forward:  깨끗한 이미지 → 가우시안 노이즈 추가 (점점 파괴)
# Reverse:  노이즈 → 노이즈 제거 (신경망이 학습) → 깨끗한 이미지!

# 노이즈 추가 과정
def add_noise(image, t, noise_schedule):
    """x_t = √(α_bar_t) × x_0 + √(1 - α_bar_t) × ε"""
    alpha_bar = noise_schedule[t]
    noise = np.random.randn(*image.shape)  # 가우시안 노이즈
    noisy = np.sqrt(alpha_bar) * image + np.sqrt(1 - alpha_bar) * noise
    return noisy, noise

크로스 엔트로피 (Cross-Entropy) — 손실 함수의 왕

# 모델의 예측이 정답과 얼마나 다른지 측정
def cross_entropy(predictions, targets):
    """H(p, q) = -Σ p(x) log q(x)"""
    return -np.sum(targets * np.log(predictions + 1e-9))

# 정답: "cat" (원-핫 인코딩)
target = np.array([0, 1, 0, 0, 0])  # [the, cat, sat, on, mat]

# 좋은 예측
good_pred = np.array([0.05, 0.85, 0.03, 0.02, 0.05])
print(f"Good: {cross_entropy(good_pred, target):.4f}")  # 0.1625 (낮음 ✅)

# 나쁜 예측
bad_pred = np.array([0.3, 0.1, 0.2, 0.2, 0.2])
print(f"Bad:  {cross_entropy(bad_pred, target):.4f}")  # 2.3026 (높음 ❌)

Part 4: 정보이론 — LLM의 수학적 기초

엔트로피 (Entropy) — 불확실성의 척도

def entropy(probs):
    """H(X) = -Σ p(x) log₂ p(x)"""
    return -np.sum(probs * np.log2(probs + 1e-9))

# 공정한 동전: H = 1 bit (최대 불확실성)
fair_coin = np.array([0.5, 0.5])
print(f"공정한 동전: {entropy(fair_coin):.2f} bits")  # 1.00

# 편향된 동전: H < 1 bit (예측 가능)
biased_coin = np.array([0.9, 0.1])
print(f"편향 동전: {entropy(biased_coin):.2f} bits")  # 0.47

# GPT의 출력 엔트로피가 낮으면 → 모델이 확신하는 것
# Temperature를 높이면 → 엔트로피 증가 → 더 다양한 출력

KL Divergence — 두 분포의 차이

def kl_divergence(p, q):
    """D_KL(P || Q) = Σ p(x) log(p(x) / q(x))"""
    return np.sum(p * np.log(p / (q + 1e-9) + 1e-9))

# VAE (Variational Autoencoder)에서:
# KL(q(z|x) || p(z))를 최소화
# = 인코더의 출력 분포가 표준 정규 분포에 가까워지도록!

# RLHF에서:
# KL(π_new || π_ref)를 패널티로 추가
# = 새 모델이 기존 모델에서 너무 벗어나지 않도록!

수학 → AI 매핑 요약

수학 개념	AI에서의 역할	등장하는 곳
행렬 곱셈	레이어 연산	모든 신경망
코사인 유사도	임베딩 비교	검색, RAG
SVD	모델 압축	LoRA, 양자화
편미분	기울기 계산	역전파
연쇄 법칙	자동 미분	PyTorch autograd
경사 하강법	파라미터 최적화	Adam, SGD
Softmax	확률 분포 변환	분류, Attention
크로스 엔트로피	손실 함수	LLM, 분류기
가우시안 분포	노이즈 모델링	DDPM, VAE
베이즈 정리	사후 확률 추론	베이지안 ML
KL Divergence	분포 차이 측정	VAE, RLHF
엔트로피	불확실성 측정	Temperature, 정보량

공부 로드맵

[1주차] 선형대수 기초
  → 벡터, 행렬 곱, 전치, 역행렬
  → numpy로 직접 구현

[2주차] 미적분 + 최적화
  → 편미분, 연쇄법칙, 경사하강법
  → PyTorch autograd 이해

[3주차] 확률 + 통계
  → 조건부 확률, 베이즈, 분포
  → softmax, cross-entropy 구현

[4주차] 정보이론 + 실전
  → 엔트로피, KL-divergence
  → nanoGPT/DDPM 코드에서 수학 찾기

📖 관련 시리즈 & 추천 포스팅

나만의 GPT 만들기 — nanoGPT — 이 수학이 실제로 쓰이는 곳
GPT 시리즈 논문 완벽 분석 — GPT-1부터 GPT-4까지
torchvision 완전 가이드 — CNN/ViT (선형대수 실전)
torchaudio 완전 가이드 — 푸리에 변환 실전
보안 완전 가이드 — 암호화 (수학의 보안 응용)

참고 자료

3Blue1Brown — Essence of Linear Algebra — 선형대수 직관 (필수 시청!)
3Blue1Brown — Neural Networks — 신경망 시각화
StatQuest — Machine Learning — 통계/ML 쉬운 설명
Mathematics for Machine Learning (book) — 무료 교재

Complete Math Guide for AI — From Linear Algebra to Information Theory

Introduction
Part 1: Linear Algebra — The Skeleton of AI
Part 2: Calculus — The Engine of Learning
Part 3: Probability and Statistics — The Language of AI
Part 4: Information Theory — The Mathematical Foundation of LLMs
- Entropy — A Measure of Uncertainty
- KL Divergence — Difference Between Two Distributions
Math → AI Mapping Summary
Study Roadmap
Recommended Resources
Related Series and Recommended Posts
- References
Quiz

$Math for AI$

Introduction

"How much math do I need to study AI?"

Answer: Linear algebra + calculus + probability/statistics + optimization. These four areas let you read 90% of papers.

This is not a math textbook. It explains why this math is used in AI, with code and intuition. When building nanoGPT from scratch, or training an image generation model (DDPM) — this is where that math shows up.

Part 1: Linear Algebra — The Skeleton of AI

Why Do You Need It?

Every operation in a neural network is matrix multiplication.

import numpy as np

# A single neuron = dot product
weights = np.array([0.5, -0.3, 0.8])  # weights
inputs = np.array([1.0, 2.0, 3.0])     # inputs
bias = 0.1

output = np.dot(weights, inputs) + bias
# 0.5*1.0 + (-0.3)*2.0 + 0.8*3.0 + 0.1 = 2.5

# An entire layer = matrix multiplication
W = np.random.randn(4, 3)  # 3 → 4 neurons
x = np.random.randn(3)      # input vector
h = W @ x                    # matrix-vector product = layer output

Vectors — Representing Data

# Representing words as vectors (Word Embedding)
king = np.array([0.8, 0.2, 0.9, -0.5])
queen = np.array([0.7, 0.8, 0.85, -0.4])
man = np.array([0.9, 0.1, 0.5, -0.6])
woman = np.array([0.8, 0.7, 0.45, -0.5])

# king - man + woman ≈ queen (the famous relationship!)
result = king - man + woman
print(f"king - man + woman = {result}")
print(f"queen              = {queen}")
# Nearly identical!

# Cosine similarity — how similar two vectors are
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_similarity(result, queen))  # ~0.95 (very similar!)

Matrix Multiplication — The Heart of Neural Networks

# Transformer's Self-Attention is also matrix multiplication!
# Q, K, V = input multiplied by weight matrices
batch_size, seq_len, d_model = 2, 10, 64

X = np.random.randn(batch_size, seq_len, d_model)
W_Q = np.random.randn(d_model, d_model)
W_K = np.random.randn(d_model, d_model)
W_V = np.random.randn(d_model, d_model)

Q = X @ W_Q  # Query: (2, 10, 64)
K = X @ W_K  # Key:   (2, 10, 64)
V = X @ W_V  # Value: (2, 10, 64)

# Attention Score = Q x K^T / sqrt(d)
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_model)
# scores shape: (2, 10, 10) — attention each token pays to other tokens

Eigenvalue Decomposition — PCA, SVD

# PCA: Finding the principal directions of data
from sklearn.decomposition import PCA

# 100-dimensional data → reduced to 2 dimensions
data = np.random.randn(1000, 100)
pca = PCA(n_components=2)
reduced = pca.fit_transform(data)

# What happens internally:
# 1. Compute covariance matrix: C = X^T X / n
# 2. Eigenvalue decomposition: C = V Lambda V^T
# 3. Select eigenvectors corresponding to largest eigenvalues
# → The directions of greatest data variance!

# SVD (Singular Value Decomposition) — Used for LLM compression!
# LoRA is exactly this: decomposing a large matrix into 2 small matrices

W = np.random.randn(768, 768)  # GPT-2's attention weight

# SVD: W = U x Sigma x V^T
U, S, Vt = np.linalg.svd(W)

# Keep only the top r values for approximation (the principle behind LoRA!)
r = 16  # rank
W_approx = U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :]

# Original: 768x768 = 589,824 parameters
# LoRA: 768x16 + 16x768 = 24,576 parameters (only 4%!)
error = np.linalg.norm(W - W_approx) / np.linalg.norm(W)
print(f"Rank-{r} approximation error: {error:.4f}")

Part 2: Calculus — The Engine of Learning

Why Do You Need It?

Neural network training = minimizing a loss function = computing gradients via differentiation to update parameters

Partial Derivatives

# f(x, y) = x^2 + 2xy + y^2
# df/dx = 2x + 2y (treating y as constant)
# df/dy = 2x + 2y (treating x as constant)

def f(x, y):
    return x**2 + 2*x*y + y**2

def df_dx(x, y):
    return 2*x + 2*y  # gradient in x direction

def df_dy(x, y):
    return 2*x + 2*y  # gradient in y direction

# Gradient vector
gradient = np.array([df_dx(3, 2), df_dy(3, 2)])
print(f"nabla f(3,2) = {gradient}")  # [10, 10]
# → Moving in the opposite direction decreases the function value!

Chain Rule — The Mathematical Foundation of Backpropagation!

# y = f(g(x)) → dy/dx = df/dg x dg/dx

# In neural networks:
# Loss = CrossEntropy(softmax(Wx + b), target)
# dLoss/dW = dLoss/dsoftmax x dsoftmax/d(Wx+b) x d(Wx+b)/dW

# PyTorch does this automatically!
import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
W = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)

# Forward
h = W @ x + b
loss = h.sum()

# Backward (chain rule applied automatically!)
loss.backward()

print(f"dLoss/dW = {W.grad}")  # automatic differentiation!
print(f"dLoss/db = {b.grad}")
print(f"dLoss/dx = {x.grad}")

Gradient Descent — A Blind Hiker Walking Downhill

# Loss function: L(w) = (w - 3)^2
# Minimum: w = 3

def loss(w):
    return (w - 3) ** 2

def grad(w):
    return 2 * (w - 3)

# Gradient descent
w = 10.0  # starting point (way off)
lr = 0.1  # learning rate

for step in range(20):
    g = grad(w)
    w = w - lr * g  # move in the opposite direction of the gradient!
    if step % 5 == 0:
        print(f"Step {step}: w={w:.4f}, loss={loss(w):.4f}")

# Step 0:  w=8.6000, loss=31.3600
# Step 5:  w=3.2150, loss=0.0462
# Step 10: w=3.0070, loss=0.0000
# Step 15: w=3.0002, loss=0.0000
# → w converges to 3!

The Importance of Learning Rate

If lr is too large:
  w: 10 → -4 → 18 → -22 → diverges!

If lr is too small:
  w: 10 → 9.86 → 9.72 → ... → after 1M steps → 3.001

With the right lr:
  w: 10 → 8.6 → 7.48 → ... → after 20 steps → 3.0002

Real-World Optimizer: Adam

# Adam = Momentum + RMSprop (modern deep learning standard)
class Adam:
    def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1  # momentum (inertia)
        self.beta2 = beta2  # moving average of squared gradients
        self.eps = eps
        self.m = {id(p): 0 for p in params}  # 1st moment
        self.v = {id(p): 0 for p in params}  # 2nd moment
        self.t = 0

    def step(self, params, grads):
        self.t += 1
        for p, g in zip(params, grads):
            pid = id(p)
            # Momentum: remember previous gradient direction
            self.m[pid] = self.beta1 * self.m[pid] + (1 - self.beta1) * g
            # Adaptive learning rate: adjust based on gradient magnitude
            self.v[pid] = self.beta2 * self.v[pid] + (1 - self.beta2) * g**2
            # Bias correction
            m_hat = self.m[pid] / (1 - self.beta1**self.t)
            v_hat = self.v[pid] / (1 - self.beta2**self.t)
            # Update
            p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

Part 3: Probability and Statistics — The Language of AI

Why Do You Need It?

The output of AI models is almost always a probability distribution.

# GPT's output = probability distribution of the next token
logits = np.array([2.0, 1.0, 0.1, -1.0, 3.0])  # model output (raw)
vocab = ["the", "cat", "sat", "on", "mat"]

# Softmax: logits → probabilities
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # numerical stability
    return exp_x / exp_x.sum()

probs = softmax(logits)
for word, p in zip(vocab, probs):
    print(f"  {word}: {p:.4f}")
# the: 0.2312, cat: 0.0851, sat: 0.0346, on: 0.0115, mat: 0.6276
# → "mat" has the highest probability!

Bayes' Theorem

# P(A|B) = P(B|A) x P(A) / P(B)
# "The probability that the model is correct given the data"

# Example: Spam filter
# P(spam|"free") = P("free"|spam) x P(spam) / P("free")
p_free_given_spam = 0.8    # probability of "free" appearing in spam
p_spam = 0.3               # proportion of all emails that are spam
p_free = 0.35              # probability of "free" appearing in all emails

p_spam_given_free = (p_free_given_spam * p_spam) / p_free
print(f"P(spam|'free') = {p_spam_given_free:.2f}")  # 0.69 (69%!)

Probability Distributions

# Gaussian (Normal) Distribution — The core of Diffusion Models!
def gaussian(x, mu=0, sigma=1):
    return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# DDPM (Image Generation):
# Forward:  clean image → add Gaussian noise (gradually destroy)
# Reverse:  noise → remove noise (neural network learns this) → clean image!

# Noise addition process
def add_noise(image, t, noise_schedule):
    """x_t = sqrt(alpha_bar_t) x x_0 + sqrt(1 - alpha_bar_t) x epsilon"""
    alpha_bar = noise_schedule[t]
    noise = np.random.randn(*image.shape)  # Gaussian noise
    noisy = np.sqrt(alpha_bar) * image + np.sqrt(1 - alpha_bar) * noise
    return noisy, noise

Cross-Entropy — The King of Loss Functions

# Measures how different the model's predictions are from the ground truth
def cross_entropy(predictions, targets):
    """H(p, q) = -sum p(x) log q(x)"""
    return -np.sum(targets * np.log(predictions + 1e-9))

# Ground truth: "cat" (one-hot encoding)
target = np.array([0, 1, 0, 0, 0])  # [the, cat, sat, on, mat]

# Good prediction
good_pred = np.array([0.05, 0.85, 0.03, 0.02, 0.05])
print(f"Good: {cross_entropy(good_pred, target):.4f}")  # 0.1625 (low)

# Bad prediction
bad_pred = np.array([0.3, 0.1, 0.2, 0.2, 0.2])
print(f"Bad:  {cross_entropy(bad_pred, target):.4f}")  # 2.3026 (high)

Part 4: Information Theory — The Mathematical Foundation of LLMs

Entropy — A Measure of Uncertainty

def entropy(probs):
    """H(X) = -sum p(x) log2 p(x)"""
    return -np.sum(probs * np.log2(probs + 1e-9))

# Fair coin: H = 1 bit (maximum uncertainty)
fair_coin = np.array([0.5, 0.5])
print(f"Fair coin: {entropy(fair_coin):.2f} bits")  # 1.00

# Biased coin: H is less than 1 bit (predictable)
biased_coin = np.array([0.9, 0.1])
print(f"Biased coin: {entropy(biased_coin):.2f} bits")  # 0.47

# Low entropy in GPT's output → the model is confident
# Raising temperature → entropy increases → more diverse outputs

KL Divergence — Difference Between Two Distributions

def kl_divergence(p, q):
    """D_KL(P || Q) = sum p(x) log(p(x) / q(x))"""
    return np.sum(p * np.log(p / (q + 1e-9) + 1e-9))

# In VAE (Variational Autoencoder):
# Minimize KL(q(z|x) || p(z))
# = Make the encoder's output distribution close to standard normal!

# In RLHF:
# Add KL(pi_new || pi_ref) as penalty
# = Prevent the new model from deviating too far from the reference model!

Math → AI Mapping Summary

Math Concept	Role in AI	Where It Appears
Matrix multiply	Layer computation	All neural networks
Cosine similarity	Embedding comparison	Search, RAG
SVD	Model compression	LoRA, quantization
Partial deriv.	Gradient computation	Backpropagation
Chain rule	Automatic diff.	PyTorch autograd
Gradient descent	Parameter optimization	Adam, SGD
Softmax	Probability transform	Classification, Attention
Cross-entropy	Loss function	LLM, classifiers
Gaussian dist.	Noise modeling	DDPM, VAE
Bayes' theorem	Posterior inference	Bayesian ML
KL Divergence	Distribution difference	VAE, RLHF
Entropy	Uncertainty measure	Temperature, information

Study Roadmap

[Week 1] Linear Algebra Fundamentals
  → Vectors, matrix multiplication, transpose, inverse
  → Implement from scratch with numpy

[Week 2] Calculus + Optimization
  → Partial derivatives, chain rule, gradient descent
  → Understand PyTorch autograd

[Week 3] Probability + Statistics
  → Conditional probability, Bayes, distributions
  → Implement softmax, cross-entropy

[Week 4] Information Theory + Practice
  → Entropy, KL-divergence
  → Find the math in nanoGPT/DDPM code

Recommended Resources

3Blue1Brown (YouTube) — Intuitive visualizations of linear algebra/calculus
Mathematics for Machine Learning (free textbook) — Bridging math to ML
Andrej Karpathy's micrograd — Backpropagation from scratch
Stanford CS229 — Probability/statistics + ML math
Ian Goodfellow's Deep Learning Book — Ch.2–4 (free online)

Quiz — Math for AI (Click to reveal!)

Q1. What role does matrix multiplication play in neural networks? ||It multiplies input vectors by weight matrices to compute the next layer's output. One matrix multiplication = one layer's linear transformation||

Q2. Why is LoRA related to SVD? ||SVD decomposes a large matrix into products of smaller matrices. LoRA approximates the weight change (delta W) as a product of two low-rank matrices (A, B), dramatically reducing parameters||

Q3. What is the mathematical foundation of backpropagation? ||The chain rule. It decomposes the derivative of a composite function into products of derivatives at each stage. Gradients propagate backward from the loss to each parameter||

Q4. What does the Softmax function do, and what is the numerical stability trick? ||Converts a real-valued vector (logits) into a probability distribution (sums to 1, all positive). Trick: subtract the maximum value from inputs to prevent exp overflow||

Q5. Why is cross-entropy a good loss function? ||When the predicted probability for the correct class approaches 1, loss approaches 0; when it approaches 0, loss approaches infinity. Strong penalties for wrong predictions make training efficient||

Q6. Why is the Gaussian distribution used in Diffusion Models? ||In the forward process, Gaussian noise is gradually added to images. Gaussians are mathematically tractable (closed-form solutions) and provide a natural noise model via the central limit theorem||

Q7. What happens to GPT output entropy when temperature is high? ||It increases. Dividing logits by T flattens the softmax distribution (closer to uniform), increasing uncertainty and generating more diverse outputs||

Q8. What is the purpose of the KL Divergence penalty in RLHF? ||It constrains the RL-updated model (pi_new) from straying too far from the original model (pi_ref). Prevents reward hacking and preserves existing capabilities||

Build Your Own GPT — nanoGPT — Where this math is actually used
GPT Series Paper Analysis — From GPT-1 to GPT-4
torchvision Complete Guide — CNN/ViT (linear algebra in practice)
torchaudio Complete Guide — Fourier Transform in practice
Security Fundamentals Guide — Cryptography (security applications of math)

References

3Blue1Brown — Essence of Linear Algebra — Linear algebra intuition (must watch!)
3Blue1Brown — Neural Networks — Neural network visualization
StatQuest — Machine Learning — Easy statistics/ML explanations
Mathematics for Machine Learning (book) — Free textbook

Quiz

Q1: What is the main topic covered in "Complete Math Guide for AI — From Linear Algebra to Information Theory"?

A guide to the math needed for AI/deep learning, explained with code and intuition. Linear algebra (matrices, eigenvalues), calculus (partial derivatives, backpropagation), probability/statistics (Bayes, distributions), optimization (gradient descent), and information theory (ent...

Q2: What is Part 1: Linear Algebra — The Skeleton of AI?

Why Do You Need It? Every operation in a neural network is matrix multiplication. Vectors — Representing Data Matrix Multiplication — The Heart of Neural Networks Eigenvalue Decomposition — PCA, SVD

Q3: Explain the core concept of Part 2: Calculus — The Engine of Learning.

Why Do You Need It? Neural network training = minimizing a loss function = computing gradients via differentiation to update parameters Partial Derivatives Chain Rule — The Mathematical Foundation of Backpropagation!

Q4: What are the key aspects of Part 3: Probability and Statistics — The Language of AI?

Why Do You Need It? The output of AI models is almost always a probability distribution. Bayes' Theorem Probability Distributions Cross-Entropy — The King of Loss Functions

Q5: How does Part 4: Information Theory — The Mathematical Foundation of LLMs work?

Entropy — A Measure of Uncertainty KL Divergence — Difference Between Two Distributions

AI를 위한 수학 완전 가이드 — 선형대수부터 정보이론까지

들어가며

Part 1: 선형대수 (Linear Algebra) — AI의 뼈대

왜 필요한가?

벡터 (Vector) — 데이터의 표현

행렬 곱셈 (Matrix Multiplication) — 신경망의 핵심

고유값 분해 (Eigenvalue Decomposition) — PCA, SVD

Part 2: 미적분 (Calculus) — 학습의 엔진

왜 필요한가?

편미분 (Partial Derivative)

연쇄 법칙 (Chain Rule) — 역전파의 수학적 기초!

경사 하강법 (Gradient Descent) — 산을 내려오는 눈먼 등산가

학습률의 중요성

실전 옵티마이저: Adam

Part 3: 확률과 통계 — AI의 언어

왜 필요한가?

베이즈 정리 (Bayes' Theorem)

확률 분포

크로스 엔트로피 (Cross-Entropy) — 손실 함수의 왕

Part 4: 정보이론 — LLM의 수학적 기초

엔트로피 (Entropy) — 불확실성의 척도

KL Divergence — 두 분포의 차이

수학 → AI 매핑 요약

공부 로드맵

추천 리소스

📖 관련 시리즈 & 추천 포스팅

참고 자료

Complete Math Guide for AI — From Linear Algebra to Information Theory

Introduction

Part 1: Linear Algebra — The Skeleton of AI

Why Do You Need It?

Vectors — Representing Data

Matrix Multiplication — The Heart of Neural Networks

Eigenvalue Decomposition — PCA, SVD

Part 2: Calculus — The Engine of Learning

Why Do You Need It?

Partial Derivatives

Chain Rule — The Mathematical Foundation of Backpropagation!

Gradient Descent — A Blind Hiker Walking Downhill

The Importance of Learning Rate

Real-World Optimizer: Adam

Part 3: Probability and Statistics — The Language of AI

Why Do You Need It?

Bayes' Theorem

Probability Distributions

Cross-Entropy — The King of Loss Functions

Part 4: Information Theory — The Mathematical Foundation of LLMs

Entropy — A Measure of Uncertainty

KL Divergence — Difference Between Two Distributions

Math → AI Mapping Summary

Study Roadmap

Recommended Resources

Related Series and Recommended Posts

References

Quiz