Skip to content

Split View: AI를 위한 수학 완전 가이드 — 선형대수부터 정보이론까지

✨ Learn with Quiz
|

AI를 위한 수학 완전 가이드 — 선형대수부터 정보이론까지

Math for AI

들어가며

"AI 공부하려면 수학 어디까지 해야 해요?"

답: 선형대수 + 미적분 + 확률/통계 + 최적화. 이 4가지면 논문의 90%를 읽을 수 있습니다.

이 글은 수학 교과서가 아닙니다. AI에서 왜 이 수학이 쓰이는지, 코드와 직관으로 설명합니다. nanoGPT를 처음부터 만들 때, 이미지 생성 모델(DDPM)을 학습할 때 — 이 수학이 어디서 등장하는지 연결합니다.

Part 1: 선형대수 (Linear Algebra) — AI의 뼈대

왜 필요한가?

신경망의 모든 연산은 행렬 곱셈입니다.

import numpy as np

# 뉴런 1개 = 벡터 내적
weights = np.array([0.5, -0.3, 0.8])  # 가중치
inputs = np.array([1.0, 2.0, 3.0])     # 입력
bias = 0.1

output = np.dot(weights, inputs) + bias
# 0.5*1.0 + (-0.3)*2.0 + 0.8*3.0 + 0.1 = 2.5

# 레이어 전체 = 행렬 곱
W = np.random.randn(4, 3)  # 3 → 4 뉴런
x = np.random.randn(3)      # 입력 벡터
h = W @ x                    # 행렬-벡터 곱 = 레이어 출력

벡터 (Vector) — 데이터의 표현

# 단어를 벡터로 표현 (Word Embedding)
king = np.array([0.8, 0.2, 0.9, -0.5])
queen = np.array([0.7, 0.8, 0.85, -0.4])
man = np.array([0.9, 0.1, 0.5, -0.6])
woman = np.array([0.8, 0.7, 0.45, -0.5])

# king - man + woman ≈ queen (유명한 관계!)
result = king - man + woman
print(f"king - man + woman = {result}")
print(f"queen              = {queen}")
# 거의 같다!

# 코사인 유사도 — 두 벡터가 얼마나 비슷한지
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_similarity(result, queen))  # ~0.95 (매우 유사!)

행렬 곱셈 (Matrix Multiplication) — 신경망의 핵심

# Transformer의 Self-Attention도 행렬 곱!
# Q, K, V = 입력에 가중치 행렬을 곱한 것
batch_size, seq_len, d_model = 2, 10, 64

X = np.random.randn(batch_size, seq_len, d_model)
W_Q = np.random.randn(d_model, d_model)
W_K = np.random.randn(d_model, d_model)
W_V = np.random.randn(d_model, d_model)

Q = X @ W_Q  # Query: (2, 10, 64)
K = X @ W_K  # Key:   (2, 10, 64)
V = X @ W_V  # Value: (2, 10, 64)

# Attention Score = Q × K^T / √d
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_model)
# scores shape: (2, 10, 10) — 각 토큰이 다른 토큰에 주는 attention

고유값 분해 (Eigenvalue Decomposition) — PCA, SVD

# PCA: 데이터의 주요 방향 찾기
from sklearn.decomposition import PCA

# 100차원 데이터 → 2차원으로 축소
data = np.random.randn(1000, 100)
pca = PCA(n_components=2)
reduced = pca.fit_transform(data)

# 내부적으로 일어나는 일:
# 1. 공분산 행렬 계산: C = X^T X / n
# 2. 고유값 분해: C = V Λ V^T
# 3. 가장 큰 고유값에 해당하는 고유벡터 선택
# → 데이터의 분산이 가장 큰 방향!
# SVD (Singular Value Decomposition) — LLM 압축에 사용!
# LoRA가 바로 이것: 큰 행렬을 작은 행렬 2개로 분해

W = np.random.randn(768, 768)  # GPT-2의 attention weight

# SVD: W = U × Σ × V^T
U, S, Vt = np.linalg.svd(W)

# 상위 r개만 남기면 근사 (LoRA의 원리!)
r = 16  # rank
W_approx = U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :]

# 원본: 768×768 = 589,824 파라미터
# LoRA: 768×16 + 16×768 = 24,576 파라미터 (4% 만!)
error = np.linalg.norm(W - W_approx) / np.linalg.norm(W)
print(f"Rank-{r} 근사 오차: {error:.4f}")

Part 2: 미적분 (Calculus) — 학습의 엔진

왜 필요한가?

신경망 학습 = 손실 함수를 최소화하는 것 = 미분으로 기울기를 구해서 파라미터를 업데이트

편미분 (Partial Derivative)

# f(x, y) = x² + 2xy + y²
# ∂f/∂x = 2x + 2y (y를 상수로 취급)
# ∂f/∂y = 2x + 2y (x를 상수로 취급)

def f(x, y):
    return x**2 + 2*x*y + y**2

def df_dx(x, y):
    return 2*x + 2*y  # x 방향 기울기

def df_dy(x, y):
    return 2*x + 2*y  # y 방향 기울기

# 기울기 벡터 (Gradient)
gradient = np.array([df_dx(3, 2), df_dy(3, 2)])
print(f"∇f(3,2) = {gradient}")  # [10, 10]
# → 이 방향의 반대로 가면 함수값이 줄어듦!

연쇄 법칙 (Chain Rule) — 역전파의 수학적 기초!

# y = f(g(x)) → dy/dx = df/dg × dg/dx

# 신경망에서:
# Loss = CrossEntropy(softmax(Wx + b), target)
# dLoss/dW = dLoss/dsoftmax × dsoftmax/d(Wx+b) × d(Wx+b)/dW

# PyTorch가 이걸 자동으로 해줌!
import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
W = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)

# Forward
h = W @ x + b
loss = h.sum()

# Backward (연쇄 법칙 자동 적용!)
loss.backward()

print(f"dLoss/dW = {W.grad}")  # 자동 미분!
print(f"dLoss/db = {b.grad}")
print(f"dLoss/dx = {x.grad}")

경사 하강법 (Gradient Descent) — 산을 내려오는 눈먼 등산가

# 손실 함수: L(w) = (w - 3)²
# 최솟값: w = 3

def loss(w):
    return (w - 3) ** 2

def grad(w):
    return 2 * (w - 3)

# 경사 하강법
w = 10.0  # 시작점 (엉뚱한 곳)
lr = 0.1  # 학습률

for step in range(20):
    g = grad(w)
    w = w - lr * g  # 기울기 반대 방향으로 이동!
    if step % 5 == 0:
        print(f"Step {step}: w={w:.4f}, loss={loss(w):.4f}")

# Step 0:  w=8.6000, loss=31.3600
# Step 5:  w=3.2150, loss=0.0462
# Step 10: w=3.0070, loss=0.0000
# Step 15: w=3.0002, loss=0.0000
# → w가 3에 수렴!

학습률의 중요성

lr이 너무 크면:
  w: 10-418-22 → 발산! 💥

lr이 너무 작으면:
  w: 109.869.72...100만 스텝 후 → 3.001

적절한 lr:
  w: 108.67.48...20스텝 → 3.0002

실전 옵티마이저: Adam

# Adam = Momentum + RMSprop (현대 딥러닝 표준)
class Adam:
    def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1  # 모멘텀 (관성)
        self.beta2 = beta2  # 기울기 제곱의 이동 평균
        self.eps = eps
        self.m = {id(p): 0 for p in params}  # 1차 모멘트
        self.v = {id(p): 0 for p in params}  # 2차 모멘트
        self.t = 0

    def step(self, params, grads):
        self.t += 1
        for p, g in zip(params, grads):
            pid = id(p)
            # 모멘텀: 이전 기울기 방향을 기억
            self.m[pid] = self.beta1 * self.m[pid] + (1 - self.beta1) * g
            # 적응적 학습률: 기울기 크기에 따라 조절
            self.v[pid] = self.beta2 * self.v[pid] + (1 - self.beta2) * g**2
            # 편향 보정
            m_hat = self.m[pid] / (1 - self.beta1**self.t)
            v_hat = self.v[pid] / (1 - self.beta2**self.t)
            # 업데이트
            p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

Part 3: 확률과 통계 — AI의 언어

왜 필요한가?

AI 모델의 출력은 거의 항상 확률 분포입니다.

# GPT의 출력 = 다음 토큰의 확률 분포
logits = np.array([2.0, 1.0, 0.1, -1.0, 3.0])  # 모델 출력 (raw)
vocab = ["the", "cat", "sat", "on", "mat"]

# Softmax: logits → 확률
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # 수치 안정성
    return exp_x / exp_x.sum()

probs = softmax(logits)
for word, p in zip(vocab, probs):
    print(f"  {word}: {p:.4f}")
# the: 0.2312, cat: 0.0851, sat: 0.0346, on: 0.0115, mat: 0.6276
# → "mat"이 가장 높은 확률!

베이즈 정리 (Bayes' Theorem)

# P(A|B) = P(B|A) × P(A) / P(B)
# "데이터를 봤을 때 모델이 맞을 확률"

# 예: 스팸 필터
# P(스팸|"무료") = P("무료"|스팸) × P(스팸) / P("무료")
p_free_given_spam = 0.8    # 스팸에서 "무료" 등장 확률
p_spam = 0.3               # 전체 메일 중 스팸 비율
p_free = 0.35              # 전체 메일에서 "무료" 등장 확률

p_spam_given_free = (p_free_given_spam * p_spam) / p_free
print(f"P(스팸|'무료') = {p_spam_given_free:.2f}")  # 0.69 (69%!)

확률 분포

# 가우시안 (정규) 분포 — Diffusion Model의 핵심!
def gaussian(x, mu=0, sigma=1):
    return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# DDPM (이미지 생성):
# Forward:  깨끗한 이미지 → 가우시안 노이즈 추가 (점점 파괴)
# Reverse:  노이즈 → 노이즈 제거 (신경망이 학습) → 깨끗한 이미지!

# 노이즈 추가 과정
def add_noise(image, t, noise_schedule):
    """x_t = √(α_bar_t) × x_0 + √(1 - α_bar_t) × ε"""
    alpha_bar = noise_schedule[t]
    noise = np.random.randn(*image.shape)  # 가우시안 노이즈
    noisy = np.sqrt(alpha_bar) * image + np.sqrt(1 - alpha_bar) * noise
    return noisy, noise

크로스 엔트로피 (Cross-Entropy) — 손실 함수의 왕

# 모델의 예측이 정답과 얼마나 다른지 측정
def cross_entropy(predictions, targets):
    """H(p, q) = -Σ p(x) log q(x)"""
    return -np.sum(targets * np.log(predictions + 1e-9))

# 정답: "cat" (원-핫 인코딩)
target = np.array([0, 1, 0, 0, 0])  # [the, cat, sat, on, mat]

# 좋은 예측
good_pred = np.array([0.05, 0.85, 0.03, 0.02, 0.05])
print(f"Good: {cross_entropy(good_pred, target):.4f}")  # 0.1625 (낮음 ✅)

# 나쁜 예측
bad_pred = np.array([0.3, 0.1, 0.2, 0.2, 0.2])
print(f"Bad:  {cross_entropy(bad_pred, target):.4f}")  # 2.3026 (높음 ❌)

Part 4: 정보이론 — LLM의 수학적 기초

엔트로피 (Entropy) — 불확실성의 척도

def entropy(probs):
    """H(X) = -Σ p(x) log₂ p(x)"""
    return -np.sum(probs * np.log2(probs + 1e-9))

# 공정한 동전: H = 1 bit (최대 불확실성)
fair_coin = np.array([0.5, 0.5])
print(f"공정한 동전: {entropy(fair_coin):.2f} bits")  # 1.00

# 편향된 동전: H < 1 bit (예측 가능)
biased_coin = np.array([0.9, 0.1])
print(f"편향 동전: {entropy(biased_coin):.2f} bits")  # 0.47

# GPT의 출력 엔트로피가 낮으면 → 모델이 확신하는 것
# Temperature를 높이면 → 엔트로피 증가 → 더 다양한 출력

KL Divergence — 두 분포의 차이

def kl_divergence(p, q):
    """D_KL(P || Q) = Σ p(x) log(p(x) / q(x))"""
    return np.sum(p * np.log(p / (q + 1e-9) + 1e-9))

# VAE (Variational Autoencoder)에서:
# KL(q(z|x) || p(z))를 최소화
# = 인코더의 출력 분포가 표준 정규 분포에 가까워지도록!

# RLHF에서:
# KL(π_new || π_ref)를 패널티로 추가
# = 새 모델이 기존 모델에서 너무 벗어나지 않도록!

수학 → AI 매핑 요약

수학 개념AI에서의 역할등장하는 곳
행렬 곱셈레이어 연산모든 신경망
코사인 유사도임베딩 비교검색, RAG
SVD모델 압축LoRA, 양자화
편미분기울기 계산역전파
연쇄 법칙자동 미분PyTorch autograd
경사 하강법파라미터 최적화Adam, SGD
Softmax확률 분포 변환분류, Attention
크로스 엔트로피손실 함수LLM, 분류기
가우시안 분포노이즈 모델링DDPM, VAE
베이즈 정리사후 확률 추론베이지안 ML
KL Divergence분포 차이 측정VAE, RLHF
엔트로피불확실성 측정Temperature, 정보량

공부 로드맵

[1주차] 선형대수 기초
  → 벡터, 행렬 곱, 전치, 역행렬
  → numpy로 직접 구현

[2주차] 미적분 + 최적화
  → 편미분, 연쇄법칙, 경사하강법
PyTorch autograd 이해

[3주차] 확률 + 통계
  → 조건부 확률, 베이즈, 분포
  → softmax, cross-entropy 구현

[4주차] 정보이론 + 실전
  → 엔트로피, KL-divergence
  → nanoGPT/DDPM 코드에서 수학 찾기

추천 리소스

  • 3Blue1Brown (YouTube) — 선형대수/미적분의 직관적 시각화
  • Mathematics for Machine Learning (무료 교재) — 수학 → ML 연결
  • Andrej Karpathy의 micrograd — 역전파를 밑바닥부터
  • Stanford CS229 — 확률/통계 + ML 수학
  • Ian Goodfellow의 Deep Learning Book — Ch.2~4 (무료 온라인)

📝 퀴즈 — AI 수학 (클릭해서 확인!)

Q1. 신경망에서 행렬 곱셈이 하는 역할은? ||입력 벡터에 가중치 행렬을 곱해 다음 레이어의 출력을 계산. 하나의 행렬 곱 = 레이어 하나의 선형 변환||

Q2. LoRA가 SVD와 관련된 이유는? ||SVD는 큰 행렬을 작은 행렬들의 곱으로 분해. LoRA는 가중치 행렬의 변화량(ΔW)을 저랭크 행렬 2개(A, B)의 곱으로 근사해 파라미터를 대폭 줄임||

Q3. 역전파(Backpropagation)의 수학적 기초는? ||연쇄 법칙(Chain Rule). 합성 함수의 미분을 각 단계의 미분의 곱으로 분해. Loss에서 각 파라미터까지 기울기를 역방향으로 전파||

Q4. Softmax 함수의 역할과 수치 안정성 트릭은? ||실수 벡터(logits)를 확률 분포로 변환 (합=1, 모두 양수). 트릭: 입력에서 최댓값을 빼기 — exp 오버플로 방지||

Q5. 크로스 엔트로피가 손실 함수로 좋은 이유는? ||정답 확률이 1에 가까우면 loss → 0, 0에 가까우면 loss → ∞. 잘못된 예측에 강한 패널티를 줘서 학습 효율적||

Q6. Diffusion Model에서 가우시안 분포가 쓰이는 이유는? ||Forward process에서 이미지에 가우시안 노이즈를 점진적으로 추가. 가우시안은 수학적으로 다루기 쉽고(닫힌 형태 해), 중심극한정리에 의해 자연스러운 노이즈 모델||

Q7. Temperature가 높으면 GPT 출력의 엔트로피는? ||증가. logits를 T로 나누면 softmax 분포가 평평해져 (균등 분포에 가까움) 불확실성 증가 → 다양한 출력 생성||

Q8. RLHF에서 KL Divergence 페널티의 목적은? ||강화학습으로 업데이트된 모델(π_new)이 원래 모델(π_ref)에서 너무 멀어지지 않도록 제약. 보상 해킹 방지 + 기존 능력 보존||

📖 관련 시리즈 & 추천 포스팅

참고 자료

Complete Math Guide for AI — From Linear Algebra to Information Theory

Math for AI

Introduction

"How much math do I need to study AI?"

Answer: Linear algebra + calculus + probability/statistics + optimization. These four areas let you read 90% of papers.

This is not a math textbook. It explains why this math is used in AI, with code and intuition. When building nanoGPT from scratch, or training an image generation model (DDPM) — this is where that math shows up.

Part 1: Linear Algebra — The Skeleton of AI

Why Do You Need It?

Every operation in a neural network is matrix multiplication.

import numpy as np

# A single neuron = dot product
weights = np.array([0.5, -0.3, 0.8])  # weights
inputs = np.array([1.0, 2.0, 3.0])     # inputs
bias = 0.1

output = np.dot(weights, inputs) + bias
# 0.5*1.0 + (-0.3)*2.0 + 0.8*3.0 + 0.1 = 2.5

# An entire layer = matrix multiplication
W = np.random.randn(4, 3)  # 3 → 4 neurons
x = np.random.randn(3)      # input vector
h = W @ x                    # matrix-vector product = layer output

Vectors — Representing Data

# Representing words as vectors (Word Embedding)
king = np.array([0.8, 0.2, 0.9, -0.5])
queen = np.array([0.7, 0.8, 0.85, -0.4])
man = np.array([0.9, 0.1, 0.5, -0.6])
woman = np.array([0.8, 0.7, 0.45, -0.5])

# king - man + woman ≈ queen (the famous relationship!)
result = king - man + woman
print(f"king - man + woman = {result}")
print(f"queen              = {queen}")
# Nearly identical!

# Cosine similarity — how similar two vectors are
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_similarity(result, queen))  # ~0.95 (very similar!)

Matrix Multiplication — The Heart of Neural Networks

# Transformer's Self-Attention is also matrix multiplication!
# Q, K, V = input multiplied by weight matrices
batch_size, seq_len, d_model = 2, 10, 64

X = np.random.randn(batch_size, seq_len, d_model)
W_Q = np.random.randn(d_model, d_model)
W_K = np.random.randn(d_model, d_model)
W_V = np.random.randn(d_model, d_model)

Q = X @ W_Q  # Query: (2, 10, 64)
K = X @ W_K  # Key:   (2, 10, 64)
V = X @ W_V  # Value: (2, 10, 64)

# Attention Score = Q x K^T / sqrt(d)
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_model)
# scores shape: (2, 10, 10) — attention each token pays to other tokens

Eigenvalue Decomposition — PCA, SVD

# PCA: Finding the principal directions of data
from sklearn.decomposition import PCA

# 100-dimensional data → reduced to 2 dimensions
data = np.random.randn(1000, 100)
pca = PCA(n_components=2)
reduced = pca.fit_transform(data)

# What happens internally:
# 1. Compute covariance matrix: C = X^T X / n
# 2. Eigenvalue decomposition: C = V Lambda V^T
# 3. Select eigenvectors corresponding to largest eigenvalues
# → The directions of greatest data variance!
# SVD (Singular Value Decomposition) — Used for LLM compression!
# LoRA is exactly this: decomposing a large matrix into 2 small matrices

W = np.random.randn(768, 768)  # GPT-2's attention weight

# SVD: W = U x Sigma x V^T
U, S, Vt = np.linalg.svd(W)

# Keep only the top r values for approximation (the principle behind LoRA!)
r = 16  # rank
W_approx = U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :]

# Original: 768x768 = 589,824 parameters
# LoRA: 768x16 + 16x768 = 24,576 parameters (only 4%!)
error = np.linalg.norm(W - W_approx) / np.linalg.norm(W)
print(f"Rank-{r} approximation error: {error:.4f}")

Part 2: Calculus — The Engine of Learning

Why Do You Need It?

Neural network training = minimizing a loss function = computing gradients via differentiation to update parameters

Partial Derivatives

# f(x, y) = x^2 + 2xy + y^2
# df/dx = 2x + 2y (treating y as constant)
# df/dy = 2x + 2y (treating x as constant)

def f(x, y):
    return x**2 + 2*x*y + y**2

def df_dx(x, y):
    return 2*x + 2*y  # gradient in x direction

def df_dy(x, y):
    return 2*x + 2*y  # gradient in y direction

# Gradient vector
gradient = np.array([df_dx(3, 2), df_dy(3, 2)])
print(f"nabla f(3,2) = {gradient}")  # [10, 10]
# → Moving in the opposite direction decreases the function value!

Chain Rule — The Mathematical Foundation of Backpropagation!

# y = f(g(x)) → dy/dx = df/dg x dg/dx

# In neural networks:
# Loss = CrossEntropy(softmax(Wx + b), target)
# dLoss/dW = dLoss/dsoftmax x dsoftmax/d(Wx+b) x d(Wx+b)/dW

# PyTorch does this automatically!
import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
W = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)

# Forward
h = W @ x + b
loss = h.sum()

# Backward (chain rule applied automatically!)
loss.backward()

print(f"dLoss/dW = {W.grad}")  # automatic differentiation!
print(f"dLoss/db = {b.grad}")
print(f"dLoss/dx = {x.grad}")

Gradient Descent — A Blind Hiker Walking Downhill

# Loss function: L(w) = (w - 3)^2
# Minimum: w = 3

def loss(w):
    return (w - 3) ** 2

def grad(w):
    return 2 * (w - 3)

# Gradient descent
w = 10.0  # starting point (way off)
lr = 0.1  # learning rate

for step in range(20):
    g = grad(w)
    w = w - lr * g  # move in the opposite direction of the gradient!
    if step % 5 == 0:
        print(f"Step {step}: w={w:.4f}, loss={loss(w):.4f}")

# Step 0:  w=8.6000, loss=31.3600
# Step 5:  w=3.2150, loss=0.0462
# Step 10: w=3.0070, loss=0.0000
# Step 15: w=3.0002, loss=0.0000
# → w converges to 3!

The Importance of Learning Rate

If lr is too large:
  w: 10-418-22 → diverges!

If lr is too small:
  w: 109.869.72... → after 1M steps → 3.001

With the right lr:
  w: 108.67.48... → after 20 steps → 3.0002

Real-World Optimizer: Adam

# Adam = Momentum + RMSprop (modern deep learning standard)
class Adam:
    def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1  # momentum (inertia)
        self.beta2 = beta2  # moving average of squared gradients
        self.eps = eps
        self.m = {id(p): 0 for p in params}  # 1st moment
        self.v = {id(p): 0 for p in params}  # 2nd moment
        self.t = 0

    def step(self, params, grads):
        self.t += 1
        for p, g in zip(params, grads):
            pid = id(p)
            # Momentum: remember previous gradient direction
            self.m[pid] = self.beta1 * self.m[pid] + (1 - self.beta1) * g
            # Adaptive learning rate: adjust based on gradient magnitude
            self.v[pid] = self.beta2 * self.v[pid] + (1 - self.beta2) * g**2
            # Bias correction
            m_hat = self.m[pid] / (1 - self.beta1**self.t)
            v_hat = self.v[pid] / (1 - self.beta2**self.t)
            # Update
            p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

Part 3: Probability and Statistics — The Language of AI

Why Do You Need It?

The output of AI models is almost always a probability distribution.

# GPT's output = probability distribution of the next token
logits = np.array([2.0, 1.0, 0.1, -1.0, 3.0])  # model output (raw)
vocab = ["the", "cat", "sat", "on", "mat"]

# Softmax: logits → probabilities
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # numerical stability
    return exp_x / exp_x.sum()

probs = softmax(logits)
for word, p in zip(vocab, probs):
    print(f"  {word}: {p:.4f}")
# the: 0.2312, cat: 0.0851, sat: 0.0346, on: 0.0115, mat: 0.6276
# → "mat" has the highest probability!

Bayes' Theorem

# P(A|B) = P(B|A) x P(A) / P(B)
# "The probability that the model is correct given the data"

# Example: Spam filter
# P(spam|"free") = P("free"|spam) x P(spam) / P("free")
p_free_given_spam = 0.8    # probability of "free" appearing in spam
p_spam = 0.3               # proportion of all emails that are spam
p_free = 0.35              # probability of "free" appearing in all emails

p_spam_given_free = (p_free_given_spam * p_spam) / p_free
print(f"P(spam|'free') = {p_spam_given_free:.2f}")  # 0.69 (69%!)

Probability Distributions

# Gaussian (Normal) Distribution — The core of Diffusion Models!
def gaussian(x, mu=0, sigma=1):
    return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# DDPM (Image Generation):
# Forward:  clean image → add Gaussian noise (gradually destroy)
# Reverse:  noise → remove noise (neural network learns this) → clean image!

# Noise addition process
def add_noise(image, t, noise_schedule):
    """x_t = sqrt(alpha_bar_t) x x_0 + sqrt(1 - alpha_bar_t) x epsilon"""
    alpha_bar = noise_schedule[t]
    noise = np.random.randn(*image.shape)  # Gaussian noise
    noisy = np.sqrt(alpha_bar) * image + np.sqrt(1 - alpha_bar) * noise
    return noisy, noise

Cross-Entropy — The King of Loss Functions

# Measures how different the model's predictions are from the ground truth
def cross_entropy(predictions, targets):
    """H(p, q) = -sum p(x) log q(x)"""
    return -np.sum(targets * np.log(predictions + 1e-9))

# Ground truth: "cat" (one-hot encoding)
target = np.array([0, 1, 0, 0, 0])  # [the, cat, sat, on, mat]

# Good prediction
good_pred = np.array([0.05, 0.85, 0.03, 0.02, 0.05])
print(f"Good: {cross_entropy(good_pred, target):.4f}")  # 0.1625 (low)

# Bad prediction
bad_pred = np.array([0.3, 0.1, 0.2, 0.2, 0.2])
print(f"Bad:  {cross_entropy(bad_pred, target):.4f}")  # 2.3026 (high)

Part 4: Information Theory — The Mathematical Foundation of LLMs

Entropy — A Measure of Uncertainty

def entropy(probs):
    """H(X) = -sum p(x) log2 p(x)"""
    return -np.sum(probs * np.log2(probs + 1e-9))

# Fair coin: H = 1 bit (maximum uncertainty)
fair_coin = np.array([0.5, 0.5])
print(f"Fair coin: {entropy(fair_coin):.2f} bits")  # 1.00

# Biased coin: H is less than 1 bit (predictable)
biased_coin = np.array([0.9, 0.1])
print(f"Biased coin: {entropy(biased_coin):.2f} bits")  # 0.47

# Low entropy in GPT's output → the model is confident
# Raising temperature → entropy increases → more diverse outputs

KL Divergence — Difference Between Two Distributions

def kl_divergence(p, q):
    """D_KL(P || Q) = sum p(x) log(p(x) / q(x))"""
    return np.sum(p * np.log(p / (q + 1e-9) + 1e-9))

# In VAE (Variational Autoencoder):
# Minimize KL(q(z|x) || p(z))
# = Make the encoder's output distribution close to standard normal!

# In RLHF:
# Add KL(pi_new || pi_ref) as penalty
# = Prevent the new model from deviating too far from the reference model!

Math → AI Mapping Summary

Math ConceptRole in AIWhere It Appears
Matrix multiplyLayer computationAll neural networks
Cosine similarityEmbedding comparisonSearch, RAG
SVDModel compressionLoRA, quantization
Partial deriv.Gradient computationBackpropagation
Chain ruleAutomatic diff.PyTorch autograd
Gradient descentParameter optimizationAdam, SGD
SoftmaxProbability transformClassification, Attention
Cross-entropyLoss functionLLM, classifiers
Gaussian dist.Noise modelingDDPM, VAE
Bayes' theoremPosterior inferenceBayesian ML
KL DivergenceDistribution differenceVAE, RLHF
EntropyUncertainty measureTemperature, information

Study Roadmap

[Week 1] Linear Algebra Fundamentals
Vectors, matrix multiplication, transpose, inverse
Implement from scratch with numpy

[Week 2] Calculus + Optimization
Partial derivatives, chain rule, gradient descent
Understand PyTorch autograd

[Week 3] Probability + Statistics
Conditional probability, Bayes, distributions
Implement softmax, cross-entropy

[Week 4] Information Theory + Practice
Entropy, KL-divergence
Find the math in nanoGPT/DDPM code
  • 3Blue1Brown (YouTube) — Intuitive visualizations of linear algebra/calculus
  • Mathematics for Machine Learning (free textbook) — Bridging math to ML
  • Andrej Karpathy's micrograd — Backpropagation from scratch
  • Stanford CS229 — Probability/statistics + ML math
  • Ian Goodfellow's Deep Learning Book — Ch.2–4 (free online)

Quiz — Math for AI (Click to reveal!)

Q1. What role does matrix multiplication play in neural networks? ||It multiplies input vectors by weight matrices to compute the next layer's output. One matrix multiplication = one layer's linear transformation||

Q2. Why is LoRA related to SVD? ||SVD decomposes a large matrix into products of smaller matrices. LoRA approximates the weight change (delta W) as a product of two low-rank matrices (A, B), dramatically reducing parameters||

Q3. What is the mathematical foundation of backpropagation? ||The chain rule. It decomposes the derivative of a composite function into products of derivatives at each stage. Gradients propagate backward from the loss to each parameter||

Q4. What does the Softmax function do, and what is the numerical stability trick? ||Converts a real-valued vector (logits) into a probability distribution (sums to 1, all positive). Trick: subtract the maximum value from inputs to prevent exp overflow||

Q5. Why is cross-entropy a good loss function? ||When the predicted probability for the correct class approaches 1, loss approaches 0; when it approaches 0, loss approaches infinity. Strong penalties for wrong predictions make training efficient||

Q6. Why is the Gaussian distribution used in Diffusion Models? ||In the forward process, Gaussian noise is gradually added to images. Gaussians are mathematically tractable (closed-form solutions) and provide a natural noise model via the central limit theorem||

Q7. What happens to GPT output entropy when temperature is high? ||It increases. Dividing logits by T flattens the softmax distribution (closer to uniform), increasing uncertainty and generating more diverse outputs||

Q8. What is the purpose of the KL Divergence penalty in RLHF? ||It constrains the RL-updated model (pi_new) from straying too far from the original model (pi_ref). Prevents reward hacking and preserves existing capabilities||

References

Quiz

Q1: What is the main topic covered in "Complete Math Guide for AI — From Linear Algebra to Information Theory"?

A guide to the math needed for AI/deep learning, explained with code and intuition. Linear algebra (matrices, eigenvalues), calculus (partial derivatives, backpropagation), probability/statistics (Bayes, distributions), optimization (gradient descent), and information theory (ent...

Q2: What is Part 1: Linear Algebra — The Skeleton of AI? Why Do You Need It? Every operation in a neural network is matrix multiplication. Vectors — Representing Data Matrix Multiplication — The Heart of Neural Networks Eigenvalue Decomposition — PCA, SVD

Q3: Explain the core concept of Part 2: Calculus — The Engine of Learning. Why Do You Need It? Neural network training = minimizing a loss function = computing gradients via differentiation to update parameters Partial Derivatives Chain Rule — The Mathematical Foundation of Backpropagation!

Q4: What are the key aspects of Part 3: Probability and Statistics — The Language of AI?

Why Do You Need It? The output of AI models is almost always a probability distribution. Bayes' Theorem Probability Distributions Cross-Entropy — The King of Loss Functions

Q5: How does Part 4: Information Theory — The Mathematical Foundation of LLMs work?

Entropy — A Measure of Uncertainty KL Divergence — Difference Between Two Distributions