Skip to content

필사 모드: Complete Math Guide for AI — From Linear Algebra to Information Theory

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

"How much math do I need to study AI?"

Answer: **Linear algebra + calculus + probability/statistics + optimization**. These four areas let you read 90% of papers.

This is not a math textbook. It explains **why this math is used in AI**, with code and intuition. When building nanoGPT from scratch, or training an image generation model (DDPM) — this is where that math shows up.

Part 1: Linear Algebra — The Skeleton of AI

Why Do You Need It?

Every operation in a neural network is **matrix multiplication**.

A single neuron = dot product

weights = np.array([0.5, -0.3, 0.8]) # weights

inputs = np.array([1.0, 2.0, 3.0]) # inputs

bias = 0.1

output = np.dot(weights, inputs) + bias

0.5*1.0 + (-0.3)*2.0 + 0.8*3.0 + 0.1 = 2.5

An entire layer = matrix multiplication

W = np.random.randn(4, 3) # 3 → 4 neurons

x = np.random.randn(3) # input vector

h = W @ x # matrix-vector product = layer output

Vectors — Representing Data

Representing words as vectors (Word Embedding)

king = np.array([0.8, 0.2, 0.9, -0.5])

queen = np.array([0.7, 0.8, 0.85, -0.4])

man = np.array([0.9, 0.1, 0.5, -0.6])

woman = np.array([0.8, 0.7, 0.45, -0.5])

king - man + woman ≈ queen (the famous relationship!)

result = king - man + woman

print(f"king - man + woman = {result}")

print(f"queen = {queen}")

Nearly identical!

Cosine similarity — how similar two vectors are

def cosine_similarity(a, b):

return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_similarity(result, queen)) # ~0.95 (very similar!)

Matrix Multiplication — The Heart of Neural Networks

Transformer's Self-Attention is also matrix multiplication!

Q, K, V = input multiplied by weight matrices

batch_size, seq_len, d_model = 2, 10, 64

X = np.random.randn(batch_size, seq_len, d_model)

W_Q = np.random.randn(d_model, d_model)

W_K = np.random.randn(d_model, d_model)

W_V = np.random.randn(d_model, d_model)

Q = X @ W_Q # Query: (2, 10, 64)

K = X @ W_K # Key: (2, 10, 64)

V = X @ W_V # Value: (2, 10, 64)

Attention Score = Q x K^T / sqrt(d)

scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_model)

scores shape: (2, 10, 10) — attention each token pays to other tokens

Eigenvalue Decomposition — PCA, SVD

PCA: Finding the principal directions of data

from sklearn.decomposition import PCA

100-dimensional data → reduced to 2 dimensions

data = np.random.randn(1000, 100)

pca = PCA(n_components=2)

reduced = pca.fit_transform(data)

What happens internally:

1. Compute covariance matrix: C = X^T X / n

2. Eigenvalue decomposition: C = V Lambda V^T

3. Select eigenvectors corresponding to largest eigenvalues

→ The directions of greatest data variance!

SVD (Singular Value Decomposition) — Used for LLM compression!

LoRA is exactly this: decomposing a large matrix into 2 small matrices

W = np.random.randn(768, 768) # GPT-2's attention weight

SVD: W = U x Sigma x V^T

U, S, Vt = np.linalg.svd(W)

Keep only the top r values for approximation (the principle behind LoRA!)

r = 16 # rank

W_approx = U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :]

Original: 768x768 = 589,824 parameters

LoRA: 768x16 + 16x768 = 24,576 parameters (only 4%!)

error = np.linalg.norm(W - W_approx) / np.linalg.norm(W)

print(f"Rank-{r} approximation error: {error:.4f}")

Part 2: Calculus — The Engine of Learning

Why Do You Need It?

Neural network training = **minimizing a loss function** = **computing gradients via differentiation to update parameters**

Partial Derivatives

f(x, y) = x^2 + 2xy + y^2

df/dx = 2x + 2y (treating y as constant)

df/dy = 2x + 2y (treating x as constant)

def f(x, y):

return x**2 + 2*x*y + y**2

def df_dx(x, y):

return 2*x + 2*y # gradient in x direction

def df_dy(x, y):

return 2*x + 2*y # gradient in y direction

Gradient vector

gradient = np.array([df_dx(3, 2), df_dy(3, 2)])

print(f"nabla f(3,2) = {gradient}") # [10, 10]

→ Moving in the opposite direction decreases the function value!

Chain Rule — The Mathematical Foundation of Backpropagation!

y = f(g(x)) → dy/dx = df/dg x dg/dx

In neural networks:

Loss = CrossEntropy(softmax(Wx + b), target)

dLoss/dW = dLoss/dsoftmax x dsoftmax/d(Wx+b) x d(Wx+b)/dW

PyTorch does this automatically!

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

W = torch.randn(2, 3, requires_grad=True)

b = torch.randn(2, requires_grad=True)

Forward

h = W @ x + b

loss = h.sum()

Backward (chain rule applied automatically!)

loss.backward()

print(f"dLoss/dW = {W.grad}") # automatic differentiation!

print(f"dLoss/db = {b.grad}")

print(f"dLoss/dx = {x.grad}")

Gradient Descent — A Blind Hiker Walking Downhill

Loss function: L(w) = (w - 3)^2

Minimum: w = 3

def loss(w):

return (w - 3) ** 2

def grad(w):

return 2 * (w - 3)

Gradient descent

w = 10.0 # starting point (way off)

lr = 0.1 # learning rate

for step in range(20):

g = grad(w)

w = w - lr * g # move in the opposite direction of the gradient!

if step % 5 == 0:

print(f"Step {step}: w={w:.4f}, loss={loss(w):.4f}")

Step 0: w=8.6000, loss=31.3600

Step 5: w=3.2150, loss=0.0462

Step 10: w=3.0070, loss=0.0000

Step 15: w=3.0002, loss=0.0000

→ w converges to 3!

The Importance of Learning Rate

If lr is too large:

w: 10 → -4 → 18 → -22 → diverges!

If lr is too small:

w: 10 → 9.86 → 9.72 → ... → after 1M steps → 3.001

With the right lr:

w: 10 → 8.6 → 7.48 → ... → after 20 steps → 3.0002

Real-World Optimizer: Adam

Adam = Momentum + RMSprop (modern deep learning standard)

class Adam:

def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):

self.lr = lr

self.beta1 = beta1 # momentum (inertia)

self.beta2 = beta2 # moving average of squared gradients

self.eps = eps

self.m = {id(p): 0 for p in params} # 1st moment

self.v = {id(p): 0 for p in params} # 2nd moment

self.t = 0

def step(self, params, grads):

self.t += 1

for p, g in zip(params, grads):

pid = id(p)

Momentum: remember previous gradient direction

self.m[pid] = self.beta1 * self.m[pid] + (1 - self.beta1) * g

Adaptive learning rate: adjust based on gradient magnitude

self.v[pid] = self.beta2 * self.v[pid] + (1 - self.beta2) * g**2

Bias correction

m_hat = self.m[pid] / (1 - self.beta1**self.t)

v_hat = self.v[pid] / (1 - self.beta2**self.t)

Update

p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

Part 3: Probability and Statistics — The Language of AI

Why Do You Need It?

The output of AI models is almost always a **probability distribution**.

GPT's output = probability distribution of the next token

logits = np.array([2.0, 1.0, 0.1, -1.0, 3.0]) # model output (raw)

vocab = ["the", "cat", "sat", "on", "mat"]

Softmax: logits → probabilities

def softmax(x):

exp_x = np.exp(x - np.max(x)) # numerical stability

return exp_x / exp_x.sum()

probs = softmax(logits)

for word, p in zip(vocab, probs):

print(f" {word}: {p:.4f}")

the: 0.2312, cat: 0.0851, sat: 0.0346, on: 0.0115, mat: 0.6276

→ "mat" has the highest probability!

Bayes' Theorem

P(A|B) = P(B|A) x P(A) / P(B)

"The probability that the model is correct given the data"

Example: Spam filter

P(spam|"free") = P("free"|spam) x P(spam) / P("free")

p_free_given_spam = 0.8 # probability of "free" appearing in spam

p_spam = 0.3 # proportion of all emails that are spam

p_free = 0.35 # probability of "free" appearing in all emails

p_spam_given_free = (p_free_given_spam * p_spam) / p_free

print(f"P(spam|'free') = {p_spam_given_free:.2f}") # 0.69 (69%!)

Probability Distributions

Gaussian (Normal) Distribution — The core of Diffusion Models!

def gaussian(x, mu=0, sigma=1):

return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

DDPM (Image Generation):

Forward: clean image → add Gaussian noise (gradually destroy)

Reverse: noise → remove noise (neural network learns this) → clean image!

Noise addition process

def add_noise(image, t, noise_schedule):

"""x_t = sqrt(alpha_bar_t) x x_0 + sqrt(1 - alpha_bar_t) x epsilon"""

alpha_bar = noise_schedule[t]

noise = np.random.randn(*image.shape) # Gaussian noise

noisy = np.sqrt(alpha_bar) * image + np.sqrt(1 - alpha_bar) * noise

return noisy, noise

Cross-Entropy — The King of Loss Functions

Measures how different the model's predictions are from the ground truth

def cross_entropy(predictions, targets):

"""H(p, q) = -sum p(x) log q(x)"""

return -np.sum(targets * np.log(predictions + 1e-9))

Ground truth: "cat" (one-hot encoding)

target = np.array([0, 1, 0, 0, 0]) # [the, cat, sat, on, mat]

Good prediction

good_pred = np.array([0.05, 0.85, 0.03, 0.02, 0.05])

print(f"Good: {cross_entropy(good_pred, target):.4f}") # 0.1625 (low)

Bad prediction

bad_pred = np.array([0.3, 0.1, 0.2, 0.2, 0.2])

print(f"Bad: {cross_entropy(bad_pred, target):.4f}") # 2.3026 (high)

Part 4: Information Theory — The Mathematical Foundation of LLMs

Entropy — A Measure of Uncertainty

def entropy(probs):

"""H(X) = -sum p(x) log2 p(x)"""

return -np.sum(probs * np.log2(probs + 1e-9))

Fair coin: H = 1 bit (maximum uncertainty)

fair_coin = np.array([0.5, 0.5])

print(f"Fair coin: {entropy(fair_coin):.2f} bits") # 1.00

Biased coin: H is less than 1 bit (predictable)

biased_coin = np.array([0.9, 0.1])

print(f"Biased coin: {entropy(biased_coin):.2f} bits") # 0.47

Low entropy in GPT's output → the model is confident

Raising temperature → entropy increases → more diverse outputs

KL Divergence — Difference Between Two Distributions

def kl_divergence(p, q):

"""D_KL(P || Q) = sum p(x) log(p(x) / q(x))"""

return np.sum(p * np.log(p / (q + 1e-9) + 1e-9))

In VAE (Variational Autoencoder):

Minimize KL(q(z|x) || p(z))

= Make the encoder's output distribution close to standard normal!

In RLHF:

Add KL(pi_new || pi_ref) as penalty

= Prevent the new model from deviating too far from the reference model!

Math → AI Mapping Summary

| Math Concept | Role in AI | Where It Appears |

| ----------------- | ----------------------- | ------------------------- |

| Matrix multiply | Layer computation | All neural networks |

| Cosine similarity | Embedding comparison | Search, RAG |

| SVD | Model compression | LoRA, quantization |

| Partial deriv. | Gradient computation | Backpropagation |

| Chain rule | Automatic diff. | PyTorch autograd |

| Gradient descent | Parameter optimization | Adam, SGD |

| Softmax | Probability transform | Classification, Attention |

| Cross-entropy | Loss function | LLM, classifiers |

| Gaussian dist. | Noise modeling | DDPM, VAE |

| Bayes' theorem | Posterior inference | Bayesian ML |

| KL Divergence | Distribution difference | VAE, RLHF |

| Entropy | Uncertainty measure | Temperature, information |

Study Roadmap

[Week 1] Linear Algebra Fundamentals

→ Vectors, matrix multiplication, transpose, inverse

→ Implement from scratch with numpy

[Week 2] Calculus + Optimization

→ Partial derivatives, chain rule, gradient descent

→ Understand PyTorch autograd

[Week 3] Probability + Statistics

→ Conditional probability, Bayes, distributions

→ Implement softmax, cross-entropy

[Week 4] Information Theory + Practice

→ Entropy, KL-divergence

→ Find the math in nanoGPT/DDPM code

Recommended Resources

- **3Blue1Brown** (YouTube) — Intuitive visualizations of linear algebra/calculus

- **Mathematics for Machine Learning** (free textbook) — Bridging math to ML

- **Andrej Karpathy's micrograd** — Backpropagation from scratch

- **Stanford CS229** — Probability/statistics + ML math

- **Ian Goodfellow's Deep Learning Book** — Ch.2–4 (free online)

**Q1.** What role does matrix multiplication play in neural networks?

||It multiplies input vectors by weight matrices to compute the next layer's output. One matrix multiplication = one layer's linear transformation||

**Q2.** Why is LoRA related to SVD?

||SVD decomposes a large matrix into products of smaller matrices. LoRA approximates the weight change (delta W) as a product of two low-rank matrices (A, B), dramatically reducing parameters||

**Q3.** What is the mathematical foundation of backpropagation?

||The chain rule. It decomposes the derivative of a composite function into products of derivatives at each stage. Gradients propagate backward from the loss to each parameter||

**Q4.** What does the Softmax function do, and what is the numerical stability trick?

||Converts a real-valued vector (logits) into a probability distribution (sums to 1, all positive). Trick: subtract the maximum value from inputs to prevent exp overflow||

**Q5.** Why is cross-entropy a good loss function?

||When the predicted probability for the correct class approaches 1, loss approaches 0; when it approaches 0, loss approaches infinity. Strong penalties for wrong predictions make training efficient||

**Q6.** Why is the Gaussian distribution used in Diffusion Models?

||In the forward process, Gaussian noise is gradually added to images. Gaussians are mathematically tractable (closed-form solutions) and provide a natural noise model via the central limit theorem||

**Q7.** What happens to GPT output entropy when temperature is high?

||It increases. Dividing logits by T flattens the softmax distribution (closer to uniform), increasing uncertainty and generating more diverse outputs||

**Q8.** What is the purpose of the KL Divergence penalty in RLHF?

||It constrains the RL-updated model (pi_new) from straying too far from the original model (pi_ref). Prevents reward hacking and preserves existing capabilities||

Related Series and Recommended Posts

- [Build Your Own GPT — nanoGPT](/blog/ai/2026-03-03-build-your-own-gpt-from-scratch) — Where this math is actually used

- [GPT Series Paper Analysis](/blog/ai-papers/gpt_series_evolution) — From GPT-1 to GPT-4

- [torchvision Complete Guide](/blog/ai-platform/2026-03-03-torchvision-complete-guide) — CNN/ViT (linear algebra in practice)

- [torchaudio Complete Guide](/blog/ai-platform/2026-03-03-torchaudio-complete-guide) — Fourier Transform in practice

- [Security Fundamentals Guide](/blog/architecture/2026-03-03-security-fundamentals-for-developers) — Cryptography (security applications of math)

References

- [3Blue1Brown — Essence of Linear Algebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) — Linear algebra intuition (must watch!)

- [3Blue1Brown — Neural Networks](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) — Neural network visualization

- [StatQuest — Machine Learning](https://www.youtube.com/@statquest) — Easy statistics/ML explanations

- [Mathematics for Machine Learning (book)](https://mml-book.github.io/) — Free textbook

Quiz

Q1: What is the main topic covered in "Complete Math Guide for AI — From Linear Algebra to

Information Theory"?

A guide to the math needed for AI/deep learning, explained with code and intuition. Linear algebra

(matrices, eigenvalues), calculus (partial derivatives, backpropagation), probability/statistics

(Bayes, distributions), optimization (gradient descent), and information theory (ent...

Why Do You Need It? Every operation in a neural network is matrix multiplication. Vectors —

Representing Data Matrix Multiplication — The Heart of Neural Networks Eigenvalue Decomposition —

PCA, SVD

Why Do You Need It? Neural network training = minimizing a loss function = computing gradients via

differentiation to update parameters Partial Derivatives Chain Rule — The Mathematical Foundation

of Backpropagation!

Q4: What are the key aspects of Part 3: Probability and Statistics — The Language of AI?

Why Do You Need It? The output of AI models is almost always a probability distribution. Bayes'

Theorem Probability Distributions Cross-Entropy — The King of Loss Functions

Q5: How does Part 4: Information Theory — The Mathematical Foundation of LLMs work?

Entropy — A Measure of Uncertainty KL Divergence — Difference Between Two Distributions

현재 단락 (1/205)

"How much math do I need to study AI?"

작성 글자: 0원문 글자: 13,516작성 단락: 0/205