Introduction
"How much math do I need to study AI?"
Answer: **Linear algebra + calculus + probability/statistics + optimization**. These four areas let you read 90% of papers.
This is not a math textbook. It explains **why this math is used in AI**, with code and intuition. When building nanoGPT from scratch, or training an image generation model (DDPM) — this is where that math shows up.
Part 1: Linear Algebra — The Skeleton of AI
Why Do You Need It?
Every operation in a neural network is **matrix multiplication**.
A single neuron = dot product
weights = np.array([0.5, -0.3, 0.8]) # weights
inputs = np.array([1.0, 2.0, 3.0]) # inputs
bias = 0.1
output = np.dot(weights, inputs) + bias
0.5*1.0 + (-0.3)*2.0 + 0.8*3.0 + 0.1 = 2.5
An entire layer = matrix multiplication
W = np.random.randn(4, 3) # 3 → 4 neurons
x = np.random.randn(3) # input vector
h = W @ x # matrix-vector product = layer output
Vectors — Representing Data
Representing words as vectors (Word Embedding)
king = np.array([0.8, 0.2, 0.9, -0.5])
queen = np.array([0.7, 0.8, 0.85, -0.4])
man = np.array([0.9, 0.1, 0.5, -0.6])
woman = np.array([0.8, 0.7, 0.45, -0.5])
king - man + woman ≈ queen (the famous relationship!)
result = king - man + woman
print(f"king - man + woman = {result}")
print(f"queen = {queen}")
Nearly identical!
Cosine similarity — how similar two vectors are
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(cosine_similarity(result, queen)) # ~0.95 (very similar!)
Matrix Multiplication — The Heart of Neural Networks
Transformer's Self-Attention is also matrix multiplication!
Q, K, V = input multiplied by weight matrices
batch_size, seq_len, d_model = 2, 10, 64
X = np.random.randn(batch_size, seq_len, d_model)
W_Q = np.random.randn(d_model, d_model)
W_K = np.random.randn(d_model, d_model)
W_V = np.random.randn(d_model, d_model)
Q = X @ W_Q # Query: (2, 10, 64)
K = X @ W_K # Key: (2, 10, 64)
V = X @ W_V # Value: (2, 10, 64)
Attention Score = Q x K^T / sqrt(d)
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_model)
scores shape: (2, 10, 10) — attention each token pays to other tokens
Eigenvalue Decomposition — PCA, SVD
PCA: Finding the principal directions of data
from sklearn.decomposition import PCA
100-dimensional data → reduced to 2 dimensions
data = np.random.randn(1000, 100)
pca = PCA(n_components=2)
reduced = pca.fit_transform(data)
What happens internally:
1. Compute covariance matrix: C = X^T X / n
2. Eigenvalue decomposition: C = V Lambda V^T
3. Select eigenvectors corresponding to largest eigenvalues
→ The directions of greatest data variance!
SVD (Singular Value Decomposition) — Used for LLM compression!
LoRA is exactly this: decomposing a large matrix into 2 small matrices
W = np.random.randn(768, 768) # GPT-2's attention weight
SVD: W = U x Sigma x V^T
U, S, Vt = np.linalg.svd(W)
Keep only the top r values for approximation (the principle behind LoRA!)
r = 16 # rank
W_approx = U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :]
Original: 768x768 = 589,824 parameters
LoRA: 768x16 + 16x768 = 24,576 parameters (only 4%!)
error = np.linalg.norm(W - W_approx) / np.linalg.norm(W)
print(f"Rank-{r} approximation error: {error:.4f}")
Part 2: Calculus — The Engine of Learning
Why Do You Need It?
Neural network training = **minimizing a loss function** = **computing gradients via differentiation to update parameters**
Partial Derivatives
f(x, y) = x^2 + 2xy + y^2
df/dx = 2x + 2y (treating y as constant)
df/dy = 2x + 2y (treating x as constant)
def f(x, y):
return x**2 + 2*x*y + y**2
def df_dx(x, y):
return 2*x + 2*y # gradient in x direction
def df_dy(x, y):
return 2*x + 2*y # gradient in y direction
Gradient vector
gradient = np.array([df_dx(3, 2), df_dy(3, 2)])
print(f"nabla f(3,2) = {gradient}") # [10, 10]
→ Moving in the opposite direction decreases the function value!
Chain Rule — The Mathematical Foundation of Backpropagation!
y = f(g(x)) → dy/dx = df/dg x dg/dx
In neural networks:
Loss = CrossEntropy(softmax(Wx + b), target)
dLoss/dW = dLoss/dsoftmax x dsoftmax/d(Wx+b) x d(Wx+b)/dW
PyTorch does this automatically!
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
W = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)
Forward
h = W @ x + b
loss = h.sum()
Backward (chain rule applied automatically!)
loss.backward()
print(f"dLoss/dW = {W.grad}") # automatic differentiation!
print(f"dLoss/db = {b.grad}")
print(f"dLoss/dx = {x.grad}")
Gradient Descent — A Blind Hiker Walking Downhill
Loss function: L(w) = (w - 3)^2
Minimum: w = 3
def loss(w):
return (w - 3) ** 2
def grad(w):
return 2 * (w - 3)
Gradient descent
w = 10.0 # starting point (way off)
lr = 0.1 # learning rate
for step in range(20):
g = grad(w)
w = w - lr * g # move in the opposite direction of the gradient!
if step % 5 == 0:
print(f"Step {step}: w={w:.4f}, loss={loss(w):.4f}")
Step 0: w=8.6000, loss=31.3600
Step 5: w=3.2150, loss=0.0462
Step 10: w=3.0070, loss=0.0000
Step 15: w=3.0002, loss=0.0000
→ w converges to 3!
The Importance of Learning Rate
If lr is too large:
w: 10 → -4 → 18 → -22 → diverges!
If lr is too small:
w: 10 → 9.86 → 9.72 → ... → after 1M steps → 3.001
With the right lr:
w: 10 → 8.6 → 7.48 → ... → after 20 steps → 3.0002
Real-World Optimizer: Adam
Adam = Momentum + RMSprop (modern deep learning standard)
class Adam:
def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
self.lr = lr
self.beta1 = beta1 # momentum (inertia)
self.beta2 = beta2 # moving average of squared gradients
self.eps = eps
self.m = {id(p): 0 for p in params} # 1st moment
self.v = {id(p): 0 for p in params} # 2nd moment
self.t = 0
def step(self, params, grads):
self.t += 1
for p, g in zip(params, grads):
pid = id(p)
Momentum: remember previous gradient direction
self.m[pid] = self.beta1 * self.m[pid] + (1 - self.beta1) * g
Adaptive learning rate: adjust based on gradient magnitude
self.v[pid] = self.beta2 * self.v[pid] + (1 - self.beta2) * g**2
Bias correction
m_hat = self.m[pid] / (1 - self.beta1**self.t)
v_hat = self.v[pid] / (1 - self.beta2**self.t)
Update
p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
Part 3: Probability and Statistics — The Language of AI
Why Do You Need It?
The output of AI models is almost always a **probability distribution**.
GPT's output = probability distribution of the next token
logits = np.array([2.0, 1.0, 0.1, -1.0, 3.0]) # model output (raw)
vocab = ["the", "cat", "sat", "on", "mat"]
Softmax: logits → probabilities
def softmax(x):
exp_x = np.exp(x - np.max(x)) # numerical stability
return exp_x / exp_x.sum()
probs = softmax(logits)
for word, p in zip(vocab, probs):
print(f" {word}: {p:.4f}")
the: 0.2312, cat: 0.0851, sat: 0.0346, on: 0.0115, mat: 0.6276
→ "mat" has the highest probability!
Bayes' Theorem
P(A|B) = P(B|A) x P(A) / P(B)
"The probability that the model is correct given the data"
Example: Spam filter
P(spam|"free") = P("free"|spam) x P(spam) / P("free")
p_free_given_spam = 0.8 # probability of "free" appearing in spam
p_spam = 0.3 # proportion of all emails that are spam
p_free = 0.35 # probability of "free" appearing in all emails
p_spam_given_free = (p_free_given_spam * p_spam) / p_free
print(f"P(spam|'free') = {p_spam_given_free:.2f}") # 0.69 (69%!)
Probability Distributions
Gaussian (Normal) Distribution — The core of Diffusion Models!
def gaussian(x, mu=0, sigma=1):
return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)
DDPM (Image Generation):
Forward: clean image → add Gaussian noise (gradually destroy)
Reverse: noise → remove noise (neural network learns this) → clean image!
Noise addition process
def add_noise(image, t, noise_schedule):
"""x_t = sqrt(alpha_bar_t) x x_0 + sqrt(1 - alpha_bar_t) x epsilon"""
alpha_bar = noise_schedule[t]
noise = np.random.randn(*image.shape) # Gaussian noise
noisy = np.sqrt(alpha_bar) * image + np.sqrt(1 - alpha_bar) * noise
return noisy, noise
Cross-Entropy — The King of Loss Functions
Measures how different the model's predictions are from the ground truth
def cross_entropy(predictions, targets):
"""H(p, q) = -sum p(x) log q(x)"""
return -np.sum(targets * np.log(predictions + 1e-9))
Ground truth: "cat" (one-hot encoding)
target = np.array([0, 1, 0, 0, 0]) # [the, cat, sat, on, mat]
Good prediction
good_pred = np.array([0.05, 0.85, 0.03, 0.02, 0.05])
print(f"Good: {cross_entropy(good_pred, target):.4f}") # 0.1625 (low)
Bad prediction
bad_pred = np.array([0.3, 0.1, 0.2, 0.2, 0.2])
print(f"Bad: {cross_entropy(bad_pred, target):.4f}") # 2.3026 (high)
Part 4: Information Theory — The Mathematical Foundation of LLMs
Entropy — A Measure of Uncertainty
def entropy(probs):
"""H(X) = -sum p(x) log2 p(x)"""
return -np.sum(probs * np.log2(probs + 1e-9))
Fair coin: H = 1 bit (maximum uncertainty)
fair_coin = np.array([0.5, 0.5])
print(f"Fair coin: {entropy(fair_coin):.2f} bits") # 1.00
Biased coin: H is less than 1 bit (predictable)
biased_coin = np.array([0.9, 0.1])
print(f"Biased coin: {entropy(biased_coin):.2f} bits") # 0.47
Low entropy in GPT's output → the model is confident
Raising temperature → entropy increases → more diverse outputs
KL Divergence — Difference Between Two Distributions
def kl_divergence(p, q):
"""D_KL(P || Q) = sum p(x) log(p(x) / q(x))"""
return np.sum(p * np.log(p / (q + 1e-9) + 1e-9))
In VAE (Variational Autoencoder):
Minimize KL(q(z|x) || p(z))
= Make the encoder's output distribution close to standard normal!
In RLHF:
Add KL(pi_new || pi_ref) as penalty
= Prevent the new model from deviating too far from the reference model!
Math → AI Mapping Summary
| Math Concept | Role in AI | Where It Appears |
| ----------------- | ----------------------- | ------------------------- |
| Matrix multiply | Layer computation | All neural networks |
| Cosine similarity | Embedding comparison | Search, RAG |
| SVD | Model compression | LoRA, quantization |
| Partial deriv. | Gradient computation | Backpropagation |
| Chain rule | Automatic diff. | PyTorch autograd |
| Gradient descent | Parameter optimization | Adam, SGD |
| Softmax | Probability transform | Classification, Attention |
| Cross-entropy | Loss function | LLM, classifiers |
| Gaussian dist. | Noise modeling | DDPM, VAE |
| Bayes' theorem | Posterior inference | Bayesian ML |
| KL Divergence | Distribution difference | VAE, RLHF |
| Entropy | Uncertainty measure | Temperature, information |
Study Roadmap
[Week 1] Linear Algebra Fundamentals
→ Vectors, matrix multiplication, transpose, inverse
→ Implement from scratch with numpy
[Week 2] Calculus + Optimization
→ Partial derivatives, chain rule, gradient descent
→ Understand PyTorch autograd
[Week 3] Probability + Statistics
→ Conditional probability, Bayes, distributions
→ Implement softmax, cross-entropy
[Week 4] Information Theory + Practice
→ Entropy, KL-divergence
→ Find the math in nanoGPT/DDPM code
Recommended Resources
- **3Blue1Brown** (YouTube) — Intuitive visualizations of linear algebra/calculus
- **Mathematics for Machine Learning** (free textbook) — Bridging math to ML
- **Andrej Karpathy's micrograd** — Backpropagation from scratch
- **Stanford CS229** — Probability/statistics + ML math
- **Ian Goodfellow's Deep Learning Book** — Ch.2–4 (free online)
**Q1.** What role does matrix multiplication play in neural networks?
||It multiplies input vectors by weight matrices to compute the next layer's output. One matrix multiplication = one layer's linear transformation||
**Q2.** Why is LoRA related to SVD?
||SVD decomposes a large matrix into products of smaller matrices. LoRA approximates the weight change (delta W) as a product of two low-rank matrices (A, B), dramatically reducing parameters||
**Q3.** What is the mathematical foundation of backpropagation?
||The chain rule. It decomposes the derivative of a composite function into products of derivatives at each stage. Gradients propagate backward from the loss to each parameter||
**Q4.** What does the Softmax function do, and what is the numerical stability trick?
||Converts a real-valued vector (logits) into a probability distribution (sums to 1, all positive). Trick: subtract the maximum value from inputs to prevent exp overflow||
**Q5.** Why is cross-entropy a good loss function?
||When the predicted probability for the correct class approaches 1, loss approaches 0; when it approaches 0, loss approaches infinity. Strong penalties for wrong predictions make training efficient||
**Q6.** Why is the Gaussian distribution used in Diffusion Models?
||In the forward process, Gaussian noise is gradually added to images. Gaussians are mathematically tractable (closed-form solutions) and provide a natural noise model via the central limit theorem||
**Q7.** What happens to GPT output entropy when temperature is high?
||It increases. Dividing logits by T flattens the softmax distribution (closer to uniform), increasing uncertainty and generating more diverse outputs||
**Q8.** What is the purpose of the KL Divergence penalty in RLHF?
||It constrains the RL-updated model (pi_new) from straying too far from the original model (pi_ref). Prevents reward hacking and preserves existing capabilities||
Related Series and Recommended Posts
- [Build Your Own GPT — nanoGPT](/blog/ai/2026-03-03-build-your-own-gpt-from-scratch) — Where this math is actually used
- [GPT Series Paper Analysis](/blog/ai-papers/gpt_series_evolution) — From GPT-1 to GPT-4
- [torchvision Complete Guide](/blog/ai-platform/2026-03-03-torchvision-complete-guide) — CNN/ViT (linear algebra in practice)
- [torchaudio Complete Guide](/blog/ai-platform/2026-03-03-torchaudio-complete-guide) — Fourier Transform in practice
- [Security Fundamentals Guide](/blog/architecture/2026-03-03-security-fundamentals-for-developers) — Cryptography (security applications of math)
References
- [3Blue1Brown — Essence of Linear Algebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) — Linear algebra intuition (must watch!)
- [3Blue1Brown — Neural Networks](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) — Neural network visualization
- [StatQuest — Machine Learning](https://www.youtube.com/@statquest) — Easy statistics/ML explanations
- [Mathematics for Machine Learning (book)](https://mml-book.github.io/) — Free textbook
Quiz
Q1: What is the main topic covered in "Complete Math Guide for AI — From Linear Algebra to
Information Theory"?
A guide to the math needed for AI/deep learning, explained with code and intuition. Linear algebra
(matrices, eigenvalues), calculus (partial derivatives, backpropagation), probability/statistics
(Bayes, distributions), optimization (gradient descent), and information theory (ent...
Why Do You Need It? Every operation in a neural network is matrix multiplication. Vectors —
Representing Data Matrix Multiplication — The Heart of Neural Networks Eigenvalue Decomposition —
PCA, SVD
Why Do You Need It? Neural network training = minimizing a loss function = computing gradients via
differentiation to update parameters Partial Derivatives Chain Rule — The Mathematical Foundation
of Backpropagation!
Q4: What are the key aspects of Part 3: Probability and Statistics — The Language of AI?
Why Do You Need It? The output of AI models is almost always a probability distribution. Bayes'
Theorem Probability Distributions Cross-Entropy — The King of Loss Functions
Q5: How does Part 4: Information Theory — The Mathematical Foundation of LLMs work?
Entropy — A Measure of Uncertainty KL Divergence — Difference Between Two Distributions
현재 단락 (1/205)
"How much math do I need to study AI?"