- Authors
- Name
- Introduction
- Part 1: Linear Algebra — The Skeleton of AI
- Part 2: Calculus — The Engine of Learning
- Part 3: Probability and Statistics — The Language of AI
- Part 4: Information Theory — The Mathematical Foundation of LLMs
- Math → AI Mapping Summary
- Study Roadmap
- Recommended Resources
- Related Series and Recommended Posts

Introduction
"How much math do I need to study AI?"
Answer: Linear algebra + calculus + probability/statistics + optimization. These four areas let you read 90% of papers.
This is not a math textbook. It explains why this math is used in AI, with code and intuition. When building nanoGPT from scratch, or training an image generation model (DDPM) — this is where that math shows up.
Part 1: Linear Algebra — The Skeleton of AI
Why Do You Need It?
Every operation in a neural network is matrix multiplication.
import numpy as np
# A single neuron = dot product
weights = np.array([0.5, -0.3, 0.8]) # weights
inputs = np.array([1.0, 2.0, 3.0]) # inputs
bias = 0.1
output = np.dot(weights, inputs) + bias
# 0.5*1.0 + (-0.3)*2.0 + 0.8*3.0 + 0.1 = 2.5
# An entire layer = matrix multiplication
W = np.random.randn(4, 3) # 3 → 4 neurons
x = np.random.randn(3) # input vector
h = W @ x # matrix-vector product = layer output
Vectors — Representing Data
# Representing words as vectors (Word Embedding)
king = np.array([0.8, 0.2, 0.9, -0.5])
queen = np.array([0.7, 0.8, 0.85, -0.4])
man = np.array([0.9, 0.1, 0.5, -0.6])
woman = np.array([0.8, 0.7, 0.45, -0.5])
# king - man + woman ≈ queen (the famous relationship!)
result = king - man + woman
print(f"king - man + woman = {result}")
print(f"queen = {queen}")
# Nearly identical!
# Cosine similarity — how similar two vectors are
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(cosine_similarity(result, queen)) # ~0.95 (very similar!)
Matrix Multiplication — The Heart of Neural Networks
# Transformer's Self-Attention is also matrix multiplication!
# Q, K, V = input multiplied by weight matrices
batch_size, seq_len, d_model = 2, 10, 64
X = np.random.randn(batch_size, seq_len, d_model)
W_Q = np.random.randn(d_model, d_model)
W_K = np.random.randn(d_model, d_model)
W_V = np.random.randn(d_model, d_model)
Q = X @ W_Q # Query: (2, 10, 64)
K = X @ W_K # Key: (2, 10, 64)
V = X @ W_V # Value: (2, 10, 64)
# Attention Score = Q x K^T / sqrt(d)
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_model)
# scores shape: (2, 10, 10) — attention each token pays to other tokens
Eigenvalue Decomposition — PCA, SVD
# PCA: Finding the principal directions of data
from sklearn.decomposition import PCA
# 100-dimensional data → reduced to 2 dimensions
data = np.random.randn(1000, 100)
pca = PCA(n_components=2)
reduced = pca.fit_transform(data)
# What happens internally:
# 1. Compute covariance matrix: C = X^T X / n
# 2. Eigenvalue decomposition: C = V Lambda V^T
# 3. Select eigenvectors corresponding to largest eigenvalues
# → The directions of greatest data variance!
# SVD (Singular Value Decomposition) — Used for LLM compression!
# LoRA is exactly this: decomposing a large matrix into 2 small matrices
W = np.random.randn(768, 768) # GPT-2's attention weight
# SVD: W = U x Sigma x V^T
U, S, Vt = np.linalg.svd(W)
# Keep only the top r values for approximation (the principle behind LoRA!)
r = 16 # rank
W_approx = U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :]
# Original: 768x768 = 589,824 parameters
# LoRA: 768x16 + 16x768 = 24,576 parameters (only 4%!)
error = np.linalg.norm(W - W_approx) / np.linalg.norm(W)
print(f"Rank-{r} approximation error: {error:.4f}")
Part 2: Calculus — The Engine of Learning
Why Do You Need It?
Neural network training = minimizing a loss function = computing gradients via differentiation to update parameters
Partial Derivatives
# f(x, y) = x^2 + 2xy + y^2
# df/dx = 2x + 2y (treating y as constant)
# df/dy = 2x + 2y (treating x as constant)
def f(x, y):
return x**2 + 2*x*y + y**2
def df_dx(x, y):
return 2*x + 2*y # gradient in x direction
def df_dy(x, y):
return 2*x + 2*y # gradient in y direction
# Gradient vector
gradient = np.array([df_dx(3, 2), df_dy(3, 2)])
print(f"nabla f(3,2) = {gradient}") # [10, 10]
# → Moving in the opposite direction decreases the function value!
Chain Rule — The Mathematical Foundation of Backpropagation!
# y = f(g(x)) → dy/dx = df/dg x dg/dx
# In neural networks:
# Loss = CrossEntropy(softmax(Wx + b), target)
# dLoss/dW = dLoss/dsoftmax x dsoftmax/d(Wx+b) x d(Wx+b)/dW
# PyTorch does this automatically!
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
W = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)
# Forward
h = W @ x + b
loss = h.sum()
# Backward (chain rule applied automatically!)
loss.backward()
print(f"dLoss/dW = {W.grad}") # automatic differentiation!
print(f"dLoss/db = {b.grad}")
print(f"dLoss/dx = {x.grad}")
Gradient Descent — A Blind Hiker Walking Downhill
# Loss function: L(w) = (w - 3)^2
# Minimum: w = 3
def loss(w):
return (w - 3) ** 2
def grad(w):
return 2 * (w - 3)
# Gradient descent
w = 10.0 # starting point (way off)
lr = 0.1 # learning rate
for step in range(20):
g = grad(w)
w = w - lr * g # move in the opposite direction of the gradient!
if step % 5 == 0:
print(f"Step {step}: w={w:.4f}, loss={loss(w):.4f}")
# Step 0: w=8.6000, loss=31.3600
# Step 5: w=3.2150, loss=0.0462
# Step 10: w=3.0070, loss=0.0000
# Step 15: w=3.0002, loss=0.0000
# → w converges to 3!
The Importance of Learning Rate
If lr is too large:
w: 10 → -4 → 18 → -22 → diverges!
If lr is too small:
w: 10 → 9.86 → 9.72 → ... → after 1M steps → 3.001
With the right lr:
w: 10 → 8.6 → 7.48 → ... → after 20 steps → 3.0002
Real-World Optimizer: Adam
# Adam = Momentum + RMSprop (modern deep learning standard)
class Adam:
def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
self.lr = lr
self.beta1 = beta1 # momentum (inertia)
self.beta2 = beta2 # moving average of squared gradients
self.eps = eps
self.m = {id(p): 0 for p in params} # 1st moment
self.v = {id(p): 0 for p in params} # 2nd moment
self.t = 0
def step(self, params, grads):
self.t += 1
for p, g in zip(params, grads):
pid = id(p)
# Momentum: remember previous gradient direction
self.m[pid] = self.beta1 * self.m[pid] + (1 - self.beta1) * g
# Adaptive learning rate: adjust based on gradient magnitude
self.v[pid] = self.beta2 * self.v[pid] + (1 - self.beta2) * g**2
# Bias correction
m_hat = self.m[pid] / (1 - self.beta1**self.t)
v_hat = self.v[pid] / (1 - self.beta2**self.t)
# Update
p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
Part 3: Probability and Statistics — The Language of AI
Why Do You Need It?
The output of AI models is almost always a probability distribution.
# GPT's output = probability distribution of the next token
logits = np.array([2.0, 1.0, 0.1, -1.0, 3.0]) # model output (raw)
vocab = ["the", "cat", "sat", "on", "mat"]
# Softmax: logits → probabilities
def softmax(x):
exp_x = np.exp(x - np.max(x)) # numerical stability
return exp_x / exp_x.sum()
probs = softmax(logits)
for word, p in zip(vocab, probs):
print(f" {word}: {p:.4f}")
# the: 0.2312, cat: 0.0851, sat: 0.0346, on: 0.0115, mat: 0.6276
# → "mat" has the highest probability!
Bayes' Theorem
# P(A|B) = P(B|A) x P(A) / P(B)
# "The probability that the model is correct given the data"
# Example: Spam filter
# P(spam|"free") = P("free"|spam) x P(spam) / P("free")
p_free_given_spam = 0.8 # probability of "free" appearing in spam
p_spam = 0.3 # proportion of all emails that are spam
p_free = 0.35 # probability of "free" appearing in all emails
p_spam_given_free = (p_free_given_spam * p_spam) / p_free
print(f"P(spam|'free') = {p_spam_given_free:.2f}") # 0.69 (69%!)
Probability Distributions
# Gaussian (Normal) Distribution — The core of Diffusion Models!
def gaussian(x, mu=0, sigma=1):
return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)
# DDPM (Image Generation):
# Forward: clean image → add Gaussian noise (gradually destroy)
# Reverse: noise → remove noise (neural network learns this) → clean image!
# Noise addition process
def add_noise(image, t, noise_schedule):
"""x_t = sqrt(alpha_bar_t) x x_0 + sqrt(1 - alpha_bar_t) x epsilon"""
alpha_bar = noise_schedule[t]
noise = np.random.randn(*image.shape) # Gaussian noise
noisy = np.sqrt(alpha_bar) * image + np.sqrt(1 - alpha_bar) * noise
return noisy, noise
Cross-Entropy — The King of Loss Functions
# Measures how different the model's predictions are from the ground truth
def cross_entropy(predictions, targets):
"""H(p, q) = -sum p(x) log q(x)"""
return -np.sum(targets * np.log(predictions + 1e-9))
# Ground truth: "cat" (one-hot encoding)
target = np.array([0, 1, 0, 0, 0]) # [the, cat, sat, on, mat]
# Good prediction
good_pred = np.array([0.05, 0.85, 0.03, 0.02, 0.05])
print(f"Good: {cross_entropy(good_pred, target):.4f}") # 0.1625 (low)
# Bad prediction
bad_pred = np.array([0.3, 0.1, 0.2, 0.2, 0.2])
print(f"Bad: {cross_entropy(bad_pred, target):.4f}") # 2.3026 (high)
Part 4: Information Theory — The Mathematical Foundation of LLMs
Entropy — A Measure of Uncertainty
def entropy(probs):
"""H(X) = -sum p(x) log2 p(x)"""
return -np.sum(probs * np.log2(probs + 1e-9))
# Fair coin: H = 1 bit (maximum uncertainty)
fair_coin = np.array([0.5, 0.5])
print(f"Fair coin: {entropy(fair_coin):.2f} bits") # 1.00
# Biased coin: H is less than 1 bit (predictable)
biased_coin = np.array([0.9, 0.1])
print(f"Biased coin: {entropy(biased_coin):.2f} bits") # 0.47
# Low entropy in GPT's output → the model is confident
# Raising temperature → entropy increases → more diverse outputs
KL Divergence — Difference Between Two Distributions
def kl_divergence(p, q):
"""D_KL(P || Q) = sum p(x) log(p(x) / q(x))"""
return np.sum(p * np.log(p / (q + 1e-9) + 1e-9))
# In VAE (Variational Autoencoder):
# Minimize KL(q(z|x) || p(z))
# = Make the encoder's output distribution close to standard normal!
# In RLHF:
# Add KL(pi_new || pi_ref) as penalty
# = Prevent the new model from deviating too far from the reference model!
Math → AI Mapping Summary
| Math Concept | Role in AI | Where It Appears |
|---|---|---|
| Matrix multiply | Layer computation | All neural networks |
| Cosine similarity | Embedding comparison | Search, RAG |
| SVD | Model compression | LoRA, quantization |
| Partial deriv. | Gradient computation | Backpropagation |
| Chain rule | Automatic diff. | PyTorch autograd |
| Gradient descent | Parameter optimization | Adam, SGD |
| Softmax | Probability transform | Classification, Attention |
| Cross-entropy | Loss function | LLM, classifiers |
| Gaussian dist. | Noise modeling | DDPM, VAE |
| Bayes' theorem | Posterior inference | Bayesian ML |
| KL Divergence | Distribution difference | VAE, RLHF |
| Entropy | Uncertainty measure | Temperature, information |
Study Roadmap
[Week 1] Linear Algebra Fundamentals
→ Vectors, matrix multiplication, transpose, inverse
→ Implement from scratch with numpy
[Week 2] Calculus + Optimization
→ Partial derivatives, chain rule, gradient descent
→ Understand PyTorch autograd
[Week 3] Probability + Statistics
→ Conditional probability, Bayes, distributions
→ Implement softmax, cross-entropy
[Week 4] Information Theory + Practice
→ Entropy, KL-divergence
→ Find the math in nanoGPT/DDPM code
Recommended Resources
- 3Blue1Brown (YouTube) — Intuitive visualizations of linear algebra/calculus
- Mathematics for Machine Learning (free textbook) — Bridging math to ML
- Andrej Karpathy's micrograd — Backpropagation from scratch
- Stanford CS229 — Probability/statistics + ML math
- Ian Goodfellow's Deep Learning Book — Ch.2–4 (free online)
Quiz — Math for AI (Click to reveal!)
Q1. What role does matrix multiplication play in neural networks? ||It multiplies input vectors by weight matrices to compute the next layer's output. One matrix multiplication = one layer's linear transformation||
Q2. Why is LoRA related to SVD? ||SVD decomposes a large matrix into products of smaller matrices. LoRA approximates the weight change (delta W) as a product of two low-rank matrices (A, B), dramatically reducing parameters||
Q3. What is the mathematical foundation of backpropagation? ||The chain rule. It decomposes the derivative of a composite function into products of derivatives at each stage. Gradients propagate backward from the loss to each parameter||
Q4. What does the Softmax function do, and what is the numerical stability trick? ||Converts a real-valued vector (logits) into a probability distribution (sums to 1, all positive). Trick: subtract the maximum value from inputs to prevent exp overflow||
Q5. Why is cross-entropy a good loss function? ||When the predicted probability for the correct class approaches 1, loss approaches 0; when it approaches 0, loss approaches infinity. Strong penalties for wrong predictions make training efficient||
Q6. Why is the Gaussian distribution used in Diffusion Models? ||In the forward process, Gaussian noise is gradually added to images. Gaussians are mathematically tractable (closed-form solutions) and provide a natural noise model via the central limit theorem||
Q7. What happens to GPT output entropy when temperature is high? ||It increases. Dividing logits by T flattens the softmax distribution (closer to uniform), increasing uncertainty and generating more diverse outputs||
Q8. What is the purpose of the KL Divergence penalty in RLHF? ||It constrains the RL-updated model (pi_new) from straying too far from the original model (pi_ref). Prevents reward hacking and preserves existing capabilities||
Related Series and Recommended Posts
- Build Your Own GPT — nanoGPT — Where this math is actually used
- GPT Series Paper Analysis — From GPT-1 to GPT-4
- torchvision Complete Guide — CNN/ViT (linear algebra in practice)
- torchaudio Complete Guide — Fourier Transform in practice
- Security Fundamentals Guide — Cryptography (security applications of math)
References
- 3Blue1Brown — Essence of Linear Algebra — Linear algebra intuition (must watch!)
- 3Blue1Brown — Neural Networks — Neural network visualization
- StatQuest — Machine Learning — Easy statistics/ML explanations
- Mathematics for Machine Learning (book) — Free textbook