Complete Math Guide for AI — From Linear Algebra to Information Theory

Introduction
Part 1: Linear Algebra — The Skeleton of AI
Part 2: Calculus — The Engine of Learning
Part 3: Probability and Statistics — The Language of AI
Part 4: Information Theory — The Mathematical Foundation of LLMs
- Entropy — A Measure of Uncertainty
- KL Divergence — Difference Between Two Distributions
Math → AI Mapping Summary
Study Roadmap
Recommended Resources
Related Series and Recommended Posts
- References

Introduction

"How much math do I need to study AI?"

Answer: Linear algebra + calculus + probability/statistics + optimization. These four areas let you read 90% of papers.

This is not a math textbook. It explains why this math is used in AI, with code and intuition. When building nanoGPT from scratch, or training an image generation model (DDPM) — this is where that math shows up.

Part 1: Linear Algebra — The Skeleton of AI

Why Do You Need It?

Every operation in a neural network is matrix multiplication.

import numpy as np

# A single neuron = dot product
weights = np.array([0.5, -0.3, 0.8])  # weights
inputs = np.array([1.0, 2.0, 3.0])     # inputs
bias = 0.1

output = np.dot(weights, inputs) + bias
# 0.5*1.0 + (-0.3)*2.0 + 0.8*3.0 + 0.1 = 2.5

# An entire layer = matrix multiplication
W = np.random.randn(4, 3)  # 3 → 4 neurons
x = np.random.randn(3)      # input vector
h = W @ x                    # matrix-vector product = layer output

Vectors — Representing Data

# Representing words as vectors (Word Embedding)
king = np.array([0.8, 0.2, 0.9, -0.5])
queen = np.array([0.7, 0.8, 0.85, -0.4])
man = np.array([0.9, 0.1, 0.5, -0.6])
woman = np.array([0.8, 0.7, 0.45, -0.5])

# king - man + woman ≈ queen (the famous relationship!)
result = king - man + woman
print(f"king - man + woman = {result}")
print(f"queen              = {queen}")
# Nearly identical!

# Cosine similarity — how similar two vectors are
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_similarity(result, queen))  # ~0.95 (very similar!)

Matrix Multiplication — The Heart of Neural Networks

# Transformer's Self-Attention is also matrix multiplication!
# Q, K, V = input multiplied by weight matrices
batch_size, seq_len, d_model = 2, 10, 64

X = np.random.randn(batch_size, seq_len, d_model)
W_Q = np.random.randn(d_model, d_model)
W_K = np.random.randn(d_model, d_model)
W_V = np.random.randn(d_model, d_model)

Q = X @ W_Q  # Query: (2, 10, 64)
K = X @ W_K  # Key:   (2, 10, 64)
V = X @ W_V  # Value: (2, 10, 64)

# Attention Score = Q x K^T / sqrt(d)
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_model)
# scores shape: (2, 10, 10) — attention each token pays to other tokens

Eigenvalue Decomposition — PCA, SVD

# PCA: Finding the principal directions of data
from sklearn.decomposition import PCA

# 100-dimensional data → reduced to 2 dimensions
data = np.random.randn(1000, 100)
pca = PCA(n_components=2)
reduced = pca.fit_transform(data)

# What happens internally:
# 1. Compute covariance matrix: C = X^T X / n
# 2. Eigenvalue decomposition: C = V Lambda V^T
# 3. Select eigenvectors corresponding to largest eigenvalues
# → The directions of greatest data variance!

# SVD (Singular Value Decomposition) — Used for LLM compression!
# LoRA is exactly this: decomposing a large matrix into 2 small matrices

W = np.random.randn(768, 768)  # GPT-2's attention weight

# SVD: W = U x Sigma x V^T
U, S, Vt = np.linalg.svd(W)

# Keep only the top r values for approximation (the principle behind LoRA!)
r = 16  # rank
W_approx = U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :]

# Original: 768x768 = 589,824 parameters
# LoRA: 768x16 + 16x768 = 24,576 parameters (only 4%!)
error = np.linalg.norm(W - W_approx) / np.linalg.norm(W)
print(f"Rank-{r} approximation error: {error:.4f}")

Part 2: Calculus — The Engine of Learning

Why Do You Need It?

Neural network training = minimizing a loss function = computing gradients via differentiation to update parameters

Partial Derivatives

# f(x, y) = x^2 + 2xy + y^2
# df/dx = 2x + 2y (treating y as constant)
# df/dy = 2x + 2y (treating x as constant)

def f(x, y):
    return x**2 + 2*x*y + y**2

def df_dx(x, y):
    return 2*x + 2*y  # gradient in x direction

def df_dy(x, y):
    return 2*x + 2*y  # gradient in y direction

# Gradient vector
gradient = np.array([df_dx(3, 2), df_dy(3, 2)])
print(f"nabla f(3,2) = {gradient}")  # [10, 10]
# → Moving in the opposite direction decreases the function value!

Chain Rule — The Mathematical Foundation of Backpropagation!

# y = f(g(x)) → dy/dx = df/dg x dg/dx

# In neural networks:
# Loss = CrossEntropy(softmax(Wx + b), target)
# dLoss/dW = dLoss/dsoftmax x dsoftmax/d(Wx+b) x d(Wx+b)/dW

# PyTorch does this automatically!
import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
W = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)

# Forward
h = W @ x + b
loss = h.sum()

# Backward (chain rule applied automatically!)
loss.backward()

print(f"dLoss/dW = {W.grad}")  # automatic differentiation!
print(f"dLoss/db = {b.grad}")
print(f"dLoss/dx = {x.grad}")

Gradient Descent — A Blind Hiker Walking Downhill

# Loss function: L(w) = (w - 3)^2
# Minimum: w = 3

def loss(w):
    return (w - 3) ** 2

def grad(w):
    return 2 * (w - 3)

# Gradient descent
w = 10.0  # starting point (way off)
lr = 0.1  # learning rate

for step in range(20):
    g = grad(w)
    w = w - lr * g  # move in the opposite direction of the gradient!
    if step % 5 == 0:
        print(f"Step {step}: w={w:.4f}, loss={loss(w):.4f}")

# Step 0:  w=8.6000, loss=31.3600
# Step 5:  w=3.2150, loss=0.0462
# Step 10: w=3.0070, loss=0.0000
# Step 15: w=3.0002, loss=0.0000
# → w converges to 3!

The Importance of Learning Rate

If lr is too large:
  w: 10 → -4 → 18 → -22 → diverges!

If lr is too small:
  w: 10 → 9.86 → 9.72 → ... → after 1M steps → 3.001

With the right lr:
  w: 10 → 8.6 → 7.48 → ... → after 20 steps → 3.0002

Real-World Optimizer: Adam

# Adam = Momentum + RMSprop (modern deep learning standard)
class Adam:
    def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1  # momentum (inertia)
        self.beta2 = beta2  # moving average of squared gradients
        self.eps = eps
        self.m = {id(p): 0 for p in params}  # 1st moment
        self.v = {id(p): 0 for p in params}  # 2nd moment
        self.t = 0

    def step(self, params, grads):
        self.t += 1
        for p, g in zip(params, grads):
            pid = id(p)
            # Momentum: remember previous gradient direction
            self.m[pid] = self.beta1 * self.m[pid] + (1 - self.beta1) * g
            # Adaptive learning rate: adjust based on gradient magnitude
            self.v[pid] = self.beta2 * self.v[pid] + (1 - self.beta2) * g**2
            # Bias correction
            m_hat = self.m[pid] / (1 - self.beta1**self.t)
            v_hat = self.v[pid] / (1 - self.beta2**self.t)
            # Update
            p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

Part 3: Probability and Statistics — The Language of AI

Why Do You Need It?

The output of AI models is almost always a probability distribution.

# GPT's output = probability distribution of the next token
logits = np.array([2.0, 1.0, 0.1, -1.0, 3.0])  # model output (raw)
vocab = ["the", "cat", "sat", "on", "mat"]

# Softmax: logits → probabilities
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # numerical stability
    return exp_x / exp_x.sum()

probs = softmax(logits)
for word, p in zip(vocab, probs):
    print(f"  {word}: {p:.4f}")
# the: 0.2312, cat: 0.0851, sat: 0.0346, on: 0.0115, mat: 0.6276
# → "mat" has the highest probability!

Bayes' Theorem

# P(A|B) = P(B|A) x P(A) / P(B)
# "The probability that the model is correct given the data"

# Example: Spam filter
# P(spam|"free") = P("free"|spam) x P(spam) / P("free")
p_free_given_spam = 0.8    # probability of "free" appearing in spam
p_spam = 0.3               # proportion of all emails that are spam
p_free = 0.35              # probability of "free" appearing in all emails

p_spam_given_free = (p_free_given_spam * p_spam) / p_free
print(f"P(spam|'free') = {p_spam_given_free:.2f}")  # 0.69 (69%!)

Probability Distributions

# Gaussian (Normal) Distribution — The core of Diffusion Models!
def gaussian(x, mu=0, sigma=1):
    return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# DDPM (Image Generation):
# Forward:  clean image → add Gaussian noise (gradually destroy)
# Reverse:  noise → remove noise (neural network learns this) → clean image!

# Noise addition process
def add_noise(image, t, noise_schedule):
    """x_t = sqrt(alpha_bar_t) x x_0 + sqrt(1 - alpha_bar_t) x epsilon"""
    alpha_bar = noise_schedule[t]
    noise = np.random.randn(*image.shape)  # Gaussian noise
    noisy = np.sqrt(alpha_bar) * image + np.sqrt(1 - alpha_bar) * noise
    return noisy, noise

Cross-Entropy — The King of Loss Functions

# Measures how different the model's predictions are from the ground truth
def cross_entropy(predictions, targets):
    """H(p, q) = -sum p(x) log q(x)"""
    return -np.sum(targets * np.log(predictions + 1e-9))

# Ground truth: "cat" (one-hot encoding)
target = np.array([0, 1, 0, 0, 0])  # [the, cat, sat, on, mat]

# Good prediction
good_pred = np.array([0.05, 0.85, 0.03, 0.02, 0.05])
print(f"Good: {cross_entropy(good_pred, target):.4f}")  # 0.1625 (low)

# Bad prediction
bad_pred = np.array([0.3, 0.1, 0.2, 0.2, 0.2])
print(f"Bad:  {cross_entropy(bad_pred, target):.4f}")  # 2.3026 (high)

Part 4: Information Theory — The Mathematical Foundation of LLMs

Entropy — A Measure of Uncertainty

def entropy(probs):
    """H(X) = -sum p(x) log2 p(x)"""
    return -np.sum(probs * np.log2(probs + 1e-9))

# Fair coin: H = 1 bit (maximum uncertainty)
fair_coin = np.array([0.5, 0.5])
print(f"Fair coin: {entropy(fair_coin):.2f} bits")  # 1.00

# Biased coin: H is less than 1 bit (predictable)
biased_coin = np.array([0.9, 0.1])
print(f"Biased coin: {entropy(biased_coin):.2f} bits")  # 0.47

# Low entropy in GPT's output → the model is confident
# Raising temperature → entropy increases → more diverse outputs

KL Divergence — Difference Between Two Distributions

def kl_divergence(p, q):
    """D_KL(P || Q) = sum p(x) log(p(x) / q(x))"""
    return np.sum(p * np.log(p / (q + 1e-9) + 1e-9))

# In VAE (Variational Autoencoder):
# Minimize KL(q(z|x) || p(z))
# = Make the encoder's output distribution close to standard normal!

# In RLHF:
# Add KL(pi_new || pi_ref) as penalty
# = Prevent the new model from deviating too far from the reference model!

Math → AI Mapping Summary

Math Concept	Role in AI	Where It Appears
Matrix multiply	Layer computation	All neural networks
Cosine similarity	Embedding comparison	Search, RAG
SVD	Model compression	LoRA, quantization
Partial deriv.	Gradient computation	Backpropagation
Chain rule	Automatic diff.	PyTorch autograd
Gradient descent	Parameter optimization	Adam, SGD
Softmax	Probability transform	Classification, Attention
Cross-entropy	Loss function	LLM, classifiers
Gaussian dist.	Noise modeling	DDPM, VAE
Bayes' theorem	Posterior inference	Bayesian ML
KL Divergence	Distribution difference	VAE, RLHF
Entropy	Uncertainty measure	Temperature, information

Study Roadmap

[Week 1] Linear Algebra Fundamentals
  → Vectors, matrix multiplication, transpose, inverse
  → Implement from scratch with numpy

[Week 2] Calculus + Optimization
  → Partial derivatives, chain rule, gradient descent
  → Understand PyTorch autograd

[Week 3] Probability + Statistics
  → Conditional probability, Bayes, distributions
  → Implement softmax, cross-entropy

[Week 4] Information Theory + Practice
  → Entropy, KL-divergence
  → Find the math in nanoGPT/DDPM code

Recommended Resources

3Blue1Brown (YouTube) — Intuitive visualizations of linear algebra/calculus
Mathematics for Machine Learning (free textbook) — Bridging math to ML
Andrej Karpathy's micrograd — Backpropagation from scratch
Stanford CS229 — Probability/statistics + ML math
Ian Goodfellow's Deep Learning Book — Ch.2–4 (free online)

Quiz — Math for AI (Click to reveal!)

Q1. What role does matrix multiplication play in neural networks? ||It multiplies input vectors by weight matrices to compute the next layer's output. One matrix multiplication = one layer's linear transformation||

Q2. Why is LoRA related to SVD? ||SVD decomposes a large matrix into products of smaller matrices. LoRA approximates the weight change (delta W) as a product of two low-rank matrices (A, B), dramatically reducing parameters||

Q3. What is the mathematical foundation of backpropagation? ||The chain rule. It decomposes the derivative of a composite function into products of derivatives at each stage. Gradients propagate backward from the loss to each parameter||

Q4. What does the Softmax function do, and what is the numerical stability trick? ||Converts a real-valued vector (logits) into a probability distribution (sums to 1, all positive). Trick: subtract the maximum value from inputs to prevent exp overflow||

Q5. Why is cross-entropy a good loss function? ||When the predicted probability for the correct class approaches 1, loss approaches 0; when it approaches 0, loss approaches infinity. Strong penalties for wrong predictions make training efficient||

Q6. Why is the Gaussian distribution used in Diffusion Models? ||In the forward process, Gaussian noise is gradually added to images. Gaussians are mathematically tractable (closed-form solutions) and provide a natural noise model via the central limit theorem||

Q7. What happens to GPT output entropy when temperature is high? ||It increases. Dividing logits by T flattens the softmax distribution (closer to uniform), increasing uncertainty and generating more diverse outputs||

Q8. What is the purpose of the KL Divergence penalty in RLHF? ||It constrains the RL-updated model (pi_new) from straying too far from the original model (pi_ref). Prevents reward hacking and preserves existing capabilities||

Build Your Own GPT — nanoGPT — Where this math is actually used
GPT Series Paper Analysis — From GPT-1 to GPT-4
torchvision Complete Guide — CNN/ViT (linear algebra in practice)
torchaudio Complete Guide — Fourier Transform in practice
Security Fundamentals Guide — Cryptography (security applications of math)

References

3Blue1Brown — Essence of Linear Algebra — Linear algebra intuition (must watch!)
3Blue1Brown — Neural Networks — Neural network visualization
StatQuest — Machine Learning — Easy statistics/ML explanations
Mathematics for Machine Learning (book) — Free textbook

Introduction

Part 1: Linear Algebra — The Skeleton of AI

Why Do You Need It?

Vectors — Representing Data

Matrix Multiplication — The Heart of Neural Networks

Eigenvalue Decomposition — PCA, SVD

Part 2: Calculus — The Engine of Learning

Why Do You Need It?

Partial Derivatives

Chain Rule — The Mathematical Foundation of Backpropagation!

Gradient Descent — A Blind Hiker Walking Downhill

The Importance of Learning Rate

Real-World Optimizer: Adam

Part 3: Probability and Statistics — The Language of AI

Why Do You Need It?

Bayes' Theorem

Probability Distributions

Cross-Entropy — The King of Loss Functions

Part 4: Information Theory — The Mathematical Foundation of LLMs

Entropy — A Measure of Uncertainty

KL Divergence — Difference Between Two Distributions

Math → AI Mapping Summary

Study Roadmap

Recommended Resources

Related Series and Recommended Posts

References