Skip to content
Published on

Complete Math Guide for AI — From Linear Algebra to Information Theory

Authors
  • Name
    Twitter
Math for AI

Introduction

"How much math do I need to study AI?"

Answer: Linear algebra + calculus + probability/statistics + optimization. These four areas let you read 90% of papers.

This is not a math textbook. It explains why this math is used in AI, with code and intuition. When building nanoGPT from scratch, or training an image generation model (DDPM) — this is where that math shows up.

Part 1: Linear Algebra — The Skeleton of AI

Why Do You Need It?

Every operation in a neural network is matrix multiplication.

import numpy as np

# A single neuron = dot product
weights = np.array([0.5, -0.3, 0.8])  # weights
inputs = np.array([1.0, 2.0, 3.0])     # inputs
bias = 0.1

output = np.dot(weights, inputs) + bias
# 0.5*1.0 + (-0.3)*2.0 + 0.8*3.0 + 0.1 = 2.5

# An entire layer = matrix multiplication
W = np.random.randn(4, 3)  # 3 → 4 neurons
x = np.random.randn(3)      # input vector
h = W @ x                    # matrix-vector product = layer output

Vectors — Representing Data

# Representing words as vectors (Word Embedding)
king = np.array([0.8, 0.2, 0.9, -0.5])
queen = np.array([0.7, 0.8, 0.85, -0.4])
man = np.array([0.9, 0.1, 0.5, -0.6])
woman = np.array([0.8, 0.7, 0.45, -0.5])

# king - man + woman ≈ queen (the famous relationship!)
result = king - man + woman
print(f"king - man + woman = {result}")
print(f"queen              = {queen}")
# Nearly identical!

# Cosine similarity — how similar two vectors are
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_similarity(result, queen))  # ~0.95 (very similar!)

Matrix Multiplication — The Heart of Neural Networks

# Transformer's Self-Attention is also matrix multiplication!
# Q, K, V = input multiplied by weight matrices
batch_size, seq_len, d_model = 2, 10, 64

X = np.random.randn(batch_size, seq_len, d_model)
W_Q = np.random.randn(d_model, d_model)
W_K = np.random.randn(d_model, d_model)
W_V = np.random.randn(d_model, d_model)

Q = X @ W_Q  # Query: (2, 10, 64)
K = X @ W_K  # Key:   (2, 10, 64)
V = X @ W_V  # Value: (2, 10, 64)

# Attention Score = Q x K^T / sqrt(d)
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_model)
# scores shape: (2, 10, 10) — attention each token pays to other tokens

Eigenvalue Decomposition — PCA, SVD

# PCA: Finding the principal directions of data
from sklearn.decomposition import PCA

# 100-dimensional data → reduced to 2 dimensions
data = np.random.randn(1000, 100)
pca = PCA(n_components=2)
reduced = pca.fit_transform(data)

# What happens internally:
# 1. Compute covariance matrix: C = X^T X / n
# 2. Eigenvalue decomposition: C = V Lambda V^T
# 3. Select eigenvectors corresponding to largest eigenvalues
# → The directions of greatest data variance!
# SVD (Singular Value Decomposition) — Used for LLM compression!
# LoRA is exactly this: decomposing a large matrix into 2 small matrices

W = np.random.randn(768, 768)  # GPT-2's attention weight

# SVD: W = U x Sigma x V^T
U, S, Vt = np.linalg.svd(W)

# Keep only the top r values for approximation (the principle behind LoRA!)
r = 16  # rank
W_approx = U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :]

# Original: 768x768 = 589,824 parameters
# LoRA: 768x16 + 16x768 = 24,576 parameters (only 4%!)
error = np.linalg.norm(W - W_approx) / np.linalg.norm(W)
print(f"Rank-{r} approximation error: {error:.4f}")

Part 2: Calculus — The Engine of Learning

Why Do You Need It?

Neural network training = minimizing a loss function = computing gradients via differentiation to update parameters

Partial Derivatives

# f(x, y) = x^2 + 2xy + y^2
# df/dx = 2x + 2y (treating y as constant)
# df/dy = 2x + 2y (treating x as constant)

def f(x, y):
    return x**2 + 2*x*y + y**2

def df_dx(x, y):
    return 2*x + 2*y  # gradient in x direction

def df_dy(x, y):
    return 2*x + 2*y  # gradient in y direction

# Gradient vector
gradient = np.array([df_dx(3, 2), df_dy(3, 2)])
print(f"nabla f(3,2) = {gradient}")  # [10, 10]
# → Moving in the opposite direction decreases the function value!

Chain Rule — The Mathematical Foundation of Backpropagation!

# y = f(g(x)) → dy/dx = df/dg x dg/dx

# In neural networks:
# Loss = CrossEntropy(softmax(Wx + b), target)
# dLoss/dW = dLoss/dsoftmax x dsoftmax/d(Wx+b) x d(Wx+b)/dW

# PyTorch does this automatically!
import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
W = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)

# Forward
h = W @ x + b
loss = h.sum()

# Backward (chain rule applied automatically!)
loss.backward()

print(f"dLoss/dW = {W.grad}")  # automatic differentiation!
print(f"dLoss/db = {b.grad}")
print(f"dLoss/dx = {x.grad}")

Gradient Descent — A Blind Hiker Walking Downhill

# Loss function: L(w) = (w - 3)^2
# Minimum: w = 3

def loss(w):
    return (w - 3) ** 2

def grad(w):
    return 2 * (w - 3)

# Gradient descent
w = 10.0  # starting point (way off)
lr = 0.1  # learning rate

for step in range(20):
    g = grad(w)
    w = w - lr * g  # move in the opposite direction of the gradient!
    if step % 5 == 0:
        print(f"Step {step}: w={w:.4f}, loss={loss(w):.4f}")

# Step 0:  w=8.6000, loss=31.3600
# Step 5:  w=3.2150, loss=0.0462
# Step 10: w=3.0070, loss=0.0000
# Step 15: w=3.0002, loss=0.0000
# → w converges to 3!

The Importance of Learning Rate

If lr is too large:
  w: 10-418-22 → diverges!

If lr is too small:
  w: 109.869.72... → after 1M steps → 3.001

With the right lr:
  w: 108.67.48... → after 20 steps → 3.0002

Real-World Optimizer: Adam

# Adam = Momentum + RMSprop (modern deep learning standard)
class Adam:
    def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1  # momentum (inertia)
        self.beta2 = beta2  # moving average of squared gradients
        self.eps = eps
        self.m = {id(p): 0 for p in params}  # 1st moment
        self.v = {id(p): 0 for p in params}  # 2nd moment
        self.t = 0

    def step(self, params, grads):
        self.t += 1
        for p, g in zip(params, grads):
            pid = id(p)
            # Momentum: remember previous gradient direction
            self.m[pid] = self.beta1 * self.m[pid] + (1 - self.beta1) * g
            # Adaptive learning rate: adjust based on gradient magnitude
            self.v[pid] = self.beta2 * self.v[pid] + (1 - self.beta2) * g**2
            # Bias correction
            m_hat = self.m[pid] / (1 - self.beta1**self.t)
            v_hat = self.v[pid] / (1 - self.beta2**self.t)
            # Update
            p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

Part 3: Probability and Statistics — The Language of AI

Why Do You Need It?

The output of AI models is almost always a probability distribution.

# GPT's output = probability distribution of the next token
logits = np.array([2.0, 1.0, 0.1, -1.0, 3.0])  # model output (raw)
vocab = ["the", "cat", "sat", "on", "mat"]

# Softmax: logits → probabilities
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # numerical stability
    return exp_x / exp_x.sum()

probs = softmax(logits)
for word, p in zip(vocab, probs):
    print(f"  {word}: {p:.4f}")
# the: 0.2312, cat: 0.0851, sat: 0.0346, on: 0.0115, mat: 0.6276
# → "mat" has the highest probability!

Bayes' Theorem

# P(A|B) = P(B|A) x P(A) / P(B)
# "The probability that the model is correct given the data"

# Example: Spam filter
# P(spam|"free") = P("free"|spam) x P(spam) / P("free")
p_free_given_spam = 0.8    # probability of "free" appearing in spam
p_spam = 0.3               # proportion of all emails that are spam
p_free = 0.35              # probability of "free" appearing in all emails

p_spam_given_free = (p_free_given_spam * p_spam) / p_free
print(f"P(spam|'free') = {p_spam_given_free:.2f}")  # 0.69 (69%!)

Probability Distributions

# Gaussian (Normal) Distribution — The core of Diffusion Models!
def gaussian(x, mu=0, sigma=1):
    return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# DDPM (Image Generation):
# Forward:  clean image → add Gaussian noise (gradually destroy)
# Reverse:  noise → remove noise (neural network learns this) → clean image!

# Noise addition process
def add_noise(image, t, noise_schedule):
    """x_t = sqrt(alpha_bar_t) x x_0 + sqrt(1 - alpha_bar_t) x epsilon"""
    alpha_bar = noise_schedule[t]
    noise = np.random.randn(*image.shape)  # Gaussian noise
    noisy = np.sqrt(alpha_bar) * image + np.sqrt(1 - alpha_bar) * noise
    return noisy, noise

Cross-Entropy — The King of Loss Functions

# Measures how different the model's predictions are from the ground truth
def cross_entropy(predictions, targets):
    """H(p, q) = -sum p(x) log q(x)"""
    return -np.sum(targets * np.log(predictions + 1e-9))

# Ground truth: "cat" (one-hot encoding)
target = np.array([0, 1, 0, 0, 0])  # [the, cat, sat, on, mat]

# Good prediction
good_pred = np.array([0.05, 0.85, 0.03, 0.02, 0.05])
print(f"Good: {cross_entropy(good_pred, target):.4f}")  # 0.1625 (low)

# Bad prediction
bad_pred = np.array([0.3, 0.1, 0.2, 0.2, 0.2])
print(f"Bad:  {cross_entropy(bad_pred, target):.4f}")  # 2.3026 (high)

Part 4: Information Theory — The Mathematical Foundation of LLMs

Entropy — A Measure of Uncertainty

def entropy(probs):
    """H(X) = -sum p(x) log2 p(x)"""
    return -np.sum(probs * np.log2(probs + 1e-9))

# Fair coin: H = 1 bit (maximum uncertainty)
fair_coin = np.array([0.5, 0.5])
print(f"Fair coin: {entropy(fair_coin):.2f} bits")  # 1.00

# Biased coin: H is less than 1 bit (predictable)
biased_coin = np.array([0.9, 0.1])
print(f"Biased coin: {entropy(biased_coin):.2f} bits")  # 0.47

# Low entropy in GPT's output → the model is confident
# Raising temperature → entropy increases → more diverse outputs

KL Divergence — Difference Between Two Distributions

def kl_divergence(p, q):
    """D_KL(P || Q) = sum p(x) log(p(x) / q(x))"""
    return np.sum(p * np.log(p / (q + 1e-9) + 1e-9))

# In VAE (Variational Autoencoder):
# Minimize KL(q(z|x) || p(z))
# = Make the encoder's output distribution close to standard normal!

# In RLHF:
# Add KL(pi_new || pi_ref) as penalty
# = Prevent the new model from deviating too far from the reference model!

Math → AI Mapping Summary

Math ConceptRole in AIWhere It Appears
Matrix multiplyLayer computationAll neural networks
Cosine similarityEmbedding comparisonSearch, RAG
SVDModel compressionLoRA, quantization
Partial deriv.Gradient computationBackpropagation
Chain ruleAutomatic diff.PyTorch autograd
Gradient descentParameter optimizationAdam, SGD
SoftmaxProbability transformClassification, Attention
Cross-entropyLoss functionLLM, classifiers
Gaussian dist.Noise modelingDDPM, VAE
Bayes' theoremPosterior inferenceBayesian ML
KL DivergenceDistribution differenceVAE, RLHF
EntropyUncertainty measureTemperature, information

Study Roadmap

[Week 1] Linear Algebra Fundamentals
Vectors, matrix multiplication, transpose, inverse
Implement from scratch with numpy

[Week 2] Calculus + Optimization
Partial derivatives, chain rule, gradient descent
Understand PyTorch autograd

[Week 3] Probability + Statistics
Conditional probability, Bayes, distributions
Implement softmax, cross-entropy

[Week 4] Information Theory + Practice
Entropy, KL-divergence
Find the math in nanoGPT/DDPM code
  • 3Blue1Brown (YouTube) — Intuitive visualizations of linear algebra/calculus
  • Mathematics for Machine Learning (free textbook) — Bridging math to ML
  • Andrej Karpathy's micrograd — Backpropagation from scratch
  • Stanford CS229 — Probability/statistics + ML math
  • Ian Goodfellow's Deep Learning Book — Ch.2–4 (free online)

Quiz — Math for AI (Click to reveal!)

Q1. What role does matrix multiplication play in neural networks? ||It multiplies input vectors by weight matrices to compute the next layer's output. One matrix multiplication = one layer's linear transformation||

Q2. Why is LoRA related to SVD? ||SVD decomposes a large matrix into products of smaller matrices. LoRA approximates the weight change (delta W) as a product of two low-rank matrices (A, B), dramatically reducing parameters||

Q3. What is the mathematical foundation of backpropagation? ||The chain rule. It decomposes the derivative of a composite function into products of derivatives at each stage. Gradients propagate backward from the loss to each parameter||

Q4. What does the Softmax function do, and what is the numerical stability trick? ||Converts a real-valued vector (logits) into a probability distribution (sums to 1, all positive). Trick: subtract the maximum value from inputs to prevent exp overflow||

Q5. Why is cross-entropy a good loss function? ||When the predicted probability for the correct class approaches 1, loss approaches 0; when it approaches 0, loss approaches infinity. Strong penalties for wrong predictions make training efficient||

Q6. Why is the Gaussian distribution used in Diffusion Models? ||In the forward process, Gaussian noise is gradually added to images. Gaussians are mathematically tractable (closed-form solutions) and provide a natural noise model via the central limit theorem||

Q7. What happens to GPT output entropy when temperature is high? ||It increases. Dividing logits by T flattens the softmax distribution (closer to uniform), increasing uncertainty and generating more diverse outputs||

Q8. What is the purpose of the KL Divergence penalty in RLHF? ||It constrains the RL-updated model (pi_new) from straying too far from the original model (pi_ref). Prevents reward hacking and preserves existing capabilities||

References