Skip to content
Published on

Mathematical Foundations for AI/ML: Complete Guide - Linear Algebra, Calculus, Probability

Authors

Mathematical Foundations for AI/ML: Complete Guide

To truly understand machine learning and deep learning, you need to grasp the underlying mathematics. Many developers use frameworks without fully understanding why the math works the way it does. This guide aims to fill that gap.

Rather than listing formulas, we focus on why each concept matters and how it is used in deep learning, accompanied by Python/NumPy code you can run to verify the math yourself.


1. Linear Algebra

Linear algebra is the fundamental language of deep learning. The forward pass of a neural network is a sequence of matrix multiplications, and embeddings are points in a high-dimensional vector space.

1.1 Vectors

Definition and Geometric Meaning

A vector is a quantity with both magnitude and direction. Mathematically it is an ordered list of numbers; geometrically it is an arrow pointing to a location in nn-dimensional space.

v=[v1v2vn]\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}

In deep learning, vectors appear everywhere: input data, hidden-layer activations, embedding representations — all are vectors.

Vector Addition and Scalar Multiplication

Addition is performed element-wise:

a+b=[a1+b1a2+b2an+bn]\mathbf{a} + \mathbf{b} = \begin{bmatrix} a_1 + b_1 \\ a_2 + b_2 \\ \vdots \\ a_n + b_n \end{bmatrix}

Multiplying by scalar cc scales each component:

cv=[cv1cv2cvn]c \cdot \mathbf{v} = \begin{bmatrix} c v_1 \\ c v_2 \\ \vdots \\ c v_n \end{bmatrix}

Dot Product

The dot product measures the "similarity" between two vectors:

ab=i=1naibi\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i

Geometrically, it equals the product of magnitudes and the cosine of the angle θ\theta between them:

ab=abcosθ\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta

A positive dot product indicates vectors pointing in the same general direction; negative means opposite; zero means orthogonal.

Cosine Similarity

To measure directional similarity independent of magnitude, normalize by both norms:

cosine_similarity(a,b)=abab\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}

The value ranges from 1-1 to 11. Widely used to compare word embeddings and document representations in NLP.

Vector Norms

Norms measure the "length" or "size" of a vector:

  • L1L_1 norm (Manhattan): v1=ivi\|\mathbf{v}\|_1 = \sum_i |v_i|
  • L2L_2 norm (Euclidean): v2=ivi2\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}
  • LL_\infty norm (max norm): v=maxivi\|\mathbf{v}\|_\infty = \max_i |v_i|

In regularization, L1L_1 encourages sparsity while L2L_2 keeps weights uniformly small.

import numpy as np

a = np.array([3.0, 4.0])
b = np.array([1.0, 2.0])

# Dot product
dot_product = np.dot(a, b)       # 3*1 + 4*2 = 11
print(f"Dot product: {dot_product}")

# L2 norm
l2_norm_a = np.linalg.norm(a)   # sqrt(9+16) = 5
print(f"L2 norm of a: {l2_norm_a}")

# Cosine similarity
cosine_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cosine similarity: {cosine_sim:.4f}")

# Various norms
v = np.array([3.0, -4.0, 1.0])
print(f"L1 norm: {np.linalg.norm(v, ord=1)}")         # 8.0
print(f"L2 norm: {np.linalg.norm(v, ord=2):.4f}")     # 5.099
print(f"Linf norm: {np.linalg.norm(v, ord=np.inf)}") # 4.0

1.2 Matrices

Matrix Notation

A matrix is a rectangular array of numbers with mm rows and nn columns (m×nm \times n matrix):

A=[a11a12a1na21a22a2nam1am2amn]A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}

In deep learning, the weight matrix WW represents the linear transformation between layers.

Matrix Multiplication

If AA is m×km \times k and BB is k×nk \times n, then C=ABC = AB is m×nm \times n:

Cij=p=1kAipBpjC_{ij} = \sum_{p=1}^{k} A_{ip} B_{pj}

The key rule: columns of the left matrix must equal rows of the right matrix. Matrix multiplication is generally not commutative: ABBAAB \neq BA.

Transpose

Swap rows and columns: (AT)ij=Aji(A^T)_{ij} = A_{ji}.

Property: (AB)T=BTAT(AB)^T = B^T A^T

Inverse Matrix

For a square matrix AA, its inverse A1A^{-1} satisfies AA1=A1A=IAA^{-1} = A^{-1}A = I. An inverse exists only when the determinant is non-zero.

Determinant

For a 2×22 \times 2 matrix:

det(abcd)=adbc\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc

The determinant tells you how much the linear transformation stretches or compresses space. det(A)=0\det(A) = 0 means the transformation collapses dimensions — no inverse exists.

Rank

The rank of a matrix is the maximum number of linearly independent rows (or columns). A lower rank means the transformation produces a lower-dimensional output space. Low-rank approximation is central to model compression techniques like LoRA.

import numpy as np

A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

B = np.array([[9, 8, 7],
              [6, 5, 4],
              [3, 2, 1]])

# Matrix multiplication
C = np.dot(A, B)
print("Matrix product:\n", C)

# Transpose
print("Transpose of A:\n", A.T)

# Determinant
A2 = np.array([[3.0, 1.0],
               [2.0, 4.0]])
print(f"Determinant: {np.linalg.det(A2):.2f}")  # 3*4 - 1*2 = 10

# Inverse
A_inv = np.linalg.inv(A2)
print("Inverse:\n", A_inv)
print("A @ A_inv (should be identity):\n", np.round(A2 @ A_inv))

# Rank
print(f"Rank of A: {np.linalg.matrix_rank(A)}")  # 2 (rows are linearly dependent)

1.3 Eigenvalues and Eigenvectors

Definition

For a square matrix AA, a non-zero vector v\mathbf{v} and scalar λ\lambda satisfying:

Av=λvA\mathbf{v} = \lambda \mathbf{v}

are called the eigenvector and eigenvalue, respectively.

The meaning is powerful: when AA transforms v\mathbf{v}, the direction stays the same — only the magnitude changes by a factor of λ\lambda.

Eigendecomposition

For a symmetric matrix, we can decompose using orthogonal eigenvectors:

A=QΛQTA = Q \Lambda Q^T

where QQ contains eigenvectors as columns and Λ\Lambda is the diagonal matrix of eigenvalues.

Relationship to PCA

The eigenvectors of the covariance matrix Σ=1nXTX\Sigma = \frac{1}{n} X^T X are the principal components. The eigenvector corresponding to the largest eigenvalue points in the direction of greatest variance — the most informative feature direction.

SVD (Singular Value Decomposition)

While eigendecomposition only applies to square matrices, SVD applies to any matrix:

A=UΣVTA = U \Sigma V^T

  • UU: m×mm \times m orthogonal matrix (left singular vectors)
  • Σ\Sigma: m×nm \times n diagonal matrix (singular values, non-negative)
  • VTV^T: n×nn \times n orthogonal matrix (right singular vectors)

SVD underpins matrix approximation, noise reduction, recommendation systems, LSA, and the LoRA technique for parameter-efficient fine-tuning.

import numpy as np

# Eigenvalues and eigenvectors
A = np.array([[4.0, 2.0],
              [1.0, 3.0]])

eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors (columns):\n", eigenvectors)

# Verify: A*v = lambda*v
v0 = eigenvectors[:, 0]
lam0 = eigenvalues[0]
print("A*v:", A @ v0)
print("lambda*v:", lam0 * v0)

# SVD
M = np.array([[1, 2, 3],
              [4, 5, 6]])

U, S, Vt = np.linalg.svd(M, full_matrices=False)
print(f"\nU shape: {U.shape}")
print(f"Singular values: {S}")
print(f"Vt shape: {Vt.shape}")

# Rank-1 approximation
rank1_approx = np.outer(S[0] * U[:, 0], Vt[0, :])
print("Original matrix:\n", M)
print("Rank-1 approximation:\n", rank1_approx.round(2))

# PCA example
np.random.seed(42)
data = np.random.randn(100, 2)
data[:, 1] = data[:, 0] * 0.8 + data[:, 1] * 0.2

cov = np.cov(data.T)
eigenvals, eigenvecs = np.linalg.eig(cov)
idx = np.argsort(eigenvals)[::-1]
print("\nFirst principal component:", eigenvecs[:, idx[0]])
print("Explained variance ratio:", eigenvals[idx[0]] / eigenvals.sum())

1.4 Linear Algebra in Deep Learning

Layers as Linear Transformations

A fully connected layer is fundamentally a linear transformation:

y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}

where xRn\mathbf{x} \in \mathbb{R}^n, WRm×nW \in \mathbb{R}^{m \times n}, bRm\mathbf{b} \in \mathbb{R}^m, yRm\mathbf{y} \in \mathbb{R}^m.

Without non-linear activation functions, stacking multiple layers reduces to a single linear transformation — which is why activations like ReLU and Sigmoid are essential.

Batch Processing

For a batch of BB samples processed simultaneously:

Y=XWT+bY = XW^T + \mathbf{b}

where XRB×nX \in \mathbb{R}^{B \times n} and YRB×mY \in \mathbb{R}^{B \times m}. GPUs are optimized for exactly this type of large-scale matrix multiplication.


2. Calculus

Calculus governs how deep learning models learn. Gradient descent — the engine of learning — relies entirely on differentiation.

2.1 Differentiation

Limit Definition of Derivative

f(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

This is the instantaneous rate of change — the slope of the tangent line at point xx.

Basic Differentiation Rules

  • Power rule: (xn)=nxn1(x^n)' = nx^{n-1}
  • Sum rule: (f+g)=f+g(f+g)' = f' + g'
  • Product rule: (fg)=fg+fg(fg)' = f'g + fg'
  • Quotient rule: (f/g)=(fgfg)/g2(f/g)' = (f'g - fg') / g^2
  • Chain rule: [f(g(x))]=f(g(x))g(x)[f(g(x))]' = f'(g(x)) \cdot g'(x)

Derivatives of common deep learning functions:

  • ddx(ex)=ex\frac{d}{dx}(e^x) = e^x
  • ddx(lnx)=1x\frac{d}{dx}(\ln x) = \frac{1}{x}
  • ddx(σ(x))=σ(x)(1σ(x))\frac{d}{dx}(\sigma(x)) = \sigma(x)(1 - \sigma(x)) (Sigmoid)
  • ddx(tanh(x))=1tanh2(x)\frac{d}{dx}(\tanh(x)) = 1 - \tanh^2(x)
  • ddx(ReLU(x))={1x>00x0\frac{d}{dx}(\text{ReLU}(x)) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}

Partial Derivatives

For a multi-variable function, differentiate with respect to one variable while treating all others as constants:

fxi\frac{\partial f}{\partial x_i}

For f(x,y)=x2+3xy+y2f(x, y) = x^2 + 3xy + y^2:

fx=2x+3y,fy=3x+2y\frac{\partial f}{\partial x} = 2x + 3y, \quad \frac{\partial f}{\partial y} = 3x + 2y

Gradient

The gradient collects all partial derivatives into a vector:

f=[fx1fx2fxn]\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The gradient points in the direction of steepest ascent. To minimize a loss function, move in the opposite direction (gradient descent).

Jacobian Matrix

For a vector function f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m, the Jacobian collects all first-order partial derivatives:

Jij=fixjJ_{ij} = \frac{\partial f_i}{\partial x_j}

Hessian Matrix

The matrix of second-order partial derivatives for a scalar function:

Hij=2fxixjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

Eigenvalues of the Hessian indicate curvature: all positive means local minimum, all negative means local maximum, mixed signs indicate a saddle point.

import numpy as np

def numerical_gradient(f, x, h=1e-5):
    """Compute gradient numerically via central differences"""
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy(); x_plus[i] += h
        x_minus = x.copy(); x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

# f(x, y) = x^2 + 2y^2
def f(x):
    return x[0]**2 + 2 * x[1]**2

x0 = np.array([3.0, 4.0])
grad = numerical_gradient(f, x0)
print(f"Numerical gradient: {grad}")          # [6.0, 16.0]
print(f"Analytic gradient: [{2*x0[0]}, {4*x0[1]}]")

# Automatic differentiation with PyTorch
try:
    import torch
    x = torch.tensor([3.0, 4.0], requires_grad=True)
    y = x[0]**2 + 2 * x[1]**2
    y.backward()
    print(f"PyTorch gradient: {x.grad}")
except ImportError:
    print("PyTorch not installed")

2.2 Chain Rule and Backpropagation

Chain Rule

For composite functions, differentiation follows:

dzdx=dzdydydx\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}

Generalized to multiple variables:

zxi=jzyjyjxi\frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial y_j} \cdot \frac{\partial y_j}{\partial x_i}

Backpropagation

Training a neural network requires computing Lw\frac{\partial L}{\partial w} for every weight ww. Since a network is a composition of many functions, we repeatedly apply the chain rule — from the output layer back to the input layer.

For a 2-layer network:

L=(softmax(W2ReLU(W1x+b1)+b2),y)L = \ell\bigl(\text{softmax}(W_2 \cdot \text{ReLU}(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2), \mathbf{y}\bigr)

Computing LW1\frac{\partial L}{\partial W_1} requires chaining gradients through each operation:

LW1=La2a2hhW1\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{a}_2} \cdot \frac{\partial \mathbf{a}_2}{\partial \mathbf{h}} \cdot \frac{\partial \mathbf{h}}{\partial W_1}

import numpy as np

class SimpleNet:
    def __init__(self):
        self.W1 = np.random.randn(4, 3) * 0.1
        self.b1 = np.zeros(4)
        self.W2 = np.random.randn(2, 4) * 0.1
        self.b2 = np.zeros(2)

    def relu(self, x):
        return np.maximum(0, x)

    def relu_grad(self, x):
        return (x > 0).astype(float)

    def forward(self, x):
        self.x = x
        self.z1 = self.W1 @ x + self.b1
        self.h = self.relu(self.z1)
        self.z2 = self.W2 @ self.h + self.b2
        return self.z2

    def backward(self, dL_dz2):
        # Layer 2 gradients
        dL_dW2 = np.outer(dL_dz2, self.h)
        dL_db2 = dL_dz2
        dL_dh = self.W2.T @ dL_dz2

        # ReLU gradient
        dL_dz1 = dL_dh * self.relu_grad(self.z1)

        # Layer 1 gradients
        dL_dW1 = np.outer(dL_dz1, self.x)
        dL_db1 = dL_dz1

        return dL_dW1, dL_db1, dL_dW2, dL_db2

net = SimpleNet()
x = np.array([1.0, 2.0, 3.0])
output = net.forward(x)
dL_dout = np.array([1.0, -1.0])
grads = net.backward(dL_dout)
print(f"W1 gradient shape: {grads[0].shape}")
print(f"W2 gradient shape: {grads[2].shape}")

2.3 Optimization Theory

Optimality Conditions

  • First-order (necessary): at an extremum, f=0\nabla f = \mathbf{0}
  • Second-order (sufficient): the Hessian's eigenvalues determine min/max

Convex Functions

A function is convex if:

f(λx+(1λ)y)λf(x)+(1λ)f(y),λ[0,1]f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y), \quad \lambda \in [0,1]

For convex functions, any local minimum is a global minimum. Deep learning loss functions are generally non-convex, so global optimality cannot be guaranteed.

Gradient Descent

θt+1=θtαθL(θt)\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)

where α\alpha is the learning rate. Too large and the updates diverge; too small and convergence is slow.

Common variants:

  • SGD (Stochastic Gradient Descent): uses a random mini-batch each step
  • Momentum: accumulates past gradient directions for acceleration
  • Adam: adaptive learning rates per parameter

Adam update rule:

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_t vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 θt+1=θtαv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

import numpy as np

def gradient_descent(grad_f, init_x, lr=0.1, steps=100):
    x = init_x.copy()
    history = [x.copy()]
    for _ in range(steps):
        grad = grad_f(x)
        x -= lr * grad
        history.append(x.copy())
    return x, np.array(history)

def f(x):
    return x[0]**2 + 4 * x[1]**2

def grad_f(x):
    return np.array([2*x[0], 8*x[1]])

init_x = np.array([3.0, 3.0])
final_x, history = gradient_descent(grad_f, init_x, lr=0.1, steps=50)
print(f"Start: {init_x}, Converged to: {final_x.round(6)}")

def adam_optimizer(grad_f, init_x, alpha=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):
    x = init_x.copy()
    m = np.zeros_like(x)
    v = np.zeros_like(x)
    for t in range(1, steps + 1):
        g = grad_f(x)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        x -= alpha * m_hat / (np.sqrt(v_hat) + eps)
    return x

final_adam = adam_optimizer(grad_f, init_x)
print(f"Adam converged to: {final_adam.round(6)}")

3. Probability and Statistics

Probability and statistics provide the language for handling uncertainty in deep learning. Loss function design, regularization, and model evaluation all rely on probabilistic thinking.

3.1 Probability Basics

Kolmogorov Axioms

  1. P(A)0P(A) \geq 0 (non-negativity)
  2. P(Ω)=1P(\Omega) = 1 (total probability equals 1)
  3. For mutually exclusive events: P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B) (additivity)

Conditional Probability

The probability of AA given that BB has occurred:

P(AB)=P(AB)P(B),P(B)>0P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0

Bayes' Theorem

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

Bayes' theorem formalizes "updating beliefs in the light of evidence":

  • P(A)P(A): Prior — belief before seeing evidence
  • P(BA)P(B|A): Likelihood — probability of evidence under hypothesis AA
  • P(AB)P(A|B): Posterior — updated belief after seeing evidence

This underlies Bayesian classifiers, Bayesian neural networks, and probabilistic model evaluation.

3.2 Probability Distributions

Discrete Distributions

Bernoulli Distribution

Single trial with outcome 0 or 1:

P(X=k)=pk(1p)1k,k{0,1}P(X=k) = p^k (1-p)^{1-k}, \quad k \in \{0, 1\}

E[X]=pE[X] = p, Var(X)=p(1p)Var(X) = p(1-p)

Binomial Distribution

Number of successes in nn Bernoulli trials:

P(X=k)=(nk)pk(1p)nkP(X=k) = \binom{n}{k} p^k (1-p)^{n-k}

Poisson Distribution

Event count in unit time or space:

P(X=k)=λkeλk!P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}

E[X]=Var(X)=λE[X] = Var(X) = \lambda

Continuous Distributions

Uniform Distribution

Equal probability density over [a,b][a, b]:

f(x)=1ba,axbf(x) = \frac{1}{b-a}, \quad a \leq x \leq b

Normal (Gaussian) Distribution

f(x)=1σ2πexp((xμ)22σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Characterized by mean μ\mu and standard deviation σ\sigma. By the Central Limit Theorem, many natural phenomena follow this distribution. Used extensively in weight initialization and noise modeling.

Multivariate Normal Distribution

f(x)=1(2π)d/2Σ1/2exp(12(xμ)TΣ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)

where μ\boldsymbol{\mu} is the mean vector and Σ\Sigma is the covariance matrix. Central to VAE latent space modeling.

import numpy as np
from scipy import stats

# Normal distribution
mu, sigma = 0, 1
print(f"P(-1 < X < 1) = {stats.norm.cdf(1) - stats.norm.cdf(-1):.4f}")  # 68.27%
print(f"P(-2 < X < 2) = {stats.norm.cdf(2) - stats.norm.cdf(-2):.4f}")  # 95.45%
print(f"P(-3 < X < 3) = {stats.norm.cdf(3) - stats.norm.cdf(-3):.4f}")  # 99.73%

# Multivariate normal sampling
mean = np.array([0, 0])
cov = np.array([[1, 0.7],
                [0.7, 1]])
samples = np.random.multivariate_normal(mean, cov, size=1000)
print(f"\nSample shape: {samples.shape}")
print(f"Sample mean: {samples.mean(axis=0).round(3)}")
print(f"Sample covariance:\n{np.cov(samples.T).round(3)}")

# Binomial distribution
n, p = 10, 0.5
print(f"\nBinomial B(10, 0.5): P(X=5) = {stats.binom.pmf(5, n, p):.4f}")

3.3 Expectation and Variance

Expected Value

E[X]=xxP(X=x)(discrete)E[X] = \sum_x x \cdot P(X=x) \quad \text{(discrete)} E[X]=xf(x)dx(continuous)E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx \quad \text{(continuous)}

Linearity of expectation: E[aX+b]=aE[X]+bE[aX + b] = aE[X] + b

Variance

Var(X)=E[(Xμ)2]=E[X2](E[X])2Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2

Standard deviation: σ=Var(X)\sigma = \sqrt{Var(X)}

Covariance

Measures the linear relationship between two random variables:

Cov(X,Y)=E[(XμX)(YμY)]Cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]

Correlation

ρXY=Cov(X,Y)σXσY\rho_{XY} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}

Range [1,1][-1, 1]; measures strength of linear relationship independent of scale.

Covariance Matrix

For an nn-dimensional random vector:

Σij=Cov(Xi,Xj)\Sigma_{ij} = Cov(X_i, X_j)

The covariance matrix encodes the variance structure of data and is the key input to PCA.

3.4 Maximum Likelihood Estimation (MLE)

Likelihood Function

The probability of observing data D={x1,,xn}\mathcal{D} = \{x_1, \ldots, x_n\} given parameters θ\theta:

L(θ;D)=P(Dθ)=i=1nP(xiθ)L(\theta; \mathcal{D}) = P(\mathcal{D}|\theta) = \prod_{i=1}^n P(x_i|\theta)

MLE finds the θ\theta that maximizes this likelihood:

θ^MLE=argmaxθL(θ;D)\hat{\theta}_{MLE} = \arg\max_\theta L(\theta; \mathcal{D})

Log-Likelihood

Taking the log converts the product to a sum, simplifying computation:

logL(θ)=i=1nlogP(xiθ)\log L(\theta) = \sum_{i=1}^n \log P(x_i|\theta)

Since log is monotonically increasing, the maximizer of log-likelihood equals the maximizer of likelihood.

MLE for the Normal Distribution

Assuming data follows N(μ,σ2)\mathcal{N}(\mu, \sigma^2):

logL(μ,σ2)=n2log(2πσ2)12σ2i=1n(xiμ)2\log L(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2

Setting the derivative with respect to μ\mu to zero: μ^MLE=xˉ\hat{\mu}_{MLE} = \bar{x} (sample mean)

For σ2\sigma^2: σ^MLE2=1n(xixˉ)2\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum (x_i - \bar{x})^2

Cross-Entropy and MLE

Minimizing cross-entropy loss in classification is equivalent to MLE:

LCE=1ni=1nc=1Cyiclogp^ic\mathcal{L}_{CE} = -\frac{1}{n}\sum_{i=1}^n \sum_{c=1}^C y_{ic} \log \hat{p}_{ic}

This maximizes the log-likelihood of the model under the true data distribution.

import numpy as np
from scipy.optimize import minimize_scalar

# MLE for Bernoulli distribution
data = np.array([1, 0, 1, 1, 0, 1, 1, 1, 0, 1])
n = len(data)
k = data.sum()

p_mle = k / n
print(f"Analytic MLE: p = {p_mle:.2f}")

def neg_log_likelihood(p):
    if p <= 0 or p >= 1:
        return np.inf
    return -(k * np.log(p) + (n - k) * np.log(1 - p))

result = minimize_scalar(neg_log_likelihood, bounds=(0.01, 0.99), method='bounded')
print(f"Numerical MLE: p = {result.x:.4f}")

# Cross-entropy loss
def cross_entropy(y_true, y_pred, eps=1e-15):
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.2])
print(f"\nCross-Entropy loss: {cross_entropy(y_true, y_pred):.4f}")

3.5 Information Theory

Entropy

Measures the uncertainty of a probability distribution PP:

H(X)=xP(x)logP(x)H(X) = -\sum_{x} P(x) \log P(x)

Maximum for uniform distributions, zero when a single outcome is certain.

KL Divergence

A measure of "distance" between distributions PP and QQ (asymmetric):

DKL(PQ)=xP(x)logP(x)Q(x)D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}

DKL(PQ)0D_{KL}(P \| Q) \geq 0, with equality iff P=QP = Q. Used in VAE loss functions, PPO policy optimization, and knowledge distillation.

Cross-Entropy

H(P,Q)=xP(x)logQ(x)=H(P)+DKL(PQ)H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + D_{KL}(P \| Q)

Since H(P)H(P) is constant, minimizing cross-entropy is equivalent to minimizing KL divergence between the true and model distributions.

Mutual Information

Information shared between two random variables:

I(X;Y)=H(X)H(XY)=H(Y)H(YX)I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Theoretical foundation for feature selection and representation learning.

import numpy as np

def entropy(p):
    p = np.array(p)
    p = p[p > 0]
    return -np.sum(p * np.log2(p))

def kl_divergence(p, q, eps=1e-15):
    p = np.array(p) + eps
    q = np.array(q) + eps
    p = p / p.sum()
    q = q / q.sum()
    return np.sum(p * np.log(p / q))

# Uniform distribution (maximum entropy)
uniform = [0.25, 0.25, 0.25, 0.25]
print(f"Uniform distribution entropy: {entropy(uniform):.4f} bits")   # 2.0

# Concentrated distribution (low entropy)
concentrated = [0.97, 0.01, 0.01, 0.01]
print(f"Concentrated distribution entropy: {entropy(concentrated):.4f} bits")

# KL Divergence
true_dist = [0.4, 0.3, 0.2, 0.1]
model_dist = [0.35, 0.35, 0.15, 0.15]
kl = kl_divergence(true_dist, model_dist)
print(f"\nKL Divergence D_KL(P||Q): {kl:.4f}")

4. Numerical Methods

Understanding the gap between mathematical theory and actual computer arithmetic is important for reliable deep learning implementations.

4.1 Floating-Point Representation

FP32 (Single Precision)

  • 32 bits: 1 sign, 8 exponent, 23 mantissa
  • Precision: ~7 significant decimal digits
  • Range: approximately ±3.4×1038\pm 3.4 \times 10^{38}

FP16 (Half Precision)

  • 16 bits: 1 sign, 5 exponent, 10 mantissa
  • Saves GPU memory, faster computation (Mixed Precision Training)
  • Risk of overflow/underflow with large or small values

BF16 (Brain Float 16)

  • 16 bits: 1 sign, 8 exponent, 7 mantissa
  • Same exponent range as FP32 (lower overflow risk)
  • Introduced by Google for TPUs, now widely used for AI training
import numpy as np

# Floating-point precision issues
a = np.float32(0.1)
b = np.float32(0.2)
print(f"0.1 + 0.2 (float32) = {a + b}")  # may not be exactly 0.3

# FP16 overflow
print(f"FP16 large value: {np.float16(65000)}")
print(f"FP16 overflow: {np.float16(70000)}")  # inf

# Compare dtype ranges
for dtype in [np.float16, np.float32, np.float64]:
    info = np.finfo(dtype)
    print(f"{dtype.__name__}: max={info.max:.3e}, min_positive={info.tiny:.3e}")

4.2 Numerical Stability

The Instability of Naive Softmax

softmax(xi)=exijexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}

Large inputs cause exie^{x_i} to overflow (inf). The fix: subtract the maximum value before exponentiation.

softmax(xi)=eximax(x)jexjmax(x)\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}

This is mathematically equivalent but numerically stable.

import numpy as np

def naive_softmax(x):
    return np.exp(x) / np.sum(np.exp(x))

def stable_softmax(x):
    x_shifted = x - np.max(x)
    return np.exp(x_shifted) / np.sum(np.exp(x_shifted))

# Normal input
x_normal = np.array([1.0, 2.0, 3.0])
print("Normal input:")
print(f"  Naive:  {naive_softmax(x_normal)}")
print(f"  Stable: {stable_softmax(x_normal)}")

# Large input
x_large = np.array([1000.0, 2000.0, 3000.0])
print("\nLarge input:")
naive_result = naive_softmax(x_large)
print(f"  Naive:  {naive_result}")   # [nan, nan, nan] due to overflow
print(f"  Stable: {stable_softmax(x_large)}")   # [0, 0, 1]

# Log-Softmax (used with NLLLoss)
def log_softmax(x):
    x_shifted = x - np.max(x)
    return x_shifted - np.log(np.sum(np.exp(x_shifted)))

logits = np.array([2.0, 1.0, 0.1])
print(f"\nLog-Softmax: {log_softmax(logits)}")

5. Mathematical Analysis of Linear Regression

Linear regression is the simplest building block of deep learning. A thorough mathematical understanding creates a foundation for understanding more complex models.

5.1 Ordinary Least Squares

For data {(xi,yi)}i=1n\{(\mathbf{x}_i, y_i)\}_{i=1}^n, we fit a linear model y^=wTx+b\hat{y} = \mathbf{w}^T \mathbf{x} + b.

Loss function (MSE):

L(w)=1nyXw2L(\mathbf{w}) = \frac{1}{n} \|y - X\mathbf{w}\|^2

Normal Equation

Differentiating with respect to w\mathbf{w} and setting to zero:

XTXw=XTyX^T X \mathbf{w} = X^T y

w^=(XTX)1XTy\hat{\mathbf{w}} = (X^T X)^{-1} X^T y

This closed-form solution requires XTXX^T X to be invertible (features must be linearly independent).

5.2 Ridge and Lasso Regression

Ridge Regression (L2 regularization)

Lridge(w)=yXw2+λw22L_{ridge}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_2^2

Normal equation: w^=(XTX+λI)1XTy\hat{\mathbf{w}} = (X^T X + \lambda I)^{-1} X^T y

Adding λI\lambda I guarantees invertibility and shrinks weights toward zero, preventing overfitting.

Lasso Regression (L1 regularization)

Llasso(w)=yXw2+λw1L_{lasso}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1

L1 regularization induces sparse solutions — some weights become exactly zero — providing automatic feature selection.

import numpy as np
from sklearn.linear_model import Ridge, Lasso

np.random.seed(42)
n, d = 100, 10
X = np.random.randn(n, d)
true_w = np.array([1.0, -2.0, 3.0, 0.0, 0.0, 0.0, 0.5, -0.5, 0.0, 0.0])
y = X @ true_w + np.random.randn(n) * 0.5

# Normal equation (OLS)
w_ols = np.linalg.lstsq(X, y, rcond=None)[0]
print("OLS coefficients:", w_ols.round(2))

# Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
print("Ridge coefficients:", ridge.coef_.round(2))

# Lasso (sparse solution)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
print("Lasso coefficients:", lasso.coef_.round(2))
print("Number of zero coefficients:", (np.abs(lasso.coef_) < 0.01).sum())

# Ridge via normal equation
def ridge_normal_equation(X, y, lam=1.0):
    d = X.shape[1]
    return np.linalg.solve(X.T @ X + lam * np.eye(d), X.T @ y)

w_ridge = ridge_normal_equation(X, y, lam=1.0)
print("\nRidge (normal equation):", w_ridge.round(2))

6. Integration: The Mathematics of Deep Learning

6.1 A Mathematical View of Neural Networks

A fully connected neural network with LL layers:

h(0)=x\mathbf{h}^{(0)} = \mathbf{x} z(l)=W(l)h(l1)+b(l)\mathbf{z}^{(l)} = W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)} h(l)=σ(z(l))\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)}) y^=h(L)\hat{y} = \mathbf{h}^{(L)}

Forward pass: linear algebra (matrix multiplication) + activation functions

Loss computation: information theory (cross-entropy) or statistics (MSE)

Backward pass: calculus (chain rule) to compute gradients

Parameter update: optimization theory (gradient descent, Adam)

6.2 The Attention Mechanism and Matrix Operations

The Transformer's scaled dot-product attention:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

  • Q,K,VQ, K, V: Query, Key, Value matrices (linear algebra)
  • 1dk\frac{1}{\sqrt{d_k}} scaling: numerical stability (numerical analysis)
  • Softmax: converts scores to probability distribution (probability theory)
  • Dot product QKTQK^T: similarity measurement (vector dot product)
import numpy as np

def stable_softmax_2d(x):
    x_max = x.max(axis=-1, keepdims=True)
    x_shifted = x - x_max
    exp_x = np.exp(x_shifted)
    return exp_x / exp_x.sum(axis=-1, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    if mask is not None:
        scores = np.where(mask, scores, -1e9)
    weights = stable_softmax_2d(scores)
    output = weights @ V
    return output, weights

# 3 tokens, d_k=4
seq_len, d_k = 3, 4
np.random.seed(42)
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)

output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Attention output shape: {output.shape}")
print(f"Attention weights (rows sum to 1):\n{weights.round(3)}")
print(f"Row sums: {weights.sum(axis=-1)}")

Conclusion: Next Steps

This guide covered the core mathematical foundations of AI/ML:

  1. Linear Algebra: the language for representing and transforming data and model parameters
  2. Calculus: the tool explaining how models learn
  3. Probability & Statistics: handles uncertainty and provides the theoretical basis for loss functions
  4. Numerical Methods: knowledge for stable computer implementations of mathematical theory

Recommended next steps:

  • Gilbert Strang's Linear Algebra — free on MIT OCW
  • Pattern Recognition and Machine Learning (Bishop) — probabilistic perspective on ML
  • Deep Learning (Goodfellow et al.) — the theoretical bible of deep learning
  • Practice: implement a neural network from scratch using only NumPy

Mathematics is a tool for understanding. You do not need to memorize every formula, but grasping why each concept exists and how it operates within deep learning gives you significantly more powerful practical skills.


References

  • Gilbert Strang, Introduction to Linear Algebra, 6th Edition
  • Ian Goodfellow et al., Deep Learning, MIT Press (2016)
  • Christopher Bishop, Pattern Recognition and Machine Learning, Springer (2006)
  • 3Blue1Brown — Essence of Linear Algebra, Essence of Calculus (YouTube)
  • fast.ai — Practical Deep Learning for Coders