Mathematical Foundations for AI/ML: Complete Guide

To truly understand machine learning and deep learning, you need to grasp the underlying mathematics. Many developers use frameworks without fully understanding why the math works the way it does. This guide aims to fill that gap.

Rather than listing formulas, we focus on why each concept matters and how it is used in deep learning, accompanied by Python/NumPy code you can run to verify the math yourself.

1. Linear Algebra

Linear algebra is the fundamental language of deep learning. The forward pass of a neural network is a sequence of matrix multiplications, and embeddings are points in a high-dimensional vector space.

1.1 Vectors

Definition and Geometric Meaning

A vector is a quantity with both magnitude and direction. Mathematically it is an ordered list of numbers; geometrically it is an arrow pointing to a location in $n$ -dimensional space.

$\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}$

In deep learning, vectors appear everywhere: input data, hidden-layer activations, embedding representations — all are vectors.

Vector Addition and Scalar Multiplication

Addition is performed element-wise:

$\mathbf{a} + \mathbf{b} = \begin{bmatrix} a_1 + b_1 \\ a_2 + b_2 \\ \vdots \\ a_n + b_n \end{bmatrix}$

Multiplying by scalar $c$ scales each component:

$c \cdot \mathbf{v} = \begin{bmatrix} c v_1 \\ c v_2 \\ \vdots \\ c v_n \end{bmatrix}$

Dot Product

The dot product measures the "similarity" between two vectors:

$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i$

Geometrically, it equals the product of magnitudes and the cosine of the angle $\theta$ between them:

$\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta$

A positive dot product indicates vectors pointing in the same general direction; negative means opposite; zero means orthogonal.

Cosine Similarity

To measure directional similarity independent of magnitude, normalize by both norms:

$\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$

The value ranges from $-1$ to $1$ . Widely used to compare word embeddings and document representations in NLP.

Vector Norms

Norms measure the "length" or "size" of a vector:

$L_1$ norm (Manhattan): $\|\mathbf{v}\|_1 = \sum_i |v_i|$
$L_2$ norm (Euclidean): $\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}$
$L_\infty$ norm (max norm): $\|\mathbf{v}\|_\infty = \max_i |v_i|$

In regularization, $L_1$ encourages sparsity while $L_2$ keeps weights uniformly small.

import numpy as np

a = np.array([3.0, 4.0])
b = np.array([1.0, 2.0])

# Dot product
dot_product = np.dot(a, b)       # 3*1 + 4*2 = 11
print(f"Dot product: {dot_product}")

# L2 norm
l2_norm_a = np.linalg.norm(a)   # sqrt(9+16) = 5
print(f"L2 norm of a: {l2_norm_a}")

# Cosine similarity
cosine_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cosine similarity: {cosine_sim:.4f}")

# Various norms
v = np.array([3.0, -4.0, 1.0])
print(f"L1 norm: {np.linalg.norm(v, ord=1)}")         # 8.0
print(f"L2 norm: {np.linalg.norm(v, ord=2):.4f}")     # 5.099
print(f"Linf norm: {np.linalg.norm(v, ord=np.inf)}") # 4.0

1.2 Matrices

Matrix Notation

A matrix is a rectangular array of numbers with $m$ rows and $n$ columns ( $m \times n$ matrix):

$A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}$

In deep learning, the weight matrix $W$ represents the linear transformation between layers.

Matrix Multiplication

If $A$ is $m \times k$ and $B$ is $k \times n$ , then $C = AB$ is $m \times n$ :

$C_{ij} = \sum_{p=1}^{k} A_{ip} B_{pj}$

The key rule: columns of the left matrix must equal rows of the right matrix. Matrix multiplication is generally not commutative: $AB \neq BA$ .

Transpose

Swap rows and columns: $(A^T)_{ij} = A_{ji}$ .

Property: $(AB)^T = B^T A^T$

Inverse Matrix

For a square matrix $A$ , its inverse $A^{-1}$ satisfies $AA^{-1} = A^{-1}A = I$ . An inverse exists only when the determinant is non-zero.

Determinant

For a $2 \times 2$ matrix:

$\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc$

The determinant tells you how much the linear transformation stretches or compresses space. $\det(A) = 0$ means the transformation collapses dimensions — no inverse exists.

Rank

The rank of a matrix is the maximum number of linearly independent rows (or columns). A lower rank means the transformation produces a lower-dimensional output space. Low-rank approximation is central to model compression techniques like LoRA.

import numpy as np

A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

B = np.array([[9, 8, 7],
              [6, 5, 4],
              [3, 2, 1]])

# Matrix multiplication
C = np.dot(A, B)
print("Matrix product:\n", C)

# Transpose
print("Transpose of A:\n", A.T)

# Determinant
A2 = np.array([[3.0, 1.0],
               [2.0, 4.0]])
print(f"Determinant: {np.linalg.det(A2):.2f}")  # 3*4 - 1*2 = 10

# Inverse
A_inv = np.linalg.inv(A2)
print("Inverse:\n", A_inv)
print("A @ A_inv (should be identity):\n", np.round(A2 @ A_inv))

# Rank
print(f"Rank of A: {np.linalg.matrix_rank(A)}")  # 2 (rows are linearly dependent)

1.3 Eigenvalues and Eigenvectors

Definition

For a square matrix $A$ , a non-zero vector $\mathbf{v}$ and scalar $\lambda$ satisfying:

$A\mathbf{v} = \lambda \mathbf{v}$

are called the eigenvector and eigenvalue, respectively.

The meaning is powerful: when $A$ transforms $\mathbf{v}$ , the direction stays the same — only the magnitude changes by a factor of $\lambda$ .

Eigendecomposition

For a symmetric matrix, we can decompose using orthogonal eigenvectors:

$A = Q \Lambda Q^T$

where $Q$ contains eigenvectors as columns and $\Lambda$ is the diagonal matrix of eigenvalues.

Relationship to PCA

The eigenvectors of the covariance matrix $\Sigma = \frac{1}{n} X^T X$ are the principal components. The eigenvector corresponding to the largest eigenvalue points in the direction of greatest variance — the most informative feature direction.

SVD (Singular Value Decomposition)

While eigendecomposition only applies to square matrices, SVD applies to any matrix:

$A = U \Sigma V^T$

$U$ : $m \times m$ orthogonal matrix (left singular vectors)
$\Sigma$ : $m \times n$ diagonal matrix (singular values, non-negative)
$V^T$ : $n \times n$ orthogonal matrix (right singular vectors)

SVD underpins matrix approximation, noise reduction, recommendation systems, LSA, and the LoRA technique for parameter-efficient fine-tuning.

import numpy as np

# Eigenvalues and eigenvectors
A = np.array([[4.0, 2.0],
              [1.0, 3.0]])

eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors (columns):\n", eigenvectors)

# Verify: A*v = lambda*v
v0 = eigenvectors[:, 0]
lam0 = eigenvalues[0]
print("A*v:", A @ v0)
print("lambda*v:", lam0 * v0)

# SVD
M = np.array([[1, 2, 3],
              [4, 5, 6]])

U, S, Vt = np.linalg.svd(M, full_matrices=False)
print(f"\nU shape: {U.shape}")
print(f"Singular values: {S}")
print(f"Vt shape: {Vt.shape}")

# Rank-1 approximation
rank1_approx = np.outer(S[0] * U[:, 0], Vt[0, :])
print("Original matrix:\n", M)
print("Rank-1 approximation:\n", rank1_approx.round(2))

# PCA example
np.random.seed(42)
data = np.random.randn(100, 2)
data[:, 1] = data[:, 0] * 0.8 + data[:, 1] * 0.2

cov = np.cov(data.T)
eigenvals, eigenvecs = np.linalg.eig(cov)
idx = np.argsort(eigenvals)[::-1]
print("\nFirst principal component:", eigenvecs[:, idx[0]])
print("Explained variance ratio:", eigenvals[idx[0]] / eigenvals.sum())

1.4 Linear Algebra in Deep Learning

Layers as Linear Transformations

A fully connected layer is fundamentally a linear transformation:

$\mathbf{y} = W\mathbf{x} + \mathbf{b}$

where $\mathbf{x} \in \mathbb{R}^n$ , $W \in \mathbb{R}^{m \times n}$ , $\mathbf{b} \in \mathbb{R}^m$ , $\mathbf{y} \in \mathbb{R}^m$ .

Without non-linear activation functions, stacking multiple layers reduces to a single linear transformation — which is why activations like ReLU and Sigmoid are essential.

Batch Processing

For a batch of $B$ samples processed simultaneously:

$Y = XW^T + \mathbf{b}$

where $X \in \mathbb{R}^{B \times n}$ and $Y \in \mathbb{R}^{B \times m}$ . GPUs are optimized for exactly this type of large-scale matrix multiplication.

2. Calculus

Calculus governs how deep learning models learn. Gradient descent — the engine of learning — relies entirely on differentiation.

2.1 Differentiation

Limit Definition of Derivative

$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$

This is the instantaneous rate of change — the slope of the tangent line at point $x$ .

Basic Differentiation Rules

Power rule: $(x^n)' = nx^{n-1}$
Sum rule: $(f+g)' = f' + g'$
Product rule: $(fg)' = f'g + fg'$
Quotient rule: $(f/g)' = (f'g - fg') / g^2$
Chain rule: $[f(g(x))]' = f'(g(x)) \cdot g'(x)$

Derivatives of common deep learning functions:

$\frac{d}{dx}(e^x) = e^x$
$\frac{d}{dx}(\ln x) = \frac{1}{x}$
$\frac{d}{dx}(\sigma(x)) = \sigma(x)(1 - \sigma(x))$ (Sigmoid)
$\frac{d}{dx}(\tanh(x)) = 1 - \tanh^2(x)$
$\frac{d}{dx}(\text{ReLU}(x)) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}$

Partial Derivatives

For a multi-variable function, differentiate with respect to one variable while treating all others as constants:

$\frac{\partial f}{\partial x_i}$

For $f(x, y) = x^2 + 3xy + y^2$ :

$\frac{\partial f}{\partial x} = 2x + 3y, \quad \frac{\partial f}{\partial y} = 3x + 2y$

Gradient

The gradient collects all partial derivatives into a vector:

$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$

The gradient points in the direction of steepest ascent. To minimize a loss function, move in the opposite direction (gradient descent).

Jacobian Matrix

For a vector function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ , the Jacobian collects all first-order partial derivatives:

$J_{ij} = \frac{\partial f_i}{\partial x_j}$

Hessian Matrix

The matrix of second-order partial derivatives for a scalar function:

$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$

Eigenvalues of the Hessian indicate curvature: all positive means local minimum, all negative means local maximum, mixed signs indicate a saddle point.

import numpy as np

def numerical_gradient(f, x, h=1e-5):
    """Compute gradient numerically via central differences"""
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy(); x_plus[i] += h
        x_minus = x.copy(); x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

# f(x, y) = x^2 + 2y^2
def f(x):
    return x[0]**2 + 2 * x[1]**2

x0 = np.array([3.0, 4.0])
grad = numerical_gradient(f, x0)
print(f"Numerical gradient: {grad}")          # [6.0, 16.0]
print(f"Analytic gradient: [{2*x0[0]}, {4*x0[1]}]")

# Automatic differentiation with PyTorch
try:
    import torch
    x = torch.tensor([3.0, 4.0], requires_grad=True)
    y = x[0]**2 + 2 * x[1]**2
    y.backward()
    print(f"PyTorch gradient: {x.grad}")
except ImportError:
    print("PyTorch not installed")

2.2 Chain Rule and Backpropagation

Chain Rule

For composite functions, differentiation follows:

$\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$

Generalized to multiple variables:

$\frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial y_j} \cdot \frac{\partial y_j}{\partial x_i}$

Backpropagation

Training a neural network requires computing $\frac{\partial L}{\partial w}$ for every weight $w$ . Since a network is a composition of many functions, we repeatedly apply the chain rule — from the output layer back to the input layer.

For a 2-layer network:

$L = \ell\bigl(\text{softmax}(W_2 \cdot \text{ReLU}(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2), \mathbf{y}\bigr)$

Computing $\frac{\partial L}{\partial W_1}$ requires chaining gradients through each operation:

$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{a}_2} \cdot \frac{\partial \mathbf{a}_2}{\partial \mathbf{h}} \cdot \frac{\partial \mathbf{h}}{\partial W_1}$

import numpy as np

class SimpleNet:
    def __init__(self):
        self.W1 = np.random.randn(4, 3) * 0.1
        self.b1 = np.zeros(4)
        self.W2 = np.random.randn(2, 4) * 0.1
        self.b2 = np.zeros(2)

    def relu(self, x):
        return np.maximum(0, x)

    def relu_grad(self, x):
        return (x > 0).astype(float)

    def forward(self, x):
        self.x = x
        self.z1 = self.W1 @ x + self.b1
        self.h = self.relu(self.z1)
        self.z2 = self.W2 @ self.h + self.b2
        return self.z2

    def backward(self, dL_dz2):
        # Layer 2 gradients
        dL_dW2 = np.outer(dL_dz2, self.h)
        dL_db2 = dL_dz2
        dL_dh = self.W2.T @ dL_dz2

        # ReLU gradient
        dL_dz1 = dL_dh * self.relu_grad(self.z1)

        # Layer 1 gradients
        dL_dW1 = np.outer(dL_dz1, self.x)
        dL_db1 = dL_dz1

        return dL_dW1, dL_db1, dL_dW2, dL_db2

net = SimpleNet()
x = np.array([1.0, 2.0, 3.0])
output = net.forward(x)
dL_dout = np.array([1.0, -1.0])
grads = net.backward(dL_dout)
print(f"W1 gradient shape: {grads[0].shape}")
print(f"W2 gradient shape: {grads[2].shape}")

2.3 Optimization Theory

Optimality Conditions

First-order (necessary): at an extremum, $\nabla f = \mathbf{0}$
Second-order (sufficient): the Hessian's eigenvalues determine min/max

Convex Functions

A function is convex if:

$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y), \quad \lambda \in [0,1]$

For convex functions, any local minimum is a global minimum. Deep learning loss functions are generally non-convex, so global optimality cannot be guaranteed.

Gradient Descent

$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$

where $\alpha$ is the learning rate. Too large and the updates diverge; too small and convergence is slow.

Common variants:

SGD (Stochastic Gradient Descent): uses a random mini-batch each step
Momentum: accumulates past gradient directions for acceleration
Adam: adaptive learning rates per parameter

Adam update rule:

$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$ $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ $\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

import numpy as np

def gradient_descent(grad_f, init_x, lr=0.1, steps=100):
    x = init_x.copy()
    history = [x.copy()]
    for _ in range(steps):
        grad = grad_f(x)
        x -= lr * grad
        history.append(x.copy())
    return x, np.array(history)

def f(x):
    return x[0]**2 + 4 * x[1]**2

def grad_f(x):
    return np.array([2*x[0], 8*x[1]])

init_x = np.array([3.0, 3.0])
final_x, history = gradient_descent(grad_f, init_x, lr=0.1, steps=50)
print(f"Start: {init_x}, Converged to: {final_x.round(6)}")

def adam_optimizer(grad_f, init_x, alpha=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):
    x = init_x.copy()
    m = np.zeros_like(x)
    v = np.zeros_like(x)
    for t in range(1, steps + 1):
        g = grad_f(x)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        x -= alpha * m_hat / (np.sqrt(v_hat) + eps)
    return x

final_adam = adam_optimizer(grad_f, init_x)
print(f"Adam converged to: {final_adam.round(6)}")

3. Probability and Statistics

Probability and statistics provide the language for handling uncertainty in deep learning. Loss function design, regularization, and model evaluation all rely on probabilistic thinking.

3.1 Probability Basics

Kolmogorov Axioms

$P(A) \geq 0$ (non-negativity)
$P(\Omega) = 1$ (total probability equals 1)
For mutually exclusive events: $P(A \cup B) = P(A) + P(B)$ (additivity)

Conditional Probability

The probability of $A$ given that $B$ has occurred:

$P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0$

Bayes' Theorem

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

Bayes' theorem formalizes "updating beliefs in the light of evidence":

$P(A)$ : Prior — belief before seeing evidence
$P(B|A)$ : Likelihood — probability of evidence under hypothesis $A$
$P(A|B)$ : Posterior — updated belief after seeing evidence

This underlies Bayesian classifiers, Bayesian neural networks, and probabilistic model evaluation.

3.2 Probability Distributions

Discrete Distributions

Bernoulli Distribution

Single trial with outcome 0 or 1:

$P(X=k) = p^k (1-p)^{1-k}, \quad k \in \{0, 1\}$

$E[X] = p$ , $Var(X) = p(1-p)$

Binomial Distribution

Number of successes in $n$ Bernoulli trials:

$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$

Poisson Distribution

Event count in unit time or space:

$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$

$E[X] = Var(X) = \lambda$

Continuous Distributions

Uniform Distribution

Equal probability density over $[a, b]$ :

$f(x) = \frac{1}{b-a}, \quad a \leq x \leq b$

Normal (Gaussian) Distribution

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$

Characterized by mean $\mu$ and standard deviation $\sigma$ . By the Central Limit Theorem, many natural phenomena follow this distribution. Used extensively in weight initialization and noise modeling.

Multivariate Normal Distribution

$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$

where $\boldsymbol{\mu}$ is the mean vector and $\Sigma$ is the covariance matrix. Central to VAE latent space modeling.

import numpy as np
from scipy import stats

# Normal distribution
mu, sigma = 0, 1
print(f"P(-1 < X < 1) = {stats.norm.cdf(1) - stats.norm.cdf(-1):.4f}")  # 68.27%
print(f"P(-2 < X < 2) = {stats.norm.cdf(2) - stats.norm.cdf(-2):.4f}")  # 95.45%
print(f"P(-3 < X < 3) = {stats.norm.cdf(3) - stats.norm.cdf(-3):.4f}")  # 99.73%

# Multivariate normal sampling
mean = np.array([0, 0])
cov = np.array([[1, 0.7],
                [0.7, 1]])
samples = np.random.multivariate_normal(mean, cov, size=1000)
print(f"\nSample shape: {samples.shape}")
print(f"Sample mean: {samples.mean(axis=0).round(3)}")
print(f"Sample covariance:\n{np.cov(samples.T).round(3)}")

# Binomial distribution
n, p = 10, 0.5
print(f"\nBinomial B(10, 0.5): P(X=5) = {stats.binom.pmf(5, n, p):.4f}")

3.3 Expectation and Variance

Expected Value

$E[X] = \sum_x x \cdot P(X=x) \quad \text{(discrete)}$ $E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx \quad \text{(continuous)}$

Linearity of expectation: $E[aX + b] = aE[X] + b$

Variance

$Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2$

Standard deviation: $\sigma = \sqrt{Var(X)}$

Covariance

Measures the linear relationship between two random variables:

$Cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]$

Correlation

$\rho_{XY} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}$

Range $[-1, 1]$ ; measures strength of linear relationship independent of scale.

Covariance Matrix

For an $n$ -dimensional random vector:

$\Sigma_{ij} = Cov(X_i, X_j)$

The covariance matrix encodes the variance structure of data and is the key input to PCA.

3.4 Maximum Likelihood Estimation (MLE)

Likelihood Function

The probability of observing data $\mathcal{D} = \{x_1, \ldots, x_n\}$ given parameters $\theta$ :

$L(\theta; \mathcal{D}) = P(\mathcal{D}|\theta) = \prod_{i=1}^n P(x_i|\theta)$

MLE finds the $\theta$ that maximizes this likelihood:

$\hat{\theta}_{MLE} = \arg\max_\theta L(\theta; \mathcal{D})$

Log-Likelihood

Taking the log converts the product to a sum, simplifying computation:

$\log L(\theta) = \sum_{i=1}^n \log P(x_i|\theta)$

Since log is monotonically increasing, the maximizer of log-likelihood equals the maximizer of likelihood.

MLE for the Normal Distribution

Assuming data follows $\mathcal{N}(\mu, \sigma^2)$ :

$\log L(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2$

Setting the derivative with respect to $\mu$ to zero: $\hat{\mu}_{MLE} = \bar{x}$ (sample mean)

For $\sigma^2$ : $\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum (x_i - \bar{x})^2$

Cross-Entropy and MLE

Minimizing cross-entropy loss in classification is equivalent to MLE:

$\mathcal{L}_{CE} = -\frac{1}{n}\sum_{i=1}^n \sum_{c=1}^C y_{ic} \log \hat{p}_{ic}$

This maximizes the log-likelihood of the model under the true data distribution.

import numpy as np
from scipy.optimize import minimize_scalar

# MLE for Bernoulli distribution
data = np.array([1, 0, 1, 1, 0, 1, 1, 1, 0, 1])
n = len(data)
k = data.sum()

p_mle = k / n
print(f"Analytic MLE: p = {p_mle:.2f}")

def neg_log_likelihood(p):
    if p <= 0 or p >= 1:
        return np.inf
    return -(k * np.log(p) + (n - k) * np.log(1 - p))

result = minimize_scalar(neg_log_likelihood, bounds=(0.01, 0.99), method='bounded')
print(f"Numerical MLE: p = {result.x:.4f}")

# Cross-entropy loss
def cross_entropy(y_true, y_pred, eps=1e-15):
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.2])
print(f"\nCross-Entropy loss: {cross_entropy(y_true, y_pred):.4f}")

3.5 Information Theory

Entropy

Measures the uncertainty of a probability distribution $P$ :

$H(X) = -\sum_{x} P(x) \log P(x)$

Maximum for uniform distributions, zero when a single outcome is certain.

KL Divergence

A measure of "distance" between distributions $P$ and $Q$ (asymmetric):

$D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$

$D_{KL}(P \| Q) \geq 0$ , with equality iff $P = Q$ . Used in VAE loss functions, PPO policy optimization, and knowledge distillation.

Cross-Entropy

$H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + D_{KL}(P \| Q)$

Since $H(P)$ is constant, minimizing cross-entropy is equivalent to minimizing KL divergence between the true and model distributions.

Mutual Information

Information shared between two random variables:

$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$

Theoretical foundation for feature selection and representation learning.

import numpy as np

def entropy(p):
    p = np.array(p)
    p = p[p > 0]
    return -np.sum(p * np.log2(p))

def kl_divergence(p, q, eps=1e-15):
    p = np.array(p) + eps
    q = np.array(q) + eps
    p = p / p.sum()
    q = q / q.sum()
    return np.sum(p * np.log(p / q))

# Uniform distribution (maximum entropy)
uniform = [0.25, 0.25, 0.25, 0.25]
print(f"Uniform distribution entropy: {entropy(uniform):.4f} bits")   # 2.0

# Concentrated distribution (low entropy)
concentrated = [0.97, 0.01, 0.01, 0.01]
print(f"Concentrated distribution entropy: {entropy(concentrated):.4f} bits")

# KL Divergence
true_dist = [0.4, 0.3, 0.2, 0.1]
model_dist = [0.35, 0.35, 0.15, 0.15]
kl = kl_divergence(true_dist, model_dist)
print(f"\nKL Divergence D_KL(P||Q): {kl:.4f}")

4. Numerical Methods

Understanding the gap between mathematical theory and actual computer arithmetic is important for reliable deep learning implementations.

4.1 Floating-Point Representation

FP32 (Single Precision)

32 bits: 1 sign, 8 exponent, 23 mantissa
Precision: ~7 significant decimal digits
Range: approximately $\pm 3.4 \times 10^{38}$

FP16 (Half Precision)

16 bits: 1 sign, 5 exponent, 10 mantissa
Saves GPU memory, faster computation (Mixed Precision Training)
Risk of overflow/underflow with large or small values

BF16 (Brain Float 16)

16 bits: 1 sign, 8 exponent, 7 mantissa
Same exponent range as FP32 (lower overflow risk)
Introduced by Google for TPUs, now widely used for AI training

import numpy as np

# Floating-point precision issues
a = np.float32(0.1)
b = np.float32(0.2)
print(f"0.1 + 0.2 (float32) = {a + b}")  # may not be exactly 0.3

# FP16 overflow
print(f"FP16 large value: {np.float16(65000)}")
print(f"FP16 overflow: {np.float16(70000)}")  # inf

# Compare dtype ranges
for dtype in [np.float16, np.float32, np.float64]:
    info = np.finfo(dtype)
    print(f"{dtype.__name__}: max={info.max:.3e}, min_positive={info.tiny:.3e}")

4.2 Numerical Stability

The Instability of Naive Softmax

$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$

Large inputs cause $e^{x_i}$ to overflow (inf). The fix: subtract the maximum value before exponentiation.

$\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$

This is mathematically equivalent but numerically stable.

import numpy as np

def naive_softmax(x):
    return np.exp(x) / np.sum(np.exp(x))

def stable_softmax(x):
    x_shifted = x - np.max(x)
    return np.exp(x_shifted) / np.sum(np.exp(x_shifted))

# Normal input
x_normal = np.array([1.0, 2.0, 3.0])
print("Normal input:")
print(f"  Naive:  {naive_softmax(x_normal)}")
print(f"  Stable: {stable_softmax(x_normal)}")

# Large input
x_large = np.array([1000.0, 2000.0, 3000.0])
print("\nLarge input:")
naive_result = naive_softmax(x_large)
print(f"  Naive:  {naive_result}")   # [nan, nan, nan] due to overflow
print(f"  Stable: {stable_softmax(x_large)}")   # [0, 0, 1]

# Log-Softmax (used with NLLLoss)
def log_softmax(x):
    x_shifted = x - np.max(x)
    return x_shifted - np.log(np.sum(np.exp(x_shifted)))

logits = np.array([2.0, 1.0, 0.1])
print(f"\nLog-Softmax: {log_softmax(logits)}")

5. Mathematical Analysis of Linear Regression

Linear regression is the simplest building block of deep learning. A thorough mathematical understanding creates a foundation for understanding more complex models.

5.1 Ordinary Least Squares

For data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$ , we fit a linear model $\hat{y} = \mathbf{w}^T \mathbf{x} + b$ .

Loss function (MSE):

$L(\mathbf{w}) = \frac{1}{n} \|y - X\mathbf{w}\|^2$

Normal Equation

Differentiating with respect to $\mathbf{w}$ and setting to zero:

$X^T X \mathbf{w} = X^T y$

$\hat{\mathbf{w}} = (X^T X)^{-1} X^T y$

This closed-form solution requires $X^T X$ to be invertible (features must be linearly independent).

5.2 Ridge and Lasso Regression

Ridge Regression (L2 regularization)

$L_{ridge}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_2^2$

Normal equation: $\hat{\mathbf{w}} = (X^T X + \lambda I)^{-1} X^T y$

Adding $\lambda I$ guarantees invertibility and shrinks weights toward zero, preventing overfitting.

Lasso Regression (L1 regularization)

$L_{lasso}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1$

L1 regularization induces sparse solutions — some weights become exactly zero — providing automatic feature selection.

import numpy as np
from sklearn.linear_model import Ridge, Lasso

np.random.seed(42)
n, d = 100, 10
X = np.random.randn(n, d)
true_w = np.array([1.0, -2.0, 3.0, 0.0, 0.0, 0.0, 0.5, -0.5, 0.0, 0.0])
y = X @ true_w + np.random.randn(n) * 0.5

# Normal equation (OLS)
w_ols = np.linalg.lstsq(X, y, rcond=None)[0]
print("OLS coefficients:", w_ols.round(2))

# Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
print("Ridge coefficients:", ridge.coef_.round(2))

# Lasso (sparse solution)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
print("Lasso coefficients:", lasso.coef_.round(2))
print("Number of zero coefficients:", (np.abs(lasso.coef_) < 0.01).sum())

# Ridge via normal equation
def ridge_normal_equation(X, y, lam=1.0):
    d = X.shape[1]
    return np.linalg.solve(X.T @ X + lam * np.eye(d), X.T @ y)

w_ridge = ridge_normal_equation(X, y, lam=1.0)
print("\nRidge (normal equation):", w_ridge.round(2))

6. Integration: The Mathematics of Deep Learning

6.1 A Mathematical View of Neural Networks

A fully connected neural network with $L$ layers:

$\mathbf{h}^{(0)} = \mathbf{x}$ $\mathbf{z}^{(l)} = W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}$ $\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})$ $\hat{y} = \mathbf{h}^{(L)}$

Forward pass: linear algebra (matrix multiplication) + activation functions

Loss computation: information theory (cross-entropy) or statistics (MSE)

Backward pass: calculus (chain rule) to compute gradients

Parameter update: optimization theory (gradient descent, Adam)

6.2 The Attention Mechanism and Matrix Operations

The Transformer's scaled dot-product attention:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

$Q, K, V$ : Query, Key, Value matrices (linear algebra)
$\frac{1}{\sqrt{d_k}}$ scaling: numerical stability (numerical analysis)
Softmax: converts scores to probability distribution (probability theory)
Dot product $QK^T$ : similarity measurement (vector dot product)

import numpy as np

def stable_softmax_2d(x):
    x_max = x.max(axis=-1, keepdims=True)
    x_shifted = x - x_max
    exp_x = np.exp(x_shifted)
    return exp_x / exp_x.sum(axis=-1, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    if mask is not None:
        scores = np.where(mask, scores, -1e9)
    weights = stable_softmax_2d(scores)
    output = weights @ V
    return output, weights

# 3 tokens, d_k=4
seq_len, d_k = 3, 4
np.random.seed(42)
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)

output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Attention output shape: {output.shape}")
print(f"Attention weights (rows sum to 1):\n{weights.round(3)}")
print(f"Row sums: {weights.sum(axis=-1)}")

Conclusion: Next Steps

This guide covered the core mathematical foundations of AI/ML:

Linear Algebra: the language for representing and transforming data and model parameters
Calculus: the tool explaining how models learn
Probability & Statistics: handles uncertainty and provides the theoretical basis for loss functions
Numerical Methods: knowledge for stable computer implementations of mathematical theory

Recommended next steps:

Gilbert Strang's Linear Algebra — free on MIT OCW
Pattern Recognition and Machine Learning (Bishop) — probabilistic perspective on ML
Deep Learning (Goodfellow et al.) — the theoretical bible of deep learning
Practice: implement a neural network from scratch using only NumPy

Mathematics is a tool for understanding. You do not need to memorize every formula, but grasping why each concept exists and how it operates within deep learning gives you significantly more powerful practical skills.

References

Gilbert Strang, Introduction to Linear Algebra, 6th Edition
Ian Goodfellow et al., Deep Learning, MIT Press (2016)
Christopher Bishop, Pattern Recognition and Machine Learning, Springer (2006)
3Blue1Brown — Essence of Linear Algebra, Essence of Calculus (YouTube)
fast.ai — Practical Deep Learning for Coders