💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Mathematical Foundations for AI/ML: Complete Guide

To truly understand machine learning and deep learning, you need to grasp the underlying mathematics. Many developers use frameworks without fully understanding _why_ the math works the way it does. This guide aims to fill that gap.

Rather than listing formulas, we focus on _why_ each concept matters and _how_ it is used in deep learning, accompanied by Python/NumPy code you can run to verify the math yourself.

1. Linear Algebra

Linear algebra is the fundamental language of deep learning. The forward pass of a neural network is a sequence of matrix multiplications, and embeddings are points in a high-dimensional vector space.

1.1 Vectors

**Definition and Geometric Meaning**

A vector is a quantity with both magnitude and direction. Mathematically it is an ordered list of numbers; geometrically it is an arrow pointing to a location in $n$-dimensional space.

$$\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}$$

In deep learning, vectors appear everywhere: input data, hidden-layer activations, embedding representations — all are vectors.

**Vector Addition and Scalar Multiplication**

Addition is performed element-wise:

$$\mathbf{a} + \mathbf{b} = \begin{bmatrix} a_1 + b_1 \\ a_2 + b_2 \\ \vdots \\ a_n + b_n \end{bmatrix}$$

Multiplying by scalar $c$ scales each component:

$$c \cdot \mathbf{v} = \begin{bmatrix} c v_1 \\ c v_2 \\ \vdots \\ c v_n \end{bmatrix}$$

**Dot Product**

The dot product measures the "similarity" between two vectors:

$$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i$$

Geometrically, it equals the product of magnitudes and the cosine of the angle $\theta$ between them:

$$\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta$$

A positive dot product indicates vectors pointing in the same general direction; negative means opposite; zero means orthogonal.

**Cosine Similarity**

To measure directional similarity independent of magnitude, normalize by both norms:

$$\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$$

The value ranges from $-1$ to $1$. Widely used to compare word embeddings and document representations in NLP.

**Vector Norms**

Norms measure the "length" or "size" of a vector:

- $L_1$ norm (Manhattan): $\|\mathbf{v}\|_1 = \sum_i |v_i|$

- $L_2$ norm (Euclidean): $\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}$

- $L_\infty$ norm (max norm): $\|\mathbf{v}\|_\infty = \max_i |v_i|$

In regularization, $L_1$ encourages sparsity while $L_2$ keeps weights uniformly small.

a = np.array([3.0, 4.0])

b = np.array([1.0, 2.0])

Dot product

dot_product = np.dot(a, b) # 3*1 + 4*2 = 11

print(f"Dot product: {dot_product}")

L2 norm

l2_norm_a = np.linalg.norm(a) # sqrt(9+16) = 5

print(f"L2 norm of a: {l2_norm_a}")

Cosine similarity

cosine_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"Cosine similarity: {cosine_sim:.4f}")

Various norms

v = np.array([3.0, -4.0, 1.0])

print(f"L1 norm: {np.linalg.norm(v, ord=1)}") # 8.0

print(f"L2 norm: {np.linalg.norm(v, ord=2):.4f}") # 5.099

print(f"Linf norm: {np.linalg.norm(v, ord=np.inf)}") # 4.0

1.2 Matrices

**Matrix Notation**

A matrix is a rectangular array of numbers with $m$ rows and $n$ columns ($m \times n$ matrix):

$$A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}$$

In deep learning, the weight matrix $W$ represents the linear transformation between layers.

**Matrix Multiplication**

If $A$ is $m \times k$ and $B$ is $k \times n$, then $C = AB$ is $m \times n$:

$$C_{ij} = \sum_{p=1}^{k} A_{ip} B_{pj}$$

The key rule: **columns of the left matrix must equal rows of the right matrix**. Matrix multiplication is generally not commutative: $AB \neq BA$.

**Transpose**

Swap rows and columns: $(A^T)_{ij} = A_{ji}$.

Property: $(AB)^T = B^T A^T$

**Inverse Matrix**

For a square matrix $A$, its inverse $A^{-1}$ satisfies $AA^{-1} = A^{-1}A = I$. An inverse exists only when the determinant is non-zero.

**Determinant**

For a $2 \times 2$ matrix:

$$\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc$$

The determinant tells you how much the linear transformation stretches or compresses space. $\det(A) = 0$ means the transformation collapses dimensions — no inverse exists.

**Rank**

The rank of a matrix is the maximum number of linearly independent rows (or columns). A lower rank means the transformation produces a lower-dimensional output space. Low-rank approximation is central to model compression techniques like LoRA.

A = np.array([[1, 2, 3],

[4, 5, 6],

[7, 8, 9]])

B = np.array([[9, 8, 7],

[6, 5, 4],

[3, 2, 1]])

Matrix multiplication

C = np.dot(A, B)

print("Matrix product:\n", C)

Transpose

print("Transpose of A:\n", A.T)

Determinant

A2 = np.array([[3.0, 1.0],

[2.0, 4.0]])

print(f"Determinant: {np.linalg.det(A2):.2f}") # 3*4 - 1*2 = 10

Inverse

A_inv = np.linalg.inv(A2)

print("Inverse:\n", A_inv)

print("A @ A_inv (should be identity):\n", np.round(A2 @ A_inv))

Rank

print(f"Rank of A: {np.linalg.matrix_rank(A)}") # 2 (rows are linearly dependent)

1.3 Eigenvalues and Eigenvectors

**Definition**

For a square matrix $A$, a non-zero vector $\mathbf{v}$ and scalar $\lambda$ satisfying:

$$A\mathbf{v} = \lambda \mathbf{v}$$

are called the **eigenvector** and **eigenvalue**, respectively.

The meaning is powerful: when $A$ transforms $\mathbf{v}$, the direction stays the same — only the magnitude changes by a factor of $\lambda$.

**Eigendecomposition**

For a symmetric matrix, we can decompose using orthogonal eigenvectors:

$$A = Q \Lambda Q^T$$

where $Q$ contains eigenvectors as columns and $\Lambda$ is the diagonal matrix of eigenvalues.

**Relationship to PCA**

The eigenvectors of the covariance matrix $\Sigma = \frac{1}{n} X^T X$ are the principal components. The eigenvector corresponding to the largest eigenvalue points in the direction of greatest variance — the most informative feature direction.

**SVD (Singular Value Decomposition)**

While eigendecomposition only applies to square matrices, SVD applies to any matrix:

$$A = U \Sigma V^T$$

- $U$: $m \times m$ orthogonal matrix (left singular vectors)

- $\Sigma$: $m \times n$ diagonal matrix (singular values, non-negative)

- $V^T$: $n \times n$ orthogonal matrix (right singular vectors)

SVD underpins matrix approximation, noise reduction, recommendation systems, LSA, and the LoRA technique for parameter-efficient fine-tuning.

Eigenvalues and eigenvectors

A = np.array([[4.0, 2.0],

[1.0, 3.0]])

eigenvalues, eigenvectors = np.linalg.eig(A)

print("Eigenvalues:", eigenvalues)

print("Eigenvectors (columns):\n", eigenvectors)

Verify: Av = lambdav

v0 = eigenvectors[:, 0]

lam0 = eigenvalues[0]

print("A*v:", A @ v0)

print("lambda*v:", lam0 * v0)

SVD

M = np.array([[1, 2, 3],

[4, 5, 6]])

U, S, Vt = np.linalg.svd(M, full_matrices=False)

print(f"\nU shape: {U.shape}")

print(f"Singular values: {S}")

print(f"Vt shape: {Vt.shape}")

Rank-1 approximation

rank1_approx = np.outer(S[0] * U[:, 0], Vt[0, :])

print("Original matrix:\n", M)

print("Rank-1 approximation:\n", rank1_approx.round(2))

PCA example

np.random.seed(42)

data = np.random.randn(100, 2)

data[:, 1] = data[:, 0] * 0.8 + data[:, 1] * 0.2

cov = np.cov(data.T)

eigenvals, eigenvecs = np.linalg.eig(cov)

idx = np.argsort(eigenvals)[::-1]

print("\nFirst principal component:", eigenvecs[:, idx[0]])

print("Explained variance ratio:", eigenvals[idx[0]] / eigenvals.sum())

1.4 Linear Algebra in Deep Learning

**Layers as Linear Transformations**

A fully connected layer is fundamentally a linear transformation:

$$\mathbf{y} = W\mathbf{x} + \mathbf{b}$$

where $\mathbf{x} \in \mathbb{R}^n$, $W \in \mathbb{R}^{m \times n}$, $\mathbf{b} \in \mathbb{R}^m$, $\mathbf{y} \in \mathbb{R}^m$.

Without non-linear activation functions, stacking multiple layers reduces to a single linear transformation — which is why activations like ReLU and Sigmoid are essential.

**Batch Processing**

For a batch of $B$ samples processed simultaneously:

$$Y = XW^T + \mathbf{b}$$

where $X \in \mathbb{R}^{B \times n}$ and $Y \in \mathbb{R}^{B \times m}$. GPUs are optimized for exactly this type of large-scale matrix multiplication.

2. Calculus

Calculus governs _how_ deep learning models learn. Gradient descent — the engine of learning — relies entirely on differentiation.

2.1 Differentiation

**Limit Definition of Derivative**

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

This is the instantaneous rate of change — the slope of the tangent line at point $x$.

**Basic Differentiation Rules**

- Power rule: $(x^n)' = nx^{n-1}$

- Sum rule: $(f+g)' = f' + g'$

- Product rule: $(fg)' = f'g + fg'$

- Quotient rule: $(f/g)' = (f'g - fg') / g^2$

- Chain rule: $[f(g(x))]' = f'(g(x)) \cdot g'(x)$

Derivatives of common deep learning functions:

- $\frac{d}{dx}(e^x) = e^x$

- $\frac{d}{dx}(\ln x) = \frac{1}{x}$

- $\frac{d}{dx}(\sigma(x)) = \sigma(x)(1 - \sigma(x))$ (Sigmoid)

- $\frac{d}{dx}(\tanh(x)) = 1 - \tanh^2(x)$

- $\frac{d}{dx}(\text{ReLU}(x)) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}$

**Partial Derivatives**

For a multi-variable function, differentiate with respect to one variable while treating all others as constants:

$$\frac{\partial f}{\partial x_i}$$

For $f(x, y) = x^2 + 3xy + y^2$:

$$\frac{\partial f}{\partial x} = 2x + 3y, \quad \frac{\partial f}{\partial y} = 3x + 2y$$

**Gradient**

The gradient collects all partial derivatives into a vector:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

The gradient points in the direction of **steepest ascent**. To minimize a loss function, move in the opposite direction (gradient descent).

**Jacobian Matrix**

For a vector function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian collects all first-order partial derivatives:

$$J_{ij} = \frac{\partial f_i}{\partial x_j}$$

**Hessian Matrix**

The matrix of second-order partial derivatives for a scalar function:

$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

Eigenvalues of the Hessian indicate curvature: all positive means local minimum, all negative means local maximum, mixed signs indicate a saddle point.

def numerical_gradient(f, x, h=1e-5):

"""Compute gradient numerically via central differences"""

grad = np.zeros_like(x)

for i in range(len(x)):

x_plus = x.copy(); x_plus[i] += h

x_minus = x.copy(); x_minus[i] -= h

grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)

return grad

f(x, y) = x^2 + 2y^2

def f(x):

return x[0]**2 + 2 * x[1]**2

x0 = np.array([3.0, 4.0])

grad = numerical_gradient(f, x0)

print(f"Numerical gradient: {grad}") # [6.0, 16.0]

print(f"Analytic gradient: [{2*x0[0]}, {4*x0[1]}]")

Automatic differentiation with PyTorch

try:

x = torch.tensor([3.0, 4.0], requires_grad=True)

y = x[0]**2 + 2 * x[1]**2

y.backward()

print(f"PyTorch gradient: {x.grad}")

except ImportError:

print("PyTorch not installed")

2.2 Chain Rule and Backpropagation

**Chain Rule**

For composite functions, differentiation follows:

$$\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$$

Generalized to multiple variables:

$$\frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial y_j} \cdot \frac{\partial y_j}{\partial x_i}$$

**Backpropagation**

Training a neural network requires computing $\frac{\partial L}{\partial w}$ for every weight $w$. Since a network is a composition of many functions, we repeatedly apply the chain rule — from the output layer back to the input layer.

For a 2-layer network:

$$L = \ell\bigl(\text{softmax}(W_2 \cdot \text{ReLU}(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2), \mathbf{y}\bigr)$$

Computing $\frac{\partial L}{\partial W_1}$ requires chaining gradients through each operation:

$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{a}_2} \cdot \frac{\partial \mathbf{a}_2}{\partial \mathbf{h}} \cdot \frac{\partial \mathbf{h}}{\partial W_1}$$

class SimpleNet:

def __init__(self):

self.W1 = np.random.randn(4, 3) * 0.1

self.b1 = np.zeros(4)

self.W2 = np.random.randn(2, 4) * 0.1

self.b2 = np.zeros(2)

def relu(self, x):

return np.maximum(0, x)

def relu_grad(self, x):

return (x > 0).astype(float)

def forward(self, x):

self.x = x

self.z1 = self.W1 @ x + self.b1

self.h = self.relu(self.z1)

self.z2 = self.W2 @ self.h + self.b2

return self.z2

def backward(self, dL_dz2):

Layer 2 gradients

dL_dW2 = np.outer(dL_dz2, self.h)

dL_db2 = dL_dz2

dL_dh = self.W2.T @ dL_dz2

ReLU gradient

dL_dz1 = dL_dh * self.relu_grad(self.z1)

Layer 1 gradients

dL_dW1 = np.outer(dL_dz1, self.x)

dL_db1 = dL_dz1

return dL_dW1, dL_db1, dL_dW2, dL_db2

net = SimpleNet()

x = np.array([1.0, 2.0, 3.0])

output = net.forward(x)

dL_dout = np.array([1.0, -1.0])

grads = net.backward(dL_dout)

print(f"W1 gradient shape: {grads[0].shape}")

print(f"W2 gradient shape: {grads[2].shape}")

2.3 Optimization Theory

**Optimality Conditions**

- First-order (necessary): at an extremum, $\nabla f = \mathbf{0}$

- Second-order (sufficient): the Hessian's eigenvalues determine min/max

**Convex Functions**

A function is convex if:

$$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y), \quad \lambda \in [0,1]$$

For convex functions, any local minimum is a global minimum. Deep learning loss functions are generally non-convex, so global optimality cannot be guaranteed.

**Gradient Descent**

$$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$$

where $\alpha$ is the learning rate. Too large and the updates diverge; too small and convergence is slow.

Common variants:

- **SGD (Stochastic Gradient Descent)**: uses a random mini-batch each step

- **Momentum**: accumulates past gradient directions for acceleration

- **Adam**: adaptive learning rates per parameter

Adam update rule:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$

$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$$

$$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

def gradient_descent(grad_f, init_x, lr=0.1, steps=100):

x = init_x.copy()

history = [x.copy()]

for _ in range(steps):

grad = grad_f(x)

x -= lr * grad

history.append(x.copy())

return x, np.array(history)

def f(x):

return x[0]**2 + 4 * x[1]**2

def grad_f(x):

return np.array([2*x[0], 8*x[1]])

init_x = np.array([3.0, 3.0])

final_x, history = gradient_descent(grad_f, init_x, lr=0.1, steps=50)

print(f"Start: {init_x}, Converged to: {final_x.round(6)}")

def adam_optimizer(grad_f, init_x, alpha=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):

x = init_x.copy()

m = np.zeros_like(x)

v = np.zeros_like(x)

for t in range(1, steps + 1):

g = grad_f(x)

m = beta1 * m + (1 - beta1) * g

v = beta2 * v + (1 - beta2) * g**2

m_hat = m / (1 - beta1**t)

v_hat = v / (1 - beta2**t)

x -= alpha * m_hat / (np.sqrt(v_hat) + eps)

return x

final_adam = adam_optimizer(grad_f, init_x)

print(f"Adam converged to: {final_adam.round(6)}")

3. Probability and Statistics

Probability and statistics provide the language for handling uncertainty in deep learning. Loss function design, regularization, and model evaluation all rely on probabilistic thinking.

3.1 Probability Basics

**Kolmogorov Axioms**

1. $P(A) \geq 0$ (non-negativity)

2. $P(\Omega) = 1$ (total probability equals 1)

3. For mutually exclusive events: $P(A \cup B) = P(A) + P(B)$ (additivity)

**Conditional Probability**

The probability of $A$ given that $B$ has occurred:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0$$

**Bayes' Theorem**

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Bayes' theorem formalizes "updating beliefs in the light of evidence":

- $P(A)$: Prior — belief before seeing evidence

- $P(B|A)$: Likelihood — probability of evidence under hypothesis $A$

- $P(A|B)$: Posterior — updated belief after seeing evidence

This underlies Bayesian classifiers, Bayesian neural networks, and probabilistic model evaluation.

3.2 Probability Distributions

**Discrete Distributions**

_Bernoulli Distribution_

Single trial with outcome 0 or 1:

$$P(X=k) = p^k (1-p)^{1-k}, \quad k \in \{0, 1\}$$

$E[X] = p$, $Var(X) = p(1-p)$

_Binomial Distribution_

Number of successes in $n$ Bernoulli trials:

$$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$$

_Poisson Distribution_

Event count in unit time or space:

$$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$$

$E[X] = Var(X) = \lambda$

**Continuous Distributions**

_Uniform Distribution_

Equal probability density over $[a, b]$:

$$f(x) = \frac{1}{b-a}, \quad a \leq x \leq b$$

_Normal (Gaussian) Distribution_

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Characterized by mean $\mu$ and standard deviation $\sigma$. By the Central Limit Theorem, many natural phenomena follow this distribution. Used extensively in weight initialization and noise modeling.

_Multivariate Normal Distribution_

$$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$$

where $\boldsymbol{\mu}$ is the mean vector and $\Sigma$ is the covariance matrix. Central to VAE latent space modeling.

from scipy import stats

Normal distribution

mu, sigma = 0, 1

print(f"P(-1 < X < 1) = {stats.norm.cdf(1) - stats.norm.cdf(-1):.4f}") # 68.27%

print(f"P(-2 < X < 2) = {stats.norm.cdf(2) - stats.norm.cdf(-2):.4f}") # 95.45%

print(f"P(-3 < X < 3) = {stats.norm.cdf(3) - stats.norm.cdf(-3):.4f}") # 99.73%

Multivariate normal sampling

mean = np.array([0, 0])

cov = np.array([[1, 0.7],

[0.7, 1]])

samples = np.random.multivariate_normal(mean, cov, size=1000)

print(f"\nSample shape: {samples.shape}")

print(f"Sample mean: {samples.mean(axis=0).round(3)}")

print(f"Sample covariance:\n{np.cov(samples.T).round(3)}")

Binomial distribution

n, p = 10, 0.5

print(f"\nBinomial B(10, 0.5): P(X=5) = {stats.binom.pmf(5, n, p):.4f}")

3.3 Expectation and Variance

**Expected Value**

$$E[X] = \sum_x x \cdot P(X=x) \quad \text{(discrete)}$$

$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx \quad \text{(continuous)}$$

Linearity of expectation: $E[aX + b] = aE[X] + b$

**Variance**

$$Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2$$

Standard deviation: $\sigma = \sqrt{Var(X)}$

**Covariance**

Measures the linear relationship between two random variables:

$$Cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]$$

**Correlation**

$$\rho_{XY} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}$$

Range $[-1, 1]$; measures strength of linear relationship independent of scale.

**Covariance Matrix**

For an $n$-dimensional random vector:

$$\Sigma_{ij} = Cov(X_i, X_j)$$

The covariance matrix encodes the variance structure of data and is the key input to PCA.

3.4 Maximum Likelihood Estimation (MLE)

**Likelihood Function**

The probability of observing data $\mathcal{D} = \{x_1, \ldots, x_n\}$ given parameters $\theta$:

$$L(\theta; \mathcal{D}) = P(\mathcal{D}|\theta) = \prod_{i=1}^n P(x_i|\theta)$$

MLE finds the $\theta$ that maximizes this likelihood:

$$\hat{\theta}_{MLE} = \arg\max_\theta L(\theta; \mathcal{D})$$

**Log-Likelihood**

Taking the log converts the product to a sum, simplifying computation:

$$\log L(\theta) = \sum_{i=1}^n \log P(x_i|\theta)$$

Since log is monotonically increasing, the maximizer of log-likelihood equals the maximizer of likelihood.

**MLE for the Normal Distribution**

Assuming data follows $\mathcal{N}(\mu, \sigma^2)$:

$$\log L(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2$$

Setting the derivative with respect to $\mu$ to zero: $\hat{\mu}_{MLE} = \bar{x}$ (sample mean)

For $\sigma^2$: $\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum (x_i - \bar{x})^2$

**Cross-Entropy and MLE**

Minimizing cross-entropy loss in classification is equivalent to MLE:

$$\mathcal{L}_{CE} = -\frac{1}{n}\sum_{i=1}^n \sum_{c=1}^C y_{ic} \log \hat{p}_{ic}$$

This maximizes the log-likelihood of the model under the true data distribution.

from scipy.optimize import minimize_scalar

MLE for Bernoulli distribution

data = np.array([1, 0, 1, 1, 0, 1, 1, 1, 0, 1])

n = len(data)

k = data.sum()

p_mle = k / n

print(f"Analytic MLE: p = {p_mle:.2f}")

def neg_log_likelihood(p):

if p <= 0 or p >= 1:

return np.inf

return -(k * np.log(p) + (n - k) * np.log(1 - p))

result = minimize_scalar(neg_log_likelihood, bounds=(0.01, 0.99), method='bounded')

print(f"Numerical MLE: p = {result.x:.4f}")

Cross-entropy loss

def cross_entropy(y_true, y_pred, eps=1e-15):

y_pred = np.clip(y_pred, eps, 1 - eps)

return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

y_true = np.array([1, 0, 1, 1, 0])

y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.2])

print(f"\nCross-Entropy loss: {cross_entropy(y_true, y_pred):.4f}")

3.5 Information Theory

**Entropy**

Measures the uncertainty of a probability distribution $P$:

$$H(X) = -\sum_{x} P(x) \log P(x)$$

Maximum for uniform distributions, zero when a single outcome is certain.

**KL Divergence**

A measure of "distance" between distributions $P$ and $Q$ (asymmetric):

$$D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$

$D_{KL}(P \| Q) \geq 0$, with equality iff $P = Q$. Used in VAE loss functions, PPO policy optimization, and knowledge distillation.

**Cross-Entropy**

$$H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + D_{KL}(P \| Q)$$

Since $H(P)$ is constant, minimizing cross-entropy is equivalent to minimizing KL divergence between the true and model distributions.

**Mutual Information**

Information shared between two random variables:

$$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$$

Theoretical foundation for feature selection and representation learning.

def entropy(p):

p = np.array(p)

p = p[p > 0]

return -np.sum(p * np.log2(p))

def kl_divergence(p, q, eps=1e-15):

p = np.array(p) + eps

q = np.array(q) + eps

p = p / p.sum()

q = q / q.sum()

return np.sum(p * np.log(p / q))

Uniform distribution (maximum entropy)

uniform = [0.25, 0.25, 0.25, 0.25]

print(f"Uniform distribution entropy: {entropy(uniform):.4f} bits") # 2.0

Concentrated distribution (low entropy)

concentrated = [0.97, 0.01, 0.01, 0.01]

print(f"Concentrated distribution entropy: {entropy(concentrated):.4f} bits")

KL Divergence

true_dist = [0.4, 0.3, 0.2, 0.1]

model_dist = [0.35, 0.35, 0.15, 0.15]

kl = kl_divergence(true_dist, model_dist)

print(f"\nKL Divergence D_KL(P||Q): {kl:.4f}")

4. Numerical Methods

Understanding the gap between mathematical theory and actual computer arithmetic is important for reliable deep learning implementations.

4.1 Floating-Point Representation

**FP32 (Single Precision)**

- 32 bits: 1 sign, 8 exponent, 23 mantissa

- Precision: ~7 significant decimal digits

- Range: approximately $\pm 3.4 \times 10^{38}$

**FP16 (Half Precision)**

- 16 bits: 1 sign, 5 exponent, 10 mantissa

- Saves GPU memory, faster computation (Mixed Precision Training)

- Risk of overflow/underflow with large or small values

**BF16 (Brain Float 16)**

- 16 bits: 1 sign, 8 exponent, 7 mantissa

- Same exponent range as FP32 (lower overflow risk)

- Introduced by Google for TPUs, now widely used for AI training

Floating-point precision issues

a = np.float32(0.1)

b = np.float32(0.2)

print(f"0.1 + 0.2 (float32) = {a + b}") # may not be exactly 0.3

FP16 overflow

print(f"FP16 large value: {np.float16(65000)}")

print(f"FP16 overflow: {np.float16(70000)}") # inf

Compare dtype ranges

for dtype in [np.float16, np.float32, np.float64]:

info = np.finfo(dtype)

print(f"{dtype.__name__}: max={info.max:.3e}, min_positive={info.tiny:.3e}")

4.2 Numerical Stability

**The Instability of Naive Softmax**

$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

Large inputs cause $e^{x_i}$ to overflow (inf). The fix: subtract the maximum value before exponentiation.

$$\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$$

This is mathematically equivalent but numerically stable.

def naive_softmax(x):

return np.exp(x) / np.sum(np.exp(x))

def stable_softmax(x):

x_shifted = x - np.max(x)

return np.exp(x_shifted) / np.sum(np.exp(x_shifted))

Normal input

x_normal = np.array([1.0, 2.0, 3.0])

print("Normal input:")

print(f" Naive: {naive_softmax(x_normal)}")

print(f" Stable: {stable_softmax(x_normal)}")

Large input

x_large = np.array([1000.0, 2000.0, 3000.0])

print("\nLarge input:")

naive_result = naive_softmax(x_large)

print(f" Naive: {naive_result}") # [nan, nan, nan] due to overflow

print(f" Stable: {stable_softmax(x_large)}") # [0, 0, 1]

Log-Softmax (used with NLLLoss)

def log_softmax(x):

x_shifted = x - np.max(x)

return x_shifted - np.log(np.sum(np.exp(x_shifted)))

logits = np.array([2.0, 1.0, 0.1])

print(f"\nLog-Softmax: {log_softmax(logits)}")

5. Mathematical Analysis of Linear Regression

Linear regression is the simplest building block of deep learning. A thorough mathematical understanding creates a foundation for understanding more complex models.

5.1 Ordinary Least Squares

For data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$, we fit a linear model $\hat{y} = \mathbf{w}^T \mathbf{x} + b$.

**Loss function (MSE)**:

$$L(\mathbf{w}) = \frac{1}{n} \|y - X\mathbf{w}\|^2$$

**Normal Equation**

Differentiating with respect to $\mathbf{w}$ and setting to zero:

$$X^T X \mathbf{w} = X^T y$$

$$\hat{\mathbf{w}} = (X^T X)^{-1} X^T y$$

This closed-form solution requires $X^T X$ to be invertible (features must be linearly independent).

5.2 Ridge and Lasso Regression

**Ridge Regression (L2 regularization)**

$$L_{ridge}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_2^2$$

Normal equation: $\hat{\mathbf{w}} = (X^T X + \lambda I)^{-1} X^T y$

Adding $\lambda I$ guarantees invertibility and shrinks weights toward zero, preventing overfitting.

**Lasso Regression (L1 regularization)**

$$L_{lasso}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1$$

L1 regularization induces sparse solutions — some weights become exactly zero — providing automatic feature selection.

from sklearn.linear_model import Ridge, Lasso

np.random.seed(42)

n, d = 100, 10

X = np.random.randn(n, d)

true_w = np.array([1.0, -2.0, 3.0, 0.0, 0.0, 0.0, 0.5, -0.5, 0.0, 0.0])

y = X @ true_w + np.random.randn(n) * 0.5

Normal equation (OLS)

w_ols = np.linalg.lstsq(X, y, rcond=None)[0]

print("OLS coefficients:", w_ols.round(2))

Ridge

ridge = Ridge(alpha=1.0)

ridge.fit(X, y)

print("Ridge coefficients:", ridge.coef_.round(2))

Lasso (sparse solution)

lasso = Lasso(alpha=0.1)

lasso.fit(X, y)

print("Lasso coefficients:", lasso.coef_.round(2))

print("Number of zero coefficients:", (np.abs(lasso.coef_) < 0.01).sum())

Ridge via normal equation

def ridge_normal_equation(X, y, lam=1.0):

d = X.shape[1]

return np.linalg.solve(X.T @ X + lam * np.eye(d), X.T @ y)

w_ridge = ridge_normal_equation(X, y, lam=1.0)

print("\nRidge (normal equation):", w_ridge.round(2))

6. Integration: The Mathematics of Deep Learning

6.1 A Mathematical View of Neural Networks

A fully connected neural network with $L$ layers:

$$\mathbf{h}^{(0)} = \mathbf{x}$$

$$\mathbf{z}^{(l)} = W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}$$

$$\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})$$

$$\hat{y} = \mathbf{h}^{(L)}$$

**Forward pass**: linear algebra (matrix multiplication) + activation functions

**Loss computation**: information theory (cross-entropy) or statistics (MSE)

**Backward pass**: calculus (chain rule) to compute gradients

**Parameter update**: optimization theory (gradient descent, Adam)

6.2 The Attention Mechanism and Matrix Operations

The Transformer's scaled dot-product attention:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

- $Q, K, V$: Query, Key, Value matrices (linear algebra)

- $\frac{1}{\sqrt{d_k}}$ scaling: numerical stability (numerical analysis)

- Softmax: converts scores to probability distribution (probability theory)

- Dot product $QK^T$: similarity measurement (vector dot product)

def stable_softmax_2d(x):

x_max = x.max(axis=-1, keepdims=True)

x_shifted = x - x_max

exp_x = np.exp(x_shifted)

return exp_x / exp_x.sum(axis=-1, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):

d_k = Q.shape[-1]

scores = Q @ K.T / np.sqrt(d_k)

if mask is not None:

scores = np.where(mask, scores, -1e9)

weights = stable_softmax_2d(scores)

output = weights @ V

return output, weights

3 tokens, d_k=4

seq_len, d_k = 3, 4

np.random.seed(42)

Q = np.random.randn(seq_len, d_k)

K = np.random.randn(seq_len, d_k)

V = np.random.randn(seq_len, d_k)

output, weights = scaled_dot_product_attention(Q, K, V)

print(f"Attention output shape: {output.shape}")

print(f"Attention weights (rows sum to 1):\n{weights.round(3)}")

print(f"Row sums: {weights.sum(axis=-1)}")

Conclusion: Next Steps

This guide covered the core mathematical foundations of AI/ML:

1. **Linear Algebra**: the language for representing and transforming data and model parameters

2. **Calculus**: the tool explaining _how_ models learn

3. **Probability & Statistics**: handles uncertainty and provides the theoretical basis for loss functions

4. **Numerical Methods**: knowledge for stable computer implementations of mathematical theory

Recommended next steps:

- **Gilbert Strang's Linear Algebra** — free on MIT OCW

- **Pattern Recognition and Machine Learning (Bishop)** — probabilistic perspective on ML

- **Deep Learning (Goodfellow et al.)** — the theoretical bible of deep learning

- **Practice**: implement a neural network from scratch using only NumPy

Mathematics is a tool for understanding. You do not need to memorize every formula, but grasping _why_ each concept exists and _how_ it operates within deep learning gives you significantly more powerful practical skills.

References

- Gilbert Strang, _Introduction to Linear Algebra_, 6th Edition

- Ian Goodfellow et al., _Deep Learning_, MIT Press (2016)

- Christopher Bishop, _Pattern Recognition and Machine Learning_, Springer (2006)

- 3Blue1Brown — Essence of Linear Algebra, Essence of Calculus (YouTube)

- fast.ai — Practical Deep Learning for Coders

Mathematical Foundations for AI/ML: Complete Guide

1. Linear Algebra

1.1 Vectors

Dot product

L2 norm

Cosine similarity

Various norms

1.2 Matrices

Matrix multiplication

Transpose

Determinant

Inverse

Rank

1.3 Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors

Verify: A*v = lambda*v

SVD

Rank-1 approximation

PCA example

1.4 Linear Algebra in Deep Learning

2. Calculus

2.1 Differentiation

f(x, y) = x^2 + 2y^2

Automatic differentiation with PyTorch

2.2 Chain Rule and Backpropagation

Layer 2 gradients

ReLU gradient

Layer 1 gradients

2.3 Optimization Theory

3. Probability and Statistics

3.1 Probability Basics

3.2 Probability Distributions

Normal distribution

Multivariate normal sampling

Binomial distribution

3.3 Expectation and Variance

3.4 Maximum Likelihood Estimation (MLE)

MLE for Bernoulli distribution

Cross-entropy loss

3.5 Information Theory

Uniform distribution (maximum entropy)

Concentrated distribution (low entropy)

KL Divergence

4. Numerical Methods

4.1 Floating-Point Representation

Floating-point precision issues

FP16 overflow

Compare dtype ranges

4.2 Numerical Stability

Normal input

Large input

Log-Softmax (used with NLLLoss)

5. Mathematical Analysis of Linear Regression

5.1 Ordinary Least Squares

5.2 Ridge and Lasso Regression

Normal equation (OLS)

Ridge

Lasso (sparse solution)

Ridge via normal equation

6. Integration: The Mathematics of Deep Learning

6.1 A Mathematical View of Neural Networks

6.2 The Attention Mechanism and Matrix Operations

3 tokens, d_k=4

Conclusion: Next Steps

References

Verify: Av = lambdav