필사 모드: Mathematical Foundations for AI/ML: Complete Guide - Linear Algebra, Calculus, Probability
EnglishMathematical Foundations for AI/ML: Complete Guide
To truly understand machine learning and deep learning, you need to grasp the underlying mathematics. Many developers use frameworks without fully understanding _why_ the math works the way it does. This guide aims to fill that gap.
Rather than listing formulas, we focus on _why_ each concept matters and _how_ it is used in deep learning, accompanied by Python/NumPy code you can run to verify the math yourself.
1. Linear Algebra
Linear algebra is the fundamental language of deep learning. The forward pass of a neural network is a sequence of matrix multiplications, and embeddings are points in a high-dimensional vector space.
1.1 Vectors
**Definition and Geometric Meaning**
A vector is a quantity with both magnitude and direction. Mathematically it is an ordered list of numbers; geometrically it is an arrow pointing to a location in $n$-dimensional space.
$$\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}$$
In deep learning, vectors appear everywhere: input data, hidden-layer activations, embedding representations — all are vectors.
**Vector Addition and Scalar Multiplication**
Addition is performed element-wise:
$$\mathbf{a} + \mathbf{b} = \begin{bmatrix} a_1 + b_1 \\ a_2 + b_2 \\ \vdots \\ a_n + b_n \end{bmatrix}$$
Multiplying by scalar $c$ scales each component:
$$c \cdot \mathbf{v} = \begin{bmatrix} c v_1 \\ c v_2 \\ \vdots \\ c v_n \end{bmatrix}$$
**Dot Product**
The dot product measures the "similarity" between two vectors:
$$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i$$
Geometrically, it equals the product of magnitudes and the cosine of the angle $\theta$ between them:
$$\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta$$
A positive dot product indicates vectors pointing in the same general direction; negative means opposite; zero means orthogonal.
**Cosine Similarity**
To measure directional similarity independent of magnitude, normalize by both norms:
$$\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$$
The value ranges from $-1$ to $1$. Widely used to compare word embeddings and document representations in NLP.
**Vector Norms**
Norms measure the "length" or "size" of a vector:
- $L_1$ norm (Manhattan): $\|\mathbf{v}\|_1 = \sum_i |v_i|$
- $L_2$ norm (Euclidean): $\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}$
- $L_\infty$ norm (max norm): $\|\mathbf{v}\|_\infty = \max_i |v_i|$
In regularization, $L_1$ encourages sparsity while $L_2$ keeps weights uniformly small.
a = np.array([3.0, 4.0])
b = np.array([1.0, 2.0])
Dot product
dot_product = np.dot(a, b) # 3*1 + 4*2 = 11
print(f"Dot product: {dot_product}")
L2 norm
l2_norm_a = np.linalg.norm(a) # sqrt(9+16) = 5
print(f"L2 norm of a: {l2_norm_a}")
Cosine similarity
cosine_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cosine similarity: {cosine_sim:.4f}")
Various norms
v = np.array([3.0, -4.0, 1.0])
print(f"L1 norm: {np.linalg.norm(v, ord=1)}") # 8.0
print(f"L2 norm: {np.linalg.norm(v, ord=2):.4f}") # 5.099
print(f"Linf norm: {np.linalg.norm(v, ord=np.inf)}") # 4.0
1.2 Matrices
**Matrix Notation**
A matrix is a rectangular array of numbers with $m$ rows and $n$ columns ($m \times n$ matrix):
$$A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}$$
In deep learning, the weight matrix $W$ represents the linear transformation between layers.
**Matrix Multiplication**
If $A$ is $m \times k$ and $B$ is $k \times n$, then $C = AB$ is $m \times n$:
$$C_{ij} = \sum_{p=1}^{k} A_{ip} B_{pj}$$
The key rule: **columns of the left matrix must equal rows of the right matrix**. Matrix multiplication is generally not commutative: $AB \neq BA$.
**Transpose**
Swap rows and columns: $(A^T)_{ij} = A_{ji}$.
Property: $(AB)^T = B^T A^T$
**Inverse Matrix**
For a square matrix $A$, its inverse $A^{-1}$ satisfies $AA^{-1} = A^{-1}A = I$. An inverse exists only when the determinant is non-zero.
**Determinant**
For a $2 \times 2$ matrix:
$$\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc$$
The determinant tells you how much the linear transformation stretches or compresses space. $\det(A) = 0$ means the transformation collapses dimensions — no inverse exists.
**Rank**
The rank of a matrix is the maximum number of linearly independent rows (or columns). A lower rank means the transformation produces a lower-dimensional output space. Low-rank approximation is central to model compression techniques like LoRA.
A = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
B = np.array([[9, 8, 7],
[6, 5, 4],
[3, 2, 1]])
Matrix multiplication
C = np.dot(A, B)
print("Matrix product:\n", C)
Transpose
print("Transpose of A:\n", A.T)
Determinant
A2 = np.array([[3.0, 1.0],
[2.0, 4.0]])
print(f"Determinant: {np.linalg.det(A2):.2f}") # 3*4 - 1*2 = 10
Inverse
A_inv = np.linalg.inv(A2)
print("Inverse:\n", A_inv)
print("A @ A_inv (should be identity):\n", np.round(A2 @ A_inv))
Rank
print(f"Rank of A: {np.linalg.matrix_rank(A)}") # 2 (rows are linearly dependent)
1.3 Eigenvalues and Eigenvectors
**Definition**
For a square matrix $A$, a non-zero vector $\mathbf{v}$ and scalar $\lambda$ satisfying:
$$A\mathbf{v} = \lambda \mathbf{v}$$
are called the **eigenvector** and **eigenvalue**, respectively.
The meaning is powerful: when $A$ transforms $\mathbf{v}$, the direction stays the same — only the magnitude changes by a factor of $\lambda$.
**Eigendecomposition**
For a symmetric matrix, we can decompose using orthogonal eigenvectors:
$$A = Q \Lambda Q^T$$
where $Q$ contains eigenvectors as columns and $\Lambda$ is the diagonal matrix of eigenvalues.
**Relationship to PCA**
The eigenvectors of the covariance matrix $\Sigma = \frac{1}{n} X^T X$ are the principal components. The eigenvector corresponding to the largest eigenvalue points in the direction of greatest variance — the most informative feature direction.
**SVD (Singular Value Decomposition)**
While eigendecomposition only applies to square matrices, SVD applies to any matrix:
$$A = U \Sigma V^T$$
- $U$: $m \times m$ orthogonal matrix (left singular vectors)
- $\Sigma$: $m \times n$ diagonal matrix (singular values, non-negative)
- $V^T$: $n \times n$ orthogonal matrix (right singular vectors)
SVD underpins matrix approximation, noise reduction, recommendation systems, LSA, and the LoRA technique for parameter-efficient fine-tuning.
Eigenvalues and eigenvectors
A = np.array([[4.0, 2.0],
[1.0, 3.0]])
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors (columns):\n", eigenvectors)
Verify: A*v = lambda*v
v0 = eigenvectors[:, 0]
lam0 = eigenvalues[0]
print("A*v:", A @ v0)
print("lambda*v:", lam0 * v0)
SVD
M = np.array([[1, 2, 3],
[4, 5, 6]])
U, S, Vt = np.linalg.svd(M, full_matrices=False)
print(f"\nU shape: {U.shape}")
print(f"Singular values: {S}")
print(f"Vt shape: {Vt.shape}")
Rank-1 approximation
rank1_approx = np.outer(S[0] * U[:, 0], Vt[0, :])
print("Original matrix:\n", M)
print("Rank-1 approximation:\n", rank1_approx.round(2))
PCA example
np.random.seed(42)
data = np.random.randn(100, 2)
data[:, 1] = data[:, 0] * 0.8 + data[:, 1] * 0.2
cov = np.cov(data.T)
eigenvals, eigenvecs = np.linalg.eig(cov)
idx = np.argsort(eigenvals)[::-1]
print("\nFirst principal component:", eigenvecs[:, idx[0]])
print("Explained variance ratio:", eigenvals[idx[0]] / eigenvals.sum())
1.4 Linear Algebra in Deep Learning
**Layers as Linear Transformations**
A fully connected layer is fundamentally a linear transformation:
$$\mathbf{y} = W\mathbf{x} + \mathbf{b}$$
where $\mathbf{x} \in \mathbb{R}^n$, $W \in \mathbb{R}^{m \times n}$, $\mathbf{b} \in \mathbb{R}^m$, $\mathbf{y} \in \mathbb{R}^m$.
Without non-linear activation functions, stacking multiple layers reduces to a single linear transformation — which is why activations like ReLU and Sigmoid are essential.
**Batch Processing**
For a batch of $B$ samples processed simultaneously:
$$Y = XW^T + \mathbf{b}$$
where $X \in \mathbb{R}^{B \times n}$ and $Y \in \mathbb{R}^{B \times m}$. GPUs are optimized for exactly this type of large-scale matrix multiplication.
2. Calculus
Calculus governs _how_ deep learning models learn. Gradient descent — the engine of learning — relies entirely on differentiation.
2.1 Differentiation
**Limit Definition of Derivative**
$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$
This is the instantaneous rate of change — the slope of the tangent line at point $x$.
**Basic Differentiation Rules**
- Power rule: $(x^n)' = nx^{n-1}$
- Sum rule: $(f+g)' = f' + g'$
- Product rule: $(fg)' = f'g + fg'$
- Quotient rule: $(f/g)' = (f'g - fg') / g^2$
- Chain rule: $[f(g(x))]' = f'(g(x)) \cdot g'(x)$
Derivatives of common deep learning functions:
- $\frac{d}{dx}(e^x) = e^x$
- $\frac{d}{dx}(\ln x) = \frac{1}{x}$
- $\frac{d}{dx}(\sigma(x)) = \sigma(x)(1 - \sigma(x))$ (Sigmoid)
- $\frac{d}{dx}(\tanh(x)) = 1 - \tanh^2(x)$
- $\frac{d}{dx}(\text{ReLU}(x)) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}$
**Partial Derivatives**
For a multi-variable function, differentiate with respect to one variable while treating all others as constants:
$$\frac{\partial f}{\partial x_i}$$
For $f(x, y) = x^2 + 3xy + y^2$:
$$\frac{\partial f}{\partial x} = 2x + 3y, \quad \frac{\partial f}{\partial y} = 3x + 2y$$
**Gradient**
The gradient collects all partial derivatives into a vector:
$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$
The gradient points in the direction of **steepest ascent**. To minimize a loss function, move in the opposite direction (gradient descent).
**Jacobian Matrix**
For a vector function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian collects all first-order partial derivatives:
$$J_{ij} = \frac{\partial f_i}{\partial x_j}$$
**Hessian Matrix**
The matrix of second-order partial derivatives for a scalar function:
$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$
Eigenvalues of the Hessian indicate curvature: all positive means local minimum, all negative means local maximum, mixed signs indicate a saddle point.
def numerical_gradient(f, x, h=1e-5):
"""Compute gradient numerically via central differences"""
grad = np.zeros_like(x)
for i in range(len(x)):
x_plus = x.copy(); x_plus[i] += h
x_minus = x.copy(); x_minus[i] -= h
grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
return grad
f(x, y) = x^2 + 2y^2
def f(x):
return x[0]**2 + 2 * x[1]**2
x0 = np.array([3.0, 4.0])
grad = numerical_gradient(f, x0)
print(f"Numerical gradient: {grad}") # [6.0, 16.0]
print(f"Analytic gradient: [{2*x0[0]}, {4*x0[1]}]")
Automatic differentiation with PyTorch
try:
x = torch.tensor([3.0, 4.0], requires_grad=True)
y = x[0]**2 + 2 * x[1]**2
y.backward()
print(f"PyTorch gradient: {x.grad}")
except ImportError:
print("PyTorch not installed")
2.2 Chain Rule and Backpropagation
**Chain Rule**
For composite functions, differentiation follows:
$$\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$$
Generalized to multiple variables:
$$\frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial y_j} \cdot \frac{\partial y_j}{\partial x_i}$$
**Backpropagation**
Training a neural network requires computing $\frac{\partial L}{\partial w}$ for every weight $w$. Since a network is a composition of many functions, we repeatedly apply the chain rule — from the output layer back to the input layer.
For a 2-layer network:
$$L = \ell\bigl(\text{softmax}(W_2 \cdot \text{ReLU}(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2), \mathbf{y}\bigr)$$
Computing $\frac{\partial L}{\partial W_1}$ requires chaining gradients through each operation:
$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{a}_2} \cdot \frac{\partial \mathbf{a}_2}{\partial \mathbf{h}} \cdot \frac{\partial \mathbf{h}}{\partial W_1}$$
class SimpleNet:
def __init__(self):
self.W1 = np.random.randn(4, 3) * 0.1
self.b1 = np.zeros(4)
self.W2 = np.random.randn(2, 4) * 0.1
self.b2 = np.zeros(2)
def relu(self, x):
return np.maximum(0, x)
def relu_grad(self, x):
return (x > 0).astype(float)
def forward(self, x):
self.x = x
self.z1 = self.W1 @ x + self.b1
self.h = self.relu(self.z1)
self.z2 = self.W2 @ self.h + self.b2
return self.z2
def backward(self, dL_dz2):
Layer 2 gradients
dL_dW2 = np.outer(dL_dz2, self.h)
dL_db2 = dL_dz2
dL_dh = self.W2.T @ dL_dz2
ReLU gradient
dL_dz1 = dL_dh * self.relu_grad(self.z1)
Layer 1 gradients
dL_dW1 = np.outer(dL_dz1, self.x)
dL_db1 = dL_dz1
return dL_dW1, dL_db1, dL_dW2, dL_db2
net = SimpleNet()
x = np.array([1.0, 2.0, 3.0])
output = net.forward(x)
dL_dout = np.array([1.0, -1.0])
grads = net.backward(dL_dout)
print(f"W1 gradient shape: {grads[0].shape}")
print(f"W2 gradient shape: {grads[2].shape}")
2.3 Optimization Theory
**Optimality Conditions**
- First-order (necessary): at an extremum, $\nabla f = \mathbf{0}$
- Second-order (sufficient): the Hessian's eigenvalues determine min/max
**Convex Functions**
A function is convex if:
$$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y), \quad \lambda \in [0,1]$$
For convex functions, any local minimum is a global minimum. Deep learning loss functions are generally non-convex, so global optimality cannot be guaranteed.
**Gradient Descent**
$$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$$
where $\alpha$ is the learning rate. Too large and the updates diverge; too small and convergence is slow.
Common variants:
- **SGD (Stochastic Gradient Descent)**: uses a random mini-batch each step
- **Momentum**: accumulates past gradient directions for acceleration
- **Adam**: adaptive learning rates per parameter
Adam update rule:
$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$$
$$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$
def gradient_descent(grad_f, init_x, lr=0.1, steps=100):
x = init_x.copy()
history = [x.copy()]
for _ in range(steps):
grad = grad_f(x)
x -= lr * grad
history.append(x.copy())
return x, np.array(history)
def f(x):
return x[0]**2 + 4 * x[1]**2
def grad_f(x):
return np.array([2*x[0], 8*x[1]])
init_x = np.array([3.0, 3.0])
final_x, history = gradient_descent(grad_f, init_x, lr=0.1, steps=50)
print(f"Start: {init_x}, Converged to: {final_x.round(6)}")
def adam_optimizer(grad_f, init_x, alpha=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):
x = init_x.copy()
m = np.zeros_like(x)
v = np.zeros_like(x)
for t in range(1, steps + 1):
g = grad_f(x)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g**2
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
x -= alpha * m_hat / (np.sqrt(v_hat) + eps)
return x
final_adam = adam_optimizer(grad_f, init_x)
print(f"Adam converged to: {final_adam.round(6)}")
3. Probability and Statistics
Probability and statistics provide the language for handling uncertainty in deep learning. Loss function design, regularization, and model evaluation all rely on probabilistic thinking.
3.1 Probability Basics
**Kolmogorov Axioms**
1. $P(A) \geq 0$ (non-negativity)
2. $P(\Omega) = 1$ (total probability equals 1)
3. For mutually exclusive events: $P(A \cup B) = P(A) + P(B)$ (additivity)
**Conditional Probability**
The probability of $A$ given that $B$ has occurred:
$$P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0$$
**Bayes' Theorem**
$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$
Bayes' theorem formalizes "updating beliefs in the light of evidence":
- $P(A)$: Prior — belief before seeing evidence
- $P(B|A)$: Likelihood — probability of evidence under hypothesis $A$
- $P(A|B)$: Posterior — updated belief after seeing evidence
This underlies Bayesian classifiers, Bayesian neural networks, and probabilistic model evaluation.
3.2 Probability Distributions
**Discrete Distributions**
_Bernoulli Distribution_
Single trial with outcome 0 or 1:
$$P(X=k) = p^k (1-p)^{1-k}, \quad k \in \{0, 1\}$$
$E[X] = p$, $Var(X) = p(1-p)$
_Binomial Distribution_
Number of successes in $n$ Bernoulli trials:
$$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$$
_Poisson Distribution_
Event count in unit time or space:
$$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$$
$E[X] = Var(X) = \lambda$
**Continuous Distributions**
_Uniform Distribution_
Equal probability density over $[a, b]$:
$$f(x) = \frac{1}{b-a}, \quad a \leq x \leq b$$
_Normal (Gaussian) Distribution_
$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
Characterized by mean $\mu$ and standard deviation $\sigma$. By the Central Limit Theorem, many natural phenomena follow this distribution. Used extensively in weight initialization and noise modeling.
_Multivariate Normal Distribution_
$$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$$
where $\boldsymbol{\mu}$ is the mean vector and $\Sigma$ is the covariance matrix. Central to VAE latent space modeling.
from scipy import stats
Normal distribution
mu, sigma = 0, 1
print(f"P(-1 < X < 1) = {stats.norm.cdf(1) - stats.norm.cdf(-1):.4f}") # 68.27%
print(f"P(-2 < X < 2) = {stats.norm.cdf(2) - stats.norm.cdf(-2):.4f}") # 95.45%
print(f"P(-3 < X < 3) = {stats.norm.cdf(3) - stats.norm.cdf(-3):.4f}") # 99.73%
Multivariate normal sampling
mean = np.array([0, 0])
cov = np.array([[1, 0.7],
[0.7, 1]])
samples = np.random.multivariate_normal(mean, cov, size=1000)
print(f"\nSample shape: {samples.shape}")
print(f"Sample mean: {samples.mean(axis=0).round(3)}")
print(f"Sample covariance:\n{np.cov(samples.T).round(3)}")
Binomial distribution
n, p = 10, 0.5
print(f"\nBinomial B(10, 0.5): P(X=5) = {stats.binom.pmf(5, n, p):.4f}")
3.3 Expectation and Variance
**Expected Value**
$$E[X] = \sum_x x \cdot P(X=x) \quad \text{(discrete)}$$
$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx \quad \text{(continuous)}$$
Linearity of expectation: $E[aX + b] = aE[X] + b$
**Variance**
$$Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2$$
Standard deviation: $\sigma = \sqrt{Var(X)}$
**Covariance**
Measures the linear relationship between two random variables:
$$Cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]$$
**Correlation**
$$\rho_{XY} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}$$
Range $[-1, 1]$; measures strength of linear relationship independent of scale.
**Covariance Matrix**
For an $n$-dimensional random vector:
$$\Sigma_{ij} = Cov(X_i, X_j)$$
The covariance matrix encodes the variance structure of data and is the key input to PCA.
3.4 Maximum Likelihood Estimation (MLE)
**Likelihood Function**
The probability of observing data $\mathcal{D} = \{x_1, \ldots, x_n\}$ given parameters $\theta$:
$$L(\theta; \mathcal{D}) = P(\mathcal{D}|\theta) = \prod_{i=1}^n P(x_i|\theta)$$
MLE finds the $\theta$ that maximizes this likelihood:
$$\hat{\theta}_{MLE} = \arg\max_\theta L(\theta; \mathcal{D})$$
**Log-Likelihood**
Taking the log converts the product to a sum, simplifying computation:
$$\log L(\theta) = \sum_{i=1}^n \log P(x_i|\theta)$$
Since log is monotonically increasing, the maximizer of log-likelihood equals the maximizer of likelihood.
**MLE for the Normal Distribution**
Assuming data follows $\mathcal{N}(\mu, \sigma^2)$:
$$\log L(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2$$
Setting the derivative with respect to $\mu$ to zero: $\hat{\mu}_{MLE} = \bar{x}$ (sample mean)
For $\sigma^2$: $\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum (x_i - \bar{x})^2$
**Cross-Entropy and MLE**
Minimizing cross-entropy loss in classification is equivalent to MLE:
$$\mathcal{L}_{CE} = -\frac{1}{n}\sum_{i=1}^n \sum_{c=1}^C y_{ic} \log \hat{p}_{ic}$$
This maximizes the log-likelihood of the model under the true data distribution.
from scipy.optimize import minimize_scalar
MLE for Bernoulli distribution
data = np.array([1, 0, 1, 1, 0, 1, 1, 1, 0, 1])
n = len(data)
k = data.sum()
p_mle = k / n
print(f"Analytic MLE: p = {p_mle:.2f}")
def neg_log_likelihood(p):
if p <= 0 or p >= 1:
return np.inf
return -(k * np.log(p) + (n - k) * np.log(1 - p))
result = minimize_scalar(neg_log_likelihood, bounds=(0.01, 0.99), method='bounded')
print(f"Numerical MLE: p = {result.x:.4f}")
Cross-entropy loss
def cross_entropy(y_true, y_pred, eps=1e-15):
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.2])
print(f"\nCross-Entropy loss: {cross_entropy(y_true, y_pred):.4f}")
3.5 Information Theory
**Entropy**
Measures the uncertainty of a probability distribution $P$:
$$H(X) = -\sum_{x} P(x) \log P(x)$$
Maximum for uniform distributions, zero when a single outcome is certain.
**KL Divergence**
A measure of "distance" between distributions $P$ and $Q$ (asymmetric):
$$D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$
$D_{KL}(P \| Q) \geq 0$, with equality iff $P = Q$. Used in VAE loss functions, PPO policy optimization, and knowledge distillation.
**Cross-Entropy**
$$H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + D_{KL}(P \| Q)$$
Since $H(P)$ is constant, minimizing cross-entropy is equivalent to minimizing KL divergence between the true and model distributions.
**Mutual Information**
Information shared between two random variables:
$$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$$
Theoretical foundation for feature selection and representation learning.
def entropy(p):
p = np.array(p)
p = p[p > 0]
return -np.sum(p * np.log2(p))
def kl_divergence(p, q, eps=1e-15):
p = np.array(p) + eps
q = np.array(q) + eps
p = p / p.sum()
q = q / q.sum()
return np.sum(p * np.log(p / q))
Uniform distribution (maximum entropy)
uniform = [0.25, 0.25, 0.25, 0.25]
print(f"Uniform distribution entropy: {entropy(uniform):.4f} bits") # 2.0
Concentrated distribution (low entropy)
concentrated = [0.97, 0.01, 0.01, 0.01]
print(f"Concentrated distribution entropy: {entropy(concentrated):.4f} bits")
KL Divergence
true_dist = [0.4, 0.3, 0.2, 0.1]
model_dist = [0.35, 0.35, 0.15, 0.15]
kl = kl_divergence(true_dist, model_dist)
print(f"\nKL Divergence D_KL(P||Q): {kl:.4f}")
4. Numerical Methods
Understanding the gap between mathematical theory and actual computer arithmetic is important for reliable deep learning implementations.
4.1 Floating-Point Representation
**FP32 (Single Precision)**
- 32 bits: 1 sign, 8 exponent, 23 mantissa
- Precision: ~7 significant decimal digits
- Range: approximately $\pm 3.4 \times 10^{38}$
**FP16 (Half Precision)**
- 16 bits: 1 sign, 5 exponent, 10 mantissa
- Saves GPU memory, faster computation (Mixed Precision Training)
- Risk of overflow/underflow with large or small values
**BF16 (Brain Float 16)**
- 16 bits: 1 sign, 8 exponent, 7 mantissa
- Same exponent range as FP32 (lower overflow risk)
- Introduced by Google for TPUs, now widely used for AI training
Floating-point precision issues
a = np.float32(0.1)
b = np.float32(0.2)
print(f"0.1 + 0.2 (float32) = {a + b}") # may not be exactly 0.3
FP16 overflow
print(f"FP16 large value: {np.float16(65000)}")
print(f"FP16 overflow: {np.float16(70000)}") # inf
Compare dtype ranges
for dtype in [np.float16, np.float32, np.float64]:
info = np.finfo(dtype)
print(f"{dtype.__name__}: max={info.max:.3e}, min_positive={info.tiny:.3e}")
4.2 Numerical Stability
**The Instability of Naive Softmax**
$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$
Large inputs cause $e^{x_i}$ to overflow (inf). The fix: subtract the maximum value before exponentiation.
$$\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$$
This is mathematically equivalent but numerically stable.
def naive_softmax(x):
return np.exp(x) / np.sum(np.exp(x))
def stable_softmax(x):
x_shifted = x - np.max(x)
return np.exp(x_shifted) / np.sum(np.exp(x_shifted))
Normal input
x_normal = np.array([1.0, 2.0, 3.0])
print("Normal input:")
print(f" Naive: {naive_softmax(x_normal)}")
print(f" Stable: {stable_softmax(x_normal)}")
Large input
x_large = np.array([1000.0, 2000.0, 3000.0])
print("\nLarge input:")
naive_result = naive_softmax(x_large)
print(f" Naive: {naive_result}") # [nan, nan, nan] due to overflow
print(f" Stable: {stable_softmax(x_large)}") # [0, 0, 1]
Log-Softmax (used with NLLLoss)
def log_softmax(x):
x_shifted = x - np.max(x)
return x_shifted - np.log(np.sum(np.exp(x_shifted)))
logits = np.array([2.0, 1.0, 0.1])
print(f"\nLog-Softmax: {log_softmax(logits)}")
5. Mathematical Analysis of Linear Regression
Linear regression is the simplest building block of deep learning. A thorough mathematical understanding creates a foundation for understanding more complex models.
5.1 Ordinary Least Squares
For data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$, we fit a linear model $\hat{y} = \mathbf{w}^T \mathbf{x} + b$.
**Loss function (MSE)**:
$$L(\mathbf{w}) = \frac{1}{n} \|y - X\mathbf{w}\|^2$$
**Normal Equation**
Differentiating with respect to $\mathbf{w}$ and setting to zero:
$$X^T X \mathbf{w} = X^T y$$
$$\hat{\mathbf{w}} = (X^T X)^{-1} X^T y$$
This closed-form solution requires $X^T X$ to be invertible (features must be linearly independent).
5.2 Ridge and Lasso Regression
**Ridge Regression (L2 regularization)**
$$L_{ridge}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_2^2$$
Normal equation: $\hat{\mathbf{w}} = (X^T X + \lambda I)^{-1} X^T y$
Adding $\lambda I$ guarantees invertibility and shrinks weights toward zero, preventing overfitting.
**Lasso Regression (L1 regularization)**
$$L_{lasso}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1$$
L1 regularization induces sparse solutions — some weights become exactly zero — providing automatic feature selection.
from sklearn.linear_model import Ridge, Lasso
np.random.seed(42)
n, d = 100, 10
X = np.random.randn(n, d)
true_w = np.array([1.0, -2.0, 3.0, 0.0, 0.0, 0.0, 0.5, -0.5, 0.0, 0.0])
y = X @ true_w + np.random.randn(n) * 0.5
Normal equation (OLS)
w_ols = np.linalg.lstsq(X, y, rcond=None)[0]
print("OLS coefficients:", w_ols.round(2))
Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
print("Ridge coefficients:", ridge.coef_.round(2))
Lasso (sparse solution)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
print("Lasso coefficients:", lasso.coef_.round(2))
print("Number of zero coefficients:", (np.abs(lasso.coef_) < 0.01).sum())
Ridge via normal equation
def ridge_normal_equation(X, y, lam=1.0):
d = X.shape[1]
return np.linalg.solve(X.T @ X + lam * np.eye(d), X.T @ y)
w_ridge = ridge_normal_equation(X, y, lam=1.0)
print("\nRidge (normal equation):", w_ridge.round(2))
6. Integration: The Mathematics of Deep Learning
6.1 A Mathematical View of Neural Networks
A fully connected neural network with $L$ layers:
$$\mathbf{h}^{(0)} = \mathbf{x}$$
$$\mathbf{z}^{(l)} = W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}$$
$$\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})$$
$$\hat{y} = \mathbf{h}^{(L)}$$
**Forward pass**: linear algebra (matrix multiplication) + activation functions
**Loss computation**: information theory (cross-entropy) or statistics (MSE)
**Backward pass**: calculus (chain rule) to compute gradients
**Parameter update**: optimization theory (gradient descent, Adam)
6.2 The Attention Mechanism and Matrix Operations
The Transformer's scaled dot-product attention:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
- $Q, K, V$: Query, Key, Value matrices (linear algebra)
- $\frac{1}{\sqrt{d_k}}$ scaling: numerical stability (numerical analysis)
- Softmax: converts scores to probability distribution (probability theory)
- Dot product $QK^T$: similarity measurement (vector dot product)
def stable_softmax_2d(x):
x_max = x.max(axis=-1, keepdims=True)
x_shifted = x - x_max
exp_x = np.exp(x_shifted)
return exp_x / exp_x.sum(axis=-1, keepdims=True)
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
if mask is not None:
scores = np.where(mask, scores, -1e9)
weights = stable_softmax_2d(scores)
output = weights @ V
return output, weights
3 tokens, d_k=4
seq_len, d_k = 3, 4
np.random.seed(42)
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Attention output shape: {output.shape}")
print(f"Attention weights (rows sum to 1):\n{weights.round(3)}")
print(f"Row sums: {weights.sum(axis=-1)}")
Conclusion: Next Steps
This guide covered the core mathematical foundations of AI/ML:
1. **Linear Algebra**: the language for representing and transforming data and model parameters
2. **Calculus**: the tool explaining _how_ models learn
3. **Probability & Statistics**: handles uncertainty and provides the theoretical basis for loss functions
4. **Numerical Methods**: knowledge for stable computer implementations of mathematical theory
Recommended next steps:
- **Gilbert Strang's Linear Algebra** — free on MIT OCW
- **Pattern Recognition and Machine Learning (Bishop)** — probabilistic perspective on ML
- **Deep Learning (Goodfellow et al.)** — the theoretical bible of deep learning
- **Practice**: implement a neural network from scratch using only NumPy
Mathematics is a tool for understanding. You do not need to memorize every formula, but grasping _why_ each concept exists and _how_ it operates within deep learning gives you significantly more powerful practical skills.
References
- Gilbert Strang, _Introduction to Linear Algebra_, 6th Edition
- Ian Goodfellow et al., _Deep Learning_, MIT Press (2016)
- Christopher Bishop, _Pattern Recognition and Machine Learning_, Springer (2006)
- 3Blue1Brown — Essence of Linear Algebra, Essence of Calculus (YouTube)
- fast.ai — Practical Deep Learning for Coders
현재 단락 (1/541)
To truly understand machine learning and deep learning, you need to grasp the underlying mathematics...