Skip to content

필사 모드: Mathematical Foundations for AI/ML: Complete Guide - Linear Algebra, Calculus, Probability

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Mathematical Foundations for AI/ML: Complete Guide

To truly understand machine learning and deep learning, you need to grasp the underlying mathematics. Many developers use frameworks without fully understanding _why_ the math works the way it does. This guide aims to fill that gap.

Rather than listing formulas, we focus on _why_ each concept matters and _how_ it is used in deep learning, accompanied by Python/NumPy code you can run to verify the math yourself.

1. Linear Algebra

Linear algebra is the fundamental language of deep learning. The forward pass of a neural network is a sequence of matrix multiplications, and embeddings are points in a high-dimensional vector space.

1.1 Vectors

**Definition and Geometric Meaning**

A vector is a quantity with both magnitude and direction. Mathematically it is an ordered list of numbers; geometrically it is an arrow pointing to a location in $n$-dimensional space.

$$\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}$$

In deep learning, vectors appear everywhere: input data, hidden-layer activations, embedding representations — all are vectors.

**Vector Addition and Scalar Multiplication**

Addition is performed element-wise:

$$\mathbf{a} + \mathbf{b} = \begin{bmatrix} a_1 + b_1 \\ a_2 + b_2 \\ \vdots \\ a_n + b_n \end{bmatrix}$$

Multiplying by scalar $c$ scales each component:

$$c \cdot \mathbf{v} = \begin{bmatrix} c v_1 \\ c v_2 \\ \vdots \\ c v_n \end{bmatrix}$$

**Dot Product**

The dot product measures the "similarity" between two vectors:

$$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i$$

Geometrically, it equals the product of magnitudes and the cosine of the angle $\theta$ between them:

$$\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta$$

A positive dot product indicates vectors pointing in the same general direction; negative means opposite; zero means orthogonal.

**Cosine Similarity**

To measure directional similarity independent of magnitude, normalize by both norms:

$$\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$$

The value ranges from $-1$ to $1$. Widely used to compare word embeddings and document representations in NLP.

**Vector Norms**

Norms measure the "length" or "size" of a vector:

- $L_1$ norm (Manhattan): $\|\mathbf{v}\|_1 = \sum_i |v_i|$

- $L_2$ norm (Euclidean): $\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}$

- $L_\infty$ norm (max norm): $\|\mathbf{v}\|_\infty = \max_i |v_i|$

In regularization, $L_1$ encourages sparsity while $L_2$ keeps weights uniformly small.

a = np.array([3.0, 4.0])

b = np.array([1.0, 2.0])

Dot product

dot_product = np.dot(a, b) # 3*1 + 4*2 = 11

print(f"Dot product: {dot_product}")

L2 norm

l2_norm_a = np.linalg.norm(a) # sqrt(9+16) = 5

print(f"L2 norm of a: {l2_norm_a}")

Cosine similarity

cosine_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"Cosine similarity: {cosine_sim:.4f}")

Various norms

v = np.array([3.0, -4.0, 1.0])

print(f"L1 norm: {np.linalg.norm(v, ord=1)}") # 8.0

print(f"L2 norm: {np.linalg.norm(v, ord=2):.4f}") # 5.099

print(f"Linf norm: {np.linalg.norm(v, ord=np.inf)}") # 4.0

1.2 Matrices

**Matrix Notation**

A matrix is a rectangular array of numbers with $m$ rows and $n$ columns ($m \times n$ matrix):

$$A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}$$

In deep learning, the weight matrix $W$ represents the linear transformation between layers.

**Matrix Multiplication**

If $A$ is $m \times k$ and $B$ is $k \times n$, then $C = AB$ is $m \times n$:

$$C_{ij} = \sum_{p=1}^{k} A_{ip} B_{pj}$$

The key rule: **columns of the left matrix must equal rows of the right matrix**. Matrix multiplication is generally not commutative: $AB \neq BA$.

**Transpose**

Swap rows and columns: $(A^T)_{ij} = A_{ji}$.

Property: $(AB)^T = B^T A^T$

**Inverse Matrix**

For a square matrix $A$, its inverse $A^{-1}$ satisfies $AA^{-1} = A^{-1}A = I$. An inverse exists only when the determinant is non-zero.

**Determinant**

For a $2 \times 2$ matrix:

$$\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc$$

The determinant tells you how much the linear transformation stretches or compresses space. $\det(A) = 0$ means the transformation collapses dimensions — no inverse exists.

**Rank**

The rank of a matrix is the maximum number of linearly independent rows (or columns). A lower rank means the transformation produces a lower-dimensional output space. Low-rank approximation is central to model compression techniques like LoRA.

A = np.array([[1, 2, 3],

[4, 5, 6],

[7, 8, 9]])

B = np.array([[9, 8, 7],

[6, 5, 4],

[3, 2, 1]])

Matrix multiplication

C = np.dot(A, B)

print("Matrix product:\n", C)

Transpose

print("Transpose of A:\n", A.T)

Determinant

A2 = np.array([[3.0, 1.0],

[2.0, 4.0]])

print(f"Determinant: {np.linalg.det(A2):.2f}") # 3*4 - 1*2 = 10

Inverse

A_inv = np.linalg.inv(A2)

print("Inverse:\n", A_inv)

print("A @ A_inv (should be identity):\n", np.round(A2 @ A_inv))

Rank

print(f"Rank of A: {np.linalg.matrix_rank(A)}") # 2 (rows are linearly dependent)

1.3 Eigenvalues and Eigenvectors

**Definition**

For a square matrix $A$, a non-zero vector $\mathbf{v}$ and scalar $\lambda$ satisfying:

$$A\mathbf{v} = \lambda \mathbf{v}$$

are called the **eigenvector** and **eigenvalue**, respectively.

The meaning is powerful: when $A$ transforms $\mathbf{v}$, the direction stays the same — only the magnitude changes by a factor of $\lambda$.

**Eigendecomposition**

For a symmetric matrix, we can decompose using orthogonal eigenvectors:

$$A = Q \Lambda Q^T$$

where $Q$ contains eigenvectors as columns and $\Lambda$ is the diagonal matrix of eigenvalues.

**Relationship to PCA**

The eigenvectors of the covariance matrix $\Sigma = \frac{1}{n} X^T X$ are the principal components. The eigenvector corresponding to the largest eigenvalue points in the direction of greatest variance — the most informative feature direction.

**SVD (Singular Value Decomposition)**

While eigendecomposition only applies to square matrices, SVD applies to any matrix:

$$A = U \Sigma V^T$$

- $U$: $m \times m$ orthogonal matrix (left singular vectors)

- $\Sigma$: $m \times n$ diagonal matrix (singular values, non-negative)

- $V^T$: $n \times n$ orthogonal matrix (right singular vectors)

SVD underpins matrix approximation, noise reduction, recommendation systems, LSA, and the LoRA technique for parameter-efficient fine-tuning.

Eigenvalues and eigenvectors

A = np.array([[4.0, 2.0],

[1.0, 3.0]])

eigenvalues, eigenvectors = np.linalg.eig(A)

print("Eigenvalues:", eigenvalues)

print("Eigenvectors (columns):\n", eigenvectors)

Verify: A*v = lambda*v

v0 = eigenvectors[:, 0]

lam0 = eigenvalues[0]

print("A*v:", A @ v0)

print("lambda*v:", lam0 * v0)

SVD

M = np.array([[1, 2, 3],

[4, 5, 6]])

U, S, Vt = np.linalg.svd(M, full_matrices=False)

print(f"\nU shape: {U.shape}")

print(f"Singular values: {S}")

print(f"Vt shape: {Vt.shape}")

Rank-1 approximation

rank1_approx = np.outer(S[0] * U[:, 0], Vt[0, :])

print("Original matrix:\n", M)

print("Rank-1 approximation:\n", rank1_approx.round(2))

PCA example

np.random.seed(42)

data = np.random.randn(100, 2)

data[:, 1] = data[:, 0] * 0.8 + data[:, 1] * 0.2

cov = np.cov(data.T)

eigenvals, eigenvecs = np.linalg.eig(cov)

idx = np.argsort(eigenvals)[::-1]

print("\nFirst principal component:", eigenvecs[:, idx[0]])

print("Explained variance ratio:", eigenvals[idx[0]] / eigenvals.sum())

1.4 Linear Algebra in Deep Learning

**Layers as Linear Transformations**

A fully connected layer is fundamentally a linear transformation:

$$\mathbf{y} = W\mathbf{x} + \mathbf{b}$$

where $\mathbf{x} \in \mathbb{R}^n$, $W \in \mathbb{R}^{m \times n}$, $\mathbf{b} \in \mathbb{R}^m$, $\mathbf{y} \in \mathbb{R}^m$.

Without non-linear activation functions, stacking multiple layers reduces to a single linear transformation — which is why activations like ReLU and Sigmoid are essential.

**Batch Processing**

For a batch of $B$ samples processed simultaneously:

$$Y = XW^T + \mathbf{b}$$

where $X \in \mathbb{R}^{B \times n}$ and $Y \in \mathbb{R}^{B \times m}$. GPUs are optimized for exactly this type of large-scale matrix multiplication.

2. Calculus

Calculus governs _how_ deep learning models learn. Gradient descent — the engine of learning — relies entirely on differentiation.

2.1 Differentiation

**Limit Definition of Derivative**

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

This is the instantaneous rate of change — the slope of the tangent line at point $x$.

**Basic Differentiation Rules**

- Power rule: $(x^n)' = nx^{n-1}$

- Sum rule: $(f+g)' = f' + g'$

- Product rule: $(fg)' = f'g + fg'$

- Quotient rule: $(f/g)' = (f'g - fg') / g^2$

- Chain rule: $[f(g(x))]' = f'(g(x)) \cdot g'(x)$

Derivatives of common deep learning functions:

- $\frac{d}{dx}(e^x) = e^x$

- $\frac{d}{dx}(\ln x) = \frac{1}{x}$

- $\frac{d}{dx}(\sigma(x)) = \sigma(x)(1 - \sigma(x))$ (Sigmoid)

- $\frac{d}{dx}(\tanh(x)) = 1 - \tanh^2(x)$

- $\frac{d}{dx}(\text{ReLU}(x)) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}$

**Partial Derivatives**

For a multi-variable function, differentiate with respect to one variable while treating all others as constants:

$$\frac{\partial f}{\partial x_i}$$

For $f(x, y) = x^2 + 3xy + y^2$:

$$\frac{\partial f}{\partial x} = 2x + 3y, \quad \frac{\partial f}{\partial y} = 3x + 2y$$

**Gradient**

The gradient collects all partial derivatives into a vector:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

The gradient points in the direction of **steepest ascent**. To minimize a loss function, move in the opposite direction (gradient descent).

**Jacobian Matrix**

For a vector function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian collects all first-order partial derivatives:

$$J_{ij} = \frac{\partial f_i}{\partial x_j}$$

**Hessian Matrix**

The matrix of second-order partial derivatives for a scalar function:

$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

Eigenvalues of the Hessian indicate curvature: all positive means local minimum, all negative means local maximum, mixed signs indicate a saddle point.

def numerical_gradient(f, x, h=1e-5):

"""Compute gradient numerically via central differences"""

grad = np.zeros_like(x)

for i in range(len(x)):

x_plus = x.copy(); x_plus[i] += h

x_minus = x.copy(); x_minus[i] -= h

grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)

return grad

f(x, y) = x^2 + 2y^2

def f(x):

return x[0]**2 + 2 * x[1]**2

x0 = np.array([3.0, 4.0])

grad = numerical_gradient(f, x0)

print(f"Numerical gradient: {grad}") # [6.0, 16.0]

print(f"Analytic gradient: [{2*x0[0]}, {4*x0[1]}]")

Automatic differentiation with PyTorch

try:

x = torch.tensor([3.0, 4.0], requires_grad=True)

y = x[0]**2 + 2 * x[1]**2

y.backward()

print(f"PyTorch gradient: {x.grad}")

except ImportError:

print("PyTorch not installed")

2.2 Chain Rule and Backpropagation

**Chain Rule**

For composite functions, differentiation follows:

$$\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$$

Generalized to multiple variables:

$$\frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial y_j} \cdot \frac{\partial y_j}{\partial x_i}$$

**Backpropagation**

Training a neural network requires computing $\frac{\partial L}{\partial w}$ for every weight $w$. Since a network is a composition of many functions, we repeatedly apply the chain rule — from the output layer back to the input layer.

For a 2-layer network:

$$L = \ell\bigl(\text{softmax}(W_2 \cdot \text{ReLU}(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2), \mathbf{y}\bigr)$$

Computing $\frac{\partial L}{\partial W_1}$ requires chaining gradients through each operation:

$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{a}_2} \cdot \frac{\partial \mathbf{a}_2}{\partial \mathbf{h}} \cdot \frac{\partial \mathbf{h}}{\partial W_1}$$

class SimpleNet:

def __init__(self):

self.W1 = np.random.randn(4, 3) * 0.1

self.b1 = np.zeros(4)

self.W2 = np.random.randn(2, 4) * 0.1

self.b2 = np.zeros(2)

def relu(self, x):

return np.maximum(0, x)

def relu_grad(self, x):

return (x > 0).astype(float)

def forward(self, x):

self.x = x

self.z1 = self.W1 @ x + self.b1

self.h = self.relu(self.z1)

self.z2 = self.W2 @ self.h + self.b2

return self.z2

def backward(self, dL_dz2):

Layer 2 gradients

dL_dW2 = np.outer(dL_dz2, self.h)

dL_db2 = dL_dz2

dL_dh = self.W2.T @ dL_dz2

ReLU gradient

dL_dz1 = dL_dh * self.relu_grad(self.z1)

Layer 1 gradients

dL_dW1 = np.outer(dL_dz1, self.x)

dL_db1 = dL_dz1

return dL_dW1, dL_db1, dL_dW2, dL_db2

net = SimpleNet()

x = np.array([1.0, 2.0, 3.0])

output = net.forward(x)

dL_dout = np.array([1.0, -1.0])

grads = net.backward(dL_dout)

print(f"W1 gradient shape: {grads[0].shape}")

print(f"W2 gradient shape: {grads[2].shape}")

2.3 Optimization Theory

**Optimality Conditions**

- First-order (necessary): at an extremum, $\nabla f = \mathbf{0}$

- Second-order (sufficient): the Hessian's eigenvalues determine min/max

**Convex Functions**

A function is convex if:

$$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y), \quad \lambda \in [0,1]$$

For convex functions, any local minimum is a global minimum. Deep learning loss functions are generally non-convex, so global optimality cannot be guaranteed.

**Gradient Descent**

$$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$$

where $\alpha$ is the learning rate. Too large and the updates diverge; too small and convergence is slow.

Common variants:

- **SGD (Stochastic Gradient Descent)**: uses a random mini-batch each step

- **Momentum**: accumulates past gradient directions for acceleration

- **Adam**: adaptive learning rates per parameter

Adam update rule:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$

$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$$

$$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

def gradient_descent(grad_f, init_x, lr=0.1, steps=100):

x = init_x.copy()

history = [x.copy()]

for _ in range(steps):

grad = grad_f(x)

x -= lr * grad

history.append(x.copy())

return x, np.array(history)

def f(x):

return x[0]**2 + 4 * x[1]**2

def grad_f(x):

return np.array([2*x[0], 8*x[1]])

init_x = np.array([3.0, 3.0])

final_x, history = gradient_descent(grad_f, init_x, lr=0.1, steps=50)

print(f"Start: {init_x}, Converged to: {final_x.round(6)}")

def adam_optimizer(grad_f, init_x, alpha=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):

x = init_x.copy()

m = np.zeros_like(x)

v = np.zeros_like(x)

for t in range(1, steps + 1):

g = grad_f(x)

m = beta1 * m + (1 - beta1) * g

v = beta2 * v + (1 - beta2) * g**2

m_hat = m / (1 - beta1**t)

v_hat = v / (1 - beta2**t)

x -= alpha * m_hat / (np.sqrt(v_hat) + eps)

return x

final_adam = adam_optimizer(grad_f, init_x)

print(f"Adam converged to: {final_adam.round(6)}")

3. Probability and Statistics

Probability and statistics provide the language for handling uncertainty in deep learning. Loss function design, regularization, and model evaluation all rely on probabilistic thinking.

3.1 Probability Basics

**Kolmogorov Axioms**

1. $P(A) \geq 0$ (non-negativity)

2. $P(\Omega) = 1$ (total probability equals 1)

3. For mutually exclusive events: $P(A \cup B) = P(A) + P(B)$ (additivity)

**Conditional Probability**

The probability of $A$ given that $B$ has occurred:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0$$

**Bayes' Theorem**

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Bayes' theorem formalizes "updating beliefs in the light of evidence":

- $P(A)$: Prior — belief before seeing evidence

- $P(B|A)$: Likelihood — probability of evidence under hypothesis $A$

- $P(A|B)$: Posterior — updated belief after seeing evidence

This underlies Bayesian classifiers, Bayesian neural networks, and probabilistic model evaluation.

3.2 Probability Distributions

**Discrete Distributions**

_Bernoulli Distribution_

Single trial with outcome 0 or 1:

$$P(X=k) = p^k (1-p)^{1-k}, \quad k \in \{0, 1\}$$

$E[X] = p$, $Var(X) = p(1-p)$

_Binomial Distribution_

Number of successes in $n$ Bernoulli trials:

$$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$$

_Poisson Distribution_

Event count in unit time or space:

$$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$$

$E[X] = Var(X) = \lambda$

**Continuous Distributions**

_Uniform Distribution_

Equal probability density over $[a, b]$:

$$f(x) = \frac{1}{b-a}, \quad a \leq x \leq b$$

_Normal (Gaussian) Distribution_

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Characterized by mean $\mu$ and standard deviation $\sigma$. By the Central Limit Theorem, many natural phenomena follow this distribution. Used extensively in weight initialization and noise modeling.

_Multivariate Normal Distribution_

$$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$$

where $\boldsymbol{\mu}$ is the mean vector and $\Sigma$ is the covariance matrix. Central to VAE latent space modeling.

from scipy import stats

Normal distribution

mu, sigma = 0, 1

print(f"P(-1 < X < 1) = {stats.norm.cdf(1) - stats.norm.cdf(-1):.4f}") # 68.27%

print(f"P(-2 < X < 2) = {stats.norm.cdf(2) - stats.norm.cdf(-2):.4f}") # 95.45%

print(f"P(-3 < X < 3) = {stats.norm.cdf(3) - stats.norm.cdf(-3):.4f}") # 99.73%

Multivariate normal sampling

mean = np.array([0, 0])

cov = np.array([[1, 0.7],

[0.7, 1]])

samples = np.random.multivariate_normal(mean, cov, size=1000)

print(f"\nSample shape: {samples.shape}")

print(f"Sample mean: {samples.mean(axis=0).round(3)}")

print(f"Sample covariance:\n{np.cov(samples.T).round(3)}")

Binomial distribution

n, p = 10, 0.5

print(f"\nBinomial B(10, 0.5): P(X=5) = {stats.binom.pmf(5, n, p):.4f}")

3.3 Expectation and Variance

**Expected Value**

$$E[X] = \sum_x x \cdot P(X=x) \quad \text{(discrete)}$$

$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx \quad \text{(continuous)}$$

Linearity of expectation: $E[aX + b] = aE[X] + b$

**Variance**

$$Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2$$

Standard deviation: $\sigma = \sqrt{Var(X)}$

**Covariance**

Measures the linear relationship between two random variables:

$$Cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]$$

**Correlation**

$$\rho_{XY} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}$$

Range $[-1, 1]$; measures strength of linear relationship independent of scale.

**Covariance Matrix**

For an $n$-dimensional random vector:

$$\Sigma_{ij} = Cov(X_i, X_j)$$

The covariance matrix encodes the variance structure of data and is the key input to PCA.

3.4 Maximum Likelihood Estimation (MLE)

**Likelihood Function**

The probability of observing data $\mathcal{D} = \{x_1, \ldots, x_n\}$ given parameters $\theta$:

$$L(\theta; \mathcal{D}) = P(\mathcal{D}|\theta) = \prod_{i=1}^n P(x_i|\theta)$$

MLE finds the $\theta$ that maximizes this likelihood:

$$\hat{\theta}_{MLE} = \arg\max_\theta L(\theta; \mathcal{D})$$

**Log-Likelihood**

Taking the log converts the product to a sum, simplifying computation:

$$\log L(\theta) = \sum_{i=1}^n \log P(x_i|\theta)$$

Since log is monotonically increasing, the maximizer of log-likelihood equals the maximizer of likelihood.

**MLE for the Normal Distribution**

Assuming data follows $\mathcal{N}(\mu, \sigma^2)$:

$$\log L(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2$$

Setting the derivative with respect to $\mu$ to zero: $\hat{\mu}_{MLE} = \bar{x}$ (sample mean)

For $\sigma^2$: $\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum (x_i - \bar{x})^2$

**Cross-Entropy and MLE**

Minimizing cross-entropy loss in classification is equivalent to MLE:

$$\mathcal{L}_{CE} = -\frac{1}{n}\sum_{i=1}^n \sum_{c=1}^C y_{ic} \log \hat{p}_{ic}$$

This maximizes the log-likelihood of the model under the true data distribution.

from scipy.optimize import minimize_scalar

MLE for Bernoulli distribution

data = np.array([1, 0, 1, 1, 0, 1, 1, 1, 0, 1])

n = len(data)

k = data.sum()

p_mle = k / n

print(f"Analytic MLE: p = {p_mle:.2f}")

def neg_log_likelihood(p):

if p <= 0 or p >= 1:

return np.inf

return -(k * np.log(p) + (n - k) * np.log(1 - p))

result = minimize_scalar(neg_log_likelihood, bounds=(0.01, 0.99), method='bounded')

print(f"Numerical MLE: p = {result.x:.4f}")

Cross-entropy loss

def cross_entropy(y_true, y_pred, eps=1e-15):

y_pred = np.clip(y_pred, eps, 1 - eps)

return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

y_true = np.array([1, 0, 1, 1, 0])

y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.2])

print(f"\nCross-Entropy loss: {cross_entropy(y_true, y_pred):.4f}")

3.5 Information Theory

**Entropy**

Measures the uncertainty of a probability distribution $P$:

$$H(X) = -\sum_{x} P(x) \log P(x)$$

Maximum for uniform distributions, zero when a single outcome is certain.

**KL Divergence**

A measure of "distance" between distributions $P$ and $Q$ (asymmetric):

$$D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$

$D_{KL}(P \| Q) \geq 0$, with equality iff $P = Q$. Used in VAE loss functions, PPO policy optimization, and knowledge distillation.

**Cross-Entropy**

$$H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + D_{KL}(P \| Q)$$

Since $H(P)$ is constant, minimizing cross-entropy is equivalent to minimizing KL divergence between the true and model distributions.

**Mutual Information**

Information shared between two random variables:

$$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$$

Theoretical foundation for feature selection and representation learning.

def entropy(p):

p = np.array(p)

p = p[p > 0]

return -np.sum(p * np.log2(p))

def kl_divergence(p, q, eps=1e-15):

p = np.array(p) + eps

q = np.array(q) + eps

p = p / p.sum()

q = q / q.sum()

return np.sum(p * np.log(p / q))

Uniform distribution (maximum entropy)

uniform = [0.25, 0.25, 0.25, 0.25]

print(f"Uniform distribution entropy: {entropy(uniform):.4f} bits") # 2.0

Concentrated distribution (low entropy)

concentrated = [0.97, 0.01, 0.01, 0.01]

print(f"Concentrated distribution entropy: {entropy(concentrated):.4f} bits")

KL Divergence

true_dist = [0.4, 0.3, 0.2, 0.1]

model_dist = [0.35, 0.35, 0.15, 0.15]

kl = kl_divergence(true_dist, model_dist)

print(f"\nKL Divergence D_KL(P||Q): {kl:.4f}")

4. Numerical Methods

Understanding the gap between mathematical theory and actual computer arithmetic is important for reliable deep learning implementations.

4.1 Floating-Point Representation

**FP32 (Single Precision)**

- 32 bits: 1 sign, 8 exponent, 23 mantissa

- Precision: ~7 significant decimal digits

- Range: approximately $\pm 3.4 \times 10^{38}$

**FP16 (Half Precision)**

- 16 bits: 1 sign, 5 exponent, 10 mantissa

- Saves GPU memory, faster computation (Mixed Precision Training)

- Risk of overflow/underflow with large or small values

**BF16 (Brain Float 16)**

- 16 bits: 1 sign, 8 exponent, 7 mantissa

- Same exponent range as FP32 (lower overflow risk)

- Introduced by Google for TPUs, now widely used for AI training

Floating-point precision issues

a = np.float32(0.1)

b = np.float32(0.2)

print(f"0.1 + 0.2 (float32) = {a + b}") # may not be exactly 0.3

FP16 overflow

print(f"FP16 large value: {np.float16(65000)}")

print(f"FP16 overflow: {np.float16(70000)}") # inf

Compare dtype ranges

for dtype in [np.float16, np.float32, np.float64]:

info = np.finfo(dtype)

print(f"{dtype.__name__}: max={info.max:.3e}, min_positive={info.tiny:.3e}")

4.2 Numerical Stability

**The Instability of Naive Softmax**

$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

Large inputs cause $e^{x_i}$ to overflow (inf). The fix: subtract the maximum value before exponentiation.

$$\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$$

This is mathematically equivalent but numerically stable.

def naive_softmax(x):

return np.exp(x) / np.sum(np.exp(x))

def stable_softmax(x):

x_shifted = x - np.max(x)

return np.exp(x_shifted) / np.sum(np.exp(x_shifted))

Normal input

x_normal = np.array([1.0, 2.0, 3.0])

print("Normal input:")

print(f" Naive: {naive_softmax(x_normal)}")

print(f" Stable: {stable_softmax(x_normal)}")

Large input

x_large = np.array([1000.0, 2000.0, 3000.0])

print("\nLarge input:")

naive_result = naive_softmax(x_large)

print(f" Naive: {naive_result}") # [nan, nan, nan] due to overflow

print(f" Stable: {stable_softmax(x_large)}") # [0, 0, 1]

Log-Softmax (used with NLLLoss)

def log_softmax(x):

x_shifted = x - np.max(x)

return x_shifted - np.log(np.sum(np.exp(x_shifted)))

logits = np.array([2.0, 1.0, 0.1])

print(f"\nLog-Softmax: {log_softmax(logits)}")

5. Mathematical Analysis of Linear Regression

Linear regression is the simplest building block of deep learning. A thorough mathematical understanding creates a foundation for understanding more complex models.

5.1 Ordinary Least Squares

For data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$, we fit a linear model $\hat{y} = \mathbf{w}^T \mathbf{x} + b$.

**Loss function (MSE)**:

$$L(\mathbf{w}) = \frac{1}{n} \|y - X\mathbf{w}\|^2$$

**Normal Equation**

Differentiating with respect to $\mathbf{w}$ and setting to zero:

$$X^T X \mathbf{w} = X^T y$$

$$\hat{\mathbf{w}} = (X^T X)^{-1} X^T y$$

This closed-form solution requires $X^T X$ to be invertible (features must be linearly independent).

5.2 Ridge and Lasso Regression

**Ridge Regression (L2 regularization)**

$$L_{ridge}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_2^2$$

Normal equation: $\hat{\mathbf{w}} = (X^T X + \lambda I)^{-1} X^T y$

Adding $\lambda I$ guarantees invertibility and shrinks weights toward zero, preventing overfitting.

**Lasso Regression (L1 regularization)**

$$L_{lasso}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1$$

L1 regularization induces sparse solutions — some weights become exactly zero — providing automatic feature selection.

from sklearn.linear_model import Ridge, Lasso

np.random.seed(42)

n, d = 100, 10

X = np.random.randn(n, d)

true_w = np.array([1.0, -2.0, 3.0, 0.0, 0.0, 0.0, 0.5, -0.5, 0.0, 0.0])

y = X @ true_w + np.random.randn(n) * 0.5

Normal equation (OLS)

w_ols = np.linalg.lstsq(X, y, rcond=None)[0]

print("OLS coefficients:", w_ols.round(2))

Ridge

ridge = Ridge(alpha=1.0)

ridge.fit(X, y)

print("Ridge coefficients:", ridge.coef_.round(2))

Lasso (sparse solution)

lasso = Lasso(alpha=0.1)

lasso.fit(X, y)

print("Lasso coefficients:", lasso.coef_.round(2))

print("Number of zero coefficients:", (np.abs(lasso.coef_) < 0.01).sum())

Ridge via normal equation

def ridge_normal_equation(X, y, lam=1.0):

d = X.shape[1]

return np.linalg.solve(X.T @ X + lam * np.eye(d), X.T @ y)

w_ridge = ridge_normal_equation(X, y, lam=1.0)

print("\nRidge (normal equation):", w_ridge.round(2))

6. Integration: The Mathematics of Deep Learning

6.1 A Mathematical View of Neural Networks

A fully connected neural network with $L$ layers:

$$\mathbf{h}^{(0)} = \mathbf{x}$$

$$\mathbf{z}^{(l)} = W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}$$

$$\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})$$

$$\hat{y} = \mathbf{h}^{(L)}$$

**Forward pass**: linear algebra (matrix multiplication) + activation functions

**Loss computation**: information theory (cross-entropy) or statistics (MSE)

**Backward pass**: calculus (chain rule) to compute gradients

**Parameter update**: optimization theory (gradient descent, Adam)

6.2 The Attention Mechanism and Matrix Operations

The Transformer's scaled dot-product attention:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

- $Q, K, V$: Query, Key, Value matrices (linear algebra)

- $\frac{1}{\sqrt{d_k}}$ scaling: numerical stability (numerical analysis)

- Softmax: converts scores to probability distribution (probability theory)

- Dot product $QK^T$: similarity measurement (vector dot product)

def stable_softmax_2d(x):

x_max = x.max(axis=-1, keepdims=True)

x_shifted = x - x_max

exp_x = np.exp(x_shifted)

return exp_x / exp_x.sum(axis=-1, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):

d_k = Q.shape[-1]

scores = Q @ K.T / np.sqrt(d_k)

if mask is not None:

scores = np.where(mask, scores, -1e9)

weights = stable_softmax_2d(scores)

output = weights @ V

return output, weights

3 tokens, d_k=4

seq_len, d_k = 3, 4

np.random.seed(42)

Q = np.random.randn(seq_len, d_k)

K = np.random.randn(seq_len, d_k)

V = np.random.randn(seq_len, d_k)

output, weights = scaled_dot_product_attention(Q, K, V)

print(f"Attention output shape: {output.shape}")

print(f"Attention weights (rows sum to 1):\n{weights.round(3)}")

print(f"Row sums: {weights.sum(axis=-1)}")

Conclusion: Next Steps

This guide covered the core mathematical foundations of AI/ML:

1. **Linear Algebra**: the language for representing and transforming data and model parameters

2. **Calculus**: the tool explaining _how_ models learn

3. **Probability & Statistics**: handles uncertainty and provides the theoretical basis for loss functions

4. **Numerical Methods**: knowledge for stable computer implementations of mathematical theory

Recommended next steps:

- **Gilbert Strang's Linear Algebra** — free on MIT OCW

- **Pattern Recognition and Machine Learning (Bishop)** — probabilistic perspective on ML

- **Deep Learning (Goodfellow et al.)** — the theoretical bible of deep learning

- **Practice**: implement a neural network from scratch using only NumPy

Mathematics is a tool for understanding. You do not need to memorize every formula, but grasping _why_ each concept exists and _how_ it operates within deep learning gives you significantly more powerful practical skills.

References

- Gilbert Strang, _Introduction to Linear Algebra_, 6th Edition

- Ian Goodfellow et al., _Deep Learning_, MIT Press (2016)

- Christopher Bishop, _Pattern Recognition and Machine Learning_, Springer (2006)

- 3Blue1Brown — Essence of Linear Algebra, Essence of Calculus (YouTube)

- fast.ai — Practical Deep Learning for Coders

현재 단락 (1/541)

To truly understand machine learning and deep learning, you need to grasp the underlying mathematics...

작성 글자: 0원문 글자: 26,025작성 단락: 0/541