Split View: AI/ML을 위한 수학 완전 정복: 선형대수, 미적분, 확률통계

AI/ML을 위한 수학 완전 정복: 선형대수, 미적분, 확률통계

AI/ML을 위한 수학 완전 정복

머신러닝과 딥러닝을 깊이 이해하려면 그 바탕에 깔린 수학을 알아야 합니다. 많은 개발자들이 "왜 이 수식이 이렇게 작동하는가?"라는 질문에 명확한 답을 얻지 못한 채로 프레임워크를 사용합니다. 이 글은 그 공백을 채우기 위한 시도입니다.

단순히 수식을 나열하는 것이 아니라, 각 개념이 왜 중요한지, 딥러닝의 어느 부분에서 어떻게 쓰이는지를 중심으로 설명합니다. Python/NumPy 코드를 함께 제공하여 수식을 직접 확인할 수 있습니다.

1. 선형대수(Linear Algebra)

선형대수는 딥러닝의 가장 기본적인 언어입니다. 신경망의 순전파(Forward Pass)는 사실 행렬 곱셈의 연속이고, 임베딩은 고차원 벡터 공간의 한 점입니다.

1.1 벡터(Vector)

벡터의 정의와 기하학적 의미

벡터는 크기(magnitude)와 방향(direction)을 동시에 가진 양입니다. 수학적으로는 수들의 순서 있는 목록이며, $n$ 차원 공간의 한 점을 가리키는 화살표로 시각화할 수 있습니다.

$\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}$

딥러닝에서 벡터는 도처에 등장합니다. 입력 데이터, 은닉층의 활성화값, 임베딩 표현 등이 모두 벡터입니다.

벡터 덧셈과 스칼라 곱

두 벡터의 덧셈은 성분별(element-wise)로 이루어집니다.

$\mathbf{a} + \mathbf{b} = \begin{bmatrix} a_1 + b_1 \\ a_2 + b_2 \\ \vdots \\ a_n + b_n \end{bmatrix}$

스칼라 $c$ 와 벡터의 곱은 각 성분에 $c$ 를 곱합니다.

$c \cdot \mathbf{v} = \begin{bmatrix} c \cdot v_1 \\ c \cdot v_2 \\ \vdots \\ c \cdot v_n \end{bmatrix}$

내적(Dot Product)

내적은 두 벡터의 "유사성"을 측정하는 연산입니다.

$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n$

기하학적으로, 내적은 두 벡터의 크기와 사이각 $\theta$ 의 코사인으로도 표현됩니다.

$\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta$

내적이 양수이면 두 벡터는 같은 방향을, 음수이면 반대 방향을, 0이면 직각을 이룹니다.

코사인 유사도(Cosine Similarity)

두 벡터의 방향 유사성만을 측정하려면 크기로 정규화합니다.

$\text{cosine similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$

값의 범위는 $[-1, 1]$ 이며, 1이면 완전히 같은 방향, -1이면 완전히 반대 방향을 의미합니다. 자연어 처리에서 단어 임베딩의 유사도를 측정할 때 자주 사용됩니다.

벡터 노름(Vector Norm)

노름은 벡터의 "크기(길이)"를 측정하는 함수입니다.

$L_1$ 노름(맨해튼 거리): $\|\mathbf{v}\|_1 = \sum_i |v_i|$
$L_2$ 노름(유클리드 거리): $\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}$
$L_\infty$ 노름(최댓값 노름): $\|\mathbf{v}\|_\infty = \max_i |v_i|$

정규화(Regularization)에서 $L_1$ 은 희소성(sparsity)을, $L_2$ 는 가중치의 전반적 크기 축소를 유도합니다.

import numpy as np

a = np.array([3.0, 4.0])
b = np.array([1.0, 2.0])

# 내적
dot_product = np.dot(a, b)  # 3*1 + 4*2 = 11
print(f"내적: {dot_product}")

# 노름
l2_norm_a = np.linalg.norm(a)  # sqrt(9+16) = 5
print(f"L2 노름: {l2_norm_a}")

# 코사인 유사도
cosine_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"코사인 유사도: {cosine_sim:.4f}")

# 다양한 노름
v = np.array([3.0, -4.0, 1.0])
print(f"L1 노름: {np.linalg.norm(v, ord=1)}")   # 8.0
print(f"L2 노름: {np.linalg.norm(v, ord=2)}")   # 5.099...
print(f"Linf 노름: {np.linalg.norm(v, ord=np.inf)}")  # 4.0

1.2 행렬(Matrix)

행렬 표기법

행렬은 수들을 직사각형 형태로 배열한 것입니다. $m \times n$ 행렬은 $m$ 개의 행(row)과 $n$ 개의 열(column)을 가집니다.

$A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}$

딥러닝에서 가중치 행렬 $W$ 는 레이어 간의 선형 변환을 나타냅니다.

행렬 곱셈(Matrix Multiplication)

$A$ 가 $m \times k$ 행렬이고 $B$ 가 $k \times n$ 행렬일 때, 곱 $C = AB$ 는 $m \times n$ 행렬이며:

$C_{ij} = \sum_{p=1}^{k} A_{ip} B_{pj}$

핵심은 앞 행렬의 열 수 = 뒤 행렬의 행 수가 일치해야 한다는 것입니다. 결과 행렬의 shape은 $(m, n)$ 입니다.

행렬 곱셈은 교환법칙이 성립하지 않습니다: $AB \neq BA$ (일반적으로).

전치행렬(Transpose)

행과 열을 교환한 행렬입니다.

$(A^T)_{ij} = A_{ji}$

성질: $(AB)^T = B^T A^T$

역행렬(Inverse Matrix)

정방행렬 $A$ 의 역행렬 $A^{-1}$ 는 $AA^{-1} = A^{-1}A = I$ 를 만족합니다. 역행렬이 존재하려면 행렬식이 0이 아니어야 합니다.

행렬식(Determinant)

$2 \times 2$ 행렬의 행렬식:

$\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc$

행렬식은 해당 행렬이 나타내는 선형 변환이 공간을 얼마나 "늘리거나 줄이는지"를 나타냅니다. $\det(A) = 0$ 이면 차원이 축소되어 역행렬이 존재하지 않습니다.

랭크(Rank)

행렬의 랭크는 선형 독립인 행(또는 열)의 최대 수입니다. 랭크가 낮을수록 행렬이 나타내는 선형 변환의 출력 공간 차원이 낮습니다. Low-rank approximation은 모델 압축(LoRA 등)의 핵심 개념입니다.

import numpy as np

A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

B = np.array([[9, 8, 7],
              [6, 5, 4],
              [3, 2, 1]])

# 행렬 곱
C = np.dot(A, B)
print("행렬 곱 결과:\n", C)

# 전치행렬
print("A의 전치행렬:\n", A.T)

# 행렬식
A2 = np.array([[3.0, 1.0],
               [2.0, 4.0]])
print(f"행렬식: {np.linalg.det(A2):.2f}")  # 3*4 - 1*2 = 10

# 역행렬
A_inv = np.linalg.inv(A2)
print("역행렬:\n", A_inv)
print("A * A_inv (항등행렬):\n", np.round(A2 @ A_inv))

# 랭크
print(f"A의 랭크: {np.linalg.matrix_rank(A)}")  # 2 (종속 행이 있으므로)

1.3 고유값과 고유벡터(Eigenvalues & Eigenvectors)

정의

정방행렬 $A$ 에 대해, 다음 방정식을 만족하는 비영 벡터 $\mathbf{v}$ 를 고유벡터(eigenvector), 스칼라 $\lambda$ 를 **고유값(eigenvalue)**이라 합니다.

$A\mathbf{v} = \lambda \mathbf{v}$

이 식의 의미는 강력합니다: 행렬 $A$ 가 벡터 $\mathbf{v}$ 를 변환할 때, 방향은 변하지 않고 크기만 $\lambda$ 배 바뀝니다.

고유값 분해(Eigendecomposition)

$A$ 가 대칭 행렬이면 직교하는 고유벡터들로 분해할 수 있습니다.

$A = Q \Lambda Q^T$

여기서 $Q$ 는 고유벡터들을 열로 가지는 직교 행렬이고, $\Lambda$ 는 고유값들을 대각에 가지는 대각 행렬입니다.

PCA(주성분 분석)와의 관계

데이터 행렬 $X$ 의 공분산 행렬 $\Sigma = \frac{1}{n} X^T X$ 의 고유벡터들이 주성분(Principal Components)입니다. 가장 큰 고유값에 대응하는 고유벡터가 데이터의 분산이 가장 큰 방향, 즉 가장 중요한 특징 방향입니다.

SVD(특이값 분해, Singular Value Decomposition)

고유값 분해는 정방행렬에만 적용되지만, SVD는 모든 행렬에 적용됩니다.

$A = U \Sigma V^T$

$U$ : $m \times m$ 직교 행렬 (왼쪽 특이벡터)
$\Sigma$ : $m \times n$ 대각 행렬 (특이값, 0 이상)
$V^T$ : $n \times n$ 직교 행렬 (오른쪽 특이벡터)

SVD는 행렬 근사, 노이즈 제거, 추천 시스템, 자연어 처리의 LSA 등에 핵심적으로 사용됩니다. LoRA(Low-Rank Adaptation)도 SVD의 저랭크 근사 아이디어를 활용합니다.

import numpy as np

# 고유값과 고유벡터
A = np.array([[4.0, 2.0],
              [1.0, 3.0]])

eigenvalues, eigenvectors = np.linalg.eig(A)
print("고유값:", eigenvalues)
print("고유벡터 (열 벡터):\n", eigenvectors)

# 검증: A*v = lambda*v
v0 = eigenvectors[:, 0]
lam0 = eigenvalues[0]
print("A*v:", A @ v0)
print("lambda*v:", lam0 * v0)

# SVD
M = np.array([[1, 2, 3],
              [4, 5, 6]])

U, S, Vt = np.linalg.svd(M, full_matrices=False)
print(f"U shape: {U.shape}")
print(f"S (특이값): {S}")
print(f"Vt shape: {Vt.shape}")

# 저랭크 근사 (rank-1 approximation)
rank1_approx = np.outer(S[0] * U[:, 0], Vt[0, :])
print("원본 행렬:\n", M)
print("rank-1 근사:\n", rank1_approx)

# PCA 예제
from numpy.linalg import eig

# 데이터 생성
np.random.seed(42)
data = np.random.randn(100, 2)
data[:, 1] = data[:, 0] * 0.8 + data[:, 1] * 0.2  # 상관관계 추가

# 공분산 행렬
cov = np.cov(data.T)
eigenvals, eigenvecs = eig(cov)

# 첫 번째 주성분
idx = np.argsort(eigenvals)[::-1]
principal_component = eigenvecs[:, idx[0]]
print("첫 번째 주성분:", principal_component)
print("설명 분산 비율:", eigenvals[idx[0]] / eigenvals.sum())

1.4 딥러닝과 선형대수

선형 변환으로서의 레이어

신경망의 완전 연결 레이어(Fully Connected Layer)는 본질적으로 선형 변환입니다.

$\mathbf{y} = W\mathbf{x} + \mathbf{b}$

여기서:

$\mathbf{x} \in \mathbb{R}^n$ : 입력 벡터
$W \in \mathbb{R}^{m \times n}$ : 가중치 행렬
$\mathbf{b} \in \mathbb{R}^m$ : 편향 벡터
$\mathbf{y} \in \mathbb{R}^m$ : 출력 벡터

여러 레이어를 거치는 것은 연속된 선형 변환입니다. 활성화 함수가 없으면 여러 레이어를 쌓아도 단일 선형 변환과 동일하므로, 비선형 활성화 함수(ReLU, Sigmoid 등)가 필수적입니다.

배치 처리와 행렬 곱셈

$B$ 개의 샘플을 동시에 처리할 때:

$Y = XW^T + \mathbf{b}$

여기서 $X \in \mathbb{R}^{B \times n}$ 이고 $Y \in \mathbb{R}^{B \times m}$ 입니다. GPU는 이런 대규모 행렬 곱셈을 병렬로 처리하는 데 최적화되어 있습니다.

2. 미적분(Calculus)

미적분은 딥러닝에서 "어떻게 배우는가"를 담당합니다. 손실 함수를 최소화하는 경사 하강법은 전적으로 미분에 의존합니다.

2.1 미분(Differentiation)

극한과 미분의 정의

함수 $f(x)$ 의 $x$ 에서의 미분은 극한으로 정의됩니다.

$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$

이는 $x$ 에서 함수의 순간 변화율, 즉 그래프의 접선 기울기입니다.

기본 미분 규칙

거듭제곱 법칙: $(x^n)' = nx^{n-1}$
합 법칙: $(f+g)' = f' + g'$
곱 법칙: $(fg)' = f'g + fg'$
몫 법칙: $(f/g)' = (f'g - fg') / g^2$
연쇄 법칙(Chain Rule): $[f(g(x))]' = f'(g(x)) \cdot g'(x)$

딥러닝에서 자주 쓰이는 함수들의 미분:

$\frac{d}{dx}(e^x) = e^x$
$\frac{d}{dx}(\ln x) = \frac{1}{x}$
$\frac{d}{dx}(\sigma(x)) = \sigma(x)(1 - \sigma(x))$ (Sigmoid)
$\frac{d}{dx}(\tanh(x)) = 1 - \tanh^2(x)$
$\frac{d}{dx}(\text{ReLU}(x)) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}$

편미분(Partial Derivative)

다변수 함수에서 하나의 변수에 대한 미분입니다. 다른 변수들은 상수로 취급합니다.

$\frac{\partial f}{\partial x_i}$

예를 들어, $f(x, y) = x^2 + 3xy + y^2$ 이면:

$\frac{\partial f}{\partial x} = 2x + 3y, \quad \frac{\partial f}{\partial y} = 3x + 2y$

그래디언트(Gradient)

모든 편미분을 모은 벡터입니다.

$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$

그래디언트는 함수값이 가장 빠르게 증가하는 방향을 가리킵니다. 따라서 손실 함수를 최소화하려면 그래디언트의 반대 방향으로 이동합니다(경사 하강법).

야코비안(Jacobian) 행렬

벡터 함수 $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ 의 모든 1차 편미분을 담은 행렬입니다.

$J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$

헤시안(Hessian) 행렬

스칼라 함수의 2차 편미분을 담은 정방 행렬입니다.

$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$

헤시안의 고유값은 함수의 곡률을 나타냅니다. 모든 고유값이 양수이면 극소, 모두 음수이면 극대, 혼재하면 안장점(Saddle Point)입니다.

import numpy as np

# 수치 미분 (기울기 확인용)
def numerical_gradient(f, x, h=1e-5):
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy()
        x_plus[i] += h
        x_minus = x.copy()
        x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

# 예제 함수: f(x, y) = x^2 + 2y^2
def f(x):
    return x[0]**2 + 2 * x[1]**2

x0 = np.array([3.0, 4.0])
grad = numerical_gradient(f, x0)
print(f"수치 그래디언트: {grad}")  # [6.0, 16.0]
print(f"해석적 그래디언트: [{2*x0[0]}, {4*x0[1]}]")

# PyTorch를 이용한 자동 미분
try:
    import torch
    x = torch.tensor([3.0, 4.0], requires_grad=True)
    y = x[0]**2 + 2 * x[1]**2
    y.backward()
    print(f"PyTorch 그래디언트: {x.grad}")
except ImportError:
    print("PyTorch 미설치")

2.2 체인 룰과 역전파(Backpropagation)

체인 룰(Chain Rule)

함수의 합성에서 미분을 계산하는 규칙입니다.

$\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$

다변수로 확장하면:

$\frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial y_j} \cdot \frac{\partial y_j}{\partial x_i}$

역전파 알고리즘

신경망의 훈련에서 손실 함수 $L$ 을 각 가중치 $w$ 에 대해 미분해야 합니다. 네트워크가 여러 레이어의 합성 함수이므로, 체인 룰을 반복 적용합니다.

2-레이어 신경망의 예:

$L = \ell(\text{softmax}(W_2 \cdot \text{ReLU}(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2), \mathbf{y})$

$\frac{\partial L}{\partial W_1}$ 을 구하려면:

$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{a}_2} \cdot \frac{\partial \mathbf{a}_2}{\partial \mathbf{h}} \cdot \frac{\partial \mathbf{h}}{\partial W_1}$

역전파는 출력에서 입력 방향으로 이 체인을 효율적으로 계산합니다.

import numpy as np

# 간단한 신경망 역전파 예제
class SimpleNet:
    def __init__(self):
        self.W1 = np.random.randn(4, 3) * 0.1
        self.b1 = np.zeros(4)
        self.W2 = np.random.randn(2, 4) * 0.1
        self.b2 = np.zeros(2)

    def relu(self, x):
        return np.maximum(0, x)

    def relu_grad(self, x):
        return (x > 0).astype(float)

    def forward(self, x):
        self.x = x
        self.z1 = self.W1 @ x + self.b1
        self.h = self.relu(self.z1)
        self.z2 = self.W2 @ self.h + self.b2
        return self.z2

    def backward(self, dL_dz2):
        # 레이어 2 그래디언트
        dL_dW2 = np.outer(dL_dz2, self.h)
        dL_db2 = dL_dz2
        dL_dh = self.W2.T @ dL_dz2

        # ReLU 그래디언트
        dL_dz1 = dL_dh * self.relu_grad(self.z1)

        # 레이어 1 그래디언트
        dL_dW1 = np.outer(dL_dz1, self.x)
        dL_db1 = dL_dz1

        return dL_dW1, dL_db1, dL_dW2, dL_db2

net = SimpleNet()
x = np.array([1.0, 2.0, 3.0])
output = net.forward(x)
dL_dout = np.array([1.0, -1.0])  # 손실의 그래디언트
grads = net.backward(dL_dout)
print(f"W1 그래디언트 shape: {grads[0].shape}")
print(f"W2 그래디언트 shape: {grads[2].shape}")

2.3 최적화 이론

극값 조건

1차 조건(필요 조건): 극값에서 $\nabla f = \mathbf{0}$
2차 조건(충분 조건): 헤시안 $H$ 의 부호로 극소/극대 판단

볼록 함수(Convex Function)

함수 $f$ 가 볼록이면 두 점의 직선보다 함수값이 항상 낮거나 같습니다.

$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y), \quad \lambda \in [0,1]$

볼록 함수의 극소는 항상 전역 최솟값입니다. 딥러닝의 손실 함수는 일반적으로 볼록하지 않아서 전역 최솟값을 보장할 수 없습니다.

경사 하강법(Gradient Descent)

$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$

여기서 $\alpha$ 는 학습률(learning rate)입니다. 학습률이 너무 크면 발산하고, 너무 작으면 수렴이 느립니다.

변형들:

확률적 경사 하강법(SGD): 매 스텝마다 무작위 샘플 하나(또는 미니배치)를 사용
모멘텀(Momentum): 이전 그래디언트 방향을 누적하여 가속
Adam: 적응적 학습률, 현재 딥러닝에서 가장 널리 쓰임

$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$ $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ $\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

import numpy as np
import matplotlib.pyplot as plt

# 경사 하강법 구현
def gradient_descent(grad_f, init_x, learning_rate=0.1, steps=100):
    x = init_x.copy()
    history = [x.copy()]
    for _ in range(steps):
        grad = grad_f(x)
        x -= learning_rate * grad
        history.append(x.copy())
    return x, np.array(history)

# 예제: f(x, y) = x^2 + 4y^2 최소화
def f(x):
    return x[0]**2 + 4 * x[1]**2

def grad_f(x):
    return np.array([2*x[0], 8*x[1]])

init_x = np.array([3.0, 3.0])
final_x, history = gradient_descent(grad_f, init_x, learning_rate=0.1, steps=50)
print(f"시작점: {init_x}")
print(f"수렴 지점: {final_x}")
print(f"최솟값: {f(final_x):.6f}")

# Adam 옵티마이저 구현
def adam(grad_f, init_x, alpha=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):
    x = init_x.copy()
    m = np.zeros_like(x)
    v = np.zeros_like(x)
    for t in range(1, steps + 1):
        g = grad_f(x)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        x -= alpha * m_hat / (np.sqrt(v_hat) + eps)
    return x

final_adam = adam(grad_f, init_x)
print(f"Adam 수렴 지점: {final_adam}")

3. 확률과 통계(Probability & Statistics)

확률과 통계는 딥러닝의 불확실성을 다루는 언어입니다. 손실 함수의 설계, 정규화, 모델 평가 모두 확률적 사고에 기반합니다.

3.1 확률 기초

확률 공리(Kolmogorov Axioms)

$P(A) \geq 0$ (비음성)
$P(\Omega) = 1$ (전체 확률은 1)
서로소인 사건들에 대해 $P(A \cup B) = P(A) + P(B)$ (가산 가법성)

조건부 확률

사건 $B$ 가 발생했다는 조건 하에 사건 $A$ 의 확률:

$P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0$

베이즈 정리(Bayes' Theorem)

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

베이즈 정리는 "증거를 바탕으로 믿음을 업데이트"하는 원리입니다.

$P(A)$ : 사전 확률(Prior) - 증거 이전의 믿음
$P(B|A)$ : 우도(Likelihood) - 가설 A 하에서 증거 B의 확률
$P(A|B)$ : 사후 확률(Posterior) - 증거를 본 후 업데이트된 믿음

이는 분류 모델 평가, 나이브 베이즈 분류기, 베이지안 신경망 등에 광범위하게 사용됩니다.

확률의 전체 법칙

$P(B) = \sum_i P(B|A_i) P(A_i)$

3.2 확률 분포

이산 확률 분포

베르누이 분포(Bernoulli Distribution)

결과가 0 또는 1인 단일 시행:

$P(X=k) = p^k (1-p)^{1-k}, \quad k \in \{0, 1\}$

$E[X] = p$ , $Var(X) = p(1-p)$

이항 분포(Binomial Distribution)

$n$ 번의 베르누이 시행에서 성공 횟수:

$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$

포아송 분포(Poisson Distribution)

단위 시간/공간에서 사건 발생 횟수:

$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$

$E[X] = Var(X) = \lambda$

연속 확률 분포

균등 분포(Uniform Distribution)

구간 $[a, b]$ 에서 모든 값이 동일한 확률 밀도:

$f(x) = \frac{1}{b-a}, \quad a \leq x \leq b$

정규 분포(Normal/Gaussian Distribution)

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$

평균 $\mu$ , 표준편차 $\sigma$ 로 특징지어집니다. 중심 극한 정리(Central Limit Theorem)에 의해 많은 자연 현상이 정규 분포를 따릅니다. 딥러닝에서 가중치 초기화, 노이즈 모델링 등에 자주 사용됩니다.

다변량 정규 분포(Multivariate Normal Distribution)

$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$

여기서 $\boldsymbol{\mu}$ 는 평균 벡터, $\Sigma$ 는 공분산 행렬입니다. VAE(Variational Autoencoder)의 잠재 공간 모델링에 핵심적으로 사용됩니다.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# 정규 분포
mu, sigma = 0, 1
x = np.linspace(-4, 4, 100)
pdf = stats.norm.pdf(x, mu, sigma)

print(f"P(-1 < X < 1) = {stats.norm.cdf(1) - stats.norm.cdf(-1):.4f}")  # 68.27%
print(f"P(-2 < X < 2) = {stats.norm.cdf(2) - stats.norm.cdf(-2):.4f}")  # 95.45%
print(f"P(-3 < X < 3) = {stats.norm.cdf(3) - stats.norm.cdf(-3):.4f}")  # 99.73%

# 다변량 정규 분포 샘플링
mean = np.array([0, 0])
cov = np.array([[1, 0.7],
                [0.7, 1]])
samples = np.random.multivariate_normal(mean, cov, size=1000)
print(f"샘플 형태: {samples.shape}")
print(f"샘플 평균: {samples.mean(axis=0)}")
print(f"샘플 공분산:\n{np.cov(samples.T)}")

# 이항 분포
n, p = 10, 0.5
k = np.arange(0, n+1)
pmf = stats.binom.pmf(k, n, p)
print(f"\n이항 분포 B(10, 0.5):")
print(f"P(X=5) = {stats.binom.pmf(5, n, p):.4f}")

3.3 기댓값과 분산

기댓값(Expected Value)

$E[X] = \sum_x x \cdot P(X=x) \quad \text{(이산)}$ $E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx \quad \text{(연속)}$

기댓값의 선형성: $E[aX + b] = aE[X] + b$

분산(Variance)

$Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2$

표준편차: $\sigma = \sqrt{Var(X)}$

공분산(Covariance)

두 확률변수의 선형 관계를 측정합니다.

$Cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]$

상관계수(Correlation)

$\rho_{XY} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}$

값의 범위는 $[-1, 1]$ 이며, 크기와 무관하게 선형 관계의 강도만을 측정합니다.

공분산 행렬(Covariance Matrix)

$n$ 차원 확률벡터의 모든 쌍의 공분산:

$\Sigma_{ij} = Cov(X_i, X_j)$

PCA의 핵심 입력이며, 데이터의 분산 구조를 담고 있습니다.

3.4 최대우도추정(Maximum Likelihood Estimation, MLE)

우도함수(Likelihood Function)

파라미터 $\theta$ 가 주어졌을 때, 관측된 데이터 $\mathcal{D} = \{x_1, ..., x_n\}$ 가 발생할 확률:

$L(\theta; \mathcal{D}) = P(\mathcal{D}|\theta) = \prod_{i=1}^n P(x_i|\theta)$

MLE는 이 우도를 최대화하는 $\theta$ 를 찾습니다.

$\hat{\theta}_{MLE} = \arg\max_\theta L(\theta; \mathcal{D})$

로그 우도(Log-Likelihood)

곱셈을 덧셈으로 변환하여 계산을 쉽게 합니다.

$\log L(\theta) = \sum_{i=1}^n \log P(x_i|\theta)$

로그는 단조 증가 함수이므로 로그 우도를 최대화하는 $\theta$ 와 우도를 최대화하는 $\theta$ 는 동일합니다.

MLE 유도 예제: 정규 분포

데이터 $\{x_1, ..., x_n\}$ 이 $\mathcal{N}(\mu, \sigma^2)$ 를 따른다고 가정할 때:

$\log L(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2$

$\mu$ 에 대해 미분하고 0으로 놓으면: $\hat{\mu}_{MLE} = \bar{x} = \frac{1}{n}\sum x_i$ (표본 평균)

$\sigma^2$ 에 대해 미분하면: $\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum (x_i - \bar{x})^2$ (표본 분산)

딥러닝의 Cross-Entropy와 MLE

분류 문제에서 Cross-Entropy 손실을 최소화하는 것은 MLE와 동치입니다.

$\mathcal{L}_{CE} = -\frac{1}{n}\sum_{i=1}^n \sum_{c=1}^C y_{ic} \log \hat{p}_{ic}$

이는 실제 데이터 분포 하에서 모델의 로그 우도를 최대화하는 것과 같습니다.

import numpy as np
from scipy.optimize import minimize_scalar

# MLE 예제: 베르누이 분포
data = np.array([1, 0, 1, 1, 0, 1, 1, 1, 0, 1])  # 동전 던지기
n = len(data)
k = data.sum()

# 해석적 MLE: p_hat = k/n
p_mle = k / n
print(f"MLE 추정값: p = {p_mle:.2f}")  # 0.7

# 로그 우도 함수
def neg_log_likelihood(p):
    if p <= 0 or p >= 1:
        return np.inf
    return -(k * np.log(p) + (n - k) * np.log(1 - p))

result = minimize_scalar(neg_log_likelihood, bounds=(0.01, 0.99), method='bounded')
print(f"수치 최적화 MLE: p = {result.x:.4f}")

# Cross-Entropy 손실 구현
def cross_entropy(y_true, y_pred, eps=1e-15):
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# 이진 분류 예제
y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.2])
loss = cross_entropy(y_true, y_pred)
print(f"\nCross-Entropy 손실: {loss:.4f}")

3.5 정보 이론(Information Theory)

엔트로피(Entropy)

확률 분포 $P$ 의 불확실성 또는 정보량:

$H(X) = -\sum_{x} P(x) \log P(x) = -E[\log P(X)]$

균등 분포일 때 엔트로피가 최대이고, 하나의 결과만 가능할 때 0입니다.

KL Divergence

두 분포 $P$ 와 $Q$ 사이의 "거리"(단, 비대칭):

$D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = E_P\left[\log \frac{P(X)}{Q(X)}\right]$

$D_{KL}(P \| Q) \geq 0$ 이며 $P = Q$ 일 때만 0입니다. VAE의 손실 함수, 정책 최적화(PPO 등)에 사용됩니다.

크로스 엔트로피(Cross-Entropy)

실제 분포 $P$ 와 모델 분포 $Q$ 사이의 크로스 엔트로피:

$H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + D_{KL}(P \| Q)$

$H(P)$ 는 상수이므로 Cross-Entropy를 최소화하는 것은 KL Divergence를 최소화하는 것과 동치입니다.

상호 정보량(Mutual Information)

두 확률변수가 공유하는 정보:

$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$

특징 선택, 표현 학습의 이론적 기반이 됩니다.

import numpy as np

def entropy(p):
    """이산 분포의 엔트로피 계산"""
    p = np.array(p)
    p = p[p > 0]  # 0 제거 (0*log(0) = 0으로 취급)
    return -np.sum(p * np.log2(p))

def kl_divergence(p, q, eps=1e-15):
    """KL Divergence D_KL(P||Q) 계산"""
    p = np.array(p) + eps
    q = np.array(q) + eps
    p = p / p.sum()
    q = q / q.sum()
    return np.sum(p * np.log(p / q))

def cross_entropy_distributions(p, q, eps=1e-15):
    """분포 간 Cross-Entropy 계산"""
    p = np.array(p)
    q = np.array(q) + eps
    return -np.sum(p * np.log(q))

# 균등 분포 (최대 엔트로피)
uniform = [0.25, 0.25, 0.25, 0.25]
print(f"균등 분포 엔트로피: {entropy(uniform):.4f} bits")  # 2.0 bits

# 집중된 분포 (낮은 엔트로피)
concentrated = [0.97, 0.01, 0.01, 0.01]
print(f"집중 분포 엔트로피: {entropy(concentrated):.4f} bits")

# KL Divergence
true_dist = [0.4, 0.3, 0.2, 0.1]
model_dist = [0.35, 0.35, 0.15, 0.15]
kl = kl_divergence(true_dist, model_dist)
print(f"\nKL Divergence: {kl:.4f}")

# Cross-Entropy = H(P) + KL(P||Q)
ce = cross_entropy_distributions(true_dist, model_dist)
h_p = entropy(true_dist) * np.log(2)  # nats로 변환
print(f"Cross-Entropy: {ce:.4f}")
print(f"H(P) + KL: {h_p + kl:.4f}")

4. 수치 해석(Numerical Methods)

딥러닝 구현에서 수학적 이론과 실제 컴퓨터 연산 사이의 간극을 이해하는 것은 중요합니다.

4.1 부동소수점 표현

컴퓨터에서 실수는 부동소수점으로 표현됩니다.

FP32 (단정밀도)

32비트: 부호 1비트, 지수 8비트, 가수 23비트
정밀도: 약 7자리 십진수
범위: 약 $\pm 3.4 \times 10^{38}$

FP16 (반정밀도)

16비트: 부호 1비트, 지수 5비트, 가수 10비트
정밀도: 약 3자리 십진수
GPU 메모리 절약, 속도 향상 (Mixed Precision Training)
단점: 오버플로우/언더플로우 위험

BF16 (Brain Float 16)

16비트: 부호 1비트, 지수 8비트, 가수 7비트
FP32와 같은 지수 범위 (오버플로우 위험 낮음)
Google TPU에서 도입, 현재 AI 훈련에 널리 사용

import numpy as np

# 부동소수점 정밀도 이슈
a = np.float32(0.1)
b = np.float32(0.2)
print(f"0.1 + 0.2 = {a + b}")  # 정확히 0.3이 아닐 수 있음

# 수치 오버플로우
x_large = np.float16(65000)
print(f"FP16 큰 값: {x_large}")
overflow = np.float16(70000)
print(f"FP16 오버플로우: {overflow}")  # inf

# numpy dtype 비교
for dtype in [np.float16, np.float32, np.float64]:
    info = np.finfo(dtype)
    print(f"{dtype.__name__}: max={info.max:.3e}, min_pos={info.tiny:.3e}")

4.2 수치 안정성

Softmax의 수치 불안정성과 해결

Softmax 함수:

$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$

큰 값이 입력되면 $e^{x_i}$ 가 오버플로우를 일으킵니다.

안정적인 Softmax 구현:

$\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$

수학적으로 동치이지만 수치적으로 훨씬 안정합니다.

import numpy as np

def naive_softmax(x):
    """불안정한 Softmax"""
    return np.exp(x) / np.sum(np.exp(x))

def stable_softmax(x):
    """수치적으로 안정한 Softmax"""
    x_shifted = x - np.max(x)  # 최댓값을 빼서 안정화
    return np.exp(x_shifted) / np.sum(np.exp(x_shifted))

# 일반적인 경우
x_normal = np.array([1.0, 2.0, 3.0])
print("일반 입력:")
print(f"  Naive: {naive_softmax(x_normal)}")
print(f"  Stable: {stable_softmax(x_normal)}")

# 큰 값의 경우
x_large = np.array([1000.0, 2000.0, 3000.0])
print("\n큰 입력:")
try:
    result = naive_softmax(x_large)
    print(f"  Naive: {result}")
except RuntimeWarning as e:
    print(f"  Naive: 오버플로우 경고!")
print(f"  Stable: {stable_softmax(x_large)}")

# Log-Softmax (Cross-Entropy와 함께 사용)
def log_softmax(x):
    x_shifted = x - np.max(x)
    return x_shifted - np.log(np.sum(np.exp(x_shifted)))

logits = np.array([2.0, 1.0, 0.1])
print(f"\nLog-Softmax: {log_softmax(logits)}")

5. 선형 회귀의 수학적 분석

선형 회귀는 딥러닝의 기본 블록입니다. 수학적으로 철저히 이해하면 더 복잡한 모델을 이해하는 기초가 됩니다.

5.1 최소제곱법(Ordinary Least Squares)

데이터 $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$ 에 대해 선형 모델 $\hat{y} = \mathbf{w}^T \mathbf{x} + b$ 를 학습합니다.

손실 함수(Mean Squared Error):

$L(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n} \|y - X\mathbf{w}\|^2$

정규 방정식(Normal Equation)

행렬 형태로 손실 함수를 미분하고 0으로 놓으면:

$\nabla_\mathbf{w} L = -\frac{2}{n} X^T (y - X\mathbf{w}) = 0$

$\Rightarrow X^T X \mathbf{w} = X^T y$

$\Rightarrow \hat{\mathbf{w}} = (X^T X)^{-1} X^T y$

이것이 최소제곱 해입니다. 행렬 $X^T X$ 가 가역(invertible)이어야 하며, 이는 특징들이 선형 독립일 때 성립합니다.

5.2 Ridge와 Lasso 회귀

Ridge 회귀(L2 정규화)

$L_{ridge}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_2^2$

정규 방정식: $\hat{\mathbf{w}} = (X^T X + \lambda I)^{-1} X^T y$

$\lambda I$ 를 더하면 $X^T X$ 가 항상 가역이 됩니다. 가중치들이 0에 가깝게 유지되어 과적합을 방지합니다.

Lasso 회귀(L1 정규화)

$L_{lasso}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1$

L1 정규화는 일부 가중치를 정확히 0으로 만드는 희소 해(sparse solution)를 유도합니다. 이는 자동 특징 선택(feature selection)의 효과가 있습니다.

import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# 데이터 생성
np.random.seed(42)
n, d = 100, 10
X = np.random.randn(n, d)
true_w = np.array([1.0, -2.0, 3.0, 0.0, 0.0, 0.0, 0.5, -0.5, 0.0, 0.0])
y = X @ true_w + np.random.randn(n) * 0.5

# OLS (정규 방정식)
X_b = np.column_stack([np.ones(n), X])  # 편향 항 추가
w_ols = np.linalg.lstsq(X_b, y, rcond=None)[0]
print("OLS 계수 (편향 제외):", w_ols[1:].round(2))

# Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
print("Ridge 계수:", ridge.coef_.round(2))

# Lasso (희소 해)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
print("Lasso 계수:", lasso.coef_.round(2))
print("0이 된 계수 수:", (np.abs(lasso.coef_) < 0.01).sum())

# 정규 방정식 직접 구현
def ridge_normal_equation(X, y, lam=1.0):
    n, d = X.shape
    A = X.T @ X + lam * np.eye(d)
    b = X.T @ y
    return np.linalg.solve(A, b)

w_ridge = ridge_normal_equation(X, y, lam=1.0)
print("\n정규 방정식 Ridge:", w_ridge.round(2))

5.3 기울기의 기하학적 의미

등고선과 그래디언트의 관계

그래디언트 $\nabla f(\mathbf{x})$ 는 항상 등고선에 수직(직교)합니다. 경사 하강법에서 이 방향의 반대로 이동하면 가장 빠르게 함수값을 줄일 수 있습니다.

정규화의 기하학적 해석

Ridge 회귀는 $\|\mathbf{w}\|_2^2 \leq t$ 구 안에서 MSE를 최소화하는 것과 동치입니다. 원형 제약 조건이므로 희소 해를 만들지 않습니다.

Lasso는 $\|\mathbf{w}\|_1 \leq t$ 다이아몬드 모양 안에서 MSE를 최소화합니다. 다이아몬드의 꼭짓점에서 해가 자주 발생하므로 희소 해가 나타납니다.

6. 종합: 딥러닝 수학 통합

6.1 신경망의 수학적 관점

완전 연결 신경망 $L$ 개 레이어:

$\mathbf{h}^{(0)} = \mathbf{x}$ $\mathbf{z}^{(l)} = W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}$ $\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})$ $\hat{y} = \mathbf{h}^{(L)}$

순전파: 선형대수(행렬 곱) + 활성화 함수

손실 계산: 정보 이론(Cross-Entropy) or 통계(MSE)

역전파: 미적분(체인 룰) → 그래디언트 계산

파라미터 업데이트: 최적화 이론(경사 하강법, Adam)

6.2 Attention 메커니즘과 행렬 연산

트랜스포머의 Attention:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

$Q, K, V$ : 쿼리, 키, 값 행렬 (선형대수)
$\frac{1}{\sqrt{d_k}}$ 스케일링: 수치 안정성 (수치 해석)
Softmax: 확률 분포 변환 (확률 이론)
내적 $QK^T$ : 유사도 측정 (벡터 내적)

import numpy as np

def attention(Q, K, V, mask=None):
    """
    Scaled Dot-Product Attention
    Q, K, V: (seq_len, d_k) 또는 (batch, heads, seq_len, d_k)
    """
    d_k = Q.shape[-1]

    # 스케일된 내적 유사도
    scores = Q @ K.T / np.sqrt(d_k)  # (seq_len, seq_len)

    # 마스킹 (선택적)
    if mask is not None:
        scores = np.where(mask, scores, -1e9)

    # Softmax로 확률 분포 변환
    attention_weights = stable_softmax_2d(scores)

    # 가중합
    output = attention_weights @ V

    return output, attention_weights

def stable_softmax_2d(x):
    x_max = x.max(axis=-1, keepdims=True)
    x_shifted = x - x_max
    exp_x = np.exp(x_shifted)
    return exp_x / exp_x.sum(axis=-1, keepdims=True)

# 예제: 3개 토큰, d_k=4
seq_len, d_k = 3, 4
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)

output, weights = attention(Q, K, V)
print(f"Attention 출력 shape: {output.shape}")
print(f"Attention 가중치 (합 = 1):\n{weights.round(3)}")
print(f"각 행의 합: {weights.sum(axis=-1)}")

마무리: 다음 단계

이 가이드에서는 AI/ML의 핵심 수학 기초를 살펴봤습니다. 정리하면:

선형대수: 데이터와 모델 파라미터를 표현하고 변환하는 언어
미적분: 모델이 "어떻게 배우는가"를 설명하는 도구
확률과 통계: 불확실성을 다루고, 손실 함수의 이론적 근거를 제공
수치 해석: 이론을 실제 컴퓨터에서 안정적으로 구현하기 위한 지식

다음 학습 단계를 추천합니다:

Gilbert Strang의 선형대수: MIT OCW에서 무료 제공
Pattern Recognition and Machine Learning (Bishop): 확률론적 관점의 ML
Deep Learning (Goodfellow et al.): 딥러닝 이론의 바이블
실습: NumPy로 신경망 바닥부터 구현해보기

수학은 이해의 도구입니다. 모든 수식을 외울 필요는 없지만, 각 개념이 왜 존재하는지, 딥러닝의 어느 부분에서 어떻게 작동하는지를 이해하면 훨씬 강력한 실무 능력을 갖출 수 있습니다.

참고 자료

Gilbert Strang, Introduction to Linear Algebra, 6th Edition
Ian Goodfellow et al., Deep Learning, MIT Press (2016)
Christopher Bishop, Pattern Recognition and Machine Learning, Springer (2006)
3Blue1Brown YouTube - Essence of Linear Algebra, Essence of Calculus
fast.ai - Practical Deep Learning for Coders (수학 직관 설명)

Mathematical Foundations for AI/ML: Complete Guide - Linear Algebra, Calculus, Probability

Mathematical Foundations for AI/ML: Complete Guide

To truly understand machine learning and deep learning, you need to grasp the underlying mathematics. Many developers use frameworks without fully understanding why the math works the way it does. This guide aims to fill that gap.

Rather than listing formulas, we focus on why each concept matters and how it is used in deep learning, accompanied by Python/NumPy code you can run to verify the math yourself.

1. Linear Algebra

Linear algebra is the fundamental language of deep learning. The forward pass of a neural network is a sequence of matrix multiplications, and embeddings are points in a high-dimensional vector space.

1.1 Vectors

Definition and Geometric Meaning

A vector is a quantity with both magnitude and direction. Mathematically it is an ordered list of numbers; geometrically it is an arrow pointing to a location in $n$ -dimensional space.

$\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}$

In deep learning, vectors appear everywhere: input data, hidden-layer activations, embedding representations — all are vectors.

Vector Addition and Scalar Multiplication

Addition is performed element-wise:

$\mathbf{a} + \mathbf{b} = \begin{bmatrix} a_1 + b_1 \\ a_2 + b_2 \\ \vdots \\ a_n + b_n \end{bmatrix}$

Multiplying by scalar $c$ scales each component:

$c \cdot \mathbf{v} = \begin{bmatrix} c v_1 \\ c v_2 \\ \vdots \\ c v_n \end{bmatrix}$

Dot Product

The dot product measures the "similarity" between two vectors:

$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i$

Geometrically, it equals the product of magnitudes and the cosine of the angle $\theta$ between them:

$\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta$

A positive dot product indicates vectors pointing in the same general direction; negative means opposite; zero means orthogonal.

Cosine Similarity

To measure directional similarity independent of magnitude, normalize by both norms:

$\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$

The value ranges from $-1$ to $1$ . Widely used to compare word embeddings and document representations in NLP.

Vector Norms

Norms measure the "length" or "size" of a vector:

$L_1$ norm (Manhattan): $\|\mathbf{v}\|_1 = \sum_i |v_i|$
$L_2$ norm (Euclidean): $\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}$
$L_\infty$ norm (max norm): $\|\mathbf{v}\|_\infty = \max_i |v_i|$

In regularization, $L_1$ encourages sparsity while $L_2$ keeps weights uniformly small.

import numpy as np

a = np.array([3.0, 4.0])
b = np.array([1.0, 2.0])

# Dot product
dot_product = np.dot(a, b)       # 3*1 + 4*2 = 11
print(f"Dot product: {dot_product}")

# L2 norm
l2_norm_a = np.linalg.norm(a)   # sqrt(9+16) = 5
print(f"L2 norm of a: {l2_norm_a}")

# Cosine similarity
cosine_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cosine similarity: {cosine_sim:.4f}")

# Various norms
v = np.array([3.0, -4.0, 1.0])
print(f"L1 norm: {np.linalg.norm(v, ord=1)}")         # 8.0
print(f"L2 norm: {np.linalg.norm(v, ord=2):.4f}")     # 5.099
print(f"Linf norm: {np.linalg.norm(v, ord=np.inf)}") # 4.0

1.2 Matrices

Matrix Notation

A matrix is a rectangular array of numbers with $m$ rows and $n$ columns ( $m \times n$ matrix):

$A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}$

In deep learning, the weight matrix $W$ represents the linear transformation between layers.

Matrix Multiplication

If $A$ is $m \times k$ and $B$ is $k \times n$ , then $C = AB$ is $m \times n$ :

$C_{ij} = \sum_{p=1}^{k} A_{ip} B_{pj}$

The key rule: columns of the left matrix must equal rows of the right matrix. Matrix multiplication is generally not commutative: $AB \neq BA$ .

Transpose

Swap rows and columns: $(A^T)_{ij} = A_{ji}$ .

Property: $(AB)^T = B^T A^T$

Inverse Matrix

For a square matrix $A$ , its inverse $A^{-1}$ satisfies $AA^{-1} = A^{-1}A = I$ . An inverse exists only when the determinant is non-zero.

Determinant

For a $2 \times 2$ matrix:

$\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc$

The determinant tells you how much the linear transformation stretches or compresses space. $\det(A) = 0$ means the transformation collapses dimensions — no inverse exists.

Rank

The rank of a matrix is the maximum number of linearly independent rows (or columns). A lower rank means the transformation produces a lower-dimensional output space. Low-rank approximation is central to model compression techniques like LoRA.

import numpy as np

A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

B = np.array([[9, 8, 7],
              [6, 5, 4],
              [3, 2, 1]])

# Matrix multiplication
C = np.dot(A, B)
print("Matrix product:\n", C)

# Transpose
print("Transpose of A:\n", A.T)

# Determinant
A2 = np.array([[3.0, 1.0],
               [2.0, 4.0]])
print(f"Determinant: {np.linalg.det(A2):.2f}")  # 3*4 - 1*2 = 10

# Inverse
A_inv = np.linalg.inv(A2)
print("Inverse:\n", A_inv)
print("A @ A_inv (should be identity):\n", np.round(A2 @ A_inv))

# Rank
print(f"Rank of A: {np.linalg.matrix_rank(A)}")  # 2 (rows are linearly dependent)

1.3 Eigenvalues and Eigenvectors

Definition

For a square matrix $A$ , a non-zero vector $\mathbf{v}$ and scalar $\lambda$ satisfying:

$A\mathbf{v} = \lambda \mathbf{v}$

are called the eigenvector and eigenvalue, respectively.

The meaning is powerful: when $A$ transforms $\mathbf{v}$ , the direction stays the same — only the magnitude changes by a factor of $\lambda$ .

Eigendecomposition

For a symmetric matrix, we can decompose using orthogonal eigenvectors:

$A = Q \Lambda Q^T$

where $Q$ contains eigenvectors as columns and $\Lambda$ is the diagonal matrix of eigenvalues.

Relationship to PCA

The eigenvectors of the covariance matrix $\Sigma = \frac{1}{n} X^T X$ are the principal components. The eigenvector corresponding to the largest eigenvalue points in the direction of greatest variance — the most informative feature direction.

SVD (Singular Value Decomposition)

While eigendecomposition only applies to square matrices, SVD applies to any matrix:

$A = U \Sigma V^T$

$U$ : $m \times m$ orthogonal matrix (left singular vectors)
$\Sigma$ : $m \times n$ diagonal matrix (singular values, non-negative)
$V^T$ : $n \times n$ orthogonal matrix (right singular vectors)

SVD underpins matrix approximation, noise reduction, recommendation systems, LSA, and the LoRA technique for parameter-efficient fine-tuning.

import numpy as np

# Eigenvalues and eigenvectors
A = np.array([[4.0, 2.0],
              [1.0, 3.0]])

eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors (columns):\n", eigenvectors)

# Verify: A*v = lambda*v
v0 = eigenvectors[:, 0]
lam0 = eigenvalues[0]
print("A*v:", A @ v0)
print("lambda*v:", lam0 * v0)

# SVD
M = np.array([[1, 2, 3],
              [4, 5, 6]])

U, S, Vt = np.linalg.svd(M, full_matrices=False)
print(f"\nU shape: {U.shape}")
print(f"Singular values: {S}")
print(f"Vt shape: {Vt.shape}")

# Rank-1 approximation
rank1_approx = np.outer(S[0] * U[:, 0], Vt[0, :])
print("Original matrix:\n", M)
print("Rank-1 approximation:\n", rank1_approx.round(2))

# PCA example
np.random.seed(42)
data = np.random.randn(100, 2)
data[:, 1] = data[:, 0] * 0.8 + data[:, 1] * 0.2

cov = np.cov(data.T)
eigenvals, eigenvecs = np.linalg.eig(cov)
idx = np.argsort(eigenvals)[::-1]
print("\nFirst principal component:", eigenvecs[:, idx[0]])
print("Explained variance ratio:", eigenvals[idx[0]] / eigenvals.sum())

1.4 Linear Algebra in Deep Learning

Layers as Linear Transformations

A fully connected layer is fundamentally a linear transformation:

$\mathbf{y} = W\mathbf{x} + \mathbf{b}$

where $\mathbf{x} \in \mathbb{R}^n$ , $W \in \mathbb{R}^{m \times n}$ , $\mathbf{b} \in \mathbb{R}^m$ , $\mathbf{y} \in \mathbb{R}^m$ .

Without non-linear activation functions, stacking multiple layers reduces to a single linear transformation — which is why activations like ReLU and Sigmoid are essential.

Batch Processing

For a batch of $B$ samples processed simultaneously:

$Y = XW^T + \mathbf{b}$

where $X \in \mathbb{R}^{B \times n}$ and $Y \in \mathbb{R}^{B \times m}$ . GPUs are optimized for exactly this type of large-scale matrix multiplication.

2. Calculus

Calculus governs how deep learning models learn. Gradient descent — the engine of learning — relies entirely on differentiation.

2.1 Differentiation

Limit Definition of Derivative

$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$

This is the instantaneous rate of change — the slope of the tangent line at point $x$ .

Basic Differentiation Rules

Power rule: $(x^n)' = nx^{n-1}$
Sum rule: $(f+g)' = f' + g'$
Product rule: $(fg)' = f'g + fg'$
Quotient rule: $(f/g)' = (f'g - fg') / g^2$
Chain rule: $[f(g(x))]' = f'(g(x)) \cdot g'(x)$

Derivatives of common deep learning functions:

$\frac{d}{dx}(e^x) = e^x$
$\frac{d}{dx}(\ln x) = \frac{1}{x}$
$\frac{d}{dx}(\sigma(x)) = \sigma(x)(1 - \sigma(x))$ (Sigmoid)
$\frac{d}{dx}(\tanh(x)) = 1 - \tanh^2(x)$
$\frac{d}{dx}(\text{ReLU}(x)) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}$

Partial Derivatives

For a multi-variable function, differentiate with respect to one variable while treating all others as constants:

$\frac{\partial f}{\partial x_i}$

For $f(x, y) = x^2 + 3xy + y^2$ :

$\frac{\partial f}{\partial x} = 2x + 3y, \quad \frac{\partial f}{\partial y} = 3x + 2y$

Gradient

The gradient collects all partial derivatives into a vector:

$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$

The gradient points in the direction of steepest ascent. To minimize a loss function, move in the opposite direction (gradient descent).

Jacobian Matrix

For a vector function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ , the Jacobian collects all first-order partial derivatives:

$J_{ij} = \frac{\partial f_i}{\partial x_j}$

Hessian Matrix

The matrix of second-order partial derivatives for a scalar function:

$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$

Eigenvalues of the Hessian indicate curvature: all positive means local minimum, all negative means local maximum, mixed signs indicate a saddle point.

import numpy as np

def numerical_gradient(f, x, h=1e-5):
    """Compute gradient numerically via central differences"""
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy(); x_plus[i] += h
        x_minus = x.copy(); x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

# f(x, y) = x^2 + 2y^2
def f(x):
    return x[0]**2 + 2 * x[1]**2

x0 = np.array([3.0, 4.0])
grad = numerical_gradient(f, x0)
print(f"Numerical gradient: {grad}")          # [6.0, 16.0]
print(f"Analytic gradient: [{2*x0[0]}, {4*x0[1]}]")

# Automatic differentiation with PyTorch
try:
    import torch
    x = torch.tensor([3.0, 4.0], requires_grad=True)
    y = x[0]**2 + 2 * x[1]**2
    y.backward()
    print(f"PyTorch gradient: {x.grad}")
except ImportError:
    print("PyTorch not installed")

2.2 Chain Rule and Backpropagation

Chain Rule

For composite functions, differentiation follows:

$\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$

Generalized to multiple variables:

$\frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial y_j} \cdot \frac{\partial y_j}{\partial x_i}$

Backpropagation

Training a neural network requires computing $\frac{\partial L}{\partial w}$ for every weight $w$ . Since a network is a composition of many functions, we repeatedly apply the chain rule — from the output layer back to the input layer.

For a 2-layer network:

$L = \ell\bigl(\text{softmax}(W_2 \cdot \text{ReLU}(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2), \mathbf{y}\bigr)$

Computing $\frac{\partial L}{\partial W_1}$ requires chaining gradients through each operation:

$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{a}_2} \cdot \frac{\partial \mathbf{a}_2}{\partial \mathbf{h}} \cdot \frac{\partial \mathbf{h}}{\partial W_1}$

import numpy as np

class SimpleNet:
    def __init__(self):
        self.W1 = np.random.randn(4, 3) * 0.1
        self.b1 = np.zeros(4)
        self.W2 = np.random.randn(2, 4) * 0.1
        self.b2 = np.zeros(2)

    def relu(self, x):
        return np.maximum(0, x)

    def relu_grad(self, x):
        return (x > 0).astype(float)

    def forward(self, x):
        self.x = x
        self.z1 = self.W1 @ x + self.b1
        self.h = self.relu(self.z1)
        self.z2 = self.W2 @ self.h + self.b2
        return self.z2

    def backward(self, dL_dz2):
        # Layer 2 gradients
        dL_dW2 = np.outer(dL_dz2, self.h)
        dL_db2 = dL_dz2
        dL_dh = self.W2.T @ dL_dz2

        # ReLU gradient
        dL_dz1 = dL_dh * self.relu_grad(self.z1)

        # Layer 1 gradients
        dL_dW1 = np.outer(dL_dz1, self.x)
        dL_db1 = dL_dz1

        return dL_dW1, dL_db1, dL_dW2, dL_db2

net = SimpleNet()
x = np.array([1.0, 2.0, 3.0])
output = net.forward(x)
dL_dout = np.array([1.0, -1.0])
grads = net.backward(dL_dout)
print(f"W1 gradient shape: {grads[0].shape}")
print(f"W2 gradient shape: {grads[2].shape}")

2.3 Optimization Theory

Optimality Conditions

First-order (necessary): at an extremum, $\nabla f = \mathbf{0}$
Second-order (sufficient): the Hessian's eigenvalues determine min/max

Convex Functions

A function is convex if:

$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y), \quad \lambda \in [0,1]$

For convex functions, any local minimum is a global minimum. Deep learning loss functions are generally non-convex, so global optimality cannot be guaranteed.

Gradient Descent

$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$

where $\alpha$ is the learning rate. Too large and the updates diverge; too small and convergence is slow.

Common variants:

SGD (Stochastic Gradient Descent): uses a random mini-batch each step
Momentum: accumulates past gradient directions for acceleration
Adam: adaptive learning rates per parameter

Adam update rule:

$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$ $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ $\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

import numpy as np

def gradient_descent(grad_f, init_x, lr=0.1, steps=100):
    x = init_x.copy()
    history = [x.copy()]
    for _ in range(steps):
        grad = grad_f(x)
        x -= lr * grad
        history.append(x.copy())
    return x, np.array(history)

def f(x):
    return x[0]**2 + 4 * x[1]**2

def grad_f(x):
    return np.array([2*x[0], 8*x[1]])

init_x = np.array([3.0, 3.0])
final_x, history = gradient_descent(grad_f, init_x, lr=0.1, steps=50)
print(f"Start: {init_x}, Converged to: {final_x.round(6)}")

def adam_optimizer(grad_f, init_x, alpha=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):
    x = init_x.copy()
    m = np.zeros_like(x)
    v = np.zeros_like(x)
    for t in range(1, steps + 1):
        g = grad_f(x)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        x -= alpha * m_hat / (np.sqrt(v_hat) + eps)
    return x

final_adam = adam_optimizer(grad_f, init_x)
print(f"Adam converged to: {final_adam.round(6)}")

3. Probability and Statistics

Probability and statistics provide the language for handling uncertainty in deep learning. Loss function design, regularization, and model evaluation all rely on probabilistic thinking.

3.1 Probability Basics

Kolmogorov Axioms

$P(A) \geq 0$ (non-negativity)
$P(\Omega) = 1$ (total probability equals 1)
For mutually exclusive events: $P(A \cup B) = P(A) + P(B)$ (additivity)

Conditional Probability

The probability of $A$ given that $B$ has occurred:

$P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0$

Bayes' Theorem

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

Bayes' theorem formalizes "updating beliefs in the light of evidence":

$P(A)$ : Prior — belief before seeing evidence
$P(B|A)$ : Likelihood — probability of evidence under hypothesis $A$
$P(A|B)$ : Posterior — updated belief after seeing evidence

This underlies Bayesian classifiers, Bayesian neural networks, and probabilistic model evaluation.

3.2 Probability Distributions

Discrete Distributions

Bernoulli Distribution

Single trial with outcome 0 or 1:

$P(X=k) = p^k (1-p)^{1-k}, \quad k \in \{0, 1\}$

$E[X] = p$ , $Var(X) = p(1-p)$

Binomial Distribution

Number of successes in $n$ Bernoulli trials:

$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$

Poisson Distribution

Event count in unit time or space:

$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$

$E[X] = Var(X) = \lambda$

Continuous Distributions

Uniform Distribution

Equal probability density over $[a, b]$ :

$f(x) = \frac{1}{b-a}, \quad a \leq x \leq b$

Normal (Gaussian) Distribution

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$

Characterized by mean $\mu$ and standard deviation $\sigma$ . By the Central Limit Theorem, many natural phenomena follow this distribution. Used extensively in weight initialization and noise modeling.

Multivariate Normal Distribution

$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$

where $\boldsymbol{\mu}$ is the mean vector and $\Sigma$ is the covariance matrix. Central to VAE latent space modeling.

import numpy as np
from scipy import stats

# Normal distribution
mu, sigma = 0, 1
print(f"P(-1 < X < 1) = {stats.norm.cdf(1) - stats.norm.cdf(-1):.4f}")  # 68.27%
print(f"P(-2 < X < 2) = {stats.norm.cdf(2) - stats.norm.cdf(-2):.4f}")  # 95.45%
print(f"P(-3 < X < 3) = {stats.norm.cdf(3) - stats.norm.cdf(-3):.4f}")  # 99.73%

# Multivariate normal sampling
mean = np.array([0, 0])
cov = np.array([[1, 0.7],
                [0.7, 1]])
samples = np.random.multivariate_normal(mean, cov, size=1000)
print(f"\nSample shape: {samples.shape}")
print(f"Sample mean: {samples.mean(axis=0).round(3)}")
print(f"Sample covariance:\n{np.cov(samples.T).round(3)}")

# Binomial distribution
n, p = 10, 0.5
print(f"\nBinomial B(10, 0.5): P(X=5) = {stats.binom.pmf(5, n, p):.4f}")

3.3 Expectation and Variance

Expected Value

$E[X] = \sum_x x \cdot P(X=x) \quad \text{(discrete)}$ $E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx \quad \text{(continuous)}$

Linearity of expectation: $E[aX + b] = aE[X] + b$

Variance

$Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2$

Standard deviation: $\sigma = \sqrt{Var(X)}$

Covariance

Measures the linear relationship between two random variables:

$Cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]$

Correlation

$\rho_{XY} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}$

Range $[-1, 1]$ ; measures strength of linear relationship independent of scale.

Covariance Matrix

For an $n$ -dimensional random vector:

$\Sigma_{ij} = Cov(X_i, X_j)$

The covariance matrix encodes the variance structure of data and is the key input to PCA.

3.4 Maximum Likelihood Estimation (MLE)

Likelihood Function

The probability of observing data $\mathcal{D} = \{x_1, \ldots, x_n\}$ given parameters $\theta$ :

$L(\theta; \mathcal{D}) = P(\mathcal{D}|\theta) = \prod_{i=1}^n P(x_i|\theta)$

MLE finds the $\theta$ that maximizes this likelihood:

$\hat{\theta}_{MLE} = \arg\max_\theta L(\theta; \mathcal{D})$

Log-Likelihood

Taking the log converts the product to a sum, simplifying computation:

$\log L(\theta) = \sum_{i=1}^n \log P(x_i|\theta)$

Since log is monotonically increasing, the maximizer of log-likelihood equals the maximizer of likelihood.

MLE for the Normal Distribution

Assuming data follows $\mathcal{N}(\mu, \sigma^2)$ :

$\log L(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2$

Setting the derivative with respect to $\mu$ to zero: $\hat{\mu}_{MLE} = \bar{x}$ (sample mean)

For $\sigma^2$ : $\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum (x_i - \bar{x})^2$

Cross-Entropy and MLE

Minimizing cross-entropy loss in classification is equivalent to MLE:

$\mathcal{L}_{CE} = -\frac{1}{n}\sum_{i=1}^n \sum_{c=1}^C y_{ic} \log \hat{p}_{ic}$

This maximizes the log-likelihood of the model under the true data distribution.

import numpy as np
from scipy.optimize import minimize_scalar

# MLE for Bernoulli distribution
data = np.array([1, 0, 1, 1, 0, 1, 1, 1, 0, 1])
n = len(data)
k = data.sum()

p_mle = k / n
print(f"Analytic MLE: p = {p_mle:.2f}")

def neg_log_likelihood(p):
    if p <= 0 or p >= 1:
        return np.inf
    return -(k * np.log(p) + (n - k) * np.log(1 - p))

result = minimize_scalar(neg_log_likelihood, bounds=(0.01, 0.99), method='bounded')
print(f"Numerical MLE: p = {result.x:.4f}")

# Cross-entropy loss
def cross_entropy(y_true, y_pred, eps=1e-15):
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.2])
print(f"\nCross-Entropy loss: {cross_entropy(y_true, y_pred):.4f}")

3.5 Information Theory

Entropy

Measures the uncertainty of a probability distribution $P$ :

$H(X) = -\sum_{x} P(x) \log P(x)$

Maximum for uniform distributions, zero when a single outcome is certain.

KL Divergence

A measure of "distance" between distributions $P$ and $Q$ (asymmetric):

$D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$

$D_{KL}(P \| Q) \geq 0$ , with equality iff $P = Q$ . Used in VAE loss functions, PPO policy optimization, and knowledge distillation.

Cross-Entropy

$H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + D_{KL}(P \| Q)$

Since $H(P)$ is constant, minimizing cross-entropy is equivalent to minimizing KL divergence between the true and model distributions.

Mutual Information

Information shared between two random variables:

$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$

Theoretical foundation for feature selection and representation learning.

import numpy as np

def entropy(p):
    p = np.array(p)
    p = p[p > 0]
    return -np.sum(p * np.log2(p))

def kl_divergence(p, q, eps=1e-15):
    p = np.array(p) + eps
    q = np.array(q) + eps
    p = p / p.sum()
    q = q / q.sum()
    return np.sum(p * np.log(p / q))

# Uniform distribution (maximum entropy)
uniform = [0.25, 0.25, 0.25, 0.25]
print(f"Uniform distribution entropy: {entropy(uniform):.4f} bits")   # 2.0

# Concentrated distribution (low entropy)
concentrated = [0.97, 0.01, 0.01, 0.01]
print(f"Concentrated distribution entropy: {entropy(concentrated):.4f} bits")

# KL Divergence
true_dist = [0.4, 0.3, 0.2, 0.1]
model_dist = [0.35, 0.35, 0.15, 0.15]
kl = kl_divergence(true_dist, model_dist)
print(f"\nKL Divergence D_KL(P||Q): {kl:.4f}")

4. Numerical Methods

Understanding the gap between mathematical theory and actual computer arithmetic is important for reliable deep learning implementations.

4.1 Floating-Point Representation

FP32 (Single Precision)

32 bits: 1 sign, 8 exponent, 23 mantissa
Precision: ~7 significant decimal digits
Range: approximately $\pm 3.4 \times 10^{38}$

FP16 (Half Precision)

16 bits: 1 sign, 5 exponent, 10 mantissa
Saves GPU memory, faster computation (Mixed Precision Training)
Risk of overflow/underflow with large or small values

BF16 (Brain Float 16)

16 bits: 1 sign, 8 exponent, 7 mantissa
Same exponent range as FP32 (lower overflow risk)
Introduced by Google for TPUs, now widely used for AI training

import numpy as np

# Floating-point precision issues
a = np.float32(0.1)
b = np.float32(0.2)
print(f"0.1 + 0.2 (float32) = {a + b}")  # may not be exactly 0.3

# FP16 overflow
print(f"FP16 large value: {np.float16(65000)}")
print(f"FP16 overflow: {np.float16(70000)}")  # inf

# Compare dtype ranges
for dtype in [np.float16, np.float32, np.float64]:
    info = np.finfo(dtype)
    print(f"{dtype.__name__}: max={info.max:.3e}, min_positive={info.tiny:.3e}")

4.2 Numerical Stability

The Instability of Naive Softmax

$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$

Large inputs cause $e^{x_i}$ to overflow (inf). The fix: subtract the maximum value before exponentiation.

$\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$

This is mathematically equivalent but numerically stable.

import numpy as np

def naive_softmax(x):
    return np.exp(x) / np.sum(np.exp(x))

def stable_softmax(x):
    x_shifted = x - np.max(x)
    return np.exp(x_shifted) / np.sum(np.exp(x_shifted))

# Normal input
x_normal = np.array([1.0, 2.0, 3.0])
print("Normal input:")
print(f"  Naive:  {naive_softmax(x_normal)}")
print(f"  Stable: {stable_softmax(x_normal)}")

# Large input
x_large = np.array([1000.0, 2000.0, 3000.0])
print("\nLarge input:")
naive_result = naive_softmax(x_large)
print(f"  Naive:  {naive_result}")   # [nan, nan, nan] due to overflow
print(f"  Stable: {stable_softmax(x_large)}")   # [0, 0, 1]

# Log-Softmax (used with NLLLoss)
def log_softmax(x):
    x_shifted = x - np.max(x)
    return x_shifted - np.log(np.sum(np.exp(x_shifted)))

logits = np.array([2.0, 1.0, 0.1])
print(f"\nLog-Softmax: {log_softmax(logits)}")

5. Mathematical Analysis of Linear Regression

Linear regression is the simplest building block of deep learning. A thorough mathematical understanding creates a foundation for understanding more complex models.

5.1 Ordinary Least Squares

For data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$ , we fit a linear model $\hat{y} = \mathbf{w}^T \mathbf{x} + b$ .

Loss function (MSE):

$L(\mathbf{w}) = \frac{1}{n} \|y - X\mathbf{w}\|^2$

Normal Equation

Differentiating with respect to $\mathbf{w}$ and setting to zero:

$X^T X \mathbf{w} = X^T y$

$\hat{\mathbf{w}} = (X^T X)^{-1} X^T y$

This closed-form solution requires $X^T X$ to be invertible (features must be linearly independent).

5.2 Ridge and Lasso Regression

Ridge Regression (L2 regularization)

$L_{ridge}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_2^2$

Normal equation: $\hat{\mathbf{w}} = (X^T X + \lambda I)^{-1} X^T y$

Adding $\lambda I$ guarantees invertibility and shrinks weights toward zero, preventing overfitting.

Lasso Regression (L1 regularization)

$L_{lasso}(\mathbf{w}) = \|y - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1$

L1 regularization induces sparse solutions — some weights become exactly zero — providing automatic feature selection.

import numpy as np
from sklearn.linear_model import Ridge, Lasso

np.random.seed(42)
n, d = 100, 10
X = np.random.randn(n, d)
true_w = np.array([1.0, -2.0, 3.0, 0.0, 0.0, 0.0, 0.5, -0.5, 0.0, 0.0])
y = X @ true_w + np.random.randn(n) * 0.5

# Normal equation (OLS)
w_ols = np.linalg.lstsq(X, y, rcond=None)[0]
print("OLS coefficients:", w_ols.round(2))

# Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
print("Ridge coefficients:", ridge.coef_.round(2))

# Lasso (sparse solution)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
print("Lasso coefficients:", lasso.coef_.round(2))
print("Number of zero coefficients:", (np.abs(lasso.coef_) < 0.01).sum())

# Ridge via normal equation
def ridge_normal_equation(X, y, lam=1.0):
    d = X.shape[1]
    return np.linalg.solve(X.T @ X + lam * np.eye(d), X.T @ y)

w_ridge = ridge_normal_equation(X, y, lam=1.0)
print("\nRidge (normal equation):", w_ridge.round(2))

6. Integration: The Mathematics of Deep Learning

6.1 A Mathematical View of Neural Networks

A fully connected neural network with $L$ layers:

$\mathbf{h}^{(0)} = \mathbf{x}$ $\mathbf{z}^{(l)} = W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}$ $\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})$ $\hat{y} = \mathbf{h}^{(L)}$

Forward pass: linear algebra (matrix multiplication) + activation functions

Loss computation: information theory (cross-entropy) or statistics (MSE)

Backward pass: calculus (chain rule) to compute gradients

Parameter update: optimization theory (gradient descent, Adam)

6.2 The Attention Mechanism and Matrix Operations

The Transformer's scaled dot-product attention:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

$Q, K, V$ : Query, Key, Value matrices (linear algebra)
$\frac{1}{\sqrt{d_k}}$ scaling: numerical stability (numerical analysis)
Softmax: converts scores to probability distribution (probability theory)
Dot product $QK^T$ : similarity measurement (vector dot product)

import numpy as np

def stable_softmax_2d(x):
    x_max = x.max(axis=-1, keepdims=True)
    x_shifted = x - x_max
    exp_x = np.exp(x_shifted)
    return exp_x / exp_x.sum(axis=-1, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    if mask is not None:
        scores = np.where(mask, scores, -1e9)
    weights = stable_softmax_2d(scores)
    output = weights @ V
    return output, weights

# 3 tokens, d_k=4
seq_len, d_k = 3, 4
np.random.seed(42)
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)

output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Attention output shape: {output.shape}")
print(f"Attention weights (rows sum to 1):\n{weights.round(3)}")
print(f"Row sums: {weights.sum(axis=-1)}")

Conclusion: Next Steps

This guide covered the core mathematical foundations of AI/ML:

Linear Algebra: the language for representing and transforming data and model parameters
Calculus: the tool explaining how models learn
Probability & Statistics: handles uncertainty and provides the theoretical basis for loss functions
Numerical Methods: knowledge for stable computer implementations of mathematical theory

Recommended next steps:

Gilbert Strang's Linear Algebra — free on MIT OCW
Pattern Recognition and Machine Learning (Bishop) — probabilistic perspective on ML
Deep Learning (Goodfellow et al.) — the theoretical bible of deep learning
Practice: implement a neural network from scratch using only NumPy

Mathematics is a tool for understanding. You do not need to memorize every formula, but grasping why each concept exists and how it operates within deep learning gives you significantly more powerful practical skills.

References

Gilbert Strang, Introduction to Linear Algebra, 6th Edition
Ian Goodfellow et al., Deep Learning, MIT Press (2016)
Christopher Bishop, Pattern Recognition and Machine Learning, Springer (2006)
3Blue1Brown — Essence of Linear Algebra, Essence of Calculus (YouTube)
fast.ai — Practical Deep Learning for Coders