Skip to content

Split View: 선형대수학 완전정복: 벡터부터 SVD까지 제로투히어로 가이드

|

선형대수학 완전정복: 벡터부터 SVD까지 제로투히어로 가이드

들어가며

선형대수학(Linear Algebra)은 현대 수학과 공학의 핵심 기초입니다. 데이터 과학, 머신러닝, 컴퓨터 그래픽스, 양자역학에 이르기까지 거의 모든 분야에서 활용됩니다. 이 가이드는 벡터와 행렬의 기초부터 SVD(특이값 분해)까지, 선형대수학의 핵심 개념을 체계적으로 다룹니다.

참고 자료:

  • Gilbert Strang, MIT OCW 18.06 Linear Algebra
  • 3Blue1Brown: Essence of Linear Algebra (YouTube)
  • NumPy 공식 문서 (numpy.org)
  • scikit-learn PCA 문서

1. 벡터 (Vectors)

벡터의 정의와 표현

벡터(vector)는 크기(magnitude)와 방향(direction)을 모두 가진 수학적 객체입니다. nn차원 벡터 v\mathbf{v}는 다음과 같이 표현합니다.

v=(v1v2vn)Rn\mathbf{v} = \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix} \in \mathbb{R}^n

벡터 연산

덧셈과 스칼라 곱:

u+v=(u1+v1u2+v2),cv=(cv1cv2)\mathbf{u} + \mathbf{v} = \begin{pmatrix} u_1 + v_1 \\ u_2 + v_2 \end{pmatrix}, \quad c\mathbf{v} = \begin{pmatrix} cv_1 \\ cv_2 \end{pmatrix}

내적(Dot Product):

uv=i=1nuivi=uvcosθ\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{n} u_i v_i = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta

내적이 0이면 두 벡터는 서로 직교(orthogonal)합니다.

외적(Cross Product) — 3차원에서만 정의:

u×v=ijku1u2u3v1v2v3\mathbf{u} \times \mathbf{v} = \begin{vmatrix} \mathbf{i} & \mathbf{j} & \mathbf{k} \\ u_1 & u_2 & u_3 \\ v_1 & v_2 & v_3 \end{vmatrix}

노름 (Norm)

  • L1 노름: v1=ivi\|\mathbf{v}\|_1 = \sum_{i} |v_i|
  • L2 노름 (유클리드): v2=ivi2\|\mathbf{v}\|_2 = \sqrt{\sum_{i} v_i^2}
  • Lp 노름: vp=(ivip)1/p\|\mathbf{v}\|_p = \left(\sum_{i} |v_i|^p\right)^{1/p}

단위벡터 (Unit Vector):

v^=vv\hat{\mathbf{v}} = \frac{\mathbf{v}}{\|\mathbf{v}\|}

import numpy as np

# 벡터 정의
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# 벡터 덧셈
print("덧셈:", a + b)          # [5 7 9]

# 스칼라 곱
print("스칼라 곱:", 3 * a)     # [3 6 9]

# 내적
dot = np.dot(a, b)
print("내적:", dot)             # 32

# 외적
cross = np.cross(a, b)
print("외적:", cross)           # [-3  6 -3]

# L2 노름
norm_l2 = np.linalg.norm(a)
print("L2 노름:", norm_l2)      # 3.7416...

# L1 노름
norm_l1 = np.linalg.norm(a, ord=1)
print("L1 노름:", norm_l1)      # 6.0

# 단위벡터
unit = a / np.linalg.norm(a)
print("단위벡터:", unit)
print("크기 확인:", np.linalg.norm(unit))  # 1.0

# 두 벡터의 각도
cos_theta = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
angle = np.arccos(cos_theta)
print("각도 (라디안):", angle)
print("각도 (도):", np.degrees(angle))

2. 행렬 (Matrices)

행렬의 정의와 종류

m×nm \times n 행렬은 mm개의 행과 nn개의 열로 이루어진 수의 직사각형 배열입니다.

A=(a11a12a1na21a22a2nam1am2amn)A = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix}

주요 행렬 종류:

종류정의예시
단위행렬 (Identity)Iij=1I_{ij} = 1 (i=j), 0 (i≠j)I2=(1001)I_2 = \begin{pmatrix}1&0\\0&1\end{pmatrix}
대각행렬 (Diagonal)대각선 외 모두 0diag(d1,d2)\text{diag}(d_1, d_2)
대칭행렬 (Symmetric)A=ATA = A^T
직교행렬 (Orthogonal)ATA=IA^T A = I회전 행렬

행렬 연산

행렬 곱 (Matrix Multiplication):

(AB)ij=k=1naikbkj(AB)_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}

행렬 곱은 교환 법칙이 성립하지 않습니다: ABBAAB \neq BA (일반적으로)

전치행렬 (Transpose):

(AT)ij=Aji(A^T)_{ij} = A_{ji}

역행렬 (Inverse):

정방행렬 AA의 역행렬 A1A^{-1}AA1=A1A=IAA^{-1} = A^{-1}A = I를 만족합니다. 역행렬은 행렬식이 0이 아닐 때만 존재합니다.

import numpy as np

# 행렬 정의
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# 행렬 덧셈
print("행렬 덧셈:\n", A + B)

# 행렬 곱
C = A @ B
print("행렬 곱:\n", C)

# 원소별 곱 (Hadamard product)
print("원소별 곱:\n", A * B)

# 전치행렬
print("전치행렬:\n", A.T)

# 역행렬
A_inv = np.linalg.inv(A)
print("역행렬:\n", A_inv)

# 검증: A @ A_inv = I
print("검증 (A @ A_inv):\n", A @ A_inv)

# 단위행렬
I = np.eye(3)
print("3x3 단위행렬:\n", I)

# 대각행렬
D = np.diag([1, 2, 3])
print("대각행렬:\n", D)

# 대칭행렬 생성
S = A @ A.T
print("대칭행렬 (A @ A^T):\n", S)
print("대칭 확인:", np.allclose(S, S.T))

3. 행렬식 (Determinant)

행렬식의 정의와 계산

행렬식(determinant)은 정방행렬에서 정의되는 스칼라 값으로, 행렬의 다양한 특성을 담고 있습니다.

2×2 행렬식:

det(abcd)=adbc\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc

3×3 행렬식 (사루스 법칙, Sarrus' Rule):

det(A)=a11(a22a33a23a32)a12(a21a33a23a31)+a13(a21a32a22a31)\det(A) = a_{11}(a_{22}a_{33} - a_{23}a_{32}) - a_{12}(a_{21}a_{33} - a_{23}a_{31}) + a_{13}(a_{21}a_{32} - a_{22}a_{31})

행렬식의 성질

  1. det(I)=1\det(I) = 1
  2. det(AB)=det(A)det(B)\det(AB) = \det(A)\det(B)
  3. det(AT)=det(A)\det(A^T) = \det(A)
  4. det(A1)=1/det(A)\det(A^{-1}) = 1/\det(A)
  5. 행렬식이 0이면 역행렬이 존재하지 않음 (특이행렬, singular matrix)
  6. 기하학적으로: 행 벡터들이 이루는 평행육면체의 부피

크라머 공식 (Cramer's Rule)

연립방정식 Ax=bA\mathbf{x} = \mathbf{b}의 해:

xi=det(Ai)det(A)x_i = \frac{\det(A_i)}{\det(A)}

여기서 AiA_iAAii번째 열을 b\mathbf{b}로 대체한 행렬입니다.

import numpy as np

A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

B = np.array([[2, -1, 0],
              [-1, 2, -1],
              [0, -1, 2]])

# 행렬식 계산
det_A = np.linalg.det(A)
det_B = np.linalg.det(B)
print("det(A):", det_A)   # 거의 0 (특이행렬에 가까움)
print("det(B):", det_B)   # 4.0

# 성질 검증: det(A@B) = det(A)*det(B)
print("det(A)*det(B):", det_A * det_B)
print("det(A@B):", np.linalg.det(A @ B))

# 크라머 공식으로 연립방정식 풀기
A_sys = np.array([[2, 1], [5, 3]], dtype=float)
b_sys = np.array([1, 2], dtype=float)

det_A_sys = np.linalg.det(A_sys)

# x1 계산
A1 = A_sys.copy()
A1[:, 0] = b_sys
x1 = np.linalg.det(A1) / det_A_sys

# x2 계산
A2 = A_sys.copy()
A2[:, 1] = b_sys
x2 = np.linalg.det(A2) / det_A_sys

print(f"x1 = {x1}, x2 = {x2}")
# numpy linalg solve로 검증
x = np.linalg.solve(A_sys, b_sys)
print("numpy solve:", x)

4. 연립방정식과 가우스 소거법

선형 연립방정식

nn개의 미지수에 대한 mm개의 선형방정식:

a11x1+a12x2++a1nxn=b1a_{11}x_1 + a_{12}x_2 + \cdots + a_{1n}x_n = b_1 a21x1+a22x2++a2nxn=b2a_{21}x_1 + a_{22}x_2 + \cdots + a_{2n}x_n = b_2 \vdots am1x1+am2x2++amnxn=bma_{m1}x_1 + a_{m2}x_2 + \cdots + a_{mn}x_n = b_m

행렬 형태: Ax=bA\mathbf{x} = \mathbf{b}

첨가행렬 (Augmented Matrix)

[Ab]=(a11a1nb1am1amnbm)[A|\mathbf{b}] = \left(\begin{array}{ccc|c} a_{11} & \cdots & a_{1n} & b_1 \\ \vdots & \ddots & \vdots & \vdots \\ a_{m1} & \cdots & a_{mn} & b_m \end{array}\right)

가우스 소거법 (Gaussian Elimination)

행 기본 연산(Elementary Row Operations):

  1. 두 행 교환: RiRjR_i \leftrightarrow R_j
  2. 행에 스칼라 곱: RicRiR_i \leftarrow c \cdot R_i
  3. 행의 배수 더하기: RiRi+cRjR_i \leftarrow R_i + c \cdot R_j

이 연산으로 행사다리꼴(Row Echelon Form, REF) 또는 기약행사다리꼴(Reduced REF)을 만듭니다.

LU 분해

A=LUA = LU: 하삼각행렬 LL과 상삼각행렬 UU의 곱으로 분해

import numpy as np
from scipy import linalg

# 연립방정식: 2x + y = 5, x + 3y = 10
A = np.array([[2, 1], [1, 3]], dtype=float)
b = np.array([5, 10], dtype=float)

# numpy linalg.solve
x = np.linalg.solve(A, b)
print("해:", x)  # [1. 3.]

# 가우스-조르단 소거법 구현
def gauss_jordan(A, b):
    n = len(b)
    # 첨가행렬 생성
    aug = np.hstack([A.astype(float), b.reshape(-1, 1).astype(float)])

    for col in range(n):
        # 피벗 선택 (부분 피벗팅)
        max_row = np.argmax(np.abs(aug[col:, col])) + col
        aug[[col, max_row]] = aug[[max_row, col]]

        # 피벗 행 정규화
        aug[col] = aug[col] / aug[col, col]

        # 다른 행에서 소거
        for row in range(n):
            if row != col:
                aug[row] -= aug[row, col] * aug[col]

    return aug[:, -1]

solution = gauss_jordan(A.copy(), b.copy())
print("가우스-조르단 해:", solution)

# LU 분해
P, L, U = linalg.lu(A)
print("L 행렬:\n", L)
print("U 행렬:\n", U)
print("P 행렬:\n", P)

# LU 분해로 연립방정식 풀기
x_lu = linalg.solve(A, b)
print("LU 분해 해:", x_lu)

5. 벡터 공간 (Vector Spaces)

벡터 공간의 정의

집합 VV가 벡터 공간이 되려면 덧셈과 스칼라 곱에 대해 다음 8가지 공리를 만족해야 합니다:

  1. 닫힘 (Closure): u,vVu+vV\mathbf{u}, \mathbf{v} \in V \Rightarrow \mathbf{u} + \mathbf{v} \in V
  2. 교환법칙: u+v=v+u\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}
  3. 결합법칙: (u+v)+w=u+(v+w)(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})
  4. 영벡터 존재: v+0=v\mathbf{v} + \mathbf{0} = \mathbf{v}
  5. 역원 존재: v+(v)=0\mathbf{v} + (-\mathbf{v}) = \mathbf{0}
  6. 스칼라 닫힘: cR,vVcvVc \in \mathbb{R}, \mathbf{v} \in V \Rightarrow c\mathbf{v} \in V
  7. 분배법칙 1: c(u+v)=cu+cvc(\mathbf{u} + \mathbf{v}) = c\mathbf{u} + c\mathbf{v}
  8. 분배법칙 2: (c+d)v=cv+dv(c + d)\mathbf{v} = c\mathbf{v} + d\mathbf{v}

선형독립 (Linear Independence)

벡터 집합 {v1,v2,,vk}\{\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k\}선형독립이란:

c1v1+c2v2++ckvk=0    c1=c2==ck=0c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \cdots + c_k\mathbf{v}_k = \mathbf{0} \implies c_1 = c_2 = \cdots = c_k = 0

즉, 자명한 해(trivial solution)만 존재해야 합니다.

기저 (Basis)와 차원 (Dimension)

벡터 공간 VV의 **기저(basis)**는 VV를 생성(span)하는 선형독립인 벡터들의 집합입니다.

  • 차원(dimension): 기저의 원소 개수
  • Rn\mathbb{R}^n의 표준기저: e1,e2,,en\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_n

행공간, 열공간, 영공간

행렬 ARm×nA \in \mathbb{R}^{m \times n}에 대해:

  • 행공간 (Row Space): AA의 행 벡터들로 생성되는 공간, Rn\subseteq \mathbb{R}^n
  • 열공간 (Column Space): AA의 열 벡터들로 생성되는 공간, Rm\subseteq \mathbb{R}^m
  • 영공간 (Null Space): Ax=0A\mathbf{x} = \mathbf{0}의 해 집합, Rn\subseteq \mathbb{R}^n
  • 차원 정리: rank(A)+nullity(A)=n\text{rank}(A) + \text{nullity}(A) = n

랭크 (Rank)

rank(A)=dim(Row Space)=dim(Column Space)\text{rank}(A) = \dim(\text{Row Space}) = \dim(\text{Column Space})

import numpy as np

A = np.array([[1, 2, 3],
              [4, 5, 6],
              [2, 4, 6]])  # 3행은 1행의 2배 (선형종속)

# 랭크 계산
rank = np.linalg.matrix_rank(A)
print("랭크:", rank)  # 2

# 영공간 (null space) 계산
from scipy.linalg import null_space
ns = null_space(A)
print("영공간:\n", ns)

# 열공간의 기저 (QR 분해 활용)
Q, R = np.linalg.qr(A)
print("열공간의 기저 (Q의 처음 rank개 열):\n", Q[:, :rank])

# 선형독립 확인 예시
v1 = np.array([1, 0, 0])
v2 = np.array([0, 1, 0])
v3 = np.array([0, 0, 1])
M = np.column_stack([v1, v2, v3])
print("v1, v2, v3의 랭크:", np.linalg.matrix_rank(M))  # 3 = 선형독립

v4 = np.array([1, 2, 3])
v5 = np.array([2, 4, 6])  # v5 = 2*v4
M2 = np.column_stack([v4, v5])
print("v4, v5의 랭크:", np.linalg.matrix_rank(M2))  # 1 = 선형종속

6. 선형 변환 (Linear Transformations)

선형 변환의 정의

함수 T:VWT: V \to W선형 변환이 되려면:

  1. T(u+v)=T(u)+T(v)T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}) (가산성)
  2. T(cv)=cT(v)T(c\mathbf{v}) = cT(\mathbf{v}) (동차성)

행렬 표현

모든 선형 변환은 행렬로 표현됩니다: T(x)=AxT(\mathbf{x}) = A\mathbf{x}

핵 (Kernel): ker(T)={vV:T(v)=0}\ker(T) = \{\mathbf{v} \in V : T(\mathbf{v}) = \mathbf{0}\}

상 (Image): Im(T)={T(v):vV}\text{Im}(T) = \{T(\mathbf{v}) : \mathbf{v} \in V\}

차원 정리: dim(kerT)+dim(ImT)=dimV\dim(\ker T) + \dim(\text{Im} T) = \dim V

기하학적 변환

2차원에서의 주요 변환 행렬:

회전 행렬 (각도 θ\theta): R(θ)=(cosθsinθsinθcosθ)R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}

x축 반사: Fx=(1001)F_x = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}

균등 확대/축소 (배율 s): S=(s00s)S = \begin{pmatrix} s & 0 \\ 0 & s \end{pmatrix}

import numpy as np
import matplotlib.pyplot as plt

# 회전 변환
def rotation_matrix(theta):
    return np.array([[np.cos(theta), -np.sin(theta)],
                     [np.sin(theta),  np.cos(theta)]])

# 45도 회전
theta = np.pi / 4
R = rotation_matrix(theta)
print("45도 회전 행렬:\n", R)

# 벡터 변환
v = np.array([1, 0])
v_rotated = R @ v
print("회전된 벡터:", v_rotated)  # [0.707, 0.707]

# 전단 변환 (shear)
shear = np.array([[1, 1],
                  [0, 1]])

# 단위 정사각형의 꼭짓점 변환
corners = np.array([[0, 1, 1, 0, 0],
                    [0, 0, 1, 1, 0]])

transformed = shear @ corners
print("전단 변환 후 꼭짓점:\n", transformed)

# 직교 행렬 검증
print("R^T @ R =\n", R.T @ R)  # 단위행렬
print("det(R) =", np.linalg.det(R))  # 1

7. 내적 공간 (Inner Product Spaces)

내적의 정의

내적(inner product) ,:V×VR\langle \cdot, \cdot \rangle: V \times V \to \mathbb{R}은 다음을 만족합니다:

  1. 양정치성: v,v0\langle \mathbf{v}, \mathbf{v} \rangle \geq 0, 등호는 v=0\mathbf{v} = \mathbf{0}일 때만
  2. 대칭성: u,v=v,u\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle
  3. 선형성: au+bv,w=au,w+bv,w\langle a\mathbf{u} + b\mathbf{v}, \mathbf{w} \rangle = a\langle \mathbf{u}, \mathbf{w} \rangle + b\langle \mathbf{v}, \mathbf{w} \rangle

코시-슈바르츠 부등식

u,vuv|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \cdot \|\mathbf{v}\|

등호는 u\mathbf{u}v\mathbf{v}가 평행할 때 성립합니다.

그람-슈미트 과정 (Gram-Schmidt Process)

선형독립인 벡터들로부터 정규직교기저(orthonormal basis)를 구성합니다.

주어진 벡터 v1,v2,,vk\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k에 대해:

u1=v1\mathbf{u}_1 = \mathbf{v}_1 u2=v2v2,u1u1,u1u1\mathbf{u}_2 = \mathbf{v}_2 - \frac{\langle \mathbf{v}_2, \mathbf{u}_1 \rangle}{\langle \mathbf{u}_1, \mathbf{u}_1 \rangle} \mathbf{u}_1

일반적으로:

uk=vkj=1k1vk,ujuj,ujuj\mathbf{u}_k = \mathbf{v}_k - \sum_{j=1}^{k-1} \frac{\langle \mathbf{v}_k, \mathbf{u}_j \rangle}{\langle \mathbf{u}_j, \mathbf{u}_j \rangle} \mathbf{u}_j

마지막으로 각 ui\mathbf{u}_i를 정규화합니다.

import numpy as np

def gram_schmidt(vectors):
    """그람-슈미트 정규직교화"""
    basis = []
    for v in vectors:
        w = v.copy().astype(float)
        for b in basis:
            w -= np.dot(v, b) * b
        norm = np.linalg.norm(w)
        if norm > 1e-10:
            basis.append(w / norm)
    return np.array(basis)

# 선형독립인 벡터들
v1 = np.array([1, 1, 0])
v2 = np.array([1, 0, 1])
v3 = np.array([0, 1, 1])

# 그람-슈미트 적용
basis = gram_schmidt([v1, v2, v3])
print("정규직교기저:\n", basis)

# 직교성 확인
print("내적 확인 (0이어야 함):")
print("e1 dot e2:", np.dot(basis[0], basis[1]))
print("e1 dot e3:", np.dot(basis[0], basis[2]))
print("e2 dot e3:", np.dot(basis[1], basis[2]))

# 단위벡터 확인
print("노름 확인 (1이어야 함):")
for i, b in enumerate(basis):
    print(f"  e{i+1} 노름:", np.linalg.norm(b))

# scipy의 그람-슈미트 (QR 분해 활용)
A = np.column_stack([v1, v2, v3])
Q, R = np.linalg.qr(A)
print("QR 분해 Q 행렬:\n", Q)

8. 고유값과 고유벡터 (Eigenvalues & Eigenvectors)

정의

정방행렬 AA에 대해 다음을 만족하는 스칼라 λ\lambda와 영이 아닌 벡터 v\mathbf{v}:

Av=λvA\mathbf{v} = \lambda\mathbf{v}

  • λ\lambda: 고유값 (eigenvalue)
  • v\mathbf{v}: λ\lambda에 대응하는 고유벡터 (eigenvector)

고유벡터는 AA에 의해 변환될 때 방향은 유지되고 크기만 λ\lambda배 변합니다.

특성 방정식

det(AλI)=0\det(A - \lambda I) = 0

이 다항식(characteristic polynomial)의 근이 고유값입니다.

2×2 예시:

A=(4123)A = \begin{pmatrix} 4 & 1 \\ 2 & 3 \end{pmatrix}

det(AλI)=(4λ)(3λ)2=λ27λ+10=(λ2)(λ5)=0\det(A - \lambda I) = (4-\lambda)(3-\lambda) - 2 = \lambda^2 - 7\lambda + 10 = (\lambda-2)(\lambda-5) = 0

따라서 λ1=2\lambda_1 = 2, λ2=5\lambda_2 = 5

고유값 분해 (Eigendecomposition)

대각화 가능한 행렬 AA는 다음과 같이 분해됩니다:

A=PΛP1A = P \Lambda P^{-1}

여기서 Λ=diag(λ1,λ2,,λn)\Lambda = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_n)이고 PP의 열들은 고유벡터입니다.

대칭 행렬의 스펙트럼 정리

실수 대칭 행렬 A=ATA = A^T는:

  1. 모든 고유값이 실수
  2. 서로 다른 고유값에 대한 고유벡터는 직교
  3. A=QΛQTA = Q \Lambda Q^T (직교 대각화 가능)
import numpy as np
from scipy.linalg import eig

# 일반 행렬의 고유값/고유벡터
A = np.array([[4, 1],
              [2, 3]])

eigenvalues, eigenvectors = np.linalg.eig(A)
print("고유값:", eigenvalues)          # [5., 2.]
print("고유벡터 (열):\n", eigenvectors)

# 검증: A @ v = lambda * v
for i in range(len(eigenvalues)):
    lam = eigenvalues[i]
    v = eigenvectors[:, i]
    print(f"\nlambda={lam:.2f}:")
    print("  A @ v:", A @ v)
    print("  lambda * v:", lam * v)

# 고유값 분해: A = P @ Lambda @ P^{-1}
P = eigenvectors
Lambda = np.diag(eigenvalues)
A_reconstructed = P @ Lambda @ np.linalg.inv(P)
print("\n재구성 행렬:\n", A_reconstructed)
print("원래 행렬과 동일:", np.allclose(A, A_reconstructed))

# 대칭 행렬 (실수 고유값 보장)
B = np.array([[2, 1, 0],
              [1, 3, 1],
              [0, 1, 2]])

eigenvalues_B, eigenvectors_B = np.linalg.eigh(B)  # 대칭 행렬 전용
print("\n대칭 행렬 고유값:", eigenvalues_B)
print("직교성 확인 (Q^T @ Q = I):\n",
      np.round(eigenvectors_B.T @ eigenvectors_B))

# 거듭제곱: A^n = P @ Lambda^n @ P^{-1}
n = 5
Lambda_n = np.diag(eigenvalues ** n)
A_power = P @ Lambda_n @ np.linalg.inv(P)
print(f"\nA^{n}:\n", A_power)
print("직접 계산과 비교:\n", np.linalg.matrix_power(A, n))

9. 특이값 분해 (SVD - Singular Value Decomposition)

SVD의 정의

임의의 m×nm \times n 행렬 AA는 다음과 같이 분해됩니다:

A=UΣVTA = U \Sigma V^T

  • URm×mU \in \mathbb{R}^{m \times m}: 좌특이벡터(left singular vectors), 직교행렬
  • ΣRm×n\Sigma \in \mathbb{R}^{m \times n}: 특이값 대각행렬 (σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0)
  • VRn×nV \in \mathbb{R}^{n \times n}: 우특이벡터(right singular vectors), 직교행렬

기하학적 의미

AA는 다음 세 변환의 합성으로 볼 수 있습니다:

  1. VTV^T: 회전/반사
  2. Σ\Sigma: 좌표축 방향 확대/축소
  3. UU: 회전/반사

PCA와의 관계

데이터 행렬 XX의 SVD를 수행하면 주성분 분석(PCA)을 수행한 것과 동일합니다.

X=UΣVTX = U \Sigma V^T에서 VV의 열들이 주성분(principal components)이 됩니다.

이미지 압축

특이값을 내림차순으로 정렬하면 상위 kk개의 특이값으로 근사 행렬을 구성할 수 있습니다:

Ak=i=1kσiuiviTA_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T

import numpy as np
import matplotlib.pyplot as plt

# 기본 SVD
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9],
              [10, 11, 12]])

U, S, Vt = np.linalg.svd(A)
print("U 형태:", U.shape)      # (4, 4)
print("S (특이값):", S)         # 특이값
print("Vt 형태:", Vt.shape)    # (3, 3)

# 재구성 검증
Sigma = np.zeros(A.shape)
Sigma[:min(A.shape), :min(A.shape)] = np.diag(S)
A_reconstructed = U @ Sigma @ Vt
print("재구성 오차:", np.linalg.norm(A - A_reconstructed))

# 이미지 압축 시뮬레이션
def svd_compress(matrix, k):
    """상위 k개 특이값으로 행렬 압축"""
    U, S, Vt = np.linalg.svd(matrix)
    # k개 성분만 사용
    U_k = U[:, :k]
    S_k = S[:k]
    Vt_k = Vt[:k, :]
    return U_k @ np.diag(S_k) @ Vt_k

# 테스트 행렬 (이미지 시뮬레이션)
np.random.seed(42)
img = np.random.randn(100, 100)

# 다양한 압축 비율 테스트
for k in [5, 10, 20, 50]:
    compressed = svd_compress(img, k)
    error = np.linalg.norm(img - compressed, 'fro') / np.linalg.norm(img, 'fro')
    compression_ratio = k * (100 + 100 + 1) / (100 * 100)
    print(f"k={k}: 상대 오차={error:.4f}, 압축비={compression_ratio:.4f}")

# PCA 구현 (SVD 활용)
def pca_svd(X, n_components):
    """SVD를 이용한 PCA"""
    # 데이터 중심화
    X_centered = X - X.mean(axis=0)

    # SVD 수행
    U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)

    # 주성분으로 투영
    X_pca = X_centered @ Vt[:n_components].T

    # 설명 분산 비율
    explained_variance_ratio = (S[:n_components]**2) / (S**2).sum()

    return X_pca, Vt[:n_components], explained_variance_ratio

# 예시 데이터
np.random.seed(42)
X = np.random.randn(100, 5)
X_pca, components, evr = pca_svd(X, 2)
print("\nPCA 설명 분산 비율:", evr)
print("투영된 데이터 형태:", X_pca.shape)

10. 행렬 분해 (Matrix Factorizations)

주요 분해 방법 비교

분해형태적용 조건주요 용도
LU 분해A = LU정방행렬연립방정식
QR 분해A = QR임의 행렬최소제곱법, 고유값
CholeskyA = LL^T양정치 대칭수치 최적화
SVDA = UΣV^T임의 행렬PCA, 이미지 압축
고유값 분해A = PΛP^-1정방행렬스펙트럼 분석

Cholesky 분해

양정치(positive definite) 대칭 행렬에 대해: A=LLTA = LL^T

import numpy as np
from scipy import linalg

# QR 분해
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 10]], dtype=float)

Q, R = np.linalg.qr(A)
print("Q (직교 행렬):\n", Q)
print("R (상삼각 행렬):\n", R)
print("Q^T @ Q = I 확인:\n", np.round(Q.T @ Q))

# Cholesky 분해
B = np.array([[4, 2, 2],
              [2, 3, 1],
              [2, 1, 3]], dtype=float)

L = np.linalg.cholesky(B)
print("\nCholesky L:\n", L)
print("L @ L^T = A 확인:\n", L @ L.T)

# LU 분해 (scipy)
P, L_lu, U = linalg.lu(A)
print("\nLU 분해:")
print("P:\n", P)
print("L:\n", L_lu)
print("U:\n", U)
print("P @ L @ U = A 확인:\n", np.round(P @ L_lu @ U))

11. AI/ML에서의 선형대수학

신경망에서의 행렬 곱

신경망의 순전파(forward propagation)는 연속된 행렬 곱으로 표현됩니다:

h(l)=f(W(l)h(l1)+b(l))\mathbf{h}^{(l)} = f(W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})

  • W(l)W^{(l)}: 가중치 행렬
  • b(l)\mathbf{b}^{(l)}: 편향 벡터
  • ff: 활성화 함수

배치 처리 시 입력을 행렬 XRn×dX \in \mathbb{R}^{n \times d}로 표현하여 병렬 계산이 가능합니다.

그래디언트와 야코비안

손실 함수 LL에 대한 가중치 행렬의 그래디언트:

LW=LhxT\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \mathbf{h}} \mathbf{x}^T

PCA로 차원 축소

고차원 특성 공간에서 주요 분산 방향을 찾아 차원을 축소합니다.

Transformer의 Attention 메커니즘

Self-Attention은 세 행렬 Q,K,VQ, K, V (Query, Key, Value)를 이용합니다:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# 1. 간단한 신경망 순전파
np.random.seed(42)
n_samples, input_dim, hidden_dim, output_dim = 32, 784, 128, 10

X = np.random.randn(n_samples, input_dim)
W1 = np.random.randn(input_dim, hidden_dim) * 0.01
b1 = np.zeros(hidden_dim)
W2 = np.random.randn(hidden_dim, output_dim) * 0.01
b2 = np.zeros(output_dim)

# 순전파
H1 = np.maximum(0, X @ W1 + b1)  # ReLU 활성화
logits = H1 @ W2 + b2
print("logits 형태:", logits.shape)  # (32, 10)

# 2. PCA로 차원 축소
iris = load_iris()
X_iris = iris.data  # (150, 4)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_iris)
print("\nPCA 후 형태:", X_pca.shape)  # (150, 2)
print("설명 분산 비율:", pca.explained_variance_ratio_)
print("누적 설명 분산:", pca.explained_variance_ratio_.sum())

# 3. Scaled Dot-Product Attention
def scaled_dot_product_attention(Q, K, V):
    d_k = Q.shape[-1]
    # Q @ K^T / sqrt(d_k)
    scores = Q @ K.T / np.sqrt(d_k)

    # Softmax
    exp_scores = np.exp(scores - scores.max(axis=-1, keepdims=True))
    attention_weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)

    # 가중 합
    output = attention_weights @ V
    return output, attention_weights

# 예시
seq_len, d_model = 5, 8
Q = np.random.randn(seq_len, d_model)
K = np.random.randn(seq_len, d_model)
V = np.random.randn(seq_len, d_model)

output, weights = scaled_dot_product_attention(Q, K, V)
print("\nAttention 출력 형태:", output.shape)  # (5, 8)
print("Attention 가중치 합:", weights.sum(axis=-1))  # 모두 1

# 4. Word Embedding 유사도
word_embeddings = {
    "king": np.array([0.8, 0.3, 0.1, 0.9]),
    "queen": np.array([0.7, 0.4, 0.2, 0.8]),
    "man": np.array([0.6, 0.1, 0.7, 0.2]),
    "woman": np.array([0.5, 0.2, 0.8, 0.1])
}

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("\n단어 유사도:")
print("king-queen:", cosine_similarity(
    word_embeddings["king"], word_embeddings["queen"]))
print("man-woman:", cosine_similarity(
    word_embeddings["man"], word_embeddings["woman"]))

# king - man + woman ≈ queen (analogy)
analogy = word_embeddings["king"] - word_embeddings["man"] + word_embeddings["woman"]
sim_queen = cosine_similarity(analogy, word_embeddings["queen"])
print("king - man + woman vs queen 유사도:", sim_queen)

12. 퀴즈

Q1. 벡터의 내적과 직교성의 관계는?

정답: 두 벡터의 내적이 0이면 두 벡터는 서로 직교(orthogonal)합니다.

설명: 내적 공식 uv=uvcosθ\mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\|\|\mathbf{v}\|\cos\theta에서 cos90°=0\cos 90° = 0이므로, 두 벡터가 직각을 이룰 때 내적이 0이 됩니다. 이를 이용해 정규직교기저를 구성하거나 직교 투영을 계산합니다.

Q2. 역행렬이 존재하는 조건은?

정답: 정방행렬의 행렬식(determinant)이 0이 아닐 때 역행렬이 존재합니다.

설명: det(A)0\det(A) \neq 0이면 AA는 가역행렬(invertible matrix)입니다. 행렬식이 0인 행렬은 특이행렬(singular matrix)이라 하며 역행렬이 존재하지 않습니다. 이는 행렬의 랭크(rank)가 행렬의 크기보다 작음을 의미하고, 해당 선형 연립방정식은 유일한 해를 가지지 않습니다.

Q3. 고유값과 고유벡터의 기하학적 의미는?

정답: 고유벡터는 선형 변환에 의해 방향이 바뀌지 않는 벡터이고, 고유값은 그 벡터가 몇 배로 늘어나는지를 나타냅니다.

설명: Av=λvA\mathbf{v} = \lambda\mathbf{v}에서 행렬 AA로 변환해도 벡터 v\mathbf{v}의 방향은 유지되고 크기만 λ\lambda배 변합니다. λ>1\lambda > 1이면 늘어남, 0<λ<10 < \lambda < 1이면 줄어듦, λ<0\lambda < 0이면 방향이 반전됩니다. 고유값 분해는 PCA, 행렬 거듭제곱 등에 활용됩니다.

Q4. SVD와 PCA의 관계는 무엇인가요?

정답: 중심화된 데이터 행렬에 SVD를 적용하면 PCA와 동일한 결과를 얻습니다. SVD의 우특이벡터(V)가 PCA의 주성분(principal components)에 해당합니다.

설명: 데이터 행렬 XX를 중심화한 후 SVD를 수행하면 X=UΣVTX = U\Sigma V^T가 됩니다. 이때 VV의 열벡터들이 주성분이며, 특이값 σi\sigma_i의 제곱에 비례하는 분산을 설명합니다. 계산적으로 공분산 행렬의 고유값 분해보다 수치적으로 안정적이므로 sklearn의 PCA도 내부적으로 SVD를 사용합니다.

Q5. 랭크-널리티 정리(Rank-Nullity Theorem)를 설명하세요.

정답: m×nm \times n 행렬 AA에 대해 rank(A)+nullity(A)=n\text{rank}(A) + \text{nullity}(A) = n이 성립합니다.

설명: rank(A)는 열공간(column space)의 차원이고, nullity(A)는 영공간(null space)의 차원입니다. 열의 수 nn은 항상 이 두 차원의 합과 같습니다. 예를 들어 3×4 행렬의 랭크가 3이라면 영공간의 차원은 1입니다. 이 정리는 연립방정식의 해의 구조를 이해하는 데 핵심적입니다.


정리 및 학습 로드맵

선형대수학의 핵심 개념들은 서로 긴밀하게 연결되어 있습니다:

벡터 → 행렬 → 행렬식 → 연립방정식
  ↓       ↓
벡터 공간 → 선형 변환 → 내적 공간
고유값/고유벡터 → SVDAI/ML 응용

다음 단계로 추천하는 학습 자료:

  1. Gilbert Strang의 Linear Algebra (MIT OCW 18.06): 가장 체계적인 선형대수학 강의
  2. 3Blue1Brown - Essence of Linear Algebra: 시각적 직관 이해에 탁월
  3. NumPy 공식 문서: 선형대수 모듈 (numpy.linalg) 실습
  4. scikit-learn PCA 문서: 실제 ML 파이프라인에서의 활용
  5. Matrix Methods in Data Analysis (MIT 18.065): 고급 응용 과정

선형대수학은 단순히 수식을 외우는 것이 아니라 행렬과 벡터의 기하학적 의미를 이해하는 것이 핵심입니다. 꾸준한 코드 실습과 시각화를 통해 깊은 이해를 쌓아가시길 바랍니다.

Linear Algebra Complete Guide: Zero to Hero from Vectors to SVD

Introduction

Linear Algebra is the cornerstone of modern mathematics and engineering. From data science and machine learning to computer graphics and quantum mechanics, it appears in virtually every technical discipline. This guide takes you from the very basics of vectors and matrices all the way to Singular Value Decomposition (SVD), building intuition alongside rigorous definitions.

Key References:

  • Gilbert Strang, MIT OCW 18.06 Linear Algebra
  • 3Blue1Brown: Essence of Linear Algebra (YouTube)
  • NumPy official documentation (numpy.org)
  • scikit-learn PCA documentation

1. Vectors

Definition and Notation

A vector is a mathematical object possessing both magnitude (size) and direction. An nn-dimensional vector v\mathbf{v} is written as a column:

v=(v1v2vn)Rn\mathbf{v} = \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix} \in \mathbb{R}^n

Geometrically, a vector in R2\mathbb{R}^2 or R3\mathbb{R}^3 is an arrow pointing from the origin to a specific point.

Vector Operations

Addition and Scalar Multiplication:

u+v=(u1+v1u2+v2),cv=(cv1cv2)\mathbf{u} + \mathbf{v} = \begin{pmatrix} u_1 + v_1 \\ u_2 + v_2 \end{pmatrix}, \quad c\mathbf{v} = \begin{pmatrix} cv_1 \\ cv_2 \end{pmatrix}

Dot Product (Inner Product):

uv=i=1nuivi=uvcosθ\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{n} u_i v_i = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta

The dot product is zero precisely when the two vectors are orthogonal (perpendicular).

Cross Product — defined only in 3D:

u×v=ijku1u2u3v1v2v3\mathbf{u} \times \mathbf{v} = \begin{vmatrix} \mathbf{i} & \mathbf{j} & \mathbf{k} \\ u_1 & u_2 & u_3 \\ v_1 & v_2 & v_3 \end{vmatrix}

The result is a vector perpendicular to both u\mathbf{u} and v\mathbf{v}, with magnitude uvsinθ\|\mathbf{u}\|\|\mathbf{v}\|\sin\theta.

Norms

  • L1 norm (Manhattan): v1=ivi\|\mathbf{v}\|_1 = \sum_{i} |v_i|
  • L2 norm (Euclidean): v2=ivi2\|\mathbf{v}\|_2 = \sqrt{\sum_{i} v_i^2}
  • Lp norm: vp=(ivip)1/p\|\mathbf{v}\|_p = \left(\sum_{i} |v_i|^p\right)^{1/p}
  • Infinity norm: v=maxivi\|\mathbf{v}\|_\infty = \max_i |v_i|

Unit vector (normalization):

v^=vv\hat{\mathbf{v}} = \frac{\mathbf{v}}{\|\mathbf{v}\|}

import numpy as np

# Define vectors
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Addition
print("Sum:", a + b)                        # [5 7 9]

# Scalar multiplication
print("Scalar mult:", 3 * a)               # [3 6 9]

# Dot product
dot = np.dot(a, b)
print("Dot product:", dot)                  # 32

# Cross product (3D only)
cross = np.cross(a, b)
print("Cross product:", cross)              # [-3  6 -3]

# Norms
print("L2 norm:", np.linalg.norm(a))        # 3.7416...
print("L1 norm:", np.linalg.norm(a, ord=1)) # 6.0
print("Inf norm:", np.linalg.norm(a, ord=np.inf))  # 3.0

# Unit vector
unit = a / np.linalg.norm(a)
print("Unit vector:", unit)
print("Magnitude:", np.linalg.norm(unit))   # 1.0

# Angle between vectors
cos_theta = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
angle_rad = np.arccos(np.clip(cos_theta, -1, 1))
print("Angle (degrees):", np.degrees(angle_rad))

# Orthogonality check
v1 = np.array([1, 0])
v2 = np.array([0, 1])
print("v1 · v2 =", np.dot(v1, v2))  # 0, orthogonal!

2. Matrices

Definition and Types

An m×nm \times n matrix is a rectangular array of numbers with mm rows and nn columns:

A=(a11a12a1na21a22a2nam1am2amn)A = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix}

Important special matrices:

TypeDefinitionProperty
Identity matrixIij=1I_{ij} = 1 if i=ji=j, else 0AI=IA=AAI = IA = A
Diagonal matrixOff-diagonal entries are zero
Symmetric matrixA=ATA = A^TReal eigenvalues
Orthogonal matrixATA=IA^T A = Idet(A)=±1\det(A) = \pm 1
Positive definitexTAx>0\mathbf{x}^T A \mathbf{x} > 0 for all x0\mathbf{x} \neq 0All eigenvalues positive

Matrix Operations

Matrix Multiplication:

(AB)ij=k=1naikbkj(AB)_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}

Matrix multiplication is not commutative in general: ABBAAB \neq BA.

Transpose: (AT)ij=Aji(A^T)_{ij} = A_{ji}

Key property: (AB)T=BTAT(AB)^T = B^T A^T

Inverse: A1A^{-1} satisfies AA1=A1A=IAA^{-1} = A^{-1}A = I. Exists only when det(A)0\det(A) \neq 0.

Key properties: (AB)1=B1A1(AB)^{-1} = B^{-1}A^{-1}, (AT)1=(A1)T(A^T)^{-1} = (A^{-1})^T

import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix addition
print("A + B:\n", A + B)

# Matrix multiplication
C = A @ B          # preferred syntax in Python 3
print("A @ B:\n", C)

# Element-wise multiplication (Hadamard product)
print("A * B (Hadamard):\n", A * B)

# Transpose
print("A^T:\n", A.T)

# Inverse
A_inv = np.linalg.inv(A)
print("A^(-1):\n", A_inv)
print("A @ A^(-1):\n", np.round(A @ A_inv))  # Identity

# Special matrices
print("3x3 Identity:\n", np.eye(3))
print("Diagonal:\n", np.diag([1, 2, 3]))

# Properties
print("(AB)^T == B^T A^T:", np.allclose((A@B).T, B.T @ A.T))
print("(AB)^(-1) == B^(-1) A^(-1):",
      np.allclose(np.linalg.inv(A@B), np.linalg.inv(B) @ np.linalg.inv(A)))

# Symmetric matrix check
S = A + A.T
print("S is symmetric:", np.allclose(S, S.T))

# Frobenius norm of a matrix
print("Frobenius norm:", np.linalg.norm(A, 'fro'))

3. Determinants

Definition and Computation

The determinant of a square matrix is a scalar that encodes crucial information about the matrix.

2x2 determinant:

det(abcd)=adbc\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc

3x3 determinant by cofactor expansion:

det(A)=a11C11+a12C12+a13C13\det(A) = a_{11} C_{11} + a_{12} C_{12} + a_{13} C_{13}

where Cij=(1)i+jMijC_{ij} = (-1)^{i+j} M_{ij} is the cofactor and MijM_{ij} is the minor.

Geometric Interpretation

The absolute value of the determinant equals the volume of the parallelepiped formed by the row vectors. A determinant of zero means the rows are linearly dependent — the matrix "flattens" space.

Key Properties

  1. det(I)=1\det(I) = 1
  2. det(AB)=det(A)det(B)\det(AB) = \det(A) \det(B)
  3. det(AT)=det(A)\det(A^T) = \det(A)
  4. det(A1)=1/det(A)\det(A^{-1}) = 1 / \det(A)
  5. Swapping two rows negates the determinant
  6. det(A)=0    A\det(A) = 0 \iff A is singular (non-invertible)

Cramer's Rule

For Ax=bA\mathbf{x} = \mathbf{b} with det(A)0\det(A) \neq 0:

xi=det(Ai)det(A)x_i = \frac{\det(A_i)}{\det(A)}

where AiA_i is AA with column ii replaced by b\mathbf{b}.

import numpy as np

# 2x2 determinant
A2 = np.array([[3, 1], [2, 4]])
print("det([[3,1],[2,4]]):", np.linalg.det(A2))  # 10

# 3x3 determinant
A3 = np.array([[1, 2, 3],
               [4, 5, 6],
               [7, 8, 10]])
print("det(A3):", np.linalg.det(A3))  # -3

# Properties verification
B = np.array([[2, 0], [0, 3]])
print("det(A2) * det(B) =", np.linalg.det(A2) * np.linalg.det(B))
print("det(A2 @ B) =", np.linalg.det(A2 @ B))

# Singular matrix (det = 0)
singular = np.array([[1, 2], [2, 4]])  # row 2 = 2 * row 1
print("Singular matrix det:", np.linalg.det(singular))  # ~0

# Cramer's Rule
def cramers_rule(A, b):
    det_A = np.linalg.det(A)
    if abs(det_A) < 1e-10:
        raise ValueError("Matrix is singular")
    n = len(b)
    x = np.zeros(n)
    for i in range(n):
        Ai = A.copy().astype(float)
        Ai[:, i] = b
        x[i] = np.linalg.det(Ai) / det_A
    return x

A = np.array([[2, 1], [5, 3]], dtype=float)
b = np.array([1, 2])
x_cramer = cramers_rule(A, b)
x_solve = np.linalg.solve(A, b)
print("\nCramer's rule:", x_cramer)
print("np.linalg.solve:", x_solve)
print("Results match:", np.allclose(x_cramer, x_solve))

4. Systems of Linear Equations and Gaussian Elimination

Linear Systems

A system of mm equations in nn unknowns:

a11x1+a12x2++a1nxn=b1a_{11}x_1 + a_{12}x_2 + \cdots + a_{1n}x_n = b_1 \vdots am1x1+am2x2++amnxn=bma_{m1}x_1 + a_{m2}x_2 + \cdots + a_{mn}x_n = b_m

In matrix form: Ax=bA\mathbf{x} = \mathbf{b}

A system has either: no solution (inconsistent), exactly one solution (consistent, det(A)0\det(A) \neq 0), or infinitely many solutions (underdetermined).

Augmented Matrix and Row Operations

The augmented matrix [Ab][A|\mathbf{b}] combines AA and b\mathbf{b}. Three elementary row operations preserve solution sets:

  1. Row swap: RiRjR_i \leftrightarrow R_j
  2. Row scaling: RicRiR_i \leftarrow c \cdot R_i, c0c \neq 0
  3. Row addition: RiRi+cRjR_i \leftarrow R_i + c \cdot R_j

Gaussian elimination uses these operations to produce Row Echelon Form (REF). Gauss-Jordan elimination produces Reduced Row Echelon Form (RREF).

LU Decomposition

A=LUA = LU where LL is lower triangular and UU is upper triangular. Useful for solving Ax=bA\mathbf{x} = \mathbf{b} efficiently for multiple right-hand sides.

import numpy as np
from scipy import linalg

def gauss_jordan_solve(A, b):
    """Solve Ax = b using Gauss-Jordan elimination."""
    n = len(b)
    aug = np.hstack([A.astype(float), b.reshape(-1, 1).astype(float)])

    for col in range(n):
        # Partial pivoting
        max_row = np.argmax(np.abs(aug[col:, col])) + col
        aug[[col, max_row]] = aug[[max_row, col]]

        # Normalize pivot row
        aug[col] = aug[col] / aug[col, col]

        # Eliminate column
        for row in range(n):
            if row != col:
                aug[row] -= aug[row, col] * aug[col]

    return aug[:, -1]

# Example: 3x3 system
A = np.array([[2, 1, -1],
              [-3, -1, 2],
              [-2, 1, 2]], dtype=float)
b = np.array([8, -11, -3], dtype=float)

x = gauss_jordan_solve(A.copy(), b.copy())
print("Solution:", x)  # [2. 3. -1.]
print("Verification A@x:", A @ x)

# LU Decomposition
P, L, U = linalg.lu(A)
print("\nL:\n", np.round(L, 3))
print("U:\n", np.round(U, 3))

# Solve using LU
# Ly = Pb, then Ux = y
Pb = P.T @ b
y = linalg.solve_triangular(L, Pb, lower=True)
x_lu = linalg.solve_triangular(U, y)
print("LU solution:", x_lu)

# Checking solution existence
A_rank = np.linalg.matrix_rank(A)
print("\nRank of A:", A_rank)
A_aug = np.hstack([A, b.reshape(-1, 1)])
aug_rank = np.linalg.matrix_rank(A_aug)
print("Rank of augmented matrix:", aug_rank)
# Consistent iff rank(A) == rank([A|b])

5. Vector Spaces

Definition and Axioms

A vector space VV over R\mathbb{R} is a set closed under addition and scalar multiplication satisfying 8 axioms:

  1. Closure under addition: u+vV\mathbf{u} + \mathbf{v} \in V
  2. Commutativity: u+v=v+u\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}
  3. Associativity: (u+v)+w=u+(v+w)(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})
  4. Zero vector: 0\exists\, \mathbf{0} such that v+0=v\mathbf{v} + \mathbf{0} = \mathbf{v}
  5. Additive inverse: v\exists\, {-}\mathbf{v} such that v+(v)=0\mathbf{v} + (-\mathbf{v}) = \mathbf{0}
  6. Closure under scalar multiplication: cvVc\mathbf{v} \in V
  7. Distributivity 1: c(u+v)=cu+cvc(\mathbf{u} + \mathbf{v}) = c\mathbf{u} + c\mathbf{v}
  8. Distributivity 2: (c+d)v=cv+dv(c + d)\mathbf{v} = c\mathbf{v} + d\mathbf{v}

Subspaces

A subset WVW \subseteq V is a subspace if it contains the zero vector, and is closed under addition and scalar multiplication.

Linear Independence

Vectors v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k are linearly independent if:

c1v1+c2v2++ckvk=0    c1=c2==ck=0c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \cdots + c_k\mathbf{v}_k = \mathbf{0} \implies c_1 = c_2 = \cdots = c_k = 0

Otherwise, they are linearly dependent (at least one is a linear combination of the others).

Basis and Dimension

A basis for VV is a linearly independent set that spans VV. The dimension of VV equals the number of vectors in any basis.

Standard basis of Rn\mathbb{R}^n: e1=(1,0,,0)\mathbf{e}_1 = (1,0,\ldots,0), ..., en=(0,,0,1)\mathbf{e}_n = (0,\ldots,0,1)

Four Fundamental Subspaces

For ARm×nA \in \mathbb{R}^{m \times n}:

SubspaceLives inDimension
Row space (rowsp)Rn\mathbb{R}^nrr
Column space (colsp)Rm\mathbb{R}^mrr
Null space (ker)Rn\mathbb{R}^nnrn - r
Left null spaceRm\mathbb{R}^mmrm - r

where r=rank(A)r = \text{rank}(A). Rank-Nullity Theorem: r+(nr)=nr + (n-r) = n.

import numpy as np
from scipy.linalg import null_space

A = np.array([[1, 2, 3, 4],
              [2, 4, 6, 8],   # 2 * row 1
              [1, 0, 1, 2]])

print("Shape:", A.shape)
print("Rank:", np.linalg.matrix_rank(A))  # 2

# Null space basis
ns = null_space(A)
print("Null space dimension:", ns.shape[1])  # 4 - rank = 2
print("Null space basis:\n", ns)

# Verify: A @ ns should be ~0
print("A @ null_space:\n", np.round(A @ ns, 10))

# Column space basis via QR
Q, R = np.linalg.qr(A.T)  # QR of A^T for column space
rank = np.linalg.matrix_rank(A)
col_basis = Q[:, :rank]
print("Column space basis shape:", col_basis.shape)

# Linear independence check
v1 = np.array([1, 2, 3])
v2 = np.array([0, 1, 2])
v3 = np.array([1, 3, 5])  # v1 + v2
M = np.column_stack([v1, v2, v3])
print("\nRank of [v1, v2, v3]:", np.linalg.matrix_rank(M))  # 2, linearly dependent!

v3_indep = np.array([1, 0, 1])
M2 = np.column_stack([v1, v2, v3_indep])
print("Rank of [v1, v2, v3_indep]:", np.linalg.matrix_rank(M2))  # 3, linearly independent

6. Linear Transformations

Definition

A function T:VWT: V \to W is a linear transformation if for all u,vV\mathbf{u}, \mathbf{v} \in V and cRc \in \mathbb{R}:

  1. T(u+v)=T(u)+T(v)T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}) (additivity)
  2. T(cv)=cT(v)T(c\mathbf{v}) = cT(\mathbf{v}) (homogeneity)

Equivalently: T(cu+dv)=cT(u)+dT(v)T(c\mathbf{u} + d\mathbf{v}) = cT(\mathbf{u}) + dT(\mathbf{v}) (superposition)

Matrix Representation

Every linear transformation from Rn\mathbb{R}^n to Rm\mathbb{R}^m is represented by some m×nm \times n matrix AA: T(x)=AxT(\mathbf{x}) = A\mathbf{x}.

Kernel: ker(T)={vV:T(v)=0}\ker(T) = \{\mathbf{v} \in V : T(\mathbf{v}) = \mathbf{0}\}

Image: Im(T)={T(v):vV}\text{Im}(T) = \{T(\mathbf{v}) : \mathbf{v} \in V\}

Rank-Nullity: dim(kerT)+dim(ImT)=dimV\dim(\ker T) + \dim(\text{Im} T) = \dim V

Geometric Transformations in 2D

Rotation by angle θ\theta: R(θ)=(cosθsinθsinθcosθ)R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}

Reflection about x-axis: Fx=(1001)F_x = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}

Uniform scaling by factor s: S(s)=(s00s)S(s) = \begin{pmatrix} s & 0 \\ 0 & s \end{pmatrix}

Horizontal shear: H=(1k01)H = \begin{pmatrix} 1 & k \\ 0 & 1 \end{pmatrix}

import numpy as np

# Rotation
def rotation_2d(theta_deg):
    theta = np.radians(theta_deg)
    return np.array([[np.cos(theta), -np.sin(theta)],
                     [np.sin(theta),  np.cos(theta)]])

R90 = rotation_2d(90)
v = np.array([1.0, 0.0])
print("Rotate [1,0] by 90 deg:", R90 @ v)  # [0, 1]

# Composition: rotate then reflect
Fx = np.array([[1, 0], [0, -1]])
T_composed = Fx @ R90
print("Rotate then reflect:", T_composed @ v)  # [0, -1]

# Orthogonal matrix properties
print("R90^T @ R90 = I:", np.allclose(R90.T @ R90, np.eye(2)))
print("det(R90) =", np.linalg.det(R90))  # 1.0

# Shear transformation
shear = np.array([[1, 2], [0, 1]])  # horizontal shear by 2
pts = np.array([[0, 1, 1, 0],
                [0, 0, 1, 1]], dtype=float)
transformed = shear @ pts
print("Original corners:\n", pts)
print("Sheared corners:\n", transformed)
print("Area preserved? det(shear) =", np.linalg.det(shear))  # 1

# Projection onto a line
# Project onto the line y = mx (unit direction vector u)
u = np.array([1, 1]) / np.sqrt(2)  # y = x direction
P_proj = np.outer(u, u)  # projection matrix
print("\nProjection matrix:\n", P_proj)
w = np.array([3, 1])
print("Projection of [3,1] onto y=x:", P_proj @ w)  # [2, 2]

7. Inner Product Spaces

Inner Product Definition

An inner product ,:V×VR\langle \cdot, \cdot \rangle: V \times V \to \mathbb{R} satisfies:

  1. Positive definiteness: v,v0\langle \mathbf{v}, \mathbf{v} \rangle \geq 0, with equality iff v=0\mathbf{v} = \mathbf{0}
  2. Symmetry: u,v=v,u\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle
  3. Bilinearity: au+bv,w=au,w+bv,w\langle a\mathbf{u} + b\mathbf{v}, \mathbf{w} \rangle = a\langle \mathbf{u}, \mathbf{w} \rangle + b\langle \mathbf{v}, \mathbf{w} \rangle

The induced norm is v=v,v\|\mathbf{v}\| = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}.

Cauchy-Schwarz Inequality

u,vuv|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \cdot \|\mathbf{v}\|

Equality holds iff u\mathbf{u} and v\mathbf{v} are parallel. This implies the triangle inequality: u+vu+v\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|.

Gram-Schmidt Process

Given linearly independent vectors v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k, produces an orthonormal basis:

u1=v1\mathbf{u}_1 = \mathbf{v}_1 uj=vji=1j1vj,uiui,uiui\mathbf{u}_j = \mathbf{v}_j - \sum_{i=1}^{j-1} \frac{\langle \mathbf{v}_j, \mathbf{u}_i \rangle}{\langle \mathbf{u}_i, \mathbf{u}_i \rangle} \mathbf{u}_i

Then normalize: ej=uj/uj\mathbf{e}_j = \mathbf{u}_j / \|\mathbf{u}_j\|.

This is the basis of QR decomposition: A=QRA = QR where QQ has orthonormal columns.

import numpy as np

def gram_schmidt(vectors):
    """Classical Gram-Schmidt orthonormalization."""
    basis = []
    for v in vectors:
        w = np.array(v, dtype=float)
        for b in basis:
            w = w - np.dot(w, b) * b
        norm = np.linalg.norm(w)
        if norm > 1e-10:
            basis.append(w / norm)
    return np.array(basis)

# Apply Gram-Schmidt
v1 = np.array([1, 1, 0])
v2 = np.array([1, 0, 1])
v3 = np.array([0, 1, 1])

Q = gram_schmidt([v1, v2, v3])
print("Orthonormal basis:")
for i, q in enumerate(Q):
    print(f"  e{i+1} = {q}")

# Verify orthonormality: Q @ Q^T should equal I
print("\nQ @ Q^T:\n", np.round(Q @ Q.T))

# QR decomposition (equivalent to Gram-Schmidt)
A = np.column_stack([v1, v2, v3])
Q_qr, R = np.linalg.qr(A)
print("\nQR decomposition Q:\n", Q_qr)
print("R (upper triangular):\n", np.round(R, 4))

# Projection
def project(v, u):
    """Project v onto u."""
    return (np.dot(v, u) / np.dot(u, u)) * u

# Orthogonal complement
x = np.array([3, 2, 1])
q1 = Q[0]
x_parallel = project(x, q1)
x_perp = x - x_parallel
print("\nOrthogonal decomposition:")
print("x:", x)
print("x_parallel to e1:", x_parallel)
print("x_perpendicular:", x_perp)
print("Perpendicular check:", np.dot(x_perp, q1))  # ~0

8. Eigenvalues and Eigenvectors

Definition

For a square matrix AA, a nonzero vector v\mathbf{v} and scalar λ\lambda satisfying:

Av=λvA\mathbf{v} = \lambda\mathbf{v}

are called an eigenvector and eigenvalue of AA, respectively.

An eigenvector is only scaled (not rotated) by AA. The eigenvalue λ\lambda gives the scale factor.

Characteristic Equation

Rearranging: (AλI)v=0(A - \lambda I)\mathbf{v} = \mathbf{0}. For nontrivial solutions, the matrix (AλI)(A - \lambda I) must be singular:

det(AλI)=0\det(A - \lambda I) = 0

This is the characteristic polynomial of AA. Its roots are the eigenvalues.

Eigendecomposition (Diagonalization)

If AA has nn linearly independent eigenvectors p1,,pn\mathbf{p}_1, \ldots, \mathbf{p}_n:

A=PΛP1A = P \Lambda P^{-1}

where P=[p1pn]P = [\mathbf{p}_1 \cdots \mathbf{p}_n] and Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n).

Application: Ak=PΛkP1A^k = P \Lambda^k P^{-1} — easy to compute powers!

Spectral Theorem for Symmetric Matrices

If A=ATA = A^T (real symmetric):

  1. All eigenvalues are real
  2. Eigenvectors for distinct eigenvalues are orthogonal
  3. AA is orthogonally diagonalizable: A=QΛQTA = Q \Lambda Q^T where QT=Q1Q^T = Q^{-1}
import numpy as np
from scipy.linalg import eigh

# General eigendecomposition
A = np.array([[4, 1],
              [2, 3]])

eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)      # [5., 2.]
print("Eigenvectors (columns):\n", eigenvectors)

# Verify A @ v = lambda * v
for i in range(len(eigenvalues)):
    lam = eigenvalues[i]
    v = eigenvectors[:, i]
    print(f"\nFor lambda = {lam:.2f}:")
    print("  A @ v  =", A @ v)
    print("  lam * v =", lam * v)
    print("  Match:", np.allclose(A @ v, lam * v))

# Diagonalization: A = P @ Lambda @ P_inv
P = eigenvectors
Lambda = np.diag(eigenvalues)
P_inv = np.linalg.inv(P)
print("\nA = P @ Lambda @ P_inv:", np.allclose(P @ Lambda @ P_inv, A))

# Matrix power via diagonalization: A^10
k = 10
A_k = P @ np.diag(eigenvalues**k) @ P_inv
print("A^10 (diag):", np.round(A_k.real))
print("A^10 (direct):", np.linalg.matrix_power(A, k))

# Symmetric matrix: use eigh (more stable, guaranteed real)
B = np.array([[4, 2, 0],
              [2, 3, 1],
              [0, 1, 5]])

evals, evecs = eigh(B)
print("\nSymmetric matrix eigenvalues:", evals)
print("Orthogonal eigenvectors (Q^T @ Q = I):\n",
      np.round(evecs.T @ evecs))

# Spectral decomposition: A = Q @ Lambda @ Q^T
A_reconstructed = evecs @ np.diag(evals) @ evecs.T
print("Reconstruction error:", np.linalg.norm(B - A_reconstructed))

# Trace = sum of eigenvalues, Det = product of eigenvalues
print("\ntrace(A) =", np.trace(A), "== sum of eigenvalues:", np.sum(eigenvalues).real)
print("det(A) =", np.linalg.det(A), "== product of eigenvalues:", np.prod(eigenvalues).real)

9. Singular Value Decomposition (SVD)

Definition

Every m×nm \times n real matrix AA can be decomposed as:

A=UΣVTA = U \Sigma V^T

  • URm×mU \in \mathbb{R}^{m \times m}: left singular vectors (orthonormal columns, eigenvectors of AATAA^T)
  • ΣRm×n\Sigma \in \mathbb{R}^{m \times n}: diagonal matrix with singular values σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0
  • VRn×nV \in \mathbb{R}^{n \times n}: right singular vectors (orthonormal columns, eigenvectors of ATAA^TA)

The singular values satisfy σi=λi(ATA)\sigma_i = \sqrt{\lambda_i(A^T A)}.

Geometric Interpretation

A=UΣVTA = U \Sigma V^T represents three stages:

  1. VTV^T: rotate/reflect input space
  2. Σ\Sigma: scale along coordinate axes (possibly different dimensions)
  3. UU: rotate/reflect output space

Low-Rank Approximation

The best rank-kk approximation (minimizing Frobenius norm) is:

Ak=i=1kσiuiviTA_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T

This is the foundation of image compression and collaborative filtering.

Connection to PCA

For a mean-centered data matrix XX: SVD gives X=UΣVTX = U\Sigma V^T. The right singular vectors VV are the principal components, and the singular values relate to explained variance.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

# Basic SVD
A = np.array([[3, 2, 2],
              [2, 3, -2]], dtype=float)

U, S, Vt = np.linalg.svd(A)
print("U:\n", np.round(U, 4))
print("Singular values:", np.round(S, 4))
print("Vt:\n", np.round(Vt, 4))

# Reconstruction
m, n = A.shape
Sigma = np.zeros((m, n))
Sigma[:min(m,n), :min(m,n)] = np.diag(S)
A_rec = U @ Sigma @ Vt
print("Reconstruction error:", np.linalg.norm(A - A_rec))

# Low-rank approximation
def low_rank_approx(A, k):
    U, S, Vt = np.linalg.svd(A, full_matrices=False)
    return U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]

# Demonstrate on random matrix
np.random.seed(42)
M = np.random.randn(50, 50)
for k in [1, 5, 10, 25]:
    M_k = low_rank_approx(M, k)
    rel_err = np.linalg.norm(M - M_k, 'fro') / np.linalg.norm(M, 'fro')
    storage_ratio = k * (50 + 50 + 1) / (50 * 50)
    print(f"k={k:2d}: relative error={rel_err:.4f}, storage={storage_ratio:.3f}")

# SVD-based PCA on real data
digits = load_digits()
X = digits.data  # (1797, 64)
X_centered = X - X.mean(axis=0)

U_d, S_d, Vt_d = np.linalg.svd(X_centered, full_matrices=False)
explained_variance = S_d**2 / (X.shape[0] - 1)
explained_ratio = explained_variance / explained_variance.sum()

print("\nDigits dataset SVD-PCA:")
print("Explained variance ratio (top 5):", np.round(explained_ratio[:5], 4))
print("Cumulative (top 10):", round(explained_ratio[:10].sum(), 4))

# Compare with sklearn PCA
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_centered)
print("sklearn PCA explained ratio:", np.round(pca.explained_variance_ratio_, 4))

# Condition number (ratio of largest to smallest singular value)
print("\nCondition number of A:", S[0] / S[-1])
print("Well-conditioned? (< 100):", (S[0] / S[-1]) < 100)

10. Matrix Factorizations

Overview and Comparison

FactorizationFormConditionsApplications
LUA=LUA = LUSquare matrixSolving Ax=bA\mathbf{x}=\mathbf{b}
QRA=QRA = QRAny matrixLeast squares, eigenvalues
CholeskyA=LLTA = LL^TSymmetric positive definiteOptimization, statistics
SVDA=UΣVTA = U\Sigma V^TAny matrixPCA, compression, pseudo-inverse
EigendecompositionA=PΛP1A = P\Lambda P^{-1}DiagonalizableDynamics, spectral analysis

Least Squares via QR

For overdetermined AxbA\mathbf{x} \approx \mathbf{b} (more equations than unknowns), minimize Axb2\|A\mathbf{x} - \mathbf{b}\|^2. Using QR: x^=R1QTb\hat{\mathbf{x}} = R^{-1}Q^T\mathbf{b}.

Pseudo-Inverse via SVD

For any matrix A=UΣVTA = U\Sigma V^T, the Moore-Penrose pseudo-inverse is:

A+=VΣ+UTA^+ = V \Sigma^+ U^T

where Σ+\Sigma^+ inverts nonzero diagonal entries.

import numpy as np
from scipy import linalg

np.random.seed(42)

# QR decomposition and least squares
# Overdetermined system: 5 equations, 2 unknowns
A = np.random.randn(5, 2)
b_true = np.array([1, 2])
b = A @ b_true + 0.1 * np.random.randn(5)

# QR least squares
Q, R = np.linalg.qr(A)
x_qr = np.linalg.solve(R, Q.T @ b)
print("QR least squares:", x_qr)
print("True values:", b_true)

# numpy lstsq
x_lstsq, _, _, _ = np.linalg.lstsq(A, b, rcond=None)
print("lstsq:", x_lstsq)

# Cholesky decomposition
B = np.array([[6, 3, 2],
              [3, 5, 1],
              [2, 1, 4]], dtype=float)

L = np.linalg.cholesky(B)
print("\nCholesky L:\n", np.round(L, 4))
print("L @ L^T = B:", np.allclose(L @ L.T, B))

# Cholesky solve
b_sys = np.array([1, 2, 3], dtype=float)
y = linalg.solve_triangular(L, b_sys, lower=True)
x_chol = linalg.solve_triangular(L.T, y)
print("Cholesky solution:", x_chol)
print("Verify:", np.allclose(B @ x_chol, b_sys))

# Pseudo-inverse via SVD
A_rect = np.array([[1, 2], [3, 4], [5, 6]], dtype=float)
A_pinv = np.linalg.pinv(A_rect)
print("\nPseudo-inverse shape:", A_pinv.shape)  # (2, 3)
print("A_pinv @ A = I_2x2:\n", np.round(A_pinv @ A_rect))

# SVD decomposition comparison
U, S, Vt = np.linalg.svd(A_rect, full_matrices=False)
A_pinv_manual = Vt.T @ np.diag(1/S) @ U.T
print("Manual pseudo-inverse matches:", np.allclose(A_pinv, A_pinv_manual))

11. Linear Algebra in AI/ML

Neural Networks and Matrix Multiplication

The forward pass of a neural network is a sequence of matrix multiplications:

h(l)=f ⁣(W(l)h(l1)+b(l))\mathbf{h}^{(l)} = f\!\left(W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\right)

For a batch of nn samples with dd features: XRn×dX \in \mathbb{R}^{n \times d}, the entire batch is processed as one matrix multiplication.

Backpropagation computes gradients using the chain rule, which involves matrix transpose operations:

LW(l)=(Lh(l))(h(l1))T\frac{\partial L}{\partial W^{(l)}} = \left(\frac{\partial L}{\partial \mathbf{h}^{(l)}}\right) \left(\mathbf{h}^{(l-1)}\right)^T

Attention Mechanism in Transformers

The scaled dot-product attention is a core operation in Transformer models:

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Each query vector computes dot products with all key vectors, producing attention weights for a weighted sum of value vectors — pure linear algebra!

Dimensionality Reduction

PCA finds directions of maximum variance by eigendecomposing the covariance matrix C=1nXTXC = \frac{1}{n}X^TX. The eigenvectors with largest eigenvalues are the principal components.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# 1. Neural Network Forward Pass
np.random.seed(42)
batch_size, d_in, d_hidden, d_out = 32, 784, 256, 10

X_batch = np.random.randn(batch_size, d_in)
W1 = np.random.randn(d_in, d_hidden) * np.sqrt(2.0 / d_in)  # He init
b1 = np.zeros(d_hidden)
W2 = np.random.randn(d_hidden, d_out) * np.sqrt(2.0 / d_hidden)
b2 = np.zeros(d_out)

def relu(x): return np.maximum(0, x)
def softmax(x):
    exp_x = np.exp(x - x.max(axis=1, keepdims=True))
    return exp_x / exp_x.sum(axis=1, keepdims=True)

H1 = relu(X_batch @ W1 + b1)
logits = H1 @ W2 + b2
probs = softmax(logits)
print("Forward pass shapes:", X_batch.shape, "->", H1.shape, "->", probs.shape)
print("Probabilities sum to 1:", np.allclose(probs.sum(axis=1), 1))

# 2. Transformer Attention
def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    if mask is not None:
        scores = scores + mask * -1e9
    weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
    weights /= weights.sum(axis=-1, keepdims=True)
    return weights @ V, weights

seq_len, d_model = 8, 64
Q = np.random.randn(seq_len, d_model)
K = np.random.randn(seq_len, d_model)
V = np.random.randn(seq_len, d_model)

out, attn_weights = scaled_dot_product_attention(Q, K, V)
print("\nAttention output:", out.shape)
print("Attention weights sum:", attn_weights.sum(axis=-1))  # all 1s

# 3. PCA for Dimensionality Reduction
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_iris)
print("\nIris PCA:")
print("Original shape:", X_iris.shape, "-> Reduced:", X_pca.shape)
print("Explained variance:", np.round(pca.explained_variance_ratio_, 4))
print("Total variance explained:", round(pca.explained_variance_ratio_.sum(), 4))
print("Principal components (loadings):\n", np.round(pca.components_, 4))

# 4. Word Embedding Analogy
embeddings = {
    "king":   np.array([0.9, 0.1, 0.8, 0.2, 0.3]),
    "queen":  np.array([0.8, 0.2, 0.7, 0.3, 0.9]),
    "man":    np.array([0.8, 0.1, 0.2, 0.1, 0.1]),
    "woman":  np.array([0.7, 0.2, 0.1, 0.2, 0.8]),
}

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# king - man + woman ≈ queen
analogy_vec = embeddings["king"] - embeddings["man"] + embeddings["woman"]
print("\nWord analogy (king - man + woman):")
for word, emb in embeddings.items():
    print(f"  sim to {word}: {cosine_sim(analogy_vec, emb):.4f}")

12. Quiz

Q1. What is the geometric meaning of the dot product?

Answer: The dot product uv=uvcosθ\mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\|\|\mathbf{v}\|\cos\theta gives the product of the magnitudes multiplied by the cosine of the angle between them. It measures "how much one vector goes in the direction of another."

Key applications: Testing orthogonality (dot product = 0), computing projections, measuring similarity in machine learning (cosine similarity).

Q2. When does a matrix NOT have an inverse?

Answer: A square matrix has no inverse (is singular) when its determinant equals zero. Equivalently, the matrix has linearly dependent rows/columns, its rank is less than nn, or the system Ax=bA\mathbf{x} = \mathbf{b} does not have a unique solution.

Example: Any matrix with two identical rows, or rows that sum to zero, is singular.

Q3. What is the difference between eigendecomposition and SVD?

Answer: Eigendecomposition (A=PΛP1A = P\Lambda P^{-1}) applies only to square matrices and requires linearly independent eigenvectors. SVD (A=UΣVTA = U\Sigma V^T) works for any matrix (including rectangular) and always exists. For symmetric positive definite matrices, both coincide.

When to use SVD: When the matrix is rectangular, nearly singular (SVD is more numerically stable), or when you need the best low-rank approximation.

Q4. What does rank deficiency mean for solving Ax = b?

Answer: If rank(A)<n\text{rank}(A) < n (rank deficient), the system Ax=bA\mathbf{x} = \mathbf{b} either has no solution (when b\mathbf{b} is not in the column space of AA) or infinitely many solutions (when the null space is nontrivial). A unique solution requires full column rank.

Practical implication: In machine learning, underdetermined systems correspond to models with more parameters than training samples, leading to overfitting.

Q5. How does the Gram-Schmidt process connect to QR decomposition?

Answer: QR decomposition (A=QRA = QR) is the matrix formulation of the Gram-Schmidt process. The columns of QQ are the orthonormal vectors produced by Gram-Schmidt applied to the columns of AA. RR is upper triangular and captures the change-of-basis coefficients.

Applications: Solving least squares problems numerically, computing eigenvalues (QR algorithm), and orthogonalizing data features.


Summary and Learning Roadmap

The core concepts of linear algebra are deeply interconnected:

Vectors -> Matrices -> Determinants -> Linear Systems
    |           |
Vector Spaces -> Linear Transforms -> Inner Products
    |
Eigenvalues/Eigenvectors -> SVD -> AI/ML Applications

Recommended next steps:

  1. Gilbert Strang's MIT OCW 18.06: The gold standard for linear algebra pedagogy — thorough, with excellent problem sets
  2. 3Blue1Brown - Essence of Linear Algebra: Build geometric intuition before or alongside the formal treatment
  3. NumPy linalg module: Practice every concept with code — numpy.linalg covers everything in this guide
  4. scikit-learn PCA documentation: See how linear algebra powers real ML pipelines
  5. MIT 18.065 (Matrix Methods in Data Analysis): Advanced applications in data science and deep learning

The secret to mastering linear algebra is developing geometric intuition — always visualize what a matrix operation does to space. Combine that intuition with computational practice in NumPy, and you will have a powerful tool for any technical field.