- Published on
Mathematical Foundations for AI/ML: Complete Guide - Linear Algebra, Calculus, Probability
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Mathematical Foundations for AI/ML: Complete Guide
To truly understand machine learning and deep learning, you need to grasp the underlying mathematics. Many developers use frameworks without fully understanding why the math works the way it does. This guide aims to fill that gap.
Rather than listing formulas, we focus on why each concept matters and how it is used in deep learning, accompanied by Python/NumPy code you can run to verify the math yourself.
1. Linear Algebra
Linear algebra is the fundamental language of deep learning. The forward pass of a neural network is a sequence of matrix multiplications, and embeddings are points in a high-dimensional vector space.
1.1 Vectors
Definition and Geometric Meaning
A vector is a quantity with both magnitude and direction. Mathematically it is an ordered list of numbers; geometrically it is an arrow pointing to a location in -dimensional space.
In deep learning, vectors appear everywhere: input data, hidden-layer activations, embedding representations — all are vectors.
Vector Addition and Scalar Multiplication
Addition is performed element-wise:
Multiplying by scalar scales each component:
Dot Product
The dot product measures the "similarity" between two vectors:
Geometrically, it equals the product of magnitudes and the cosine of the angle between them:
A positive dot product indicates vectors pointing in the same general direction; negative means opposite; zero means orthogonal.
Cosine Similarity
To measure directional similarity independent of magnitude, normalize by both norms:
The value ranges from to . Widely used to compare word embeddings and document representations in NLP.
Vector Norms
Norms measure the "length" or "size" of a vector:
- norm (Manhattan):
- norm (Euclidean):
- norm (max norm):
In regularization, encourages sparsity while keeps weights uniformly small.
import numpy as np
a = np.array([3.0, 4.0])
b = np.array([1.0, 2.0])
# Dot product
dot_product = np.dot(a, b) # 3*1 + 4*2 = 11
print(f"Dot product: {dot_product}")
# L2 norm
l2_norm_a = np.linalg.norm(a) # sqrt(9+16) = 5
print(f"L2 norm of a: {l2_norm_a}")
# Cosine similarity
cosine_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cosine similarity: {cosine_sim:.4f}")
# Various norms
v = np.array([3.0, -4.0, 1.0])
print(f"L1 norm: {np.linalg.norm(v, ord=1)}") # 8.0
print(f"L2 norm: {np.linalg.norm(v, ord=2):.4f}") # 5.099
print(f"Linf norm: {np.linalg.norm(v, ord=np.inf)}") # 4.0
1.2 Matrices
Matrix Notation
A matrix is a rectangular array of numbers with rows and columns ( matrix):
In deep learning, the weight matrix represents the linear transformation between layers.
Matrix Multiplication
If is and is , then is :
The key rule: columns of the left matrix must equal rows of the right matrix. Matrix multiplication is generally not commutative: .
Transpose
Swap rows and columns: .
Property:
Inverse Matrix
For a square matrix , its inverse satisfies . An inverse exists only when the determinant is non-zero.
Determinant
For a matrix:
The determinant tells you how much the linear transformation stretches or compresses space. means the transformation collapses dimensions — no inverse exists.
Rank
The rank of a matrix is the maximum number of linearly independent rows (or columns). A lower rank means the transformation produces a lower-dimensional output space. Low-rank approximation is central to model compression techniques like LoRA.
import numpy as np
A = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
B = np.array([[9, 8, 7],
[6, 5, 4],
[3, 2, 1]])
# Matrix multiplication
C = np.dot(A, B)
print("Matrix product:\n", C)
# Transpose
print("Transpose of A:\n", A.T)
# Determinant
A2 = np.array([[3.0, 1.0],
[2.0, 4.0]])
print(f"Determinant: {np.linalg.det(A2):.2f}") # 3*4 - 1*2 = 10
# Inverse
A_inv = np.linalg.inv(A2)
print("Inverse:\n", A_inv)
print("A @ A_inv (should be identity):\n", np.round(A2 @ A_inv))
# Rank
print(f"Rank of A: {np.linalg.matrix_rank(A)}") # 2 (rows are linearly dependent)
1.3 Eigenvalues and Eigenvectors
Definition
For a square matrix , a non-zero vector and scalar satisfying:
are called the eigenvector and eigenvalue, respectively.
The meaning is powerful: when transforms , the direction stays the same — only the magnitude changes by a factor of .
Eigendecomposition
For a symmetric matrix, we can decompose using orthogonal eigenvectors:
where contains eigenvectors as columns and is the diagonal matrix of eigenvalues.
Relationship to PCA
The eigenvectors of the covariance matrix are the principal components. The eigenvector corresponding to the largest eigenvalue points in the direction of greatest variance — the most informative feature direction.
SVD (Singular Value Decomposition)
While eigendecomposition only applies to square matrices, SVD applies to any matrix:
- : orthogonal matrix (left singular vectors)
- : diagonal matrix (singular values, non-negative)
- : orthogonal matrix (right singular vectors)
SVD underpins matrix approximation, noise reduction, recommendation systems, LSA, and the LoRA technique for parameter-efficient fine-tuning.
import numpy as np
# Eigenvalues and eigenvectors
A = np.array([[4.0, 2.0],
[1.0, 3.0]])
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors (columns):\n", eigenvectors)
# Verify: A*v = lambda*v
v0 = eigenvectors[:, 0]
lam0 = eigenvalues[0]
print("A*v:", A @ v0)
print("lambda*v:", lam0 * v0)
# SVD
M = np.array([[1, 2, 3],
[4, 5, 6]])
U, S, Vt = np.linalg.svd(M, full_matrices=False)
print(f"\nU shape: {U.shape}")
print(f"Singular values: {S}")
print(f"Vt shape: {Vt.shape}")
# Rank-1 approximation
rank1_approx = np.outer(S[0] * U[:, 0], Vt[0, :])
print("Original matrix:\n", M)
print("Rank-1 approximation:\n", rank1_approx.round(2))
# PCA example
np.random.seed(42)
data = np.random.randn(100, 2)
data[:, 1] = data[:, 0] * 0.8 + data[:, 1] * 0.2
cov = np.cov(data.T)
eigenvals, eigenvecs = np.linalg.eig(cov)
idx = np.argsort(eigenvals)[::-1]
print("\nFirst principal component:", eigenvecs[:, idx[0]])
print("Explained variance ratio:", eigenvals[idx[0]] / eigenvals.sum())
1.4 Linear Algebra in Deep Learning
Layers as Linear Transformations
A fully connected layer is fundamentally a linear transformation:
where , , , .
Without non-linear activation functions, stacking multiple layers reduces to a single linear transformation — which is why activations like ReLU and Sigmoid are essential.
Batch Processing
For a batch of samples processed simultaneously:
where and . GPUs are optimized for exactly this type of large-scale matrix multiplication.
2. Calculus
Calculus governs how deep learning models learn. Gradient descent — the engine of learning — relies entirely on differentiation.
2.1 Differentiation
Limit Definition of Derivative
This is the instantaneous rate of change — the slope of the tangent line at point .
Basic Differentiation Rules
- Power rule:
- Sum rule:
- Product rule:
- Quotient rule:
- Chain rule:
Derivatives of common deep learning functions:
- (Sigmoid)
Partial Derivatives
For a multi-variable function, differentiate with respect to one variable while treating all others as constants:
For :
Gradient
The gradient collects all partial derivatives into a vector:
The gradient points in the direction of steepest ascent. To minimize a loss function, move in the opposite direction (gradient descent).
Jacobian Matrix
For a vector function , the Jacobian collects all first-order partial derivatives:
Hessian Matrix
The matrix of second-order partial derivatives for a scalar function:
Eigenvalues of the Hessian indicate curvature: all positive means local minimum, all negative means local maximum, mixed signs indicate a saddle point.
import numpy as np
def numerical_gradient(f, x, h=1e-5):
"""Compute gradient numerically via central differences"""
grad = np.zeros_like(x)
for i in range(len(x)):
x_plus = x.copy(); x_plus[i] += h
x_minus = x.copy(); x_minus[i] -= h
grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
return grad
# f(x, y) = x^2 + 2y^2
def f(x):
return x[0]**2 + 2 * x[1]**2
x0 = np.array([3.0, 4.0])
grad = numerical_gradient(f, x0)
print(f"Numerical gradient: {grad}") # [6.0, 16.0]
print(f"Analytic gradient: [{2*x0[0]}, {4*x0[1]}]")
# Automatic differentiation with PyTorch
try:
import torch
x = torch.tensor([3.0, 4.0], requires_grad=True)
y = x[0]**2 + 2 * x[1]**2
y.backward()
print(f"PyTorch gradient: {x.grad}")
except ImportError:
print("PyTorch not installed")
2.2 Chain Rule and Backpropagation
Chain Rule
For composite functions, differentiation follows:
Generalized to multiple variables:
Backpropagation
Training a neural network requires computing for every weight . Since a network is a composition of many functions, we repeatedly apply the chain rule — from the output layer back to the input layer.
For a 2-layer network:
Computing requires chaining gradients through each operation:
import numpy as np
class SimpleNet:
def __init__(self):
self.W1 = np.random.randn(4, 3) * 0.1
self.b1 = np.zeros(4)
self.W2 = np.random.randn(2, 4) * 0.1
self.b2 = np.zeros(2)
def relu(self, x):
return np.maximum(0, x)
def relu_grad(self, x):
return (x > 0).astype(float)
def forward(self, x):
self.x = x
self.z1 = self.W1 @ x + self.b1
self.h = self.relu(self.z1)
self.z2 = self.W2 @ self.h + self.b2
return self.z2
def backward(self, dL_dz2):
# Layer 2 gradients
dL_dW2 = np.outer(dL_dz2, self.h)
dL_db2 = dL_dz2
dL_dh = self.W2.T @ dL_dz2
# ReLU gradient
dL_dz1 = dL_dh * self.relu_grad(self.z1)
# Layer 1 gradients
dL_dW1 = np.outer(dL_dz1, self.x)
dL_db1 = dL_dz1
return dL_dW1, dL_db1, dL_dW2, dL_db2
net = SimpleNet()
x = np.array([1.0, 2.0, 3.0])
output = net.forward(x)
dL_dout = np.array([1.0, -1.0])
grads = net.backward(dL_dout)
print(f"W1 gradient shape: {grads[0].shape}")
print(f"W2 gradient shape: {grads[2].shape}")
2.3 Optimization Theory
Optimality Conditions
- First-order (necessary): at an extremum,
- Second-order (sufficient): the Hessian's eigenvalues determine min/max
Convex Functions
A function is convex if:
For convex functions, any local minimum is a global minimum. Deep learning loss functions are generally non-convex, so global optimality cannot be guaranteed.
Gradient Descent
where is the learning rate. Too large and the updates diverge; too small and convergence is slow.
Common variants:
- SGD (Stochastic Gradient Descent): uses a random mini-batch each step
- Momentum: accumulates past gradient directions for acceleration
- Adam: adaptive learning rates per parameter
Adam update rule:
import numpy as np
def gradient_descent(grad_f, init_x, lr=0.1, steps=100):
x = init_x.copy()
history = [x.copy()]
for _ in range(steps):
grad = grad_f(x)
x -= lr * grad
history.append(x.copy())
return x, np.array(history)
def f(x):
return x[0]**2 + 4 * x[1]**2
def grad_f(x):
return np.array([2*x[0], 8*x[1]])
init_x = np.array([3.0, 3.0])
final_x, history = gradient_descent(grad_f, init_x, lr=0.1, steps=50)
print(f"Start: {init_x}, Converged to: {final_x.round(6)}")
def adam_optimizer(grad_f, init_x, alpha=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):
x = init_x.copy()
m = np.zeros_like(x)
v = np.zeros_like(x)
for t in range(1, steps + 1):
g = grad_f(x)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g**2
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
x -= alpha * m_hat / (np.sqrt(v_hat) + eps)
return x
final_adam = adam_optimizer(grad_f, init_x)
print(f"Adam converged to: {final_adam.round(6)}")
3. Probability and Statistics
Probability and statistics provide the language for handling uncertainty in deep learning. Loss function design, regularization, and model evaluation all rely on probabilistic thinking.
3.1 Probability Basics
Kolmogorov Axioms
- (non-negativity)
- (total probability equals 1)
- For mutually exclusive events: (additivity)
Conditional Probability
The probability of given that has occurred:
Bayes' Theorem
Bayes' theorem formalizes "updating beliefs in the light of evidence":
- : Prior — belief before seeing evidence
- : Likelihood — probability of evidence under hypothesis
- : Posterior — updated belief after seeing evidence
This underlies Bayesian classifiers, Bayesian neural networks, and probabilistic model evaluation.
3.2 Probability Distributions
Discrete Distributions
Bernoulli Distribution
Single trial with outcome 0 or 1:
,
Binomial Distribution
Number of successes in Bernoulli trials:
Poisson Distribution
Event count in unit time or space:
Continuous Distributions
Uniform Distribution
Equal probability density over :
Normal (Gaussian) Distribution
Characterized by mean and standard deviation . By the Central Limit Theorem, many natural phenomena follow this distribution. Used extensively in weight initialization and noise modeling.
Multivariate Normal Distribution
where is the mean vector and is the covariance matrix. Central to VAE latent space modeling.
import numpy as np
from scipy import stats
# Normal distribution
mu, sigma = 0, 1
print(f"P(-1 < X < 1) = {stats.norm.cdf(1) - stats.norm.cdf(-1):.4f}") # 68.27%
print(f"P(-2 < X < 2) = {stats.norm.cdf(2) - stats.norm.cdf(-2):.4f}") # 95.45%
print(f"P(-3 < X < 3) = {stats.norm.cdf(3) - stats.norm.cdf(-3):.4f}") # 99.73%
# Multivariate normal sampling
mean = np.array([0, 0])
cov = np.array([[1, 0.7],
[0.7, 1]])
samples = np.random.multivariate_normal(mean, cov, size=1000)
print(f"\nSample shape: {samples.shape}")
print(f"Sample mean: {samples.mean(axis=0).round(3)}")
print(f"Sample covariance:\n{np.cov(samples.T).round(3)}")
# Binomial distribution
n, p = 10, 0.5
print(f"\nBinomial B(10, 0.5): P(X=5) = {stats.binom.pmf(5, n, p):.4f}")
3.3 Expectation and Variance
Expected Value
Linearity of expectation:
Variance
Standard deviation:
Covariance
Measures the linear relationship between two random variables:
Correlation
Range ; measures strength of linear relationship independent of scale.
Covariance Matrix
For an -dimensional random vector:
The covariance matrix encodes the variance structure of data and is the key input to PCA.
3.4 Maximum Likelihood Estimation (MLE)
Likelihood Function
The probability of observing data given parameters :
MLE finds the that maximizes this likelihood:
Log-Likelihood
Taking the log converts the product to a sum, simplifying computation:
Since log is monotonically increasing, the maximizer of log-likelihood equals the maximizer of likelihood.
MLE for the Normal Distribution
Assuming data follows :
Setting the derivative with respect to to zero: (sample mean)
For :
Cross-Entropy and MLE
Minimizing cross-entropy loss in classification is equivalent to MLE:
This maximizes the log-likelihood of the model under the true data distribution.
import numpy as np
from scipy.optimize import minimize_scalar
# MLE for Bernoulli distribution
data = np.array([1, 0, 1, 1, 0, 1, 1, 1, 0, 1])
n = len(data)
k = data.sum()
p_mle = k / n
print(f"Analytic MLE: p = {p_mle:.2f}")
def neg_log_likelihood(p):
if p <= 0 or p >= 1:
return np.inf
return -(k * np.log(p) + (n - k) * np.log(1 - p))
result = minimize_scalar(neg_log_likelihood, bounds=(0.01, 0.99), method='bounded')
print(f"Numerical MLE: p = {result.x:.4f}")
# Cross-entropy loss
def cross_entropy(y_true, y_pred, eps=1e-15):
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.2])
print(f"\nCross-Entropy loss: {cross_entropy(y_true, y_pred):.4f}")
3.5 Information Theory
Entropy
Measures the uncertainty of a probability distribution :
Maximum for uniform distributions, zero when a single outcome is certain.
KL Divergence
A measure of "distance" between distributions and (asymmetric):
, with equality iff . Used in VAE loss functions, PPO policy optimization, and knowledge distillation.
Cross-Entropy
Since is constant, minimizing cross-entropy is equivalent to minimizing KL divergence between the true and model distributions.
Mutual Information
Information shared between two random variables:
Theoretical foundation for feature selection and representation learning.
import numpy as np
def entropy(p):
p = np.array(p)
p = p[p > 0]
return -np.sum(p * np.log2(p))
def kl_divergence(p, q, eps=1e-15):
p = np.array(p) + eps
q = np.array(q) + eps
p = p / p.sum()
q = q / q.sum()
return np.sum(p * np.log(p / q))
# Uniform distribution (maximum entropy)
uniform = [0.25, 0.25, 0.25, 0.25]
print(f"Uniform distribution entropy: {entropy(uniform):.4f} bits") # 2.0
# Concentrated distribution (low entropy)
concentrated = [0.97, 0.01, 0.01, 0.01]
print(f"Concentrated distribution entropy: {entropy(concentrated):.4f} bits")
# KL Divergence
true_dist = [0.4, 0.3, 0.2, 0.1]
model_dist = [0.35, 0.35, 0.15, 0.15]
kl = kl_divergence(true_dist, model_dist)
print(f"\nKL Divergence D_KL(P||Q): {kl:.4f}")
4. Numerical Methods
Understanding the gap between mathematical theory and actual computer arithmetic is important for reliable deep learning implementations.
4.1 Floating-Point Representation
FP32 (Single Precision)
- 32 bits: 1 sign, 8 exponent, 23 mantissa
- Precision: ~7 significant decimal digits
- Range: approximately
FP16 (Half Precision)
- 16 bits: 1 sign, 5 exponent, 10 mantissa
- Saves GPU memory, faster computation (Mixed Precision Training)
- Risk of overflow/underflow with large or small values
BF16 (Brain Float 16)
- 16 bits: 1 sign, 8 exponent, 7 mantissa
- Same exponent range as FP32 (lower overflow risk)
- Introduced by Google for TPUs, now widely used for AI training
import numpy as np
# Floating-point precision issues
a = np.float32(0.1)
b = np.float32(0.2)
print(f"0.1 + 0.2 (float32) = {a + b}") # may not be exactly 0.3
# FP16 overflow
print(f"FP16 large value: {np.float16(65000)}")
print(f"FP16 overflow: {np.float16(70000)}") # inf
# Compare dtype ranges
for dtype in [np.float16, np.float32, np.float64]:
info = np.finfo(dtype)
print(f"{dtype.__name__}: max={info.max:.3e}, min_positive={info.tiny:.3e}")
4.2 Numerical Stability
The Instability of Naive Softmax
Large inputs cause to overflow (inf). The fix: subtract the maximum value before exponentiation.
This is mathematically equivalent but numerically stable.
import numpy as np
def naive_softmax(x):
return np.exp(x) / np.sum(np.exp(x))
def stable_softmax(x):
x_shifted = x - np.max(x)
return np.exp(x_shifted) / np.sum(np.exp(x_shifted))
# Normal input
x_normal = np.array([1.0, 2.0, 3.0])
print("Normal input:")
print(f" Naive: {naive_softmax(x_normal)}")
print(f" Stable: {stable_softmax(x_normal)}")
# Large input
x_large = np.array([1000.0, 2000.0, 3000.0])
print("\nLarge input:")
naive_result = naive_softmax(x_large)
print(f" Naive: {naive_result}") # [nan, nan, nan] due to overflow
print(f" Stable: {stable_softmax(x_large)}") # [0, 0, 1]
# Log-Softmax (used with NLLLoss)
def log_softmax(x):
x_shifted = x - np.max(x)
return x_shifted - np.log(np.sum(np.exp(x_shifted)))
logits = np.array([2.0, 1.0, 0.1])
print(f"\nLog-Softmax: {log_softmax(logits)}")
5. Mathematical Analysis of Linear Regression
Linear regression is the simplest building block of deep learning. A thorough mathematical understanding creates a foundation for understanding more complex models.
5.1 Ordinary Least Squares
For data , we fit a linear model .
Loss function (MSE):
Normal Equation
Differentiating with respect to and setting to zero:
This closed-form solution requires to be invertible (features must be linearly independent).
5.2 Ridge and Lasso Regression
Ridge Regression (L2 regularization)
Normal equation:
Adding guarantees invertibility and shrinks weights toward zero, preventing overfitting.
Lasso Regression (L1 regularization)
L1 regularization induces sparse solutions — some weights become exactly zero — providing automatic feature selection.
import numpy as np
from sklearn.linear_model import Ridge, Lasso
np.random.seed(42)
n, d = 100, 10
X = np.random.randn(n, d)
true_w = np.array([1.0, -2.0, 3.0, 0.0, 0.0, 0.0, 0.5, -0.5, 0.0, 0.0])
y = X @ true_w + np.random.randn(n) * 0.5
# Normal equation (OLS)
w_ols = np.linalg.lstsq(X, y, rcond=None)[0]
print("OLS coefficients:", w_ols.round(2))
# Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
print("Ridge coefficients:", ridge.coef_.round(2))
# Lasso (sparse solution)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
print("Lasso coefficients:", lasso.coef_.round(2))
print("Number of zero coefficients:", (np.abs(lasso.coef_) < 0.01).sum())
# Ridge via normal equation
def ridge_normal_equation(X, y, lam=1.0):
d = X.shape[1]
return np.linalg.solve(X.T @ X + lam * np.eye(d), X.T @ y)
w_ridge = ridge_normal_equation(X, y, lam=1.0)
print("\nRidge (normal equation):", w_ridge.round(2))
6. Integration: The Mathematics of Deep Learning
6.1 A Mathematical View of Neural Networks
A fully connected neural network with layers:
Forward pass: linear algebra (matrix multiplication) + activation functions
Loss computation: information theory (cross-entropy) or statistics (MSE)
Backward pass: calculus (chain rule) to compute gradients
Parameter update: optimization theory (gradient descent, Adam)
6.2 The Attention Mechanism and Matrix Operations
The Transformer's scaled dot-product attention:
- : Query, Key, Value matrices (linear algebra)
- scaling: numerical stability (numerical analysis)
- Softmax: converts scores to probability distribution (probability theory)
- Dot product : similarity measurement (vector dot product)
import numpy as np
def stable_softmax_2d(x):
x_max = x.max(axis=-1, keepdims=True)
x_shifted = x - x_max
exp_x = np.exp(x_shifted)
return exp_x / exp_x.sum(axis=-1, keepdims=True)
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
if mask is not None:
scores = np.where(mask, scores, -1e9)
weights = stable_softmax_2d(scores)
output = weights @ V
return output, weights
# 3 tokens, d_k=4
seq_len, d_k = 3, 4
np.random.seed(42)
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Attention output shape: {output.shape}")
print(f"Attention weights (rows sum to 1):\n{weights.round(3)}")
print(f"Row sums: {weights.sum(axis=-1)}")
Conclusion: Next Steps
This guide covered the core mathematical foundations of AI/ML:
- Linear Algebra: the language for representing and transforming data and model parameters
- Calculus: the tool explaining how models learn
- Probability & Statistics: handles uncertainty and provides the theoretical basis for loss functions
- Numerical Methods: knowledge for stable computer implementations of mathematical theory
Recommended next steps:
- Gilbert Strang's Linear Algebra — free on MIT OCW
- Pattern Recognition and Machine Learning (Bishop) — probabilistic perspective on ML
- Deep Learning (Goodfellow et al.) — the theoretical bible of deep learning
- Practice: implement a neural network from scratch using only NumPy
Mathematics is a tool for understanding. You do not need to memorize every formula, but grasping why each concept exists and how it operates within deep learning gives you significantly more powerful practical skills.
References
- Gilbert Strang, Introduction to Linear Algebra, 6th Edition
- Ian Goodfellow et al., Deep Learning, MIT Press (2016)
- Christopher Bishop, Pattern Recognition and Machine Learning, Springer (2006)
- 3Blue1Brown — Essence of Linear Algebra, Essence of Calculus (YouTube)
- fast.ai — Practical Deep Learning for Coders