Skip to content
Published on

Probabilistic Machine Learning & Bayesian Methods: From Bayesian Inference to Gaussian Processes and Uncertainty Quantification

Authors

Table of Contents

  1. Probability Theory Foundations and Bayes Theorem
  2. Bayesian Inference: Prior and Posterior Distributions
  3. MCMC: Markov Chain Monte Carlo
  4. Variational Inference and VAE
  5. Gaussian Processes
  6. Bayesian Deep Learning and Uncertainty Quantification
  7. Practical Tools: PyMC, Stan, Pyro
  8. Quiz

Probability Theory Foundations and Bayes Theorem

Bayes Theorem

Bayes theorem is the cornerstone of probabilistic machine learning. Given events AA and evidence BB:

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

In the machine learning context, expressed with parameters θ\theta and data DD:

P(θD)=P(Dθ)P(θ)P(D)P(\theta|D) = \frac{P(D|\theta) \cdot P(\theta)}{P(D)}

  • P(θ)P(\theta): Prior distribution — beliefs about parameters before observing data
  • P(Dθ)P(D|\theta): Likelihood — probability of generating the data given parameter values
  • P(θD)P(\theta|D): Posterior distribution — updated beliefs after observing data
  • P(D)P(D): Marginal likelihood (Evidence) — normalization constant

Key Probability Distributions

Gaussian Distribution

N(xμ,σ2)=12πσ2exp ⁣((xμ)22σ2)\mathcal{N}(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

The foundation for modeling continuous data. Widely used in practice due to the Central Limit Theorem.

Bernoulli Distribution

Bern(xp)=px(1p)1x,x{0,1}\text{Bern}(x | p) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}

Base distribution for binary classification problems. Used as a likelihood function in Bayesian binary classification.

Dirichlet Distribution

Dir(pα)=Γ(kαk)kΓ(αk)kpkαk1\text{Dir}(\mathbf{p} | \boldsymbol{\alpha}) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_k p_k^{\alpha_k - 1}

The conjugate prior for the multinomial distribution. Used extensively in topic modeling (LDA).

import numpy as np
import scipy.stats as stats

# Key probability distributions
x = np.linspace(-4, 4, 200)

# Gaussian distribution
gaussian = stats.norm(loc=0, scale=1)
print(f"Gaussian mean: {gaussian.mean():.2f}, variance: {gaussian.var():.2f}")

# Bernoulli distribution
p = 0.7
bern_samples = stats.bernoulli(p).rvs(1000)
print(f"Bernoulli success rate: {bern_samples.mean():.3f} (expected: {p})")

# Dirichlet distribution sampling
alpha = np.array([2.0, 3.0, 5.0])
dirichlet_samples = stats.dirichlet(alpha).rvs(1000)
print(f"Dirichlet sample mean: {dirichlet_samples.mean(axis=0)}")
print(f"Theoretical mean: {alpha / alpha.sum()}")

Bayesian Inference: Prior and Posterior Distributions

MAP vs MLE

Maximum Likelihood Estimation (MLE)

θ^MLE=argmaxθP(Dθ)=argmaxθlogP(Dθ)\hat{\theta}_{MLE} = \arg\max_\theta P(D|\theta) = \arg\max_\theta \log P(D|\theta)

Maximum A Posteriori (MAP)

θ^MAP=argmaxθP(θD)=argmaxθ[logP(Dθ)+logP(θ)]\hat{\theta}_{MAP} = \arg\max_\theta P(\theta|D) = \arg\max_\theta \left[\log P(D|\theta) + \log P(\theta)\right]

MAP is equivalent to MLE plus the log of the prior distribution. Using a Gaussian prior P(θ)=N(0,τ2)P(\theta) = \mathcal{N}(0, \tau^2):

logP(θ)=θ22τ2+const\log P(\theta) = -\frac{\theta^2}{2\tau^2} + \text{const}

This is mathematically identical to L2 regularization (Ridge Regression). A Laplace prior corresponds to L1 regularization (Lasso).

Conjugate Priors

When the posterior distribution has the same functional form as the prior distribution, they are said to be conjugate.

LikelihoodConjugate PriorPosterior
Bernoulli/BinomialBetaBeta
GaussianGaussianGaussian
PoissonGammaGamma
MultinomialDirichletDirichlet

Beta-Bernoulli conjugate example:

Prior: P(p)=Beta(α,β)P(p) = \text{Beta}(\alpha, \beta), Likelihood: P(Dp)=Bern(p)NP(D|p) = \text{Bern}(p)^N

Posterior: P(pD)=Beta(α+nH,β+nT)P(p|D) = \text{Beta}(\alpha + n_H, \beta + n_T)

where nHn_H is the number of heads and nTn_T is the number of tails.

import pymc as pm
import numpy as np
import arviz as az

# Bayesian linear regression with PyMC
np.random.seed(42)
X = np.random.randn(100)
true_alpha, true_beta, true_sigma = 1.5, 2.3, 0.5
y = true_alpha + true_beta * X + np.random.randn(100) * true_sigma

with pm.Model() as linear_model:
    # Prior distributions
    alpha = pm.Normal('alpha', mu=0, sigma=10)
    beta = pm.Normal('beta', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=1)

    # Likelihood function
    mu = alpha + beta * X
    likelihood = pm.Normal('y', mu=mu, sigma=sigma, observed=y)

    # Sample posterior with NUTS sampler
    trace = pm.sample(2000, tune=1000, return_inferencedata=True, random_seed=42)

# Summarize results
summary = az.summary(trace, var_names=['alpha', 'beta', 'sigma'])
print(summary)
# Should converge near alpha ~ 1.5, beta ~ 2.3, sigma ~ 0.5

MCMC: Markov Chain Monte Carlo

MCMC is a method for drawing samples from posterior distributions that are difficult to compute analytically.

Metropolis-Hastings Algorithm

  1. From current state θ(t)\theta^{(t)}, generate candidate θ\theta' from proposal distribution q(θθ(t))q(\theta'|\theta^{(t)})
  2. Compute acceptance probability:

α=min ⁣(1,P(θD)q(θ(t)θ)P(θ(t)D)q(θθ(t)))\alpha = \min\!\left(1, \frac{P(\theta'|D) \cdot q(\theta^{(t)}|\theta')}{P(\theta^{(t)}|D) \cdot q(\theta'|\theta^{(t)})}\right)

  1. Sample uU(0,1)u \sim U(0,1); if u<αu < \alpha set θ(t+1)=θ\theta^{(t+1)} = \theta', otherwise θ(t+1)=θ(t)\theta^{(t+1)} = \theta^{(t)}
import numpy as np

def metropolis_hastings(log_posterior, initial, n_samples=10000, proposal_std=0.5):
    """Metropolis-Hastings MCMC implementation"""
    samples = np.zeros((n_samples, len(initial)))
    current = np.array(initial, dtype=float)
    accepted = 0

    for i in range(n_samples):
        # Sample candidate from proposal distribution
        proposal = current + np.random.randn(len(current)) * proposal_std

        # Compute acceptance probability (in log space)
        log_ratio = log_posterior(proposal) - log_posterior(current)
        accept_prob = min(1.0, np.exp(log_ratio))

        # Accept/reject decision
        if np.random.rand() < accept_prob:
            current = proposal
            accepted += 1

        samples[i] = current

    print(f"Acceptance rate: {accepted / n_samples:.3f}")
    return samples

# Sample from a Gaussian mixture distribution
def log_posterior_mixture(theta):
    log_p1 = -0.5 * ((theta[0] - 2)**2 + (theta[1] - 2)**2)
    log_p2 = -0.5 * ((theta[0] + 2)**2 + (theta[1] + 2)**2)
    return np.log(0.5 * np.exp(log_p1) + 0.5 * np.exp(log_p2) + 1e-10)

samples = metropolis_hastings(log_posterior_mixture, [0.0, 0.0], n_samples=50000)

# Analyze after removing burn-in
burn_in = 5000
posterior_samples = samples[burn_in:]
print(f"Posterior mean: {posterior_samples.mean(axis=0)}")
print(f"Posterior std: {posterior_samples.std(axis=0)}")

Gibbs Sampling

Sequentially sample each parameter conditioned on all others:

θi(t+1)P(θiθi(t),D)\theta_i^{(t+1)} \sim P(\theta_i | \theta_{-i}^{(t)}, D)

Particularly efficient when conditional distributions can be sampled easily.

HMC (Hamiltonian Monte Carlo)

Leverages Hamiltonian mechanics from physics to efficiently explore high-dimensional spaces.

Introducing momentum rr alongside position θ\theta, the Hamiltonian is:

H(θ,r)=logP(θD)+12rTM1rH(\theta, r) = -\log P(\theta|D) + \frac{1}{2} r^T M^{-1} r

Trajectories are simulated using a leapfrog integrator to generate uncorrelated samples efficiently. PyMC's NUTS (No-U-Turn Sampler) is an automated version of HMC.


Variational Inference and VAE

ELBO (Evidence Lower BOund)

When the posterior P(θD)P(\theta|D) is intractable, we optimize an approximate distribution q(θ)q(\theta).

Minimize the KL divergence:

KL[q(θ)P(θD)]=Eq[logq(θ)]Eq[logP(θD)]\text{KL}[q(\theta) \| P(\theta|D)] = \mathbb{E}_{q}[\log q(\theta)] - \mathbb{E}_{q}[\log P(\theta|D)]

Rearranging:

logP(D)=ELBO(q)+KL[q(θ)P(θD)]\log P(D) = \text{ELBO}(q) + \text{KL}[q(\theta) \| P(\theta|D)]

ELBO(q)=Eq[logP(Dθ)]KL[q(θ)P(θ)]\text{ELBO}(q) = \mathbb{E}_{q}[\log P(D|\theta)] - \text{KL}[q(\theta) \| P(\theta)]

Maximizing the ELBO is equivalent to maximizing a lower bound on the marginal likelihood.

Mean-Field Approximation

Assumes parameters are independent:

q(θ)=iqi(θi)q(\theta) = \prod_i q_i(\theta_i)

Optimal solution for each qiq_i:

logqi(θi)=Eqi[logP(D,θ)]+const\log q_i^*(\theta_i) = \mathbb{E}_{q_{-i}}[\log P(D, \theta)] + \text{const}

VAE (Variational Autoencoder)

VAE applies variational inference to deep learning for generative modeling.

VAE ELBO:

L=Eqϕ(zx)[logpθ(xz)]KL[qϕ(zx)p(z)]\mathcal{L} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}[q_\phi(z|x) \| p(z)]

  • First term: Reconstruction loss
  • Second term: Regularization term (KL divergence)

Reparameterization Trick:

Direct sampling from zN(μ,σ2)z \sim \mathcal{N}(\mu, \sigma^2) prevents backpropagation. Instead:

z=μ+σϵ,ϵN(0,I)z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, input_dim=784, latent_dim=20, hidden_dim=400):
        super().__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU()
        )
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)

        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid()
        )

    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_logvar(h)

    def reparameterize(self, mu, logvar):
        """Reparameterization Trick: z = mu + eps * std"""
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        recon = self.decoder(z)
        return recon, mu, logvar

def vae_loss(recon_x, x, mu, logvar):
    # Reconstruction loss (BCE)
    recon_loss = nn.functional.binary_cross_entropy(recon_x, x, reduction='sum')
    # KL divergence: -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    kl_div = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return recon_loss + kl_div

Gaussian Processes

Definition and Kernel Functions

A Gaussian Process (GP) is a distribution over functions. A collection of function values indexed by any finite set follows a multivariate Gaussian distribution:

f(x)GP(m(x),k(x,x))f(x) \sim \mathcal{GP}(m(x), k(x, x'))

  • m(x)m(x): Mean function (often set to 0)
  • k(x,x)k(x, x'): Kernel function (covariance function)

Key Kernel Functions:

RBF (Squared Exponential) kernel: kSE(x,x)=σf2exp ⁣(xx222)k_{SE}(x, x') = \sigma_f^2 \exp\!\left(-\frac{\|x - x'\|^2}{2\ell^2}\right)

Matern kernel (nu=5/2): kM52(x,x)=σf2(1+5xx+5xx232)exp ⁣(5xx)k_{M52}(x, x') = \sigma_f^2\left(1 + \frac{\sqrt{5}\|x-x'\|}{\ell} + \frac{5\|x-x'\|^2}{3\ell^2}\right)\exp\!\left(-\frac{\sqrt{5}\|x-x'\|}{\ell}\right)

Periodic kernel: kper(x,x)=σf2exp ⁣(2sin2(πxx/p)2)k_{per}(x, x') = \sigma_f^2 \exp\!\left(-\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2}\right)

GPR (Gaussian Process Regression)

Given observations (X,y)(\mathbf{X}, \mathbf{y}), predictions for new inputs X\mathbf{X}_*:

fX,y,XN(fˉ,cov(f))\mathbf{f}_* | \mathbf{X}, \mathbf{y}, \mathbf{X}_* \sim \mathcal{N}(\bar{\mathbf{f}}_*, \text{cov}(\mathbf{f}_*))

fˉ=K(X,X)[K(X,X)+σn2I]1y\bar{\mathbf{f}}_* = K(\mathbf{X}_*, \mathbf{X})[K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1}\mathbf{y}

cov(f)=K(X,X)K(X,X)[K(X,X)+σn2I]1K(X,X)\text{cov}(\mathbf{f}_*) = K(\mathbf{X}_*, \mathbf{X}_*) - K(\mathbf{X}_*, \mathbf{X})[K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1}K(\mathbf{X}, \mathbf{X}_*)

import torch
import gpytorch

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel()
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# Prepare data
torch.manual_seed(42)
train_x = torch.linspace(0, 1, 20)
train_y = torch.sin(train_x * 2 * 3.14159) + torch.randn(20) * 0.1

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)

# Optimize hyperparameters (maximize MLL)
model.train()
likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(200):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    optimizer.step()

# Predict
model.eval()
likelihood.eval()
test_x = torch.linspace(0, 1, 100)
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    predictions = likelihood(model(test_x))
    mean = predictions.mean
    lower, upper = predictions.confidence_region()

print(f"Learned lengthscale: {model.covar_module.base_kernel.lengthscale.item():.4f}")
print(f"Learned noise: {likelihood.noise.item():.4f}")

Bayesian Deep Learning and Uncertainty Quantification

Types of Uncertainty

  • Epistemic Uncertainty: Model uncertainty. Decreases as more data is observed.
  • Aleatoric Uncertainty: Inherent noise in data. Cannot be reduced regardless of how much data is collected.

MC Dropout

Activating Dropout at test time approximates Bayesian inference. Gal & Ghahramani (2016) proved that Dropout is mathematically equivalent to an approximation of Gaussian processes.

P(yx,X,Y)1Tt=1TP(yx,ω^t)P(y^*|x^*, X, Y) \approx \frac{1}{T} \sum_{t=1}^{T} P(y^*|x^*, \hat{\omega}_t)

import torch
import torch.nn as nn
import numpy as np

class BayesianMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout_p=0.1):
        super().__init__()
        self.dropout_p = dropout_p
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

def mc_dropout_predict(model, x, n_samples=100):
    """Estimate prediction uncertainty using MC Dropout"""
    model.train()  # Keep dropout active
    predictions = []
    with torch.no_grad():
        for _ in range(n_samples):
            pred = model(x)
            predictions.append(pred)

    predictions = torch.stack(predictions)  # (n_samples, batch, output)
    mean = predictions.mean(dim=0)
    # Uncertainty = variance of predictions
    uncertainty = predictions.var(dim=0)
    return mean, uncertainty

# Usage example
model = BayesianMLP(input_dim=10, hidden_dim=64, output_dim=1, dropout_p=0.1)
x_test = torch.randn(32, 10)
mean_pred, uncertainty = mc_dropout_predict(model, x_test, n_samples=200)
print(f"Prediction mean shape: {mean_pred.shape}, uncertainty: {uncertainty.shape}")
print(f"Mean uncertainty: {uncertainty.mean().item():.4f}")

Bayesian Neural Network (BNN)

Assign distributions to weights for full Bayesian treatment:

WN(0,σw2I)W \sim \mathcal{N}(0, \sigma_w^2 I)

Approximate posterior with variational inference:

qϕ(W)=N(μϕ,diag(σϕ2))q_\phi(W) = \mathcal{N}(\mu_\phi, \text{diag}(\sigma_\phi^2))

import torch
import pyro
import pyro.distributions as dist
from pyro.nn import PyroModule, PyroSample

class BayesianLinear(PyroModule):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features

        # Define prior distributions on weights and biases
        self.weight = PyroSample(
            dist.Normal(0., 1.).expand([out_features, in_features]).to_event(2)
        )
        self.bias = PyroSample(
            dist.Normal(0., 10.).expand([out_features]).to_event(1)
        )

    def forward(self, x):
        return x @ self.weight.T + self.bias

class BayesianNN(PyroModule):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.layer1 = BayesianLinear(in_dim, hidden_dim)
        self.layer2 = BayesianLinear(hidden_dim, out_dim)

    def forward(self, x, y=None):
        x = torch.relu(self.layer1(x))
        mu = self.layer2(x).squeeze(-1)
        sigma = pyro.sample("sigma", dist.Uniform(0., 1.))
        with pyro.plate("data", x.shape[0]):
            obs = pyro.sample("obs", dist.Normal(mu, sigma), obs=y)
        return mu

Practical Tools: PyMC, Stan, Pyro

Bayesian Logistic Regression with NumPyro

import jax
import jax.numpy as jnp
import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS

def logistic_regression_model(X, y=None):
    """NumPyro Bayesian logistic regression model"""
    n_features = X.shape[1]

    # Prior distributions
    alpha = numpyro.sample("alpha", dist.Normal(0., 10.))
    beta = numpyro.sample("beta", dist.Normal(jnp.zeros(n_features), jnp.ones(n_features)))

    # Compute logits
    logits = alpha + X @ beta

    # Likelihood
    with numpyro.plate("obs", X.shape[0]):
        numpyro.sample("y", dist.Bernoulli(logits=logits), obs=y)

# Run inference with NUTS sampler
def run_mcmc(X_train, y_train, num_samples=2000, num_warmup=1000):
    kernel = NUTS(logistic_regression_model)
    mcmc = MCMC(kernel, num_warmup=num_warmup, num_samples=num_samples)
    rng_key = jax.random.PRNGKey(42)
    mcmc.run(rng_key, X_train, y_train)
    return mcmc.get_samples()

# Example usage (generate data)
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

X, y = make_classification(n_samples=200, n_features=4, random_state=42)
X = StandardScaler().fit_transform(X)
X_jax = jnp.array(X)
y_jax = jnp.array(y)

samples = run_mcmc(X_jax, y_jax)
print(f"Inferred alpha: {samples['alpha'].mean():.3f} +/- {samples['alpha'].std():.3f}")
print(f"Inferred beta: {samples['beta'].mean(axis=0)}")

Hierarchical Bayesian Model with PyMC

import pymc as pm
import numpy as np

# Hierarchical Bayesian model
# Model data from multiple groups simultaneously
n_groups = 5
n_per_group = 20
np.random.seed(42)

group_means = np.random.randn(n_groups) * 2
y_obs = np.concatenate([
    np.random.randn(n_per_group) + group_means[g]
    for g in range(n_groups)
])
group_idx = np.repeat(np.arange(n_groups), n_per_group)

with pm.Model() as hierarchical_model:
    # Hyperprior distributions
    mu_global = pm.Normal('mu_global', mu=0, sigma=10)
    sigma_global = pm.HalfNormal('sigma_global', sigma=5)

    # Group-level means (Partial Pooling)
    mu_group = pm.Normal('mu_group', mu=mu_global, sigma=sigma_global, shape=n_groups)

    # Observation likelihood
    sigma_obs = pm.HalfNormal('sigma_obs', sigma=2)
    y = pm.Normal('y', mu=mu_group[group_idx], sigma=sigma_obs, observed=y_obs)

    trace = pm.sample(2000, tune=1000, return_inferencedata=True, random_seed=42)

import arviz as az
print(az.summary(trace, var_names=['mu_group', 'mu_global', 'sigma_global']))

Comparing Probabilistic Programming Libraries

LibraryBackendInference MethodsBest For
PyMCJAX/NumPyNUTS, VI, SMCFlexible Bayesian models
StanC++HMC/NUTSResearch-grade inference
PyroPyTorchSVI, MCMCDeep probabilistic models
NumPyroJAXNUTS, SVIFast GPU-accelerated inference
TensorFlow ProbabilityTFHMC, VIProduction TF pipelines

Quiz

Q1. What is the difference between MAP and MLE, and why is MAP equivalent to regularization?

Answer: MAP = MLE + log Prior

Explanation: MLE optimizes argmaxP(Dθ)\arg\max P(D|\theta), while MAP optimizes argmaxP(Dθ)P(θ)\arg\max P(D|\theta)P(\theta). Taking the log: MAP = log-likelihood + log-prior. Using a Gaussian prior P(θ)exp(λθ2)P(\theta) \propto \exp(-\lambda\|\theta\|^2), the log-prior equals λθ2-\lambda\|\theta\|^2, which is identical to the L2 regularization (Ridge) penalty. Using a Laplace prior gives λθ1-\lambda\|\theta\|_1, identical to L1 regularization (Lasso). In other words, regularization is the frequentist expression of Bayesian prior beliefs about parameters.

Q2. Why is a burn-in period necessary in MCMC?

Answer: To remove dependence on the initial state.

Explanation: The initial state of an MCMC chain is set arbitrarily and may be far from the posterior distribution. A Markov chain requires time to converge (mix) to the stationary distribution, which is the posterior. The burn-in period discards these early samples before convergence. Including burn-in samples would bias inference toward the initial values. In modern samplers like NUTS, the warm-up phase serves this role, also tuning algorithm parameters like step sizes and mass matrices.

Q3. Why is the reparameterization trick necessary in VAEs?

Answer: To create differentiable gradients for backpropagation.

Explanation: Direct sampling from zN(μϕ(x),σϕ2(x))z \sim \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)) creates a stochastic node that is not differentiable — we cannot compute z/ϕ\partial z / \partial \phi for backpropagation. The reparameterization trick rewrites z=μϕ(x)+σϕ(x)ϵz = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), separating the stochasticity into the input ϵ\epsilon. This creates a deterministic path from zz back to μϕ\mu_\phi and σϕ\sigma_\phi, making backpropagation through the sampling operation possible.

Q4. How does the choice of kernel function affect predictions in Gaussian processes?

Answer: The kernel encodes prior beliefs about the function space.

Explanation: The kernel function k(x,x)k(x, x') defines similarity between inputs, reflecting structural assumptions about the function to be learned. The RBF kernel assumes infinitely differentiable smooth functions — a larger lengthscale \ell means distant points have higher correlation. The Matern kernel assumes finite differentiability, representing rougher functions. The periodic kernel assumes periodic patterns. Kernel hyperparameters are automatically learned by maximizing the Marginal Likelihood. A poor kernel choice leads to underfitting or overfitting. Kernels can also be composed: sum and product combinations allow encoding multiple properties simultaneously.

Q5. What is the mathematical basis for MC Dropout being an approximation of Bayesian inference?

Answer: Dropout is mathematically equivalent to variational inference over a deep Gaussian process.

Explanation: Gal and Ghahramani (2016) proved that applying Dropout to a neural network of arbitrary depth and nonlinearities is mathematically equivalent to variational inference over a deep Gaussian process. Specifically, placing Bernoulli priors on each layer's weights and performing variational inference yields an objective identical to the Dropout training loss. Therefore, performing T forward passes with Dropout active at test time is a Monte Carlo integration over the posterior distribution. The Dropout rate is a hyperparameter that controls model uncertainty and requires tuning. The predictive variance across T passes represents epistemic uncertainty.