Skip to content

Split View: 확률적 머신러닝 & 베이지안 방법론: 베이즈 추론부터 가우시안 프로세스, 불확실성 정량화까지

|

확률적 머신러닝 & 베이지안 방법론: 베이즈 추론부터 가우시안 프로세스, 불확실성 정량화까지

목차

  1. 확률론 기초와 베이즈 정리
  2. 베이지안 추론: 사전/사후 분포
  3. MCMC: 마르코프 체인 몬테카를로
  4. 변분 추론과 VAE
  5. 가우시안 프로세스
  6. 베이지안 딥러닝과 불확실성 정량화
  7. 실전 도구: PyMC, Stan, Pyro
  8. 퀴즈

확률론 기초와 베이즈 정리

베이즈 정리

베이즈 정리는 확률적 머신러닝의 근간입니다. 사건 AA와 증거 BB가 주어졌을 때:

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

머신러닝 맥락에서는 파라미터 θ\theta와 데이터 DD로 표현합니다:

P(θD)=P(Dθ)P(θ)P(D)P(\theta|D) = \frac{P(D|\theta) \cdot P(\theta)}{P(D)}

  • P(θ)P(\theta): 사전 분포(Prior) — 데이터 관측 전 파라미터에 대한 믿음
  • P(Dθ)P(D|\theta): 우도(Likelihood) — 파라미터 값이 주어졌을 때 데이터 생성 확률
  • P(θD)P(\theta|D): 사후 분포(Posterior) — 데이터를 관측한 후 업데이트된 믿음
  • P(D)P(D): 주변 우도(Marginal Likelihood) — 정규화 상수

주요 확률 분포

가우시안 분포(Gaussian Distribution)

N(xμ,σ2)=12πσ2exp ⁣((xμ)22σ2)\mathcal{N}(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

연속형 데이터 모델링의 기본. 중심극한정리에 의해 실무에서 광범위하게 사용됩니다.

베르누이 분포(Bernoulli Distribution)

Bern(xp)=px(1p)1x,x{0,1}\text{Bern}(x | p) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}

이진 분류 문제의 기본 분포. 베이지안 이진 분류에서 우도 함수로 사용됩니다.

디리클레 분포(Dirichlet Distribution)

Dir(pα)=Γ(kαk)kΓ(αk)kpkαk1\text{Dir}(\mathbf{p} | \boldsymbol{\alpha}) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_k p_k^{\alpha_k - 1}

다항 분포의 켤레 사전 분포(Conjugate Prior). 토픽 모델링(LDA)에서 핵심적으로 사용됩니다.

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# 주요 확률 분포 시각화
x = np.linspace(-4, 4, 200)

# 가우시안 분포
gaussian = stats.norm(loc=0, scale=1)
print(f"가우시안 평균: {gaussian.mean():.2f}, 분산: {gaussian.var():.2f}")

# 베르누이 분포
p = 0.7
bern_samples = stats.bernoulli(p).rvs(1000)
print(f"베르누이 성공률: {bern_samples.mean():.3f} (기대값: {p})")

# 디리클레 분포 샘플링
alpha = np.array([2.0, 3.0, 5.0])
dirichlet_samples = stats.dirichlet(alpha).rvs(1000)
print(f"디리클레 평균: {dirichlet_samples.mean(axis=0)}")
print(f"이론적 평균: {alpha / alpha.sum()}")

베이지안 추론: 사전/사후 분포

MAP vs MLE

최대 우도 추정(MLE, Maximum Likelihood Estimation)

θ^MLE=argmaxθP(Dθ)=argmaxθlogP(Dθ)\hat{\theta}_{MLE} = \arg\max_\theta P(D|\theta) = \arg\max_\theta \log P(D|\theta)

최대 사후 추정(MAP, Maximum A Posteriori)

θ^MAP=argmaxθP(θD)=argmaxθ[logP(Dθ)+logP(θ)]\hat{\theta}_{MAP} = \arg\max_\theta P(\theta|D) = \arg\max_\theta \left[\log P(D|\theta) + \log P(\theta)\right]

MAP는 MLE에 사전 분포의 로그를 더한 것과 동일합니다. 가우시안 사전 분포 P(θ)=N(0,τ2)P(\theta) = \mathcal{N}(0, \tau^2)를 사용하면:

logP(θ)=θ22τ2+const\log P(\theta) = -\frac{\theta^2}{2\tau^2} + \text{const}

이는 L2 정규화(Ridge Regression)와 수학적으로 동일합니다. 라플라스 사전 분포를 사용하면 L1 정규화(Lasso)에 해당합니다.

켤레 사전 분포(Conjugate Prior)

사후 분포가 사전 분포와 동일한 함수 형태를 가질 때 켤레 관계라고 합니다.

우도켤레 사전 분포사후 분포
베르누이/이항베타(Beta)베타(Beta)
가우시안가우시안가우시안
포아송감마(Gamma)감마(Gamma)
다항디리클레디리클레

베타-베르누이 켤레 예시:

사전: P(p)=Beta(α,β)P(p) = \text{Beta}(\alpha, \beta), 우도: P(Dp)=Bern(p)NP(D|p) = \text{Bern}(p)^N

사후: P(pD)=Beta(α+nH,β+nT)P(p|D) = \text{Beta}(\alpha + n_H, \beta + n_T)

여기서 nHn_H는 앞면, nTn_T는 뒷면 횟수입니다.

import pymc as pm
import numpy as np
import arviz as az

# PyMC로 베이지안 선형 회귀
np.random.seed(42)
X = np.random.randn(100)
true_alpha, true_beta, true_sigma = 1.5, 2.3, 0.5
y = true_alpha + true_beta * X + np.random.randn(100) * true_sigma

with pm.Model() as linear_model:
    # 사전 분포 정의
    alpha = pm.Normal('alpha', mu=0, sigma=10)
    beta = pm.Normal('beta', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=1)

    # 우도 함수
    mu = alpha + beta * X
    likelihood = pm.Normal('y', mu=mu, sigma=sigma, observed=y)

    # NUTS 샘플러로 사후 분포 추정
    trace = pm.sample(2000, tune=1000, return_inferencedata=True, random_seed=42)

# 결과 요약
summary = az.summary(trace, var_names=['alpha', 'beta', 'sigma'])
print(summary)
# alpha ~ 1.5, beta ~ 2.3, sigma ~ 0.5 근처로 수렴해야 함

MCMC: 마르코프 체인 몬테카를로

MCMC는 해석적으로 계산하기 어려운 사후 분포에서 샘플을 추출하는 방법입니다.

Metropolis-Hastings 알고리즘

  1. 현재 상태 θ(t)\theta^{(t)}에서 제안 분포 q(θθ(t))q(\theta'|\theta^{(t)})로 후보 θ\theta' 생성
  2. 수용 확률 계산:

α=min ⁣(1,P(θD)q(θ(t)θ)P(θ(t)D)q(θθ(t)))\alpha = \min\!\left(1, \frac{P(\theta'|D) \cdot q(\theta^{(t)}|\theta')}{P(\theta^{(t)}|D) \cdot q(\theta'|\theta^{(t)})}\right)

  1. 균일 분포 uU(0,1)u \sim U(0,1)을 샘플링하여 u<αu < \alpha이면 θ(t+1)=θ\theta^{(t+1)} = \theta', 아니면 θ(t+1)=θ(t)\theta^{(t+1)} = \theta^{(t)}
import numpy as np

def metropolis_hastings(log_posterior, initial, n_samples=10000, proposal_std=0.5):
    """Metropolis-Hastings MCMC 구현"""
    samples = np.zeros((n_samples, len(initial)))
    current = np.array(initial, dtype=float)
    accepted = 0

    for i in range(n_samples):
        # 제안 분포에서 후보 샘플링
        proposal = current + np.random.randn(len(current)) * proposal_std

        # 수용 확률 계산 (로그 공간에서)
        log_ratio = log_posterior(proposal) - log_posterior(current)
        accept_prob = min(1.0, np.exp(log_ratio))

        # 수용/거부 결정
        if np.random.rand() < accept_prob:
            current = proposal
            accepted += 1

        samples[i] = current

    print(f"수용률: {accepted / n_samples:.3f}")
    return samples

# 가우시안 혼합 분포에서 샘플링
def log_posterior_mixture(theta):
    log_p1 = -0.5 * ((theta[0] - 2)**2 + (theta[1] - 2)**2)
    log_p2 = -0.5 * ((theta[0] + 2)**2 + (theta[1] + 2)**2)
    return np.log(0.5 * np.exp(log_p1) + 0.5 * np.exp(log_p2) + 1e-10)

samples = metropolis_hastings(log_posterior_mixture, [0.0, 0.0], n_samples=50000)

# Burn-in 제거 후 분석
burn_in = 5000
posterior_samples = samples[burn_in:]
print(f"사후 평균: {posterior_samples.mean(axis=0)}")
print(f"사후 표준편차: {posterior_samples.std(axis=0)}")

Gibbs Sampling

각 파라미터를 나머지 파라미터를 조건으로 하여 순차적으로 샘플링합니다:

θi(t+1)P(θiθi(t),D)\theta_i^{(t+1)} \sim P(\theta_i | \theta_{-i}^{(t)}, D)

조건부 분포를 쉽게 샘플링할 수 있을 때 특히 효율적입니다.

HMC (Hamiltonian Monte Carlo)

물리학의 해밀턴 역학을 활용하여 효율적으로 고차원 공간을 탐색합니다.

위치 θ\theta에 운동량 rr을 도입하여 해밀토니안:

H(θ,r)=logP(θD)+12rTM1rH(\theta, r) = -\log P(\theta|D) + \frac{1}{2} r^T M^{-1} r

리프로그 적분기로 궤적을 시뮬레이션하여 상관관계 없는 샘플을 효율적으로 생성합니다. PyMC의 NUTS(No-U-Turn Sampler)는 HMC의 자동화 버전입니다.


변분 추론과 VAE

ELBO (Evidence Lower BOund)

사후 분포 P(θD)P(\theta|D)를 직접 계산하기 어려울 때, 근사 분포 q(θ)q(\theta)를 최적화합니다.

KL 발산을 최소화:

KL[q(θ)P(θD)]=Eq[logq(θ)]Eq[logP(θD)]\text{KL}[q(\theta) \| P(\theta|D)] = \mathbb{E}_{q}[\log q(\theta)] - \mathbb{E}_{q}[\log P(\theta|D)]

이를 정리하면:

logP(D)=ELBO(q)+KL[q(θ)P(θD)]\log P(D) = \text{ELBO}(q) + \text{KL}[q(\theta) \| P(\theta|D)]

ELBO(q)=Eq[logP(Dθ)]KL[q(θ)P(θ)]\text{ELBO}(q) = \mathbb{E}_{q}[\log P(D|\theta)] - \text{KL}[q(\theta) \| P(\theta)]

ELBO를 최대화하는 것은 주변 우도의 하한을 최대화하는 것과 같습니다.

평균장 근사(Mean-Field Approximation)

파라미터들이 독립이라고 가정:

q(θ)=iqi(θi)q(\theta) = \prod_i q_i(\theta_i)

qiq_i에 대한 최적 해:

logqi(θi)=Eqi[logP(D,θ)]+const\log q_i^*(\theta_i) = \mathbb{E}_{q_{-i}}[\log P(D, \theta)] + \text{const}

VAE (Variational Autoencoder)

VAE는 변분 추론을 딥러닝에 적용한 생성 모델입니다.

VAE의 ELBO:

L=Eqϕ(zx)[logpθ(xz)]KL[qϕ(zx)p(z)]\mathcal{L} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}[q_\phi(z|x) \| p(z)]

  • 첫 번째 항: 재구성 손실 (Reconstruction Loss)
  • 두 번째 항: 정규화 항 (KL 발산)

Reparameterization Trick:

zN(μ,σ2)z \sim \mathcal{N}(\mu, \sigma^2)에서 직접 샘플링하면 역전파가 불가능합니다. 대신:

z=μ+σϵ,ϵN(0,I)z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, input_dim=784, latent_dim=20, hidden_dim=400):
        super().__init__()
        # 인코더
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU()
        )
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)

        # 디코더
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid()
        )

    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_logvar(h)

    def reparameterize(self, mu, logvar):
        """Reparameterization Trick: z = mu + eps * std"""
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        recon = self.decoder(z)
        return recon, mu, logvar

def vae_loss(recon_x, x, mu, logvar):
    # 재구성 손실 (BCE)
    recon_loss = nn.functional.binary_cross_entropy(recon_x, x, reduction='sum')
    # KL 발산: -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    kl_div = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return recon_loss + kl_div

가우시안 프로세스

정의와 커널 함수

가우시안 프로세스(GP)는 함수에 대한 분포입니다. 임의의 유한 집합의 함수값이 다변량 가우시안 분포를 따른다면 GP라고 합니다:

f(x)GP(m(x),k(x,x))f(x) \sim \mathcal{GP}(m(x), k(x, x'))

  • m(x)m(x): 평균 함수 (주로 0으로 설정)
  • k(x,x)k(x, x'): 커널 함수 (공분산 함수)

주요 커널 함수:

RBF(Squared Exponential) 커널: kSE(x,x)=σf2exp ⁣(xx222)k_{SE}(x, x') = \sigma_f^2 \exp\!\left(-\frac{\|x - x'\|^2}{2\ell^2}\right)

Matérn 커널 (nu=5/2): kM52(x,x)=σf2(1+5xx+5xx232)exp ⁣(5xx)k_{M52}(x, x') = \sigma_f^2\left(1 + \frac{\sqrt{5}\|x-x'\|}{\ell} + \frac{5\|x-x'\|^2}{3\ell^2}\right)\exp\!\left(-\frac{\sqrt{5}\|x-x'\|}{\ell}\right)

주기적 커널: kper(x,x)=σf2exp ⁣(2sin2(πxx/p)2)k_{per}(x, x') = \sigma_f^2 \exp\!\left(-\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2}\right)

GPR (Gaussian Process Regression)

관측 데이터 (X,y)(\mathbf{X}, \mathbf{y})가 주어졌을 때 새로운 입력 X\mathbf{X}_*에 대한 예측:

fX,y,XN(fˉ,cov(f))\mathbf{f}_* | \mathbf{X}, \mathbf{y}, \mathbf{X}_* \sim \mathcal{N}(\bar{\mathbf{f}}_*, \text{cov}(\mathbf{f}_*))

fˉ=K(X,X)[K(X,X)+σn2I]1y\bar{\mathbf{f}}_* = K(\mathbf{X}_*, \mathbf{X})[K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1}\mathbf{y}

cov(f)=K(X,X)K(X,X)[K(X,X)+σn2I]1K(X,X)\text{cov}(\mathbf{f}_*) = K(\mathbf{X}_*, \mathbf{X}_*) - K(\mathbf{X}_*, \mathbf{X})[K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1}K(\mathbf{X}, \mathbf{X}_*)

import torch
import gpytorch

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel()
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# 데이터 준비
torch.manual_seed(42)
train_x = torch.linspace(0, 1, 20)
train_y = torch.sin(train_x * 2 * 3.14159) + torch.randn(20) * 0.1

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)

# 하이퍼파라미터 최적화 (MLL 최대화)
model.train()
likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(200):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    optimizer.step()

# 예측
model.eval()
likelihood.eval()
test_x = torch.linspace(0, 1, 100)
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    predictions = likelihood(model(test_x))
    mean = predictions.mean
    lower, upper = predictions.confidence_region()

print(f"학습된 lengthscale: {model.covar_module.base_kernel.lengthscale.item():.4f}")
print(f"학습된 noise: {likelihood.noise.item():.4f}")

베이지안 딥러닝과 불확실성 정량화

불확실성의 종류

  • 인식론적 불확실성(Epistemic Uncertainty): 모델 자체의 불확실성. 데이터가 많아지면 감소.
  • 우연론적 불확실성(Aleatoric Uncertainty): 데이터의 고유한 노이즈. 데이터를 아무리 늘려도 줄어들지 않음.

MC Dropout

Dropout을 테스트 시에도 활성화하여 베이지안 추론을 근사합니다. Gal & Ghahramani (2016)은 Dropout이 가우시안 프로세스의 근사와 수학적으로 동등함을 증명했습니다.

P(yx,X,Y)1Tt=1TP(yx,ω^t)P(y^*|x^*, X, Y) \approx \frac{1}{T} \sum_{t=1}^{T} P(y^*|x^*, \hat{\omega}_t)

import torch
import torch.nn as nn
import numpy as np

class BayesianMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout_p=0.1):
        super().__init__()
        self.dropout_p = dropout_p
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

def mc_dropout_predict(model, x, n_samples=100):
    """MC Dropout으로 예측 불확실성 추정"""
    model.train()  # Dropout 활성화 유지
    predictions = []
    with torch.no_grad():
        for _ in range(n_samples):
            pred = model(x)
            predictions.append(pred)

    predictions = torch.stack(predictions)  # (n_samples, batch, output)
    mean = predictions.mean(dim=0)
    # 불확실성 = 예측값의 분산
    uncertainty = predictions.var(dim=0)
    return mean, uncertainty

# 사용 예시
model = BayesianMLP(input_dim=10, hidden_dim=64, output_dim=1, dropout_p=0.1)
x_test = torch.randn(32, 10)
mean_pred, uncertainty = mc_dropout_predict(model, x_test, n_samples=200)
print(f"예측 평균: {mean_pred.shape}, 불확실성: {uncertainty.shape}")
print(f"평균 불확실성: {uncertainty.mean().item():.4f}")

베이지안 신경망(BNN)

가중치에 분포를 부여하여 완전한 베이지안 처리:

WN(0,σw2I)W \sim \mathcal{N}(0, \sigma_w^2 I)

변분 추론으로 사후 분포 근사:

qϕ(W)=N(μϕ,diag(σϕ2))q_\phi(W) = \mathcal{N}(\mu_\phi, \text{diag}(\sigma_\phi^2))

import torch
import torch.nn as nn
import pyro
import pyro.distributions as dist
from pyro.nn import PyroModule, PyroSample

class BayesianLinear(PyroModule):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features

        # 가중치와 편향에 사전 분포 정의
        self.weight = PyroSample(
            dist.Normal(0., 1.).expand([out_features, in_features]).to_event(2)
        )
        self.bias = PyroSample(
            dist.Normal(0., 10.).expand([out_features]).to_event(1)
        )

    def forward(self, x):
        return x @ self.weight.T + self.bias

class BayesianNN(PyroModule):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.layer1 = BayesianLinear(in_dim, hidden_dim)
        self.layer2 = BayesianLinear(hidden_dim, out_dim)

    def forward(self, x, y=None):
        x = torch.relu(self.layer1(x))
        mu = self.layer2(x).squeeze(-1)
        sigma = pyro.sample("sigma", dist.Uniform(0., 1.))
        with pyro.plate("data", x.shape[0]):
            obs = pyro.sample("obs", dist.Normal(mu, sigma), obs=y)
        return mu

실전 도구: PyMC, Stan, Pyro

NumPyro로 베이지안 로지스틱 회귀

import jax
import jax.numpy as jnp
import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS

def logistic_regression_model(X, y=None):
    """NumPyro 베이지안 로지스틱 회귀 모델"""
    n_features = X.shape[1]

    # 사전 분포
    alpha = numpyro.sample("alpha", dist.Normal(0., 10.))
    beta = numpyro.sample("beta", dist.Normal(jnp.zeros(n_features), jnp.ones(n_features)))

    # 로짓 계산
    logits = alpha + X @ beta

    # 우도
    with numpyro.plate("obs", X.shape[0]):
        numpyro.sample("y", dist.Bernoulli(logits=logits), obs=y)

# NUTS 샘플러로 추론
def run_mcmc(X_train, y_train, num_samples=2000, num_warmup=1000):
    kernel = NUTS(logistic_regression_model)
    mcmc = MCMC(kernel, num_warmup=num_warmup, num_samples=num_samples)
    rng_key = jax.random.PRNGKey(42)
    mcmc.run(rng_key, X_train, y_train)
    return mcmc.get_samples()

# 사용 예시 (데이터 생성)
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

X, y = make_classification(n_samples=200, n_features=4, random_state=42)
X = StandardScaler().fit_transform(X)
X_jax = jnp.array(X)
y_jax = jnp.array(y)

samples = run_mcmc(X_jax, y_jax)
print(f"추론된 alpha: {samples['alpha'].mean():.3f} +/- {samples['alpha'].std():.3f}")
print(f"추론된 beta: {samples['beta'].mean(axis=0)}")

Stan 모델 예시 (PyMC로 구현)

import pymc as pm
import numpy as np

# 계층적 베이지안 모델 (Hierarchical Model)
# 여러 그룹의 데이터를 동시에 모델링
n_groups = 5
n_per_group = 20
np.random.seed(42)

group_means = np.random.randn(n_groups) * 2
y_obs = np.concatenate([
    np.random.randn(n_per_group) + group_means[g]
    for g in range(n_groups)
])
group_idx = np.repeat(np.arange(n_groups), n_per_group)

with pm.Model() as hierarchical_model:
    # 하이퍼 사전 분포 (Hyperprior)
    mu_global = pm.Normal('mu_global', mu=0, sigma=10)
    sigma_global = pm.HalfNormal('sigma_global', sigma=5)

    # 그룹별 평균 (부분 풀링, Partial Pooling)
    mu_group = pm.Normal('mu_group', mu=mu_global, sigma=sigma_global, shape=n_groups)

    # 관측 우도
    sigma_obs = pm.HalfNormal('sigma_obs', sigma=2)
    y = pm.Normal('y', mu=mu_group[group_idx], sigma=sigma_obs, observed=y_obs)

    trace = pm.sample(2000, tune=1000, return_inferencedata=True, random_seed=42)

import arviz as az
print(az.summary(trace, var_names=['mu_group', 'mu_global', 'sigma_global']))

퀴즈

Q1. MAP 추정과 MLE의 차이 및 MAP가 정규화(regularization)와 동등한 이유는?

정답: MAP = MLE + log Prior

설명: MLE는 argmaxP(Dθ)\arg\max P(D|\theta)를 최적화하고, MAP는 argmaxP(Dθ)P(θ)\arg\max P(D|\theta)P(\theta)를 최적화합니다. 로그를 취하면 MAP = log-likelihood + log-prior입니다. 가우시안 사전 분포 P(θ)exp(λθ2)P(\theta) \propto \exp(-\lambda\|\theta\|^2)를 사용하면 log-prior = λθ2-\lambda\|\theta\|^2이 되어 L2 정규화(Ridge)와 동일한 페널티를 부여합니다. 라플라스 사전 분포를 사용하면 λθ1-\lambda\|\theta\|_1이 되어 L1 정규화(Lasso)와 동일합니다. 즉, 정규화는 파라미터에 대한 베이지안 사전 믿음을 빈도주의적으로 표현한 것입니다.

Q2. MCMC에서 burn-in period가 필요한 이유는?

정답: 초기값 의존성 제거를 위해 필요합니다.

설명: MCMC 체인의 초기 상태는 임의로 설정되며, 사후 분포와 거리가 먼 공간에 있을 수 있습니다. 마르코프 체인이 정상 분포(stationary distribution)인 사후 분포로 수렴(mixing)하기까지 일정 시간이 필요합니다. Burn-in period는 이 수렴 전의 초기 샘플들을 버리는 과정입니다. Burn-in 없이 초기 샘플을 포함하면 추론이 초기값에 편향(biased)됩니다. NUTS 등 현대 샘플러에서는 warm-up 단계가 이 역할을 합니다.

Q3. VAE의 reparameterization trick이 필요한 이유는?

정답: 역전파 가능한 그래디언트를 만들기 위해서입니다.

설명: VAE의 잠재 변수 zN(μϕ(x),σϕ2(x))z \sim \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))를 직접 샘플링하면, 샘플링 연산은 미분 불가능한 확률적 노드가 됩니다. 역전파 시 z/ϕ\partial z / \partial \phi를 계산할 수 없습니다. Reparameterization trick은 z=μϕ(x)+σϕ(x)ϵz = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, ϵN(0,I)\epsilon \sim \mathcal{N}(0, I)로 변환하여 확률성을 입력 ϵ\epsilon으로 분리합니다. 이로써 zz에서 μϕ,σϕ\mu_\phi, \sigma_\phi까지의 결정론적 경로가 생겨 역전파가 가능해집니다.

Q4. 가우시안 프로세스에서 커널 함수 선택이 예측에 미치는 영향은?

정답: 커널은 함수 공간에서의 사전 믿음을 인코딩합니다.

설명: 커널 함수 k(x,x)k(x, x')는 입력 간의 유사도를 정의하며, 이는 학습하려는 함수의 구조적 가정을 반영합니다. RBF 커널은 무한 번 미분 가능한 매끄러운 함수를 가정하며, 길이 스케일 \ell이 클수록 멀리 있는 점들도 상관이 높아집니다. Matérn 커널은 유한한 미분 가능성을 가정하여 더 거친 함수를 표현합니다. 주기 커널은 주기적 패턴을 가정합니다. 커널 하이퍼파라미터는 주변 우도(Marginal Likelihood) 최대화로 자동 학습되며, 잘못된 커널 선택은 과소적합이나 과대적합으로 이어집니다.

Q5. MC Dropout이 베이지안 추론의 근사가 되는 수학적 근거는?

정답: Dropout은 가우시안 프로세스의 변분 추론과 수학적으로 동등합니다.

설명: Gal & Ghahramani (2016)은 임의 깊이와 비선형성을 가진 신경망에 Dropout을 적용하는 것이 딥 가우시안 프로세스에 대한 변분 추론과 수학적으로 동등함을 증명했습니다. 구체적으로, 각 레이어의 가중치에 베르누이 분포를 사전 분포로 설정하고 변분 추론을 수행하면 Dropout 학습의 목적 함수와 일치합니다. 따라서 테스트 시 Dropout을 활성화한 채로 TT번 순전파하면 사후 분포의 몬테카를로 적분이 됩니다. 단, Dropout rate는 모델 불확실성을 조절하는 하이퍼파라미터로 튜닝이 필요합니다.

Probabilistic Machine Learning & Bayesian Methods: From Bayesian Inference to Gaussian Processes and Uncertainty Quantification

Table of Contents

  1. Probability Theory Foundations and Bayes Theorem
  2. Bayesian Inference: Prior and Posterior Distributions
  3. MCMC: Markov Chain Monte Carlo
  4. Variational Inference and VAE
  5. Gaussian Processes
  6. Bayesian Deep Learning and Uncertainty Quantification
  7. Practical Tools: PyMC, Stan, Pyro
  8. Quiz

Probability Theory Foundations and Bayes Theorem

Bayes Theorem

Bayes theorem is the cornerstone of probabilistic machine learning. Given events AA and evidence BB:

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

In the machine learning context, expressed with parameters θ\theta and data DD:

P(θD)=P(Dθ)P(θ)P(D)P(\theta|D) = \frac{P(D|\theta) \cdot P(\theta)}{P(D)}

  • P(θ)P(\theta): Prior distribution — beliefs about parameters before observing data
  • P(Dθ)P(D|\theta): Likelihood — probability of generating the data given parameter values
  • P(θD)P(\theta|D): Posterior distribution — updated beliefs after observing data
  • P(D)P(D): Marginal likelihood (Evidence) — normalization constant

Key Probability Distributions

Gaussian Distribution

N(xμ,σ2)=12πσ2exp ⁣((xμ)22σ2)\mathcal{N}(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

The foundation for modeling continuous data. Widely used in practice due to the Central Limit Theorem.

Bernoulli Distribution

Bern(xp)=px(1p)1x,x{0,1}\text{Bern}(x | p) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}

Base distribution for binary classification problems. Used as a likelihood function in Bayesian binary classification.

Dirichlet Distribution

Dir(pα)=Γ(kαk)kΓ(αk)kpkαk1\text{Dir}(\mathbf{p} | \boldsymbol{\alpha}) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_k p_k^{\alpha_k - 1}

The conjugate prior for the multinomial distribution. Used extensively in topic modeling (LDA).

import numpy as np
import scipy.stats as stats

# Key probability distributions
x = np.linspace(-4, 4, 200)

# Gaussian distribution
gaussian = stats.norm(loc=0, scale=1)
print(f"Gaussian mean: {gaussian.mean():.2f}, variance: {gaussian.var():.2f}")

# Bernoulli distribution
p = 0.7
bern_samples = stats.bernoulli(p).rvs(1000)
print(f"Bernoulli success rate: {bern_samples.mean():.3f} (expected: {p})")

# Dirichlet distribution sampling
alpha = np.array([2.0, 3.0, 5.0])
dirichlet_samples = stats.dirichlet(alpha).rvs(1000)
print(f"Dirichlet sample mean: {dirichlet_samples.mean(axis=0)}")
print(f"Theoretical mean: {alpha / alpha.sum()}")

Bayesian Inference: Prior and Posterior Distributions

MAP vs MLE

Maximum Likelihood Estimation (MLE)

θ^MLE=argmaxθP(Dθ)=argmaxθlogP(Dθ)\hat{\theta}_{MLE} = \arg\max_\theta P(D|\theta) = \arg\max_\theta \log P(D|\theta)

Maximum A Posteriori (MAP)

θ^MAP=argmaxθP(θD)=argmaxθ[logP(Dθ)+logP(θ)]\hat{\theta}_{MAP} = \arg\max_\theta P(\theta|D) = \arg\max_\theta \left[\log P(D|\theta) + \log P(\theta)\right]

MAP is equivalent to MLE plus the log of the prior distribution. Using a Gaussian prior P(θ)=N(0,τ2)P(\theta) = \mathcal{N}(0, \tau^2):

logP(θ)=θ22τ2+const\log P(\theta) = -\frac{\theta^2}{2\tau^2} + \text{const}

This is mathematically identical to L2 regularization (Ridge Regression). A Laplace prior corresponds to L1 regularization (Lasso).

Conjugate Priors

When the posterior distribution has the same functional form as the prior distribution, they are said to be conjugate.

LikelihoodConjugate PriorPosterior
Bernoulli/BinomialBetaBeta
GaussianGaussianGaussian
PoissonGammaGamma
MultinomialDirichletDirichlet

Beta-Bernoulli conjugate example:

Prior: P(p)=Beta(α,β)P(p) = \text{Beta}(\alpha, \beta), Likelihood: P(Dp)=Bern(p)NP(D|p) = \text{Bern}(p)^N

Posterior: P(pD)=Beta(α+nH,β+nT)P(p|D) = \text{Beta}(\alpha + n_H, \beta + n_T)

where nHn_H is the number of heads and nTn_T is the number of tails.

import pymc as pm
import numpy as np
import arviz as az

# Bayesian linear regression with PyMC
np.random.seed(42)
X = np.random.randn(100)
true_alpha, true_beta, true_sigma = 1.5, 2.3, 0.5
y = true_alpha + true_beta * X + np.random.randn(100) * true_sigma

with pm.Model() as linear_model:
    # Prior distributions
    alpha = pm.Normal('alpha', mu=0, sigma=10)
    beta = pm.Normal('beta', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=1)

    # Likelihood function
    mu = alpha + beta * X
    likelihood = pm.Normal('y', mu=mu, sigma=sigma, observed=y)

    # Sample posterior with NUTS sampler
    trace = pm.sample(2000, tune=1000, return_inferencedata=True, random_seed=42)

# Summarize results
summary = az.summary(trace, var_names=['alpha', 'beta', 'sigma'])
print(summary)
# Should converge near alpha ~ 1.5, beta ~ 2.3, sigma ~ 0.5

MCMC: Markov Chain Monte Carlo

MCMC is a method for drawing samples from posterior distributions that are difficult to compute analytically.

Metropolis-Hastings Algorithm

  1. From current state θ(t)\theta^{(t)}, generate candidate θ\theta' from proposal distribution q(θθ(t))q(\theta'|\theta^{(t)})
  2. Compute acceptance probability:

α=min ⁣(1,P(θD)q(θ(t)θ)P(θ(t)D)q(θθ(t)))\alpha = \min\!\left(1, \frac{P(\theta'|D) \cdot q(\theta^{(t)}|\theta')}{P(\theta^{(t)}|D) \cdot q(\theta'|\theta^{(t)})}\right)

  1. Sample uU(0,1)u \sim U(0,1); if u<αu < \alpha set θ(t+1)=θ\theta^{(t+1)} = \theta', otherwise θ(t+1)=θ(t)\theta^{(t+1)} = \theta^{(t)}
import numpy as np

def metropolis_hastings(log_posterior, initial, n_samples=10000, proposal_std=0.5):
    """Metropolis-Hastings MCMC implementation"""
    samples = np.zeros((n_samples, len(initial)))
    current = np.array(initial, dtype=float)
    accepted = 0

    for i in range(n_samples):
        # Sample candidate from proposal distribution
        proposal = current + np.random.randn(len(current)) * proposal_std

        # Compute acceptance probability (in log space)
        log_ratio = log_posterior(proposal) - log_posterior(current)
        accept_prob = min(1.0, np.exp(log_ratio))

        # Accept/reject decision
        if np.random.rand() < accept_prob:
            current = proposal
            accepted += 1

        samples[i] = current

    print(f"Acceptance rate: {accepted / n_samples:.3f}")
    return samples

# Sample from a Gaussian mixture distribution
def log_posterior_mixture(theta):
    log_p1 = -0.5 * ((theta[0] - 2)**2 + (theta[1] - 2)**2)
    log_p2 = -0.5 * ((theta[0] + 2)**2 + (theta[1] + 2)**2)
    return np.log(0.5 * np.exp(log_p1) + 0.5 * np.exp(log_p2) + 1e-10)

samples = metropolis_hastings(log_posterior_mixture, [0.0, 0.0], n_samples=50000)

# Analyze after removing burn-in
burn_in = 5000
posterior_samples = samples[burn_in:]
print(f"Posterior mean: {posterior_samples.mean(axis=0)}")
print(f"Posterior std: {posterior_samples.std(axis=0)}")

Gibbs Sampling

Sequentially sample each parameter conditioned on all others:

θi(t+1)P(θiθi(t),D)\theta_i^{(t+1)} \sim P(\theta_i | \theta_{-i}^{(t)}, D)

Particularly efficient when conditional distributions can be sampled easily.

HMC (Hamiltonian Monte Carlo)

Leverages Hamiltonian mechanics from physics to efficiently explore high-dimensional spaces.

Introducing momentum rr alongside position θ\theta, the Hamiltonian is:

H(θ,r)=logP(θD)+12rTM1rH(\theta, r) = -\log P(\theta|D) + \frac{1}{2} r^T M^{-1} r

Trajectories are simulated using a leapfrog integrator to generate uncorrelated samples efficiently. PyMC's NUTS (No-U-Turn Sampler) is an automated version of HMC.


Variational Inference and VAE

ELBO (Evidence Lower BOund)

When the posterior P(θD)P(\theta|D) is intractable, we optimize an approximate distribution q(θ)q(\theta).

Minimize the KL divergence:

KL[q(θ)P(θD)]=Eq[logq(θ)]Eq[logP(θD)]\text{KL}[q(\theta) \| P(\theta|D)] = \mathbb{E}_{q}[\log q(\theta)] - \mathbb{E}_{q}[\log P(\theta|D)]

Rearranging:

logP(D)=ELBO(q)+KL[q(θ)P(θD)]\log P(D) = \text{ELBO}(q) + \text{KL}[q(\theta) \| P(\theta|D)]

ELBO(q)=Eq[logP(Dθ)]KL[q(θ)P(θ)]\text{ELBO}(q) = \mathbb{E}_{q}[\log P(D|\theta)] - \text{KL}[q(\theta) \| P(\theta)]

Maximizing the ELBO is equivalent to maximizing a lower bound on the marginal likelihood.

Mean-Field Approximation

Assumes parameters are independent:

q(θ)=iqi(θi)q(\theta) = \prod_i q_i(\theta_i)

Optimal solution for each qiq_i:

logqi(θi)=Eqi[logP(D,θ)]+const\log q_i^*(\theta_i) = \mathbb{E}_{q_{-i}}[\log P(D, \theta)] + \text{const}

VAE (Variational Autoencoder)

VAE applies variational inference to deep learning for generative modeling.

VAE ELBO:

L=Eqϕ(zx)[logpθ(xz)]KL[qϕ(zx)p(z)]\mathcal{L} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}[q_\phi(z|x) \| p(z)]

  • First term: Reconstruction loss
  • Second term: Regularization term (KL divergence)

Reparameterization Trick:

Direct sampling from zN(μ,σ2)z \sim \mathcal{N}(\mu, \sigma^2) prevents backpropagation. Instead:

z=μ+σϵ,ϵN(0,I)z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, input_dim=784, latent_dim=20, hidden_dim=400):
        super().__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU()
        )
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)

        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid()
        )

    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_logvar(h)

    def reparameterize(self, mu, logvar):
        """Reparameterization Trick: z = mu + eps * std"""
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        recon = self.decoder(z)
        return recon, mu, logvar

def vae_loss(recon_x, x, mu, logvar):
    # Reconstruction loss (BCE)
    recon_loss = nn.functional.binary_cross_entropy(recon_x, x, reduction='sum')
    # KL divergence: -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    kl_div = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return recon_loss + kl_div

Gaussian Processes

Definition and Kernel Functions

A Gaussian Process (GP) is a distribution over functions. A collection of function values indexed by any finite set follows a multivariate Gaussian distribution:

f(x)GP(m(x),k(x,x))f(x) \sim \mathcal{GP}(m(x), k(x, x'))

  • m(x)m(x): Mean function (often set to 0)
  • k(x,x)k(x, x'): Kernel function (covariance function)

Key Kernel Functions:

RBF (Squared Exponential) kernel: kSE(x,x)=σf2exp ⁣(xx222)k_{SE}(x, x') = \sigma_f^2 \exp\!\left(-\frac{\|x - x'\|^2}{2\ell^2}\right)

Matern kernel (nu=5/2): kM52(x,x)=σf2(1+5xx+5xx232)exp ⁣(5xx)k_{M52}(x, x') = \sigma_f^2\left(1 + \frac{\sqrt{5}\|x-x'\|}{\ell} + \frac{5\|x-x'\|^2}{3\ell^2}\right)\exp\!\left(-\frac{\sqrt{5}\|x-x'\|}{\ell}\right)

Periodic kernel: kper(x,x)=σf2exp ⁣(2sin2(πxx/p)2)k_{per}(x, x') = \sigma_f^2 \exp\!\left(-\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2}\right)

GPR (Gaussian Process Regression)

Given observations (X,y)(\mathbf{X}, \mathbf{y}), predictions for new inputs X\mathbf{X}_*:

fX,y,XN(fˉ,cov(f))\mathbf{f}_* | \mathbf{X}, \mathbf{y}, \mathbf{X}_* \sim \mathcal{N}(\bar{\mathbf{f}}_*, \text{cov}(\mathbf{f}_*))

fˉ=K(X,X)[K(X,X)+σn2I]1y\bar{\mathbf{f}}_* = K(\mathbf{X}_*, \mathbf{X})[K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1}\mathbf{y}

cov(f)=K(X,X)K(X,X)[K(X,X)+σn2I]1K(X,X)\text{cov}(\mathbf{f}_*) = K(\mathbf{X}_*, \mathbf{X}_*) - K(\mathbf{X}_*, \mathbf{X})[K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1}K(\mathbf{X}, \mathbf{X}_*)

import torch
import gpytorch

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel()
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# Prepare data
torch.manual_seed(42)
train_x = torch.linspace(0, 1, 20)
train_y = torch.sin(train_x * 2 * 3.14159) + torch.randn(20) * 0.1

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)

# Optimize hyperparameters (maximize MLL)
model.train()
likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(200):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    optimizer.step()

# Predict
model.eval()
likelihood.eval()
test_x = torch.linspace(0, 1, 100)
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    predictions = likelihood(model(test_x))
    mean = predictions.mean
    lower, upper = predictions.confidence_region()

print(f"Learned lengthscale: {model.covar_module.base_kernel.lengthscale.item():.4f}")
print(f"Learned noise: {likelihood.noise.item():.4f}")

Bayesian Deep Learning and Uncertainty Quantification

Types of Uncertainty

  • Epistemic Uncertainty: Model uncertainty. Decreases as more data is observed.
  • Aleatoric Uncertainty: Inherent noise in data. Cannot be reduced regardless of how much data is collected.

MC Dropout

Activating Dropout at test time approximates Bayesian inference. Gal & Ghahramani (2016) proved that Dropout is mathematically equivalent to an approximation of Gaussian processes.

P(yx,X,Y)1Tt=1TP(yx,ω^t)P(y^*|x^*, X, Y) \approx \frac{1}{T} \sum_{t=1}^{T} P(y^*|x^*, \hat{\omega}_t)

import torch
import torch.nn as nn
import numpy as np

class BayesianMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout_p=0.1):
        super().__init__()
        self.dropout_p = dropout_p
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

def mc_dropout_predict(model, x, n_samples=100):
    """Estimate prediction uncertainty using MC Dropout"""
    model.train()  # Keep dropout active
    predictions = []
    with torch.no_grad():
        for _ in range(n_samples):
            pred = model(x)
            predictions.append(pred)

    predictions = torch.stack(predictions)  # (n_samples, batch, output)
    mean = predictions.mean(dim=0)
    # Uncertainty = variance of predictions
    uncertainty = predictions.var(dim=0)
    return mean, uncertainty

# Usage example
model = BayesianMLP(input_dim=10, hidden_dim=64, output_dim=1, dropout_p=0.1)
x_test = torch.randn(32, 10)
mean_pred, uncertainty = mc_dropout_predict(model, x_test, n_samples=200)
print(f"Prediction mean shape: {mean_pred.shape}, uncertainty: {uncertainty.shape}")
print(f"Mean uncertainty: {uncertainty.mean().item():.4f}")

Bayesian Neural Network (BNN)

Assign distributions to weights for full Bayesian treatment:

WN(0,σw2I)W \sim \mathcal{N}(0, \sigma_w^2 I)

Approximate posterior with variational inference:

qϕ(W)=N(μϕ,diag(σϕ2))q_\phi(W) = \mathcal{N}(\mu_\phi, \text{diag}(\sigma_\phi^2))

import torch
import pyro
import pyro.distributions as dist
from pyro.nn import PyroModule, PyroSample

class BayesianLinear(PyroModule):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features

        # Define prior distributions on weights and biases
        self.weight = PyroSample(
            dist.Normal(0., 1.).expand([out_features, in_features]).to_event(2)
        )
        self.bias = PyroSample(
            dist.Normal(0., 10.).expand([out_features]).to_event(1)
        )

    def forward(self, x):
        return x @ self.weight.T + self.bias

class BayesianNN(PyroModule):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.layer1 = BayesianLinear(in_dim, hidden_dim)
        self.layer2 = BayesianLinear(hidden_dim, out_dim)

    def forward(self, x, y=None):
        x = torch.relu(self.layer1(x))
        mu = self.layer2(x).squeeze(-1)
        sigma = pyro.sample("sigma", dist.Uniform(0., 1.))
        with pyro.plate("data", x.shape[0]):
            obs = pyro.sample("obs", dist.Normal(mu, sigma), obs=y)
        return mu

Practical Tools: PyMC, Stan, Pyro

Bayesian Logistic Regression with NumPyro

import jax
import jax.numpy as jnp
import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS

def logistic_regression_model(X, y=None):
    """NumPyro Bayesian logistic regression model"""
    n_features = X.shape[1]

    # Prior distributions
    alpha = numpyro.sample("alpha", dist.Normal(0., 10.))
    beta = numpyro.sample("beta", dist.Normal(jnp.zeros(n_features), jnp.ones(n_features)))

    # Compute logits
    logits = alpha + X @ beta

    # Likelihood
    with numpyro.plate("obs", X.shape[0]):
        numpyro.sample("y", dist.Bernoulli(logits=logits), obs=y)

# Run inference with NUTS sampler
def run_mcmc(X_train, y_train, num_samples=2000, num_warmup=1000):
    kernel = NUTS(logistic_regression_model)
    mcmc = MCMC(kernel, num_warmup=num_warmup, num_samples=num_samples)
    rng_key = jax.random.PRNGKey(42)
    mcmc.run(rng_key, X_train, y_train)
    return mcmc.get_samples()

# Example usage (generate data)
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

X, y = make_classification(n_samples=200, n_features=4, random_state=42)
X = StandardScaler().fit_transform(X)
X_jax = jnp.array(X)
y_jax = jnp.array(y)

samples = run_mcmc(X_jax, y_jax)
print(f"Inferred alpha: {samples['alpha'].mean():.3f} +/- {samples['alpha'].std():.3f}")
print(f"Inferred beta: {samples['beta'].mean(axis=0)}")

Hierarchical Bayesian Model with PyMC

import pymc as pm
import numpy as np

# Hierarchical Bayesian model
# Model data from multiple groups simultaneously
n_groups = 5
n_per_group = 20
np.random.seed(42)

group_means = np.random.randn(n_groups) * 2
y_obs = np.concatenate([
    np.random.randn(n_per_group) + group_means[g]
    for g in range(n_groups)
])
group_idx = np.repeat(np.arange(n_groups), n_per_group)

with pm.Model() as hierarchical_model:
    # Hyperprior distributions
    mu_global = pm.Normal('mu_global', mu=0, sigma=10)
    sigma_global = pm.HalfNormal('sigma_global', sigma=5)

    # Group-level means (Partial Pooling)
    mu_group = pm.Normal('mu_group', mu=mu_global, sigma=sigma_global, shape=n_groups)

    # Observation likelihood
    sigma_obs = pm.HalfNormal('sigma_obs', sigma=2)
    y = pm.Normal('y', mu=mu_group[group_idx], sigma=sigma_obs, observed=y_obs)

    trace = pm.sample(2000, tune=1000, return_inferencedata=True, random_seed=42)

import arviz as az
print(az.summary(trace, var_names=['mu_group', 'mu_global', 'sigma_global']))

Comparing Probabilistic Programming Libraries

LibraryBackendInference MethodsBest For
PyMCJAX/NumPyNUTS, VI, SMCFlexible Bayesian models
StanC++HMC/NUTSResearch-grade inference
PyroPyTorchSVI, MCMCDeep probabilistic models
NumPyroJAXNUTS, SVIFast GPU-accelerated inference
TensorFlow ProbabilityTFHMC, VIProduction TF pipelines

Quiz

Q1. What is the difference between MAP and MLE, and why is MAP equivalent to regularization?

Answer: MAP = MLE + log Prior

Explanation: MLE optimizes argmaxP(Dθ)\arg\max P(D|\theta), while MAP optimizes argmaxP(Dθ)P(θ)\arg\max P(D|\theta)P(\theta). Taking the log: MAP = log-likelihood + log-prior. Using a Gaussian prior P(θ)exp(λθ2)P(\theta) \propto \exp(-\lambda\|\theta\|^2), the log-prior equals λθ2-\lambda\|\theta\|^2, which is identical to the L2 regularization (Ridge) penalty. Using a Laplace prior gives λθ1-\lambda\|\theta\|_1, identical to L1 regularization (Lasso). In other words, regularization is the frequentist expression of Bayesian prior beliefs about parameters.

Q2. Why is a burn-in period necessary in MCMC?

Answer: To remove dependence on the initial state.

Explanation: The initial state of an MCMC chain is set arbitrarily and may be far from the posterior distribution. A Markov chain requires time to converge (mix) to the stationary distribution, which is the posterior. The burn-in period discards these early samples before convergence. Including burn-in samples would bias inference toward the initial values. In modern samplers like NUTS, the warm-up phase serves this role, also tuning algorithm parameters like step sizes and mass matrices.

Q3. Why is the reparameterization trick necessary in VAEs?

Answer: To create differentiable gradients for backpropagation.

Explanation: Direct sampling from zN(μϕ(x),σϕ2(x))z \sim \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)) creates a stochastic node that is not differentiable — we cannot compute z/ϕ\partial z / \partial \phi for backpropagation. The reparameterization trick rewrites z=μϕ(x)+σϕ(x)ϵz = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), separating the stochasticity into the input ϵ\epsilon. This creates a deterministic path from zz back to μϕ\mu_\phi and σϕ\sigma_\phi, making backpropagation through the sampling operation possible.

Q4. How does the choice of kernel function affect predictions in Gaussian processes?

Answer: The kernel encodes prior beliefs about the function space.

Explanation: The kernel function k(x,x)k(x, x') defines similarity between inputs, reflecting structural assumptions about the function to be learned. The RBF kernel assumes infinitely differentiable smooth functions — a larger lengthscale \ell means distant points have higher correlation. The Matern kernel assumes finite differentiability, representing rougher functions. The periodic kernel assumes periodic patterns. Kernel hyperparameters are automatically learned by maximizing the Marginal Likelihood. A poor kernel choice leads to underfitting or overfitting. Kernels can also be composed: sum and product combinations allow encoding multiple properties simultaneously.

Q5. What is the mathematical basis for MC Dropout being an approximation of Bayesian inference?

Answer: Dropout is mathematically equivalent to variational inference over a deep Gaussian process.

Explanation: Gal and Ghahramani (2016) proved that applying Dropout to a neural network of arbitrary depth and nonlinearities is mathematically equivalent to variational inference over a deep Gaussian process. Specifically, placing Bernoulli priors on each layer's weights and performing variational inference yields an objective identical to the Dropout training loss. Therefore, performing T forward passes with Dropout active at test time is a Monte Carlo integration over the posterior distribution. The Dropout rate is a hyperparameter that controls model uncertainty and requires tuning. The predictive variance across T passes represents epistemic uncertainty.