Deep Learning Training Methods Complete Guide: From Optimization to Distributed Training

Introduction

Over the past decade, deep learning has achieved revolutionary results in virtually every AI domain — computer vision, natural language processing, speech recognition, and reinforcement learning. However, simply designing a neural network architecture is not enough to build a high-performing model. How you train it is the decisive factor.

This guide systematically covers every technique for effectively training deep learning models. Starting from the fundamentals of gradient descent, we progress through advanced optimizers, learning rate scheduling, regularization, transfer learning, mixed precision training, and large-scale distributed training — all with practical code examples.

1. Gradient Descent Fundamentals

1.1 Understanding the Loss Function

In deep learning, the loss function quantifies the discrepancy between model predictions and ground-truth labels. The goal of training is to find model parameters (weights) that minimize this loss value.

The loss function L depends on model parameters theta and data (x, y). Expressed mathematically:

L(theta) = (1/N) * sum_{i=1}^{N} l(f(x_i; theta), y_i)

Here f is the model function, l is the per-sample loss, and N is the dataset size.

1.2 Intuitive Understanding of Gradient Descent

A helpful analogy for gradient descent is a hiker descending a mountain with eyes closed. At each step, the hiker moves in the direction of steepest descent (opposite to the gradient). Repeating this process eventually leads to the valley floor (minimum).

Mathematically, the update rule is:

theta_{t+1} = theta_t - lr * grad_L(theta_t)

Here lr is the learning rate and grad_L is the gradient of the loss function.

1.3 Batch GD vs Mini-batch GD vs SGD

Batch Gradient Descent

Computes gradients over the entire dataset
Stable but memory-intensive and slow
Impractical for large datasets

Stochastic Gradient Descent (SGD)

Computes gradients from a single sample
Fast but noisy and unstable
Suitable for online learning

Mini-batch Gradient Descent

Typically uses 32–512 samples per gradient computation
Combines advantages of both Batch GD and SGD
The most widely used approach in practice

import torch
import torch.nn as nn
import numpy as np

# Gradient descent implementation with simple linear regression
class LinearRegression(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.linear = nn.Linear(input_dim, 1)

    def forward(self, x):
        return self.linear(x)

# Mini-batch gradient descent
def train_minibatch(model, X, y, batch_size=32, lr=0.01, epochs=100):
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    criterion = nn.MSELoss()
    losses = []

    N = len(X)
    for epoch in range(epochs):
        perm = torch.randperm(N)
        X_shuffled = X[perm]
        y_shuffled = y[perm]

        epoch_loss = 0
        for i in range(0, N, batch_size):
            x_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]

            optimizer.zero_grad()
            pred = model(x_batch)
            loss = criterion(pred, y_batch)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()

        losses.append(epoch_loss / (N // batch_size))
        if epoch % 10 == 0:
            print(f"Epoch {epoch}: Loss = {losses[-1]:.4f}")

    return losses

torch.manual_seed(42)
X = torch.randn(1000, 10)
true_w = torch.randn(10, 1)
y = X @ true_w + 0.1 * torch.randn(1000, 1)

model = LinearRegression(10)
losses = train_minibatch(model, X, y)

1.4 The Critical Role of Learning Rate

The learning rate is one of the most important hyperparameters in deep learning.

Too large: Loss diverges or oscillates around the minimum
Too small: Training is extremely slow and may get stuck in local minima
Just right: Fast convergence to a good minimum

Common starting values are 0.1, 0.01, and 0.001, though the optimal value depends on network architecture and data.

1.5 Mathematical Derivation (Partial Derivatives, Chain Rule)

Backpropagation in neural networks uses the chain rule to compute gradients for each layer.

For a 3-layer network:

forward: x -> z1=W1*x -> a1=relu(z1) -> z2=W2*a1 -> output
loss: L = MSE(output, y)

backward (chain rule):
dL/dW2 = dL/d_output * d_output/dz2 * dz2/dW2
dL/dW1 = dL/d_output * ... * da1/dz1 * dz1/dW1

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_deriv(x):
    s = sigmoid(x)
    return s * (1 - s)

class SimpleNet:
    def __init__(self, input_dim, hidden_dim, output_dim):
        self.W1 = np.random.randn(input_dim, hidden_dim) * np.sqrt(2/input_dim)
        self.b1 = np.zeros(hidden_dim)
        self.W2 = np.random.randn(hidden_dim, output_dim) * np.sqrt(2/hidden_dim)
        self.b2 = np.zeros(output_dim)

    def forward(self, x):
        self.x = x
        self.z1 = x @ self.W1 + self.b1
        self.a1 = sigmoid(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        return self.z2

    def backward(self, y, lr=0.01):
        N = len(y)
        dL_dz2 = 2 * (self.z2 - y.reshape(-1, 1)) / N

        dL_dW2 = self.a1.T @ dL_dz2
        dL_db2 = dL_dz2.sum(axis=0)

        dL_da1 = dL_dz2 @ self.W2.T
        dL_dz1 = dL_da1 * sigmoid_deriv(self.z1)

        dL_dW1 = self.x.T @ dL_dz1
        dL_db1 = dL_dz1.sum(axis=0)

        self.W2 -= lr * dL_dW2
        self.b2 -= lr * dL_db2
        self.W1 -= lr * dL_dW1
        self.b1 -= lr * dL_db1

# Test
net = SimpleNet(10, 32, 1)
X_np = np.random.randn(100, 10)
y_np = np.random.randn(100)

for i in range(100):
    pred = net.forward(X_np)
    loss = np.mean((pred.flatten() - y_np) ** 2)
    net.backward(y_np)
    if i % 20 == 0:
        print(f"Step {i}: MSE = {loss:.4f}")

2. Advanced Optimizers

2.1 Momentum SGD

Plain SGD follows gradients directly, causing zigzag movement in narrow valley-shaped loss landscapes. Momentum introduces the physics concept of inertia, allowing the optimizer to remember previous movement directions.

v_t = beta * v_{t-1} + (1 - beta) * grad_t
theta_{t+1} = theta_t - lr * v_t

The momentum coefficient (beta) is typically set to 0.9.

import torch
import torch.optim as optim

# Momentum SGD
optimizer_momentum = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    nesterov=False
)

# Nesterov Accelerated Gradient (NAG) - look-ahead gradient
optimizer_nag = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    nesterov=True
)

2.2 Adagrad (Adaptive Learning Rate)

Adagrad applies individual learning rates to each parameter. Frequently updated parameters receive reduced learning rates, while rarely updated ones maintain their rates.

G_t = G_{t-1} + grad_t^2
theta_{t+1} = theta_t - (lr / sqrt(G_t + epsilon)) * grad_t

Effective for sparse data, but G_t accumulates indefinitely, causing the learning rate to shrink toward zero.

optimizer_adagrad = optim.Adagrad(
    model.parameters(),
    lr=0.01,
    eps=1e-8,
    weight_decay=0
)

2.3 RMSprop

RMSprop resolves Adagrad's learning rate decay problem by using an exponential moving average of squared gradients.

E[g^2]_t = rho * E[g^2]_{t-1} + (1 - rho) * grad_t^2
theta_{t+1} = theta_t - (lr / sqrt(E[g^2]_t + epsilon)) * grad_t

optimizer_rmsprop = optim.RMSprop(
    model.parameters(),
    lr=0.001,
    alpha=0.99,
    eps=1e-8,
    momentum=0,
    centered=False
)

2.4 Adam (Adaptive Moment Estimation)

Adam combines Momentum and RMSprop, tracking both first-order moments (mean) and second-order moments (variance). It is currently the most widely used optimizer.

The algorithm:

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t       # 1st moment (before bias correction)
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2     # 2nd moment (before bias correction)

m_hat = m_t / (1 - beta1^t)                       # bias correction
v_hat = v_t / (1 - beta2^t)                       # bias correction

theta_{t+1} = theta_t - lr * m_hat / (sqrt(v_hat) + epsilon)

Default hyperparameters: lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8

optimizer_adam = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0
)

2.5 AdamW (Decoupled Weight Decay)

In standard Adam, L2 regularization is coupled with gradients and thus affected by the adaptive learning rate. AdamW applies weight decay directly to parameter updates, decoupled from the gradient-based update.

theta_{t+1} = theta_t - lr * (m_hat / (sqrt(v_hat) + epsilon) + lambda * theta_t)

AdamW has become the standard for training Transformer models (BERT, GPT, etc.).

optimizer_adamw = optim.AdamW(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0.01
)

2.6 LARS and LAMB (Large-Batch Training)

When using very large batch sizes (thousands), standard Adam degrades in performance. LARS (Layer-wise Adaptive Rate Scaling) and LAMB adjust learning rates per layer.

LARS: lr_l = lr * ||w_l|| / (||g_l|| + lambda * ||w_l||)
LAMB: applies a per-layer trust ratio to the Adam update

2.7 Lion Optimizer (2023)

Google Brain's Lion (EvoLved Sign Momentum) uses only the sign of the gradient update, resulting in lower memory usage than Adam while delivering competitive performance.

class Lion(torch.optim.Optimizer):
    def __init__(self, params, lr=1e-4, betas=(0.9, 0.99), weight_decay=0.0):
        defaults = dict(lr=lr, betas=betas, weight_decay=weight_decay)
        super().__init__(params, defaults)

    def step(self, closure=None):
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue

                grad = p.grad
                lr = group['lr']
                beta1, beta2 = group['betas']
                wd = group['weight_decay']

                state = self.state[p]
                if len(state) == 0:
                    state['exp_avg'] = torch.zeros_like(p)

                exp_avg = state['exp_avg']

                # Lion update
                update = exp_avg * beta1 + grad * (1 - beta1)
                p.data.mul_(1 - lr * wd)
                p.data.add_(update.sign_(), alpha=-lr)

                # Momentum update
                exp_avg.mul_(beta2).add_(grad, alpha=1 - beta2)

        return loss

2.8 Optimizer Comparison Experiment

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )

    def forward(self, x):
        return self.net(x)

def train_and_compare(optimizers_dict, X, y, epochs=200):
    results = {}

    for name, opt_fn in optimizers_dict.items():
        model = MLP()
        optimizer = opt_fn(model.parameters())
        criterion = nn.MSELoss()
        losses = []

        for epoch in range(epochs):
            optimizer.zero_grad()
            pred = model(X)
            loss = criterion(pred, y)
            loss.backward()
            optimizer.step()
            losses.append(loss.item())

        results[name] = losses
        print(f"{name}: Final Loss = {losses[-1]:.4f}")

    return results

X = torch.randn(500, 2)
y = (X[:, 0] * 2 + X[:, 1] * 3 + torch.randn(500) * 0.1).unsqueeze(1)

optimizers = {
    'SGD': lambda p: torch.optim.SGD(p, lr=0.01),
    'SGD+Momentum': lambda p: torch.optim.SGD(p, lr=0.01, momentum=0.9),
    'Adam': lambda p: torch.optim.Adam(p, lr=0.001),
    'AdamW': lambda p: torch.optim.AdamW(p, lr=0.001, weight_decay=0.01),
    'RMSprop': lambda p: torch.optim.RMSprop(p, lr=0.001),
}

results = train_and_compare(optimizers, X, y)

3. Learning Rate Scheduling

A fixed learning rate is rarely optimal. Learning rate scheduling dynamically adjusts the rate during training to achieve faster convergence and better final performance.

3.1 Step and Exponential Decay

import torch
import torch.optim as optim

model = MLP()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Step Decay: reduce by gamma every step_size epochs
step_scheduler = optim.lr_scheduler.StepLR(
    optimizer,
    step_size=30,
    gamma=0.1
)

# MultiStep Decay: reduce at specified milestones
multistep_scheduler = optim.lr_scheduler.MultiStepLR(
    optimizer,
    milestones=[30, 60, 80],
    gamma=0.1
)

# Exponential Decay: reduce exponentially every epoch
exp_scheduler = optim.lr_scheduler.ExponentialLR(
    optimizer,
    gamma=0.95
)

3.2 Cosine Annealing

Cosine Annealing smoothly decreases the learning rate following a cosine curve. Cosine Annealing with Warm Restarts periodically resets the learning rate for exploration.

# Cosine Annealing
cosine_scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=100,
    eta_min=1e-6
)

# Cosine Annealing with Warm Restarts (SGDR)
cosine_restart = optim.lr_scheduler.CosineAnnealingWarmRestarts(
    optimizer,
    T_0=10,
    T_mult=2,
    eta_min=1e-6
)

3.3 Warmup + Cosine Schedule

The standard schedule for training Transformer models. The learning rate increases linearly during warmup, then decreases following a cosine curve.

import math
from torch.optim.lr_scheduler import LambdaLR

def get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5):
    def lr_lambda(current_step):
        if current_step < num_warmup_steps:
            return float(current_step) / float(max(1, num_warmup_steps))
        progress = float(current_step - num_warmup_steps) / float(
            max(1, num_training_steps - num_warmup_steps)
        )
        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress)))

    return LambdaLR(optimizer, lr_lambda)

optimizer = optim.AdamW(model.parameters(), lr=5e-5)
scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=1000,
    num_training_steps=10000
)

3.4 OneCycleLR

OneCycleLR aggressively ramps the learning rate up and then down for fast convergence. Introduced by Leslie Smith and popularized by FastAI.

optimizer = optim.SGD(model.parameters(), lr=0.01)
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.1,
    steps_per_epoch=len(train_loader),
    epochs=10,
    pct_start=0.3,
    anneal_strategy='cos',
    div_factor=25.0,
    final_div_factor=1e4
)

for epoch in range(10):
    for batch in train_loader:
        optimizer.zero_grad()
        loss = criterion(model(batch[0]), batch[1])
        loss.backward()
        optimizer.step()
        scheduler.step()  # OneCycleLR steps per batch

3.5 Learning Rate Finder

Automatically identifies an appropriate learning rate range before training.

from torch_lr_finder import LRFinder

model = MLP()
optimizer = optim.SGD(model.parameters(), lr=1e-7, weight_decay=1e-2)
criterion = nn.MSELoss()

lr_finder = LRFinder(model, optimizer, criterion, device="cuda")
lr_finder.range_test(train_loader, end_lr=100, num_iter=100)
lr_finder.plot()
lr_finder.reset()

# Select the LR at the steepest loss decline
# Typically use 1/10 to 1/3 of the value at the minimum

4. Loss Functions

4.1 Regression Loss Functions

import torch
import torch.nn as nn
import torch.nn.functional as F

# MSE - sensitive to outliers
mse_loss = nn.MSELoss()

# MAE - robust to outliers
mae_loss = nn.L1Loss()

# Huber Loss - compromise between MSE and MAE
# |y - y_hat| < delta: 0.5 * (y - y_hat)^2
# |y - y_hat| >= delta: delta * (|y - y_hat| - 0.5 * delta)
huber_loss = nn.HuberLoss(delta=1.0)

def huber_loss_manual(pred, target, delta=1.0):
    residual = torch.abs(pred - target)
    condition = residual < delta
    squared_loss = 0.5 * residual ** 2
    linear_loss = delta * residual - 0.5 * delta ** 2
    return torch.where(condition, squared_loss, linear_loss).mean()

4.2 Classification Loss Functions

# Cross-Entropy Loss (multi-class)
ce_loss = nn.CrossEntropyLoss()

# Binary Cross-Entropy (binary classification)
bce_loss = nn.BCEWithLogitsLoss()

# Label Smoothing Cross-Entropy (reduces overconfidence)
ce_smooth = nn.CrossEntropyLoss(label_smoothing=0.1)

# Focal Loss (addresses class imbalance)
class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0, alpha=None, reduction='mean'):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha
        self.reduction = reduction

    def forward(self, inputs, targets):
        ce_loss = F.cross_entropy(inputs, targets, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = ((1 - pt) ** self.gamma) * ce_loss

        if self.alpha is not None:
            alpha_t = self.alpha[targets]
            focal_loss = alpha_t * focal_loss

        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        return focal_loss

4.3 Segmentation Loss Functions

def bce_loss_fn(pred, target):
    return F.binary_cross_entropy_with_logits(pred, target)

# Dice Loss (robust to class imbalance)
def dice_loss(pred, target, smooth=1.0):
    pred = torch.sigmoid(pred)
    pred_flat = pred.view(-1)
    target_flat = target.view(-1)

    intersection = (pred_flat * target_flat).sum()
    dice = (2. * intersection + smooth) / (pred_flat.sum() + target_flat.sum() + smooth)
    return 1 - dice

# BCE + Dice combination (common in segmentation)
def bce_dice_loss(pred, target, bce_weight=0.5):
    bce = bce_loss_fn(pred, target)
    dice = dice_loss(pred, target)
    return bce_weight * bce + (1 - bce_weight) * dice

4.4 Metric Learning Loss Functions

# Contrastive Loss (pull similar pairs together, push dissimilar apart)
class ContrastiveLoss(nn.Module):
    def __init__(self, margin=1.0):
        super().__init__()
        self.margin = margin

    def forward(self, output1, output2, label):
        # label=1: same class, label=0: different class
        euclidean_dist = F.pairwise_distance(output1, output2)
        loss = (label * euclidean_dist.pow(2) +
                (1 - label) * F.relu(self.margin - euclidean_dist).pow(2))
        return loss.mean()

# Triplet Loss (anchor, positive, negative)
class TripletLoss(nn.Module):
    def __init__(self, margin=0.3):
        super().__init__()
        self.margin = margin

    def forward(self, anchor, positive, negative):
        pos_dist = F.pairwise_distance(anchor, positive)
        neg_dist = F.pairwise_distance(anchor, negative)
        loss = F.relu(pos_dist - neg_dist + self.margin)
        return loss.mean()

5. Regularization Techniques

Techniques to prevent overfitting and improve generalization.

5.1 L1/L2 Regularization

# L2 Regularization (Weight Decay)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

# L1 Regularization (manual implementation)
def l1_regularization(model, lambda_l1):
    l1_penalty = 0
    for param in model.parameters():
        l1_penalty += torch.abs(param).sum()
    return lambda_l1 * l1_penalty

# Elastic Net (L1 + L2)
def elastic_net_loss(model, criterion, outputs, targets, lambda_l1=1e-5, lambda_l2=1e-4):
    base_loss = criterion(outputs, targets)
    l1_penalty = sum(torch.abs(p).sum() for p in model.parameters())
    l2_penalty = sum((p ** 2).sum() for p in model.parameters())
    return base_loss + lambda_l1 * l1_penalty + lambda_l2 * l2_penalty

5.2 Dropout

Dropout randomly deactivates neurons during training to prevent co-adaptation. Inverted Dropout divides by the keep probability during training, so no scaling is needed at inference.

class ModelWithDropout(nn.Module):
    def __init__(self, dropout_rate=0.5):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(784, 512),
            nn.ReLU(),
            nn.Dropout(p=dropout_rate),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(p=dropout_rate),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.net(x)

# Training mode: dropout is active
model.train()

# Inference mode: dropout is disabled
model.eval()

5.3 Data Augmentation

from torchvision import transforms

train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomCrop(32, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.RandomRotation(degrees=15),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Mixup Augmentation
def mixup_data(x, y, alpha=1.0):
    if alpha > 0:
        lam = np.random.beta(alpha, alpha)
    else:
        lam = 1

    batch_size = x.size()[0]
    index = torch.randperm(batch_size)

    mixed_x = lam * x + (1 - lam) * x[index]
    y_a, y_b = y, y[index]
    return mixed_x, y_a, y_b, lam

def mixup_criterion(criterion, pred, y_a, y_b, lam):
    return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)

# CutMix Augmentation
def cutmix_data(x, y, alpha=1.0):
    lam = np.random.beta(alpha, alpha)
    batch_size, C, H, W = x.size()
    index = torch.randperm(batch_size)

    cut_ratio = np.sqrt(1. - lam)
    cut_w = int(W * cut_ratio)
    cut_h = int(H * cut_ratio)

    cx = np.random.randint(W)
    cy = np.random.randint(H)

    bbx1 = np.clip(cx - cut_w // 2, 0, W)
    bby1 = np.clip(cy - cut_h // 2, 0, H)
    bbx2 = np.clip(cx + cut_w // 2, 0, W)
    bby2 = np.clip(cy + cut_h // 2, 0, H)

    mixed_x = x.clone()
    mixed_x[:, :, bby1:bby2, bbx1:bbx2] = x[index, :, bby1:bby2, bbx1:bbx2]
    lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / (W * H))

    return mixed_x, y, y[index], lam

5.4 Early Stopping

class EarlyStopping:
    def __init__(self, patience=10, min_delta=0.001, restore_best_weights=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best_weights = restore_best_weights
        self.counter = 0
        self.best_loss = None
        self.best_weights = None
        self.early_stop = False

    def __call__(self, val_loss, model):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.best_weights = {k: v.clone() for k, v in model.state_dict().items()}
        elif val_loss > self.best_loss - self.min_delta:
            self.counter += 1
            print(f"EarlyStopping counter: {self.counter}/{self.patience}")
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_loss = val_loss
            self.best_weights = {k: v.clone() for k, v in model.state_dict().items()}
            self.counter = 0

    def restore(self, model):
        if self.restore_best_weights and self.best_weights:
            model.load_state_dict(self.best_weights)
            print("Restored best model weights")

6. Normalization Layers

6.1 Batch Normalization

Proposed by Sergey Ioffe and Christian Szegedy in 2015, Batch Normalization normalizes features within each mini-batch to address the internal covariate shift problem.

The process:

1. Mini-batch mean: mu_B = (1/m) * sum(x_i)
2. Mini-batch variance: sigma_B^2 = (1/m) * sum((x_i - mu_B)^2)
3. Normalize: x_hat_i = (x_i - mu_B) / sqrt(sigma_B^2 + epsilon)
4. Scale and shift: y_i = gamma * x_hat_i + beta

gamma (scale) and beta (shift) are learnable parameters.

import torch
import torch.nn as nn

class BatchNormNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(784, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.net(x)

# Manual BatchNorm implementation
class BatchNorm(nn.Module):
    def __init__(self, num_features, eps=1e-5, momentum=0.1):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(num_features))
        self.beta = nn.Parameter(torch.zeros(num_features))
        self.eps = eps
        self.momentum = momentum

        self.register_buffer('running_mean', torch.zeros(num_features))
        self.register_buffer('running_var', torch.ones(num_features))

    def forward(self, x):
        if self.training:
            mean = x.mean(dim=0)
            var = x.var(dim=0, unbiased=False)
            self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mean
            self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var
        else:
            mean = self.running_mean
            var = self.running_var

        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta

6.2 Layer Normalization (Transformer Standard)

Layer Normalization normalizes across the feature dimension rather than the batch dimension. It is independent of batch size, making it suitable for RNNs and Transformers.

class LayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        if isinstance(normalized_shape, int):
            normalized_shape = (normalized_shape,)
        self.normalized_shape = normalized_shape
        self.gamma = nn.Parameter(torch.ones(normalized_shape))
        self.beta = nn.Parameter(torch.zeros(normalized_shape))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta

# Transformer Block with Pre-LayerNorm (modern GPT-style)
class TransformerBlock(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward):
        super().__init__()
        self.attention = nn.MultiheadAttention(d_model, nhead)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, dim_feedforward),
            nn.GELU(),
            nn.Linear(dim_feedforward, d_model)
        )

    def forward(self, x):
        attn_out, _ = self.attention(self.norm1(x), self.norm1(x), self.norm1(x))
        x = x + attn_out
        x = x + self.ffn(self.norm2(x))
        return x

6.3 Instance, Group, and RMS Normalization

# Instance Normalization (per-sample, per-channel)
# Effective for style transfer
instance_norm = nn.InstanceNorm2d(64)

# Group Normalization (normalize within channel groups)
# Alternative to BN when batch size is small
group_norm = nn.GroupNorm(num_groups=8, num_channels=64)

# RMS Normalization (used in LLaMA, T5)
# Removes mean centering from LayerNorm for speed
class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        return self.weight * self._norm(x.float()).type_as(x)

# Summary of when to use each normalization:
# BatchNorm: CNN, batch-dependent, best with batch size >= 16
# LayerNorm: Transformer/RNN, batch-independent
# InstanceNorm: style transfer, per-sample per-channel
# GroupNorm: small batches, detection/segmentation
# RMSNorm: LLMs, lightweight LayerNorm alternative

7. Weight Initialization

7.1 Xavier/He Initialization

Weight initialization sets the starting point for training. Poor initialization can trigger vanishing or exploding gradients.

import torch
import torch.nn as nn

class WeightInitDemo(nn.Module):
    def __init__(self, init_method='xavier'):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Linear(256, 256) for _ in range(5)
        ])
        self.apply_init(init_method)

    def apply_init(self, method):
        for layer in self.layers:
            if method == 'zeros':
                nn.init.zeros_(layer.weight)        # Bad: symmetry problem
            elif method == 'random_small':
                nn.init.normal_(layer.weight, std=0.01)
            elif method == 'xavier_uniform':
                nn.init.xavier_uniform_(layer.weight)  # For sigmoid/tanh
            elif method == 'xavier_normal':
                nn.init.xavier_normal_(layer.weight)
            elif method == 'kaiming_uniform':
                nn.init.kaiming_uniform_(layer.weight, mode='fan_in', nonlinearity='relu')
            elif method == 'kaiming_normal':
                nn.init.kaiming_normal_(layer.weight, mode='fan_out', nonlinearity='relu')  # For ReLU
            nn.init.zeros_(layer.bias)

    def forward(self, x):
        for layer in self.layers:
            x = torch.relu(layer(x))
        return x

# Initialization comparison
x = torch.randn(100, 256)
for method in ['zeros', 'random_small', 'xavier_uniform', 'kaiming_normal']:
    model = WeightInitDemo(method)
    with torch.no_grad():
        out = model(x)
    print(f"{method}: output mean={out.mean():.4f}, std={out.std():.4f}")

Xavier/Glorot Initialization is designed for sigmoid/tanh activations:

Uniform: weights drawn from Uniform(-limit, +limit) where limit = sqrt(6 / (fan_in + fan_out))

He/Kaiming Initialization is designed for ReLU activations:

Normal: weights drawn from Normal(0, sqrt(2 / fan_in))

8. Gradient Problem Solutions

8.1 Vanishing and Exploding Gradients

Vanishing Gradient: Gradients shrink toward zero as they propagate back through layers, preventing early layers from learning. Common with sigmoid and tanh activations in deep networks.

Exploding Gradient: Gradients grow exponentially, causing NaN or Inf values. Common in RNNs with long sequences.

import torch.nn.utils as utils

# Method 1: Gradient norm clipping
max_norm = 1.0
total_norm = utils.clip_grad_norm_(model.parameters(), max_norm)
print(f"Gradient norm: {total_norm:.4f}")

# Method 2: Gradient value clipping
utils.clip_grad_value_(model.parameters(), clip_value=0.5)

# Usage in training loop
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for batch in train_loader:
    optimizer.zero_grad()
    loss = criterion(model(batch[0]), batch[1])
    loss.backward()

    # Clip after backward, before optimizer step
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()

8.2 Residual Connections (Skip Connections)

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)  # Skip Connection
        out = self.relu(out)
        return out

8.3 Gradient Checkpointing

For very deep models, trade compute for memory: discard intermediate activations and recompute them during the backward pass.

from torch.utils.checkpoint import checkpoint, checkpoint_sequential

class DeepModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(*[
            nn.Sequential(nn.Linear(512, 512), nn.ReLU())
            for _ in range(20)
        ])

    def forward(self, x):
        # Standard: stores all activations O(N) memory
        # return self.layers(x)

        # Gradient Checkpointing: O(sqrt(N)) memory
        return checkpoint_sequential(self.layers, segments=5, input=x)

9. Transfer Learning and Fine-tuning

9.1 Feature Extraction vs Fine-tuning

import torchvision.models as models

# Feature Extraction: freeze pretrained weights
def feature_extraction(num_classes):
    model = models.resnet50(pretrained=True)
    for param in model.parameters():
        param.requires_grad = False

    # Replace only the classifier head
    model.fc = nn.Linear(model.fc.in_features, num_classes)
    return model

# Fine-tuning: selectively unfreeze layers
def fine_tuning(num_classes, unfreeze_layers=None):
    model = models.resnet50(pretrained=True)
    for param in model.parameters():
        param.requires_grad = False

    model.fc = nn.Linear(model.fc.in_features, num_classes)

    if unfreeze_layers:
        for name, param in model.named_parameters():
            for layer in unfreeze_layers:
                if layer in name:
                    param.requires_grad = True

    return model

9.2 Progressive Unfreezing and Discriminative Learning Rates

def discriminative_lr_optimizer(model, base_lr=1e-4, lr_multiplier=10):
    # Assign lower LR to early layers, higher LR to later layers
    param_groups = [
        {'params': model.layer1.parameters(), 'lr': base_lr / (lr_multiplier**3)},
        {'params': model.layer2.parameters(), 'lr': base_lr / (lr_multiplier**2)},
        {'params': model.layer3.parameters(), 'lr': base_lr / lr_multiplier},
        {'params': model.layer4.parameters(), 'lr': base_lr},
        {'params': model.fc.parameters(), 'lr': base_lr * lr_multiplier},
    ]
    return torch.optim.Adam(param_groups)

9.3 LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning technique for large language models. It freezes the original weight matrices and learns a low-rank decomposition.

For an original weight matrix W with shape d by k, LoRA learns W' = W + BA, where B has shape d by r and A has shape r by k. The rank r is chosen to be much smaller than both d and k.

class LoRALayer(nn.Module):
    def __init__(self, in_features, out_features, rank=4, alpha=1.0):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank

        # Frozen original weights
        self.weight = nn.Parameter(
            torch.randn(out_features, in_features),
            requires_grad=False
        )

        # LoRA matrix A (random init)
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
        # LoRA matrix B (zero init -> identical to original at start)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        self.bias = nn.Parameter(torch.zeros(out_features))

    def forward(self, x):
        base_output = nn.functional.linear(x, self.weight, self.bias)
        lora_output = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
        return base_output + lora_output

# Using HuggingFace PEFT library
from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)

# peft_model = get_peft_model(base_model, lora_config)
# peft_model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062

10. Hyperparameter Tuning

10.1 Bayesian Optimization with Optuna

import optuna
import torch
import torch.nn as nn

def objective(trial):
    lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
    n_layers = trial.suggest_int('n_layers', 1, 5)
    n_units = trial.suggest_categorical('n_units', [64, 128, 256, 512])
    dropout_rate = trial.suggest_float('dropout', 0.0, 0.5)
    optimizer_name = trial.suggest_categorical('optimizer', ['Adam', 'AdamW', 'SGD'])

    layers = []
    in_dim = 784
    for _ in range(n_layers):
        layers.extend([
            nn.Linear(in_dim, n_units),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        ])
        in_dim = n_units
    layers.append(nn.Linear(in_dim, 10))
    model = nn.Sequential(*layers)

    if optimizer_name == 'Adam':
        optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    elif optimizer_name == 'AdamW':
        optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
    else:
        optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)

    val_accuracy = 0.95  # replace with actual training
    return val_accuracy

study = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(),
    pruner=optuna.pruners.MedianPruner()
)

study.optimize(objective, n_trials=100, timeout=3600)

print(f"Best trial: {study.best_trial.value:.4f}")
print(f"Best params: {study.best_trial.params}")

11. Mixed Precision Training

11.1 FP32 vs FP16 vs BF16

Format	Exponent bits	Mantissa bits	Range	Primary use
FP32	8	23	+-3.4e38	Default training
FP16	5	10	+-65504	Inference / training (overflow risk)
BF16	8	7	+-3.4e38	LLM training (A100, TPU)

11.2 PyTorch AMP (Automatic Mixed Precision)

import torch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    for inputs, labels in train_loader:
        inputs, labels = inputs.cuda(), labels.cuda()
        optimizer.zero_grad()

        # FP16 computation within autocast context
        with autocast(dtype=torch.float16):
            outputs = model(inputs)
            loss = criterion(outputs, labels)

        # Scale loss, then backprop
        scaler.scale(loss).backward()

        # Unscale for gradient clipping
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        # Step (skips if NaN/Inf gradients detected)
        scaler.step(optimizer)
        scaler.update()

# BF16 (more stable, requires Ampere or newer GPU)
with autocast(dtype=torch.bfloat16):
    outputs = model(inputs)
    loss = criterion(outputs, labels)

Key benefits of AMP:

Memory reduction: roughly 2x (FP16 parameters)
Speed improvement: 1.5x–3x on Tensor Core GPUs
Near-identical accuracy to FP32 in most tasks

12. Distributed Training

12.1 Data Parallelism with DDP

Distribute data across multiple GPUs. Each GPU independently computes forward and backward passes, then gradients are aggregated.

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def setup(rank, world_size):
    import os
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group('nccl', rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def train_ddp(rank, world_size, model_class, dataset):
    setup(rank, world_size)

    device = torch.device(f'cuda:{rank}')
    model = model_class().to(device)
    model = DDP(model, device_ids=[rank])

    sampler = DistributedSampler(
        dataset,
        num_replicas=world_size,
        rank=rank,
        shuffle=True
    )

    loader = torch.utils.data.DataLoader(
        dataset,
        batch_size=32,
        sampler=sampler,
        num_workers=4,
        pin_memory=True
    )

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3 * world_size)
    criterion = nn.CrossEntropyLoss()
    scaler = GradScaler()

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # Different shuffle per epoch

        for inputs, labels in loader:
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()

            with autocast():
                outputs = model(inputs)
                loss = criterion(outputs, labels)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

        if rank == 0:
            print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

    cleanup()

import torch.multiprocessing as mp

if __name__ == '__main__':
    world_size = torch.cuda.device_count()
    mp.spawn(train_ddp, args=(world_size, MyModel, dataset), nprocs=world_size, join=True)

12.2 FSDP (Fully Sharded Data Parallel)

FSDP shards model parameters, gradients, and optimizer states across all GPUs. Essential for training models with billions of parameters that exceed single-GPU memory.

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy, MixedPrecision
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
import functools

bf16_policy = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.bfloat16,
    buffer_dtype=torch.bfloat16
)

auto_wrap_policy = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls={TransformerBlock}
)

model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    mixed_precision=bf16_policy,
    auto_wrap_policy=auto_wrap_policy,
    device_id=rank
)

FSDP sharding strategies:

FULL_SHARD: shard params, gradients, and optimizer states (maximum memory savings)
SHARD_GRAD_OP: shard gradients and optimizer states only
NO_SHARD: equivalent to DDP

12.3 Gradient Accumulation

Simulate large batch sizes on limited GPU memory by accumulating gradients across multiple micro-batches.

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
scaler = GradScaler()

micro_batch_size = 8
accumulation_steps = 8  # Effective batch size: 64

optimizer.zero_grad()
for step, (inputs, labels) in enumerate(train_loader):
    inputs, labels = inputs.cuda(), labels.cuda()

    with autocast():
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss = loss / accumulation_steps  # Normalize loss

    scaler.scale(loss).backward()

    if (step + 1) % accumulation_steps == 0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

13. Large Language Model Training Techniques

13.1 Instruction Tuning

Instruction tuning trains models to follow natural language instructions. It was central to the success of FLAN, InstructGPT, and LLaMA-2.

# Instruction tuning data format
instruction_data = [
    {
        "instruction": "Analyze the sentiment of the following text.",
        "input": "The weather today is absolutely gorgeous, I feel wonderful!",
        "output": "Positive sentiment. The text expresses satisfaction with the weather and a feeling of happiness."
    },
    {
        "instruction": "Summarize the following article.",
        "input": "...(long text)...",
        "output": "...(summary)..."
    }
]

# Alpaca-style prompt template
def format_instruction(sample):
    if sample.get('input'):
        return f"""### Instruction:
{sample['instruction']}

### Input:
{sample['input']}

### Response:
{sample['output']}"""
    else:
        return f"""### Instruction:
{sample['instruction']}

### Response:
{sample['output']}"""

13.2 RLHF (Reinforcement Learning from Human Feedback)

RLHF involves three stages:

Stage 1 — SFT (Supervised Fine-tuning): fine-tune on high-quality human demonstrations Stage 2 — Reward Model: train a reward model to predict human preferences Stage 3 — PPO: optimize the policy with RL using the reward model

# Stage 2: Reward Model (Bradley-Terry preference model)
class RewardModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model
        self.reward_head = nn.Linear(base_model.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state[:, -1, :]
        reward = self.reward_head(last_hidden).squeeze(-1)
        return reward

# Preference loss (Bradley-Terry model)
def preference_loss(reward_chosen, reward_rejected):
    # p(chosen > rejected) = sigmoid(r_chosen - r_rejected)
    return -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()

13.3 DPO (Direct Preference Optimization)

DPO simplifies RLHF by eliminating the need for PPO, directly optimizing the policy on preference data using a closed-form reparameterization.

import torch
import torch.nn.functional as F

def dpo_loss(
    policy_chosen_logps,
    policy_rejected_logps,
    reference_chosen_logps,
    reference_rejected_logps,
    beta=0.1
):
    # Log ratios between policy and reference model
    chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps)

    # DPO loss: -log(sigmoid(chosen_rewards - rejected_rewards))
    loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()

    chosen_reward = chosen_rewards.detach().mean()
    rejected_reward = rejected_rewards.detach().mean()
    reward_accuracy = (chosen_rewards > rejected_rewards).float().mean()

    return loss, chosen_reward, rejected_reward, reward_accuracy

DPO advantages over RLHF:

No need to train a separate reward model
No PPO hyperparameter tuning
More stable training
Comparable or better alignment results

14. Complete Training Pipeline

14.1 Production-Grade Trainer

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

class Trainer:
    def __init__(
        self,
        model,
        train_loader,
        val_loader,
        optimizer,
        scheduler,
        criterion,
        device='cuda',
        use_amp=True,
        grad_clip=1.0,
        accumulation_steps=1
    ):
        self.model = model.to(device)
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.optimizer = optimizer
        self.scheduler = scheduler
        self.criterion = criterion
        self.device = device
        self.use_amp = use_amp
        self.grad_clip = grad_clip
        self.accumulation_steps = accumulation_steps
        self.scaler = GradScaler() if use_amp else None

    def train_epoch(self):
        self.model.train()
        total_loss = 0
        self.optimizer.zero_grad()

        for step, (inputs, labels) in enumerate(self.train_loader):
            inputs, labels = inputs.to(self.device), labels.to(self.device)

            if self.use_amp:
                with autocast():
                    outputs = self.model(inputs)
                    loss = self.criterion(outputs, labels) / self.accumulation_steps
                self.scaler.scale(loss).backward()
            else:
                outputs = self.model(inputs)
                loss = self.criterion(outputs, labels) / self.accumulation_steps
                loss.backward()

            if (step + 1) % self.accumulation_steps == 0:
                if self.use_amp:
                    self.scaler.unscale_(self.optimizer)

                if self.grad_clip:
                    nn.utils.clip_grad_norm_(self.model.parameters(), self.grad_clip)

                if self.use_amp:
                    self.scaler.step(self.optimizer)
                    self.scaler.update()
                else:
                    self.optimizer.step()

                if self.scheduler:
                    self.scheduler.step()

                self.optimizer.zero_grad()

            total_loss += loss.item() * self.accumulation_steps

        return total_loss / len(self.train_loader)

    @torch.no_grad()
    def evaluate(self):
        self.model.eval()
        total_loss = 0
        correct = 0
        total = 0

        for inputs, labels in self.val_loader:
            inputs, labels = inputs.to(self.device), labels.to(self.device)

            with autocast() if self.use_amp else torch.no_grad():
                outputs = self.model(inputs)
                loss = self.criterion(outputs, labels)

            total_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

        return total_loss / len(self.val_loader), 100. * correct / total

    def fit(self, epochs, save_path=None):
        best_val_acc = 0
        early_stopping = EarlyStopping(patience=10)

        for epoch in range(epochs):
            train_loss = self.train_epoch()
            val_loss, val_acc = self.evaluate()

            print(f"Epoch {epoch+1}/{epochs}: "
                  f"Train Loss: {train_loss:.4f}, "
                  f"Val Loss: {val_loss:.4f}, "
                  f"Val Acc: {val_acc:.2f}%")

            if val_acc > best_val_acc:
                best_val_acc = val_acc
                if save_path:
                    torch.save(self.model.state_dict(), save_path)

            early_stopping(val_loss, self.model)
            if early_stopping.early_stop:
                print("Early stopping triggered!")
                break

        return best_val_acc

Conclusion and Best Practices

A summary of core principles for effective deep learning training:

Optimizer selection

General tasks: AdamW (lr=1e-3 to 1e-4, weight_decay=0.01)
Transformers: AdamW + Warmup + Cosine Schedule
Large-batch training: LAMB or LARS
Memory-constrained: Lion

Regularization strategy

Dropout is typically set between 0.1 and 0.5
Small datasets: stronger regularization (larger weight decay, higher dropout)
Large datasets: weak or no regularization

Learning rate scheduling

CNNs: OneCycleLR or Step Decay
Transformers: Warmup + Cosine or Inverse Square Root

Mixed precision

Always use AMP (1.5x–3x speedup, 2x memory savings)
A100/H100 and newer: prefer BF16
Older GPUs: use FP16 + Loss Scaling

Distributed training

Multi-GPU single server: DDP + NCCL
Billion-parameter models: FSDP
Always use Gradient Accumulation to increase effective batch size