Complete Guide to Multi-GPU Distributed Training: DDP, FSDP, DeepSpeed

1. Why Multi-GPU Training Is Necessary
- Model Size and Memory Requirements
- Training Time Reduction
2. NCCL: The Core of GPU Communication
- What is NCCL
- Key NCCL Environment Variables
3. NVLink vs PCIe: GPU Interconnect Bandwidth Comparison
4. Distributed Training Paradigms: Data / Model / Pipeline Parallelism
5. PyTorch DDP (DistributedDataParallel) Official Documentation Analysis
6. PyTorch FSDP (Fully Sharded Data Parallel) Official Documentation Analysis
7. DeepSpeed ZeRO Stage 1/2/3 Comparison
8. HuggingFace Accelerate: Unified Distributed Training Interface
9. nvidia-smi Monitoring and GPU Utilization Optimization
10. Practical Troubleshooting: OOM, NCCL Timeout, Communication Bottlenecks
11. Summary: Which Strategy to Choose
References

1. Why Multi-GPU Training Is Necessary

The number of parameters in recent large language models (LLMs) has been growing exponentially. From GPT-3 with 175B, PaLM with 540B, to Llama 3 with 405B parameters, training on a single GPU has become physically impossible.

Model Size and Memory Requirements

Calculating a model's memory usage makes the severity of the problem clear. With FP32 (32-bit floating point), a single parameter occupies 4 bytes. Therefore:

7B model: 7 x 10^9 x 4 bytes = approximately 28 GB (parameters only)
13B model: 13 x 10^9 x 4 bytes = approximately 52 GB
70B model: 70 x 10^9 x 4 bytes = approximately 280 GB

Adding optimizer states (2x the parameters for Adam), gradients (same size as parameters), and activation memory, the actual memory required during training is approximately 4-8x the parameter size. Training a 7B model in FP32 requires a minimum of 112-224 GB of GPU memory.

The most widely used NVIDIA A100 has 80 GB of VRAM, and the H100 also has 80 GB. Even full fine-tuning of a 7B model is challenging on a single GPU. This is why multi-GPU distributed training has become a necessity rather than an option.

Training Time Reduction

Beyond memory concerns, reducing training time is another key motivation. By distributing training that would take weeks to months on a single GPU across multiple GPUs, near-linear speedup can be expected. Using 8 GPUs can theoretically reduce training time to 1/8, and in practice, achieving over 90% scaling efficiency is possible by minimizing communication overhead.

2. NCCL: The Core of GPU Communication

The key to multi-GPU training is efficient communication between GPUs. NVIDIA provides NCCL (NVIDIA Collective Communications Library), a dedicated communication library for this purpose.

What is NCCL

NCCL is a library that optimizes collective communication primitives for multi-GPU and multi-node environments, specifically designed for NVIDIA GPUs and networking. As of NCCL 2.29.1 (current latest version), it supports the following communication operations:

AllReduce: Sums (or averages) data from all GPUs and distributes the result to all GPUs. Used critically for gradient synchronization in DDP.
AllGather: Collects data from each GPU and distributes the complete data to all GPUs. Used in FSDP to reconstruct parameters before the forward pass.
ReduceScatter: Reduces data and distributes portions to each GPU. Used in FSDP's backward pass to distribute gradients.
Broadcast: Sends data from one GPU to all GPUs. Used to replicate rank 0's weights to other GPUs during model initialization.
Send/Recv: Point-to-point communication. Used for data transfer between stages in Pipeline Parallelism.

Key NCCL Environment Variables

Important environment variables for debugging NCCL issues or tuning performance in practice:

# Enable NCCL debug logging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# Specify communication interface (important in multi-node environments)
export NCCL_SOCKET_IFNAME=eth0

# Disable InfiniBand (if needed)
export NCCL_IB_DISABLE=1

# Set P2P (Peer-to-Peer) communication level
export NCCL_P2P_LEVEL=NVL  # Use NVLink

NCCL is used as the default backend in PyTorch's torch.distributed package. Calling init_process_group(backend="nccl") creates an NCCL-based communication group.

3. NVLink vs PCIe: GPU Interconnect Bandwidth Comparison

Multi-GPU training performance is heavily influenced by GPU interconnect bandwidth. GPU interconnect methods are broadly divided into PCIe and NVLink.

PCIe (Peripheral Component Interconnect Express)

PCIe is a general-purpose interface connecting various devices such as GPUs, SSDs, and NICs. Bandwidth by major generation:

Generation	Unidirectional Bandwidth (x16)	Bidirectional Bandwidth (x16)
PCIe 4.0	32 GB/s	64 GB/s
PCIe 5.0	64 GB/s	128 GB/s
PCIe 6.0	128 GB/s	256 GB/s

In PCIe communication, data between GPUs must pass through the CPU/chipset, adding additional hops and latency.

NVLink

NVLink is NVIDIA's high-speed GPU-dedicated interconnect that provides direct GPU-to-GPU communication. Since data is exchanged directly between GPUs without going through the CPU, latency is greatly reduced and bandwidth is substantially increased.

Generation	GPU	Bandwidth (Bidirectional)
NVLink 3rd Gen	A100	600 GB/s
NVLink 4th Gen	H100	900 GB/s
NVLink 5th Gen	B200	1,800 GB/s

NVLink 4th gen (H100) provides approximately 7x the bandwidth of PCIe 5.0. This creates a decisive performance difference in communication-intensive tasks such as gradient synchronization and parameter AllGather for large models.

NVSwitch

In DGX systems, NVSwitch connects all GPUs within a node at full-bisection bandwidth. For example, in a DGX H100 system, 8 H100 GPUs are connected via NVSwitch, guaranteeing 900 GB/s bandwidth between any GPU pair.

Practical Implications

Strategies may differ depending on the availability of NVLink during multi-GPU training:

With NVLink: Communication overhead is small, so both DDP and FSDP operate efficiently.
PCIe only: It's important to reduce communication frequency through gradient compression, gradient accumulation, etc.

4. Distributed Training Paradigms: Data / Model / Pipeline Parallelism

Multi-GPU training strategies are broadly classified into three paradigms.

Data Parallelism

The most basic and widely used approach. The same model is replicated on all GPUs, and training data is divided among GPUs so each processes different mini-batches. After forward/backward passes, gradients computed on each GPU are synchronized via AllReduce, and identical optimizer steps are performed.

GPU 0: Model Copy + Data Shard 0 → Gradient 0 ─┐
GPU 1: Model Copy + Data Shard 1 → Gradient 1 ─┤── AllReduce ──→ Averaged Gradient
GPU 2: Model Copy + Data Shard 2 → Gradient 2 ─┤
GPU 3: Model Copy + Data Shard 3 → Gradient 3 ─┘

Advantages: Simple implementation and near-linear scaling. Disadvantages: The entire model is replicated on every GPU, so the model must fit in a single GPU's memory.

Model Parallelism

Model layers are distributed across multiple GPUs. Also known as Tensor Parallelism, where operations within a single layer (e.g., matrix multiplication) are shared among multiple GPUs. This approach is primarily used in Megatron-LM.

GPU 0: Upper half of layer's weight matrix  ─┐
                                               ├── Combine → Layer Output
GPU 1: Lower half of layer's weight matrix  ─┘

Advantages: Can handle layers larger than a single GPU's memory. Disadvantages: Very frequent inter-GPU communication, requiring high-speed interconnects at the NVLink level.

Pipeline Parallelism

Model layers are sequentially distributed across multiple GPUs. Each GPU handles only some layers, and micro-batches flow through like a pipeline to minimize GPU idle time (bubbles).

GPU 0: Layers 1-8   │ Micro-batch 1 → │ Micro-batch 2 → │ ...
GPU 1: Layers 9-16  │                  │ Micro-batch 1 → │ Micro-batch 2 → │ ...
GPU 2: Layers 17-24 │                  │                  │ Micro-batch 1 → │ ...
GPU 3: Layers 25-32 │                  │                  │                  │ Micro-batch 1 → │

Advantages: Memory efficient as each GPU stores only a portion of the model. Disadvantages: GPU idle time due to pipeline bubbles.

Combining in Practice

Modern large-scale training uses 3D Parallelism, combining all three approaches. For example, Megatron-DeepSpeed applies Tensor Parallelism within nodes (utilizing NVLink), Pipeline Parallelism across nodes, and Data Parallelism overall.

5. PyTorch DDP (DistributedDataParallel) Official Documentation Analysis

PyTorch DDP is the official PyTorch module for implementing Data Parallelism. Unlike torch.nn.DataParallel, it operates on a multi-process basis, eliminating the Python GIL bottleneck, and supports multi-node environments.

DDP's Internal Operation

According to PyTorch official documentation, DDP operates as follows:

Initialization: The model state from rank 0 is broadcast to all processes so they start with identical initial states.
Forward Pass: Each process independently performs the forward pass on its data shard.
Backward Pass: During the backward pass, gradients are synchronized via AllReduce in bucket units. Bucket size can be adjusted with the bucket_cap_mb parameter (default: 25 MB).
Optimizer Step: All processes perform optimizer steps with the same synchronized gradients, ensuring model parameters remain identical.

The key is that gradient AllReduce overlaps with backward computation. Once all gradients in a bucket are ready, asynchronous AllReduce begins immediately, running concurrently with the remaining backward computation. This is the core mechanism enabling DDP's high efficiency.

DDP Code Example

import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size):
    setup(rank, world_size)

    # Create model and wrap with DDP
    model = nn.Linear(1024, 512).to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    # Dataset and DataLoader (DistributedSampler is required)
    dataset = MyDataset()
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)

    optimizer = optim.Adam(ddp_model.parameters(), lr=1e-3)
    loss_fn = nn.MSELoss()

    for epoch in range(10):
        sampler.set_epoch(epoch)  # Call every epoch to ensure data shuffling
        for batch in dataloader:
            inputs, targets = batch
            inputs = inputs.to(rank)
            targets = targets.to(rank)

            optimizer.zero_grad()
            outputs = ddp_model(inputs)
            loss = loss_fn(outputs, targets)
            loss.backward()  # Gradient AllReduce is performed automatically here
            optimizer.step()

    cleanup()

# Execution: torchrun --nproc_per_node=4 train_script.py

Gradient Accumulation and no_sync()

When using gradient accumulation, no_sync() context manager can prevent unnecessary AllReduce:

accumulation_steps = 4

for i, batch in enumerate(dataloader):
    inputs, targets = batch[0].to(rank), batch[1].to(rank)

    # Only synchronize gradients on the last accumulation step
    context = ddp_model.no_sync if (i + 1) % accumulation_steps != 0 else nullcontext
    with context():
        outputs = ddp_model(inputs)
        loss = loss_fn(outputs, targets) / accumulation_steps
        loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Execution via torchrun

DDP training scripts are launched with torchrun (or torch.distributed.launch):

# Single node, 4 GPUs
torchrun --nproc_per_node=4 train.py

# Multi-node (2 nodes, 8 GPUs each)
# On node 0:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
    --master_addr="192.168.1.1" --master_port=29500 train.py
# On node 1:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \
    --master_addr="192.168.1.1" --master_port=29500 train.py

6. PyTorch FSDP (Fully Sharded Data Parallel) Official Documentation Analysis

DDP replicates the entire model on every GPU, limiting it to models that fit in a single GPU's memory. FSDP overcomes this limitation by distributing (sharding) model parameters, gradients, and optimizer states across GPUs.

Core Principles of FSDP

FSDP was inspired by Microsoft's ZeRO (Zero Redundancy Optimizer) paper. The core idea is:

Sharded State: Normally, each GPU holds only a portion (shard) of the parameters.
Forward Pass: When a layer needs to be computed, all shards are gathered via AllGather to reconstruct the full parameters, computation is performed, and then reshard occurs.
Backward Pass: Similarly, parameters are reconstructed via AllGather to compute gradients, then gradients are distributed via ReduceScatter.
Optimizer Step: Each GPU performs the optimizer step only on the shard it's responsible for.

FSDP1 vs FSDP2

According to PyTorch official documentation, FSDP1 is deprecated and FSDP2 is recommended. Key differences:

Aspect	FSDP1	FSDP2
Sharding Method	Flat-parameter sharding	Per-parameter DTensor-based dim-0 sharding
API	`FullyShardedDataParallel` wrapper	`fully_shard()` functional API
Memory Management	`recordStream` based	No `recordStream`, deterministic GPU memory
Prefetching	Limited	Both implicit/explicit prefetching supported

FSDP2 uses torch.chunk(dim=0) to split each parameter along dim-0 by the number of data parallel workers.

FSDP2 Code Example

import torch
import torch.distributed as dist
from torch.distributed.fsdp import fully_shard, MixedPrecisionPolicy

def train_fsdp(rank, world_size):
    dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

    # Create Transformer model
    model = Transformer(model_args).to(rank)

    # Mixed Precision configuration
    mp_policy = MixedPrecisionPolicy(
        param_dtype=torch.bfloat16,    # Store parameters in bfloat16
        reduce_dtype=torch.float32,     # Perform gradient reduce in float32
    )

    # Apply FSDP to each Transformer layer (layer-level sharding)
    for layer in model.layers:
        fully_shard(layer, mp_policy=mp_policy)

    # Apply FSDP to the top-level model as well
    fully_shard(model, mp_policy=mp_policy)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    for epoch in range(num_epochs):
        for batch in dataloader:
            inputs = batch["input_ids"].to(rank)
            labels = batch["labels"].to(rank)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()

    dist.destroy_process_group()

FSDP Sharding Strategies

FSDP offers various sharding strategies:

FULL_SHARD: Shards parameters, gradients, and optimizer states. Greatest memory savings but also greatest communication overhead. (Equivalent to ZeRO Stage 3)
SHARD_GRAD_OP: Shards only gradients and optimizer states, retaining parameters after forward. Moderate memory savings with less communication. (Equivalent to ZeRO Stage 2)
NO_SHARD: No sharding, operates the same as DDP. (Equivalent to ZeRO Stage 0)

7. DeepSpeed ZeRO Stage 1/2/3 Comparison

DeepSpeed is a distributed training library developed by Microsoft that enables memory-efficient training through ZeRO (Zero Redundancy Optimizer). According to DeepSpeed official documentation, ZeRO consists of three stages.

ZeRO Stage Comparison

ZeRO Stage 1: Optimizer State Partitioning

Only optimizer states are partitioned across GPUs. Since the Adam optimizer stores first moment (m) and second moment (v) in addition to parameters, partitioning these alone can save significant memory.

Memory savings: With FP16 training and 8 GPUs, per-device memory consumption can be reduced by approximately 4x.
Communication overhead: Same as DDP (AllReduce only)
CPU Offloading: Supported

{
  "zero_optimization": {
    "stage": 1,
    "reduce_bucket_size": 5e8
  }
}

ZeRO Stage 2: Gradient Partitioning

In addition to optimizer states, gradients are also partitioned. Each GPU only holds gradients corresponding to its assigned optimizer states.

Memory savings: Additional savings from distributed gradient memory compared to Stage 1
Communication overhead: Slightly more efficient using ReduceScatter instead of AllReduce
CPU Offloading: Supported

{
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  }
}

ZeRO Stage 3: Parameter Partitioning

In addition to optimizer states and gradients, model parameters are also partitioned. This enables training models larger than a single GPU's memory.

Memory savings: Most dramatic. All model states are distributed.
Communication overhead: Additional AllGather required during forward/backward, increasing communication volume
CPU/NVMe Offloading: Supported (ZeRO-Infinity)

{
  "zero_optimization": {
    "stage": 3,
    "contiguous_gradients": true,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_prefetch_bucket_size": 1e7,
    "stage3_param_persistence_threshold": 1e5,
    "reduce_bucket_size": 1e7,
    "sub_group_size": 1e9,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    }
  }
}

ZeRO Stage Comparison Summary

Aspect	Stage 1	Stage 2	Stage 3
Optimizer State Partition	O	O	O
Gradient Partition	X	O	O
Parameter Partition	X	X	O
CPU Offloading	O	O	O
NVMe Offloading	X	X	O
Communication Overhead	Low	Low	High
Memory Savings	Moderate	High	Very High
Training Speed	Fastest	Fast	Relatively Slower

Full DeepSpeed Configuration Example

{
  "train_batch_size": 64,
  "gradient_accumulation_steps": 4,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 2e8,
    "allgather_bucket_size": 2e8
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-4,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 1e-4,
      "warmup_num_steps": 1000,
      "total_num_steps": 50000
    }
  }
}

Practical Stage Selection Guide

Model fits in a single GPU: ZeRO Stage 1 or DDP is fastest.
Model fits in a single GPU but you want larger batch sizes: ZeRO Stage 2 saves gradient memory.
Model doesn't fit in a single GPU: Use ZeRO Stage 3 or FSDP.
Extremely limited GPU memory: Use ZeRO Stage 3 + CPU/NVMe Offloading.

8. HuggingFace Accelerate: Unified Distributed Training Interface

Accelerate is a library developed by HuggingFace that provides a unified interface for various distributed training strategies including DDP, FSDP, and DeepSpeed. Its key advantage is enabling distributed training with minimal modifications to existing PyTorch training code.

Core Concepts of Accelerate

Accelerate provides a thin wrapper on top of PyTorch, allowing you to apply distributed training while keeping your existing training loop virtually unchanged without learning a new framework. The entire API is centered around a single Accelerator class.

Basic Usage

from accelerate import Accelerator

accelerator = Accelerator()

# This is the only change from existing code
model, optimizer, dataloader, scheduler = accelerator.prepare(
    model, optimizer, dataloader, scheduler
)

for batch in dataloader:
    optimizer.zero_grad()
    outputs = model(batch["input_ids"])
    loss = loss_fn(outputs, batch["labels"])
    accelerator.backward(loss)  # Instead of loss.backward()
    optimizer.step()
    scheduler.step()

Setting Up Distributed Strategy with accelerate config

Running accelerate config interactively sets up the distributed training environment:

$ accelerate config

# Answering questions automatically generates a configuration file
# - Distributed training type (multi-GPU, multi-node, TPU, etc.)
# - Number of GPUs
# - Whether to use mixed precision
# - DeepSpeed / FSDP usage and settings

Example generated configuration file (default_config.yaml):

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 4
mixed_precision: bf16
use_cpu: false

Using with DeepSpeed

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  zero_stage: 2
  gradient_accumulation_steps: 4
  offload_optimizer_device: none
  offload_param_device: none
mixed_precision: bf16
num_machines: 1
num_processes: 8

Using with FSDP

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_state_dict_type: SHARDED_STATE_DICT
mixed_precision: bf16
num_machines: 1
num_processes: 8

Launching Training

accelerate launch train.py

Accelerate automatically applies the appropriate distributed strategy based on the configuration file. Since you don't need to deal with torchrun or deepspeed launchers directly, switching between distributed strategies during experiments is very convenient.

9. nvidia-smi Monitoring and GPU Utilization Optimization

Real-time GPU monitoring is essential for maximizing distributed training efficiency. nvidia-smi is a CLI utility included with the NVIDIA driver for querying GPU status in real time.

Basic Monitoring Commands

# Basic GPU status check
nvidia-smi

# Auto-refresh monitoring at 1-second intervals
watch -n 1 nvidia-smi

# Monitor specific GPUs only
nvidia-smi --id=0,1

# Per-process GPU usage monitoring
nvidia-smi pmon -i 0 -s um -d 1

# Continuous device monitoring (1-second interval)
nvidia-smi dmon -d 1

Key Monitoring Metrics

# Output key metrics in CSV format (easy for script parsing)
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total,power.draw --format=csv,noheader,nounits

# Example output:
# 0, NVIDIA A100-SXM4-80GB, 45, 98, 72, 65536, 81920, 285
# 1, NVIDIA A100-SXM4-80GB, 43, 95, 68, 61440, 81920, 278

Key metric interpretation:

Metric	Description	Ideal Value
GPU Utilization	GPU core utilization rate	90% or higher
Memory Utilization	Memory bandwidth utilization	60-80%
Memory Used	VRAM in use	80-95% of total
Temperature	GPU temperature	Below 80C
Power Draw	Power consumption	80-100% of TDP

Causes and Solutions for Low GPU Utilization

When GPU Utilization is low (less than 80%):

DataLoader bottleneck: Increase num_workers to match CPU core count and set pin_memory=True
Batch size too small: GPU parallel compute units are not fully utilized
CPU preprocessing bottleneck: Perform data preprocessing on GPU (DALI, etc.) or pre-process and cache

dataloader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=8,           # Adjust to match CPU core count
    pin_memory=True,          # Improve GPU transfer speed
    prefetch_factor=2,        # Number of batches to preload
    persistent_workers=True,  # Keep workers across epochs
)

When Memory Utilization is low:

Gradually increase batch size to maximize GPU memory utilization.
Using mixed precision (FP16/BF16) allows approximately 2x the batch size with the same memory.

PyTorch Built-in Memory Monitoring

import torch

# Check current GPU memory usage
print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
print(f"Max Allocated: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

# Save memory snapshot (for detailed analysis)
torch.cuda.memory._record_memory_history()
# ... execute training code ...
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")

10. Practical Troubleshooting: OOM, NCCL Timeout, Communication Bottlenecks

Here we compile the most frequently occurring issues in multi-GPU training and their solutions.

10.1 OOM (Out of Memory) Error

Symptoms: CUDA out of memory. Tried to allocate X MiB error

Step-by-step resolution:

# Step 1: Reduce batch size
batch_size = 16  # → Try decreasing to 8, 4, 2 incrementally

# Step 2: Use gradient accumulation to maintain effective batch size
gradient_accumulation_steps = 4  # batch_size * 4 = effective batch size

# Step 3: Use mixed precision
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

with autocast(dtype=torch.bfloat16):
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

# Step 4: Enable gradient checkpointing
from torch.utils.checkpoint import checkpoint
# Or for HuggingFace models:
model.gradient_checkpointing_enable()

# Step 5: Optimize memory allocation settings
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

Escalation path for memory shortages:

Reduce batch size → Mixed Precision → Gradient Checkpointing
→ ZeRO Stage 2 → ZeRO Stage 3 → CPU Offloading → NVMe Offloading

10.2 NCCL Timeout Error

Symptoms: Watchdog caught collective operation timeout or NCCL timeout error. Training hangs or freezes at certain points.

Common causes and solutions:

# Cause 1: Timeout value too short
# Solution: Increase timeout
export NCCL_BLOCKING_WAIT=1

# Set timeout in Python
import datetime
dist.init_process_group(
    backend="nccl",
    timeout=datetime.timedelta(minutes=30)  # Default is 30 minutes
)

# Cause 2: GPU P2P communication issues
# Solution: Disable P2P
export NCCL_P2P_DISABLE=1

# Cause 3: Network interface selection error (multi-node)
# Solution: Specify the correct network interface
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0

# Cause 4: Insufficient shared memory in Docker environment
# Solution: Increase SHM size when running Docker
# docker run --shm-size=16g ...

# Enable NCCL logs for debugging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=DETAIL

When OOM occurs on only one GPU, causing other GPUs to wait:

This situation manifests as an NCCL timeout, but the actual cause is OOM. When one GPU crashes from OOM, the remaining GPUs timeout waiting for AllReduce. The OOM must be resolved first.

10.3 Communication Bottleneck Diagnosis and Resolution

Symptoms: GPU utilization is high, but training speed doesn't scale proportionally with the number of GPUs.

Diagnostic approach:

import torch.autograd.profiler as profiler

with profiler.profile(
    activities=[
        profiler.ProfilerActivity.CPU,
        profiler.ProfilerActivity.CUDA,
    ],
    schedule=profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
    on_trace_ready=profiler.tensorboard_trace_handler('./log/profiler'),
    record_shapes=True,
    with_stack=True,
) as prof:
    for step, batch in enumerate(dataloader):
        if step >= 7:
            break
        train_step(model, batch)
        prof.step()

Solutions:

# 1. Use gradient compression (DDP)
from torch.distributed.algorithms.ddp_comm_hooks import (
    default_hooks as default,
    powerSGD_hook as powerSGD,
)
ddp_model.register_comm_hook(
    state=powerSGD.PowerSGDState(process_group=None, matrix_approximation_rank=1),
    hook=powerSGD.powerSGD_hook,
)

# 2. Adjust bucket size (DDP)
ddp_model = DDP(model, device_ids=[rank], bucket_cap_mb=50)  # Default 25MB

# 3. Enable overlap_comm (DeepSpeed)
# In ds_config.json:
# "zero_optimization": { "overlap_comm": true }

10.4 Common Mistakes and Solutions

Problem	Cause	Solution
All GPUs use identical data	`DistributedSampler` not used	Apply `DistributedSampler`
Same data order every epoch	`sampler.set_epoch(epoch)` not called	Call at the start of every epoch
Conflicts when saving model	All ranks attempt to save	Save only on rank 0 with `if rank == 0:`
Accessing original model after DDP wrap	Mixing `model` and `ddp_model`	Access via `ddp_model.module`
`find_unused_parameters` warning	Some parameters unused in forward	Set `DDP(..., find_unused_parameters=True)`

11. Summary: Which Strategy to Choose

The final decision tree for selecting a distributed training strategy:

Does the model fit in a single GPU?
├── YES → DDP (fastest and simplest)
│         ├── Want larger batch sizes → ZeRO Stage 1/2 or FSDP SHARD_GRAD_OP
│         └── Sufficient → DDP alone is enough
└── NO → Does the model fit in the total GPU memory of all GPUs?
          ├── YES → FSDP (FULL_SHARD) or ZeRO Stage 3
          │         ├── Prefer PyTorch native → FSDP
          │         └── Need more options/customization → DeepSpeed
          └── NO → ZeRO Stage 3 + CPU/NVMe Offloading
                    or Pipeline Parallelism + Tensor Parallelism combination