- Authors
- Name
- 1. Why Multi-GPU Training Is Necessary
- 2. NCCL: The Core of GPU Communication
- 3. NVLink vs PCIe: GPU Interconnect Bandwidth Comparison
- 4. Distributed Training Paradigms: Data / Model / Pipeline Parallelism
- 5. PyTorch DDP (DistributedDataParallel) Official Documentation Analysis
- 6. PyTorch FSDP (Fully Sharded Data Parallel) Official Documentation Analysis
- 7. DeepSpeed ZeRO Stage 1/2/3 Comparison
- 8. HuggingFace Accelerate: Unified Distributed Training Interface
- 9. nvidia-smi Monitoring and GPU Utilization Optimization
- 10. Practical Troubleshooting: OOM, NCCL Timeout, Communication Bottlenecks
- 11. Summary: Which Strategy to Choose
- References
1. Why Multi-GPU Training Is Necessary
The number of parameters in recent large language models (LLMs) has been growing exponentially. From GPT-3 with 175B, PaLM with 540B, to Llama 3 with 405B parameters, training on a single GPU has become physically impossible.
Model Size and Memory Requirements
Calculating a model's memory usage makes the severity of the problem clear. With FP32 (32-bit floating point), a single parameter occupies 4 bytes. Therefore:
- 7B model: 7 x 10^9 x 4 bytes = approximately 28 GB (parameters only)
- 13B model: 13 x 10^9 x 4 bytes = approximately 52 GB
- 70B model: 70 x 10^9 x 4 bytes = approximately 280 GB
Adding optimizer states (2x the parameters for Adam), gradients (same size as parameters), and activation memory, the actual memory required during training is approximately 4-8x the parameter size. Training a 7B model in FP32 requires a minimum of 112-224 GB of GPU memory.
The most widely used NVIDIA A100 has 80 GB of VRAM, and the H100 also has 80 GB. Even full fine-tuning of a 7B model is challenging on a single GPU. This is why multi-GPU distributed training has become a necessity rather than an option.
Training Time Reduction
Beyond memory concerns, reducing training time is another key motivation. By distributing training that would take weeks to months on a single GPU across multiple GPUs, near-linear speedup can be expected. Using 8 GPUs can theoretically reduce training time to 1/8, and in practice, achieving over 90% scaling efficiency is possible by minimizing communication overhead.
2. NCCL: The Core of GPU Communication
The key to multi-GPU training is efficient communication between GPUs. NVIDIA provides NCCL (NVIDIA Collective Communications Library), a dedicated communication library for this purpose.
What is NCCL
NCCL is a library that optimizes collective communication primitives for multi-GPU and multi-node environments, specifically designed for NVIDIA GPUs and networking. As of NCCL 2.29.1 (current latest version), it supports the following communication operations:
- AllReduce: Sums (or averages) data from all GPUs and distributes the result to all GPUs. Used critically for gradient synchronization in DDP.
- AllGather: Collects data from each GPU and distributes the complete data to all GPUs. Used in FSDP to reconstruct parameters before the forward pass.
- ReduceScatter: Reduces data and distributes portions to each GPU. Used in FSDP's backward pass to distribute gradients.
- Broadcast: Sends data from one GPU to all GPUs. Used to replicate rank 0's weights to other GPUs during model initialization.
- Send/Recv: Point-to-point communication. Used for data transfer between stages in Pipeline Parallelism.
Key NCCL Environment Variables
Important environment variables for debugging NCCL issues or tuning performance in practice:
# Enable NCCL debug logging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
# Specify communication interface (important in multi-node environments)
export NCCL_SOCKET_IFNAME=eth0
# Disable InfiniBand (if needed)
export NCCL_IB_DISABLE=1
# Set P2P (Peer-to-Peer) communication level
export NCCL_P2P_LEVEL=NVL # Use NVLink
NCCL is used as the default backend in PyTorch's torch.distributed package. Calling init_process_group(backend="nccl") creates an NCCL-based communication group.
3. NVLink vs PCIe: GPU Interconnect Bandwidth Comparison
Multi-GPU training performance is heavily influenced by GPU interconnect bandwidth. GPU interconnect methods are broadly divided into PCIe and NVLink.
PCIe (Peripheral Component Interconnect Express)
PCIe is a general-purpose interface connecting various devices such as GPUs, SSDs, and NICs. Bandwidth by major generation:
| Generation | Unidirectional Bandwidth (x16) | Bidirectional Bandwidth (x16) |
|---|---|---|
| PCIe 4.0 | 32 GB/s | 64 GB/s |
| PCIe 5.0 | 64 GB/s | 128 GB/s |
| PCIe 6.0 | 128 GB/s | 256 GB/s |
In PCIe communication, data between GPUs must pass through the CPU/chipset, adding additional hops and latency.
NVLink
NVLink is NVIDIA's high-speed GPU-dedicated interconnect that provides direct GPU-to-GPU communication. Since data is exchanged directly between GPUs without going through the CPU, latency is greatly reduced and bandwidth is substantially increased.
| Generation | GPU | Bandwidth (Bidirectional) |
|---|---|---|
| NVLink 3rd Gen | A100 | 600 GB/s |
| NVLink 4th Gen | H100 | 900 GB/s |
| NVLink 5th Gen | B200 | 1,800 GB/s |
NVLink 4th gen (H100) provides approximately 7x the bandwidth of PCIe 5.0. This creates a decisive performance difference in communication-intensive tasks such as gradient synchronization and parameter AllGather for large models.
NVSwitch
In DGX systems, NVSwitch connects all GPUs within a node at full-bisection bandwidth. For example, in a DGX H100 system, 8 H100 GPUs are connected via NVSwitch, guaranteeing 900 GB/s bandwidth between any GPU pair.
Practical Implications
Strategies may differ depending on the availability of NVLink during multi-GPU training:
- With NVLink: Communication overhead is small, so both DDP and FSDP operate efficiently.
- PCIe only: It's important to reduce communication frequency through gradient compression, gradient accumulation, etc.
4. Distributed Training Paradigms: Data / Model / Pipeline Parallelism
Multi-GPU training strategies are broadly classified into three paradigms.
Data Parallelism
The most basic and widely used approach. The same model is replicated on all GPUs, and training data is divided among GPUs so each processes different mini-batches. After forward/backward passes, gradients computed on each GPU are synchronized via AllReduce, and identical optimizer steps are performed.
GPU 0: Model Copy + Data Shard 0 → Gradient 0 ─┐
GPU 1: Model Copy + Data Shard 1 → Gradient 1 ─┤── AllReduce ──→ Averaged Gradient
GPU 2: Model Copy + Data Shard 2 → Gradient 2 ─┤
GPU 3: Model Copy + Data Shard 3 → Gradient 3 ─┘
Advantages: Simple implementation and near-linear scaling. Disadvantages: The entire model is replicated on every GPU, so the model must fit in a single GPU's memory.
Model Parallelism
Model layers are distributed across multiple GPUs. Also known as Tensor Parallelism, where operations within a single layer (e.g., matrix multiplication) are shared among multiple GPUs. This approach is primarily used in Megatron-LM.
GPU 0: Upper half of layer's weight matrix ─┐
├── Combine → Layer Output
GPU 1: Lower half of layer's weight matrix ─┘
Advantages: Can handle layers larger than a single GPU's memory. Disadvantages: Very frequent inter-GPU communication, requiring high-speed interconnects at the NVLink level.
Pipeline Parallelism
Model layers are sequentially distributed across multiple GPUs. Each GPU handles only some layers, and micro-batches flow through like a pipeline to minimize GPU idle time (bubbles).
GPU 0: Layers 1-8 │ Micro-batch 1 → │ Micro-batch 2 → │ ...
GPU 1: Layers 9-16 │ │ Micro-batch 1 → │ Micro-batch 2 → │ ...
GPU 2: Layers 17-24 │ │ │ Micro-batch 1 → │ ...
GPU 3: Layers 25-32 │ │ │ │ Micro-batch 1 → │
Advantages: Memory efficient as each GPU stores only a portion of the model. Disadvantages: GPU idle time due to pipeline bubbles.
Combining in Practice
Modern large-scale training uses 3D Parallelism, combining all three approaches. For example, Megatron-DeepSpeed applies Tensor Parallelism within nodes (utilizing NVLink), Pipeline Parallelism across nodes, and Data Parallelism overall.
5. PyTorch DDP (DistributedDataParallel) Official Documentation Analysis
PyTorch DDP is the official PyTorch module for implementing Data Parallelism. Unlike torch.nn.DataParallel, it operates on a multi-process basis, eliminating the Python GIL bottleneck, and supports multi-node environments.
DDP's Internal Operation
According to PyTorch official documentation, DDP operates as follows:
- Initialization: The model state from rank 0 is
broadcastto all processes so they start with identical initial states. - Forward Pass: Each process independently performs the forward pass on its data shard.
- Backward Pass: During the backward pass, gradients are synchronized via
AllReducein bucket units. Bucket size can be adjusted with thebucket_cap_mbparameter (default: 25 MB). - Optimizer Step: All processes perform optimizer steps with the same synchronized gradients, ensuring model parameters remain identical.
The key is that gradient AllReduce overlaps with backward computation. Once all gradients in a bucket are ready, asynchronous AllReduce begins immediately, running concurrently with the remaining backward computation. This is the core mechanism enabling DDP's high efficiency.
DDP Code Example
import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def cleanup():
dist.destroy_process_group()
def train(rank, world_size):
setup(rank, world_size)
# Create model and wrap with DDP
model = nn.Linear(1024, 512).to(rank)
ddp_model = DDP(model, device_ids=[rank])
# Dataset and DataLoader (DistributedSampler is required)
dataset = MyDataset()
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
optimizer = optim.Adam(ddp_model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
for epoch in range(10):
sampler.set_epoch(epoch) # Call every epoch to ensure data shuffling
for batch in dataloader:
inputs, targets = batch
inputs = inputs.to(rank)
targets = targets.to(rank)
optimizer.zero_grad()
outputs = ddp_model(inputs)
loss = loss_fn(outputs, targets)
loss.backward() # Gradient AllReduce is performed automatically here
optimizer.step()
cleanup()
# Execution: torchrun --nproc_per_node=4 train_script.py
Gradient Accumulation and no_sync()
When using gradient accumulation, no_sync() context manager can prevent unnecessary AllReduce:
accumulation_steps = 4
for i, batch in enumerate(dataloader):
inputs, targets = batch[0].to(rank), batch[1].to(rank)
# Only synchronize gradients on the last accumulation step
context = ddp_model.no_sync if (i + 1) % accumulation_steps != 0 else nullcontext
with context():
outputs = ddp_model(inputs)
loss = loss_fn(outputs, targets) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Execution via torchrun
DDP training scripts are launched with torchrun (or torch.distributed.launch):
# Single node, 4 GPUs
torchrun --nproc_per_node=4 train.py
# Multi-node (2 nodes, 8 GPUs each)
# On node 0:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
--master_addr="192.168.1.1" --master_port=29500 train.py
# On node 1:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \
--master_addr="192.168.1.1" --master_port=29500 train.py
6. PyTorch FSDP (Fully Sharded Data Parallel) Official Documentation Analysis
DDP replicates the entire model on every GPU, limiting it to models that fit in a single GPU's memory. FSDP overcomes this limitation by distributing (sharding) model parameters, gradients, and optimizer states across GPUs.
Core Principles of FSDP
FSDP was inspired by Microsoft's ZeRO (Zero Redundancy Optimizer) paper. The core idea is:
- Sharded State: Normally, each GPU holds only a portion (shard) of the parameters.
- Forward Pass: When a layer needs to be computed, all shards are gathered via
AllGatherto reconstruct the full parameters, computation is performed, and then reshard occurs. - Backward Pass: Similarly, parameters are reconstructed via
AllGatherto compute gradients, then gradients are distributed viaReduceScatter. - Optimizer Step: Each GPU performs the optimizer step only on the shard it's responsible for.
FSDP1 vs FSDP2
According to PyTorch official documentation, FSDP1 is deprecated and FSDP2 is recommended. Key differences:
| Aspect | FSDP1 | FSDP2 |
|---|---|---|
| Sharding Method | Flat-parameter sharding | Per-parameter DTensor-based dim-0 sharding |
| API | FullyShardedDataParallel wrapper | fully_shard() functional API |
| Memory Management | recordStream based | No recordStream, deterministic GPU memory |
| Prefetching | Limited | Both implicit/explicit prefetching supported |
FSDP2 uses torch.chunk(dim=0) to split each parameter along dim-0 by the number of data parallel workers.
FSDP2 Code Example
import torch
import torch.distributed as dist
from torch.distributed.fsdp import fully_shard, MixedPrecisionPolicy
def train_fsdp(rank, world_size):
dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
# Create Transformer model
model = Transformer(model_args).to(rank)
# Mixed Precision configuration
mp_policy = MixedPrecisionPolicy(
param_dtype=torch.bfloat16, # Store parameters in bfloat16
reduce_dtype=torch.float32, # Perform gradient reduce in float32
)
# Apply FSDP to each Transformer layer (layer-level sharding)
for layer in model.layers:
fully_shard(layer, mp_policy=mp_policy)
# Apply FSDP to the top-level model as well
fully_shard(model, mp_policy=mp_policy)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for epoch in range(num_epochs):
for batch in dataloader:
inputs = batch["input_ids"].to(rank)
labels = batch["labels"].to(rank)
optimizer.zero_grad()
outputs = model(inputs)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
dist.destroy_process_group()
FSDP Sharding Strategies
FSDP offers various sharding strategies:
- FULL_SHARD: Shards parameters, gradients, and optimizer states. Greatest memory savings but also greatest communication overhead. (Equivalent to ZeRO Stage 3)
- SHARD_GRAD_OP: Shards only gradients and optimizer states, retaining parameters after forward. Moderate memory savings with less communication. (Equivalent to ZeRO Stage 2)
- NO_SHARD: No sharding, operates the same as DDP. (Equivalent to ZeRO Stage 0)
7. DeepSpeed ZeRO Stage 1/2/3 Comparison
DeepSpeed is a distributed training library developed by Microsoft that enables memory-efficient training through ZeRO (Zero Redundancy Optimizer). According to DeepSpeed official documentation, ZeRO consists of three stages.
ZeRO Stage Comparison
ZeRO Stage 1: Optimizer State Partitioning
Only optimizer states are partitioned across GPUs. Since the Adam optimizer stores first moment (m) and second moment (v) in addition to parameters, partitioning these alone can save significant memory.
- Memory savings: With FP16 training and 8 GPUs, per-device memory consumption can be reduced by approximately 4x.
- Communication overhead: Same as DDP (AllReduce only)
- CPU Offloading: Supported
{
"zero_optimization": {
"stage": 1,
"reduce_bucket_size": 5e8
}
}
ZeRO Stage 2: Gradient Partitioning
In addition to optimizer states, gradients are also partitioned. Each GPU only holds gradients corresponding to its assigned optimizer states.
- Memory savings: Additional savings from distributed gradient memory compared to Stage 1
- Communication overhead: Slightly more efficient using ReduceScatter instead of AllReduce
- CPU Offloading: Supported
{
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
}
}
ZeRO Stage 3: Parameter Partitioning
In addition to optimizer states and gradients, model parameters are also partitioned. This enables training models larger than a single GPU's memory.
- Memory savings: Most dramatic. All model states are distributed.
- Communication overhead: Additional AllGather required during forward/backward, increasing communication volume
- CPU/NVMe Offloading: Supported (ZeRO-Infinity)
{
"zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
}
}
}
ZeRO Stage Comparison Summary
| Aspect | Stage 1 | Stage 2 | Stage 3 |
|---|---|---|---|
| Optimizer State Partition | O | O | O |
| Gradient Partition | X | O | O |
| Parameter Partition | X | X | O |
| CPU Offloading | O | O | O |
| NVMe Offloading | X | X | O |
| Communication Overhead | Low | Low | High |
| Memory Savings | Moderate | High | Very High |
| Training Speed | Fastest | Fast | Relatively Slower |
Full DeepSpeed Configuration Example
{
"train_batch_size": 64,
"gradient_accumulation_steps": 4,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 2e8,
"allgather_bucket_size": 2e8
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-4,
"betas": [0.9, 0.999],
"eps": 1e-8,
"weight_decay": 0.01
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 1e-4,
"warmup_num_steps": 1000,
"total_num_steps": 50000
}
}
}
Practical Stage Selection Guide
- Model fits in a single GPU: ZeRO Stage 1 or DDP is fastest.
- Model fits in a single GPU but you want larger batch sizes: ZeRO Stage 2 saves gradient memory.
- Model doesn't fit in a single GPU: Use ZeRO Stage 3 or FSDP.
- Extremely limited GPU memory: Use ZeRO Stage 3 + CPU/NVMe Offloading.
8. HuggingFace Accelerate: Unified Distributed Training Interface
Accelerate is a library developed by HuggingFace that provides a unified interface for various distributed training strategies including DDP, FSDP, and DeepSpeed. Its key advantage is enabling distributed training with minimal modifications to existing PyTorch training code.
Core Concepts of Accelerate
Accelerate provides a thin wrapper on top of PyTorch, allowing you to apply distributed training while keeping your existing training loop virtually unchanged without learning a new framework. The entire API is centered around a single Accelerator class.
Basic Usage
from accelerate import Accelerator
accelerator = Accelerator()
# This is the only change from existing code
model, optimizer, dataloader, scheduler = accelerator.prepare(
model, optimizer, dataloader, scheduler
)
for batch in dataloader:
optimizer.zero_grad()
outputs = model(batch["input_ids"])
loss = loss_fn(outputs, batch["labels"])
accelerator.backward(loss) # Instead of loss.backward()
optimizer.step()
scheduler.step()
Setting Up Distributed Strategy with accelerate config
Running accelerate config interactively sets up the distributed training environment:
$ accelerate config
# Answering questions automatically generates a configuration file
# - Distributed training type (multi-GPU, multi-node, TPU, etc.)
# - Number of GPUs
# - Whether to use mixed precision
# - DeepSpeed / FSDP usage and settings
Example generated configuration file (default_config.yaml):
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 4
mixed_precision: bf16
use_cpu: false
Using with DeepSpeed
compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
zero_stage: 2
gradient_accumulation_steps: 4
offload_optimizer_device: none
offload_param_device: none
mixed_precision: bf16
num_machines: 1
num_processes: 8
Using with FSDP
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
fsdp_sharding_strategy: FULL_SHARD
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_state_dict_type: SHARDED_STATE_DICT
mixed_precision: bf16
num_machines: 1
num_processes: 8
Launching Training
accelerate launch train.py
Accelerate automatically applies the appropriate distributed strategy based on the configuration file. Since you don't need to deal with torchrun or deepspeed launchers directly, switching between distributed strategies during experiments is very convenient.
9. nvidia-smi Monitoring and GPU Utilization Optimization
Real-time GPU monitoring is essential for maximizing distributed training efficiency. nvidia-smi is a CLI utility included with the NVIDIA driver for querying GPU status in real time.
Basic Monitoring Commands
# Basic GPU status check
nvidia-smi
# Auto-refresh monitoring at 1-second intervals
watch -n 1 nvidia-smi
# Monitor specific GPUs only
nvidia-smi --id=0,1
# Per-process GPU usage monitoring
nvidia-smi pmon -i 0 -s um -d 1
# Continuous device monitoring (1-second interval)
nvidia-smi dmon -d 1
Key Monitoring Metrics
# Output key metrics in CSV format (easy for script parsing)
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total,power.draw --format=csv,noheader,nounits
# Example output:
# 0, NVIDIA A100-SXM4-80GB, 45, 98, 72, 65536, 81920, 285
# 1, NVIDIA A100-SXM4-80GB, 43, 95, 68, 61440, 81920, 278
Key metric interpretation:
| Metric | Description | Ideal Value |
|---|---|---|
| GPU Utilization | GPU core utilization rate | 90% or higher |
| Memory Utilization | Memory bandwidth utilization | 60-80% |
| Memory Used | VRAM in use | 80-95% of total |
| Temperature | GPU temperature | Below 80C |
| Power Draw | Power consumption | 80-100% of TDP |
Causes and Solutions for Low GPU Utilization
When GPU Utilization is low (less than 80%):
- DataLoader bottleneck: Increase
num_workersto match CPU core count and setpin_memory=True - Batch size too small: GPU parallel compute units are not fully utilized
- CPU preprocessing bottleneck: Perform data preprocessing on GPU (DALI, etc.) or pre-process and cache
dataloader = DataLoader(
dataset,
batch_size=64,
num_workers=8, # Adjust to match CPU core count
pin_memory=True, # Improve GPU transfer speed
prefetch_factor=2, # Number of batches to preload
persistent_workers=True, # Keep workers across epochs
)
When Memory Utilization is low:
- Gradually increase batch size to maximize GPU memory utilization.
- Using mixed precision (FP16/BF16) allows approximately 2x the batch size with the same memory.
PyTorch Built-in Memory Monitoring
import torch
# Check current GPU memory usage
print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
print(f"Max Allocated: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
# Save memory snapshot (for detailed analysis)
torch.cuda.memory._record_memory_history()
# ... execute training code ...
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
10. Practical Troubleshooting: OOM, NCCL Timeout, Communication Bottlenecks
Here we compile the most frequently occurring issues in multi-GPU training and their solutions.
10.1 OOM (Out of Memory) Error
Symptoms: CUDA out of memory. Tried to allocate X MiB error
Step-by-step resolution:
# Step 1: Reduce batch size
batch_size = 16 # → Try decreasing to 8, 4, 2 incrementally
# Step 2: Use gradient accumulation to maintain effective batch size
gradient_accumulation_steps = 4 # batch_size * 4 = effective batch size
# Step 3: Use mixed precision
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast(dtype=torch.bfloat16):
outputs = model(inputs)
loss = loss_fn(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# Step 4: Enable gradient checkpointing
from torch.utils.checkpoint import checkpoint
# Or for HuggingFace models:
model.gradient_checkpointing_enable()
# Step 5: Optimize memory allocation settings
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
Escalation path for memory shortages:
Reduce batch size → Mixed Precision → Gradient Checkpointing
→ ZeRO Stage 2 → ZeRO Stage 3 → CPU Offloading → NVMe Offloading
10.2 NCCL Timeout Error
Symptoms: Watchdog caught collective operation timeout or NCCL timeout error. Training hangs or freezes at certain points.
Common causes and solutions:
# Cause 1: Timeout value too short
# Solution: Increase timeout
export NCCL_BLOCKING_WAIT=1
# Set timeout in Python
import datetime
dist.init_process_group(
backend="nccl",
timeout=datetime.timedelta(minutes=30) # Default is 30 minutes
)
# Cause 2: GPU P2P communication issues
# Solution: Disable P2P
export NCCL_P2P_DISABLE=1
# Cause 3: Network interface selection error (multi-node)
# Solution: Specify the correct network interface
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0
# Cause 4: Insufficient shared memory in Docker environment
# Solution: Increase SHM size when running Docker
# docker run --shm-size=16g ...
# Enable NCCL logs for debugging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=DETAIL
When OOM occurs on only one GPU, causing other GPUs to wait:
This situation manifests as an NCCL timeout, but the actual cause is OOM. When one GPU crashes from OOM, the remaining GPUs timeout waiting for AllReduce. The OOM must be resolved first.
10.3 Communication Bottleneck Diagnosis and Resolution
Symptoms: GPU utilization is high, but training speed doesn't scale proportionally with the number of GPUs.
Diagnostic approach:
import torch.autograd.profiler as profiler
with profiler.profile(
activities=[
profiler.ProfilerActivity.CPU,
profiler.ProfilerActivity.CUDA,
],
schedule=profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
on_trace_ready=profiler.tensorboard_trace_handler('./log/profiler'),
record_shapes=True,
with_stack=True,
) as prof:
for step, batch in enumerate(dataloader):
if step >= 7:
break
train_step(model, batch)
prof.step()
Solutions:
# 1. Use gradient compression (DDP)
from torch.distributed.algorithms.ddp_comm_hooks import (
default_hooks as default,
powerSGD_hook as powerSGD,
)
ddp_model.register_comm_hook(
state=powerSGD.PowerSGDState(process_group=None, matrix_approximation_rank=1),
hook=powerSGD.powerSGD_hook,
)
# 2. Adjust bucket size (DDP)
ddp_model = DDP(model, device_ids=[rank], bucket_cap_mb=50) # Default 25MB
# 3. Enable overlap_comm (DeepSpeed)
# In ds_config.json:
# "zero_optimization": { "overlap_comm": true }
10.4 Common Mistakes and Solutions
| Problem | Cause | Solution |
|---|---|---|
| All GPUs use identical data | DistributedSampler not used | Apply DistributedSampler |
| Same data order every epoch | sampler.set_epoch(epoch) not called | Call at the start of every epoch |
| Conflicts when saving model | All ranks attempt to save | Save only on rank 0 with if rank == 0: |
| Accessing original model after DDP wrap | Mixing model and ddp_model | Access via ddp_model.module |
find_unused_parameters warning | Some parameters unused in forward | Set DDP(..., find_unused_parameters=True) |
11. Summary: Which Strategy to Choose
The final decision tree for selecting a distributed training strategy:
Does the model fit in a single GPU?
├── YES → DDP (fastest and simplest)
│ ├── Want larger batch sizes → ZeRO Stage 1/2 or FSDP SHARD_GRAD_OP
│ └── Sufficient → DDP alone is enough
└── NO → Does the model fit in the total GPU memory of all GPUs?
├── YES → FSDP (FULL_SHARD) or ZeRO Stage 3
│ ├── Prefer PyTorch native → FSDP
│ └── Need more options/customization → DeepSpeed
└── NO → ZeRO Stage 3 + CPU/NVMe Offloading
or Pipeline Parallelism + Tensor Parallelism combination
References
PyTorch Official Documentation
- DistributedDataParallel API Documentation - PyTorch 2.10
- Getting Started with Distributed Data Parallel - PyTorch Tutorials
- Distributed Data Parallel Internal Design - PyTorch 2.10
- What is Distributed Data Parallel (DDP) - PyTorch Tutorials
- Getting Started with FSDP2 - PyTorch Tutorials 2.11
- torch.distributed.fsdp.fully_shard - PyTorch 2.10
- FullyShardedDataParallel - PyTorch 2.10
- Introducing PyTorch FSDP API - PyTorch Blog
DeepSpeed Official Documentation
- Zero Redundancy Optimizer Tutorial - DeepSpeed
- ZeRO API Documentation - DeepSpeed 0.18.7
- DeepSpeed Configuration JSON - DeepSpeed
- Training Overview and Features - DeepSpeed
NVIDIA Official Documentation
- NVIDIA Collective Communications Library (NCCL)
- NCCL Documentation 2.29.1
- Overview of NCCL
- NVLink & NVSwitch - NVIDIA
- nvidia-smi Manual