Split View: Multi-GPU 분산 학습 완전 가이드: DDP, FSDP, DeepSpeed

Multi-GPU 분산 학습 완전 가이드: DDP, FSDP, DeepSpeed

1. 왜 Multi-GPU 학습이 필요한가
- 모델 크기와 메모리 요구량
- 학습 시간 단축
2. NCCL: GPU 간 통신의 핵심
- NCCL이란
- NCCL의 핵심 환경 변수
3. NVLink vs PCIe: GPU 간 통신 대역폭 비교
4. 분산 학습 패러다임: Data / Model / Pipeline Parallelism
5. PyTorch DDP (DistributedDataParallel) 공식 문서 분석
6. PyTorch FSDP (Fully Sharded Data Parallel) 공식 문서 분석
7. DeepSpeed ZeRO Stage 1/2/3 비교
8. HuggingFace Accelerate: 통합 분산 학습 인터페이스
9. nvidia-smi 모니터링 및 GPU 활용률 최적화
10. 실전 트러블슈팅: OOM, NCCL Timeout, 통신 병목
11. 정리: 어떤 전략을 선택할 것인가
References

1. 왜 Multi-GPU 학습이 필요한가

최근 대규모 언어 모델(LLM)의 파라미터 수는 기하급수적으로 증가하고 있다. GPT-3는 175B, PaLM은 540B, 그리고 Llama 3 405B에 이르기까지, 단일 GPU로 학습하는 것은 물리적으로 불가능한 시대가 되었다.

모델 크기와 메모리 요구량

모델의 메모리 사용량을 계산해 보면 문제의 심각성이 명확해진다. FP32(32-bit floating point) 기준으로 파라미터 1개는 4 bytes를 차지한다. 따라서:

7B 모델: 7 x 10^9 x 4 bytes = 약 28 GB (파라미터만)
13B 모델: 13 x 10^9 x 4 bytes = 약 52 GB
70B 모델: 70 x 10^9 x 4 bytes = 약 280 GB

여기에 optimizer state(Adam 기준 파라미터의 2배), gradient(파라미터와 동일 크기), activation memory까지 더하면 실제 학습 시 필요한 메모리는 파라미터 크기의 약 4~8배에 달한다. 7B 모델을 FP32로 학습하려면 최소 112~224 GB의 GPU 메모리가 필요한 셈이다.

현재 가장 많이 사용되는 NVIDIA A100의 VRAM은 80 GB, H100은 80 GB이다. 단일 GPU로는 7B 모델의 full fine-tuning조차 버겁다. 이것이 Multi-GPU 분산 학습이 선택이 아닌 필수가 된 이유다.

학습 시간 단축

메모리 문제 외에도 학습 시간 단축은 또 다른 핵심 동기다. 단일 GPU로 수주~수개월이 걸리는 학습을 여러 GPU에 분산하면 거의 선형에 가까운 속도 향상을 기대할 수 있다. 8개의 GPU를 사용하면 이론적으로 학습 시간을 1/8로 줄일 수 있으며, 실제로도 통신 오버헤드를 최소화하면 90% 이상의 scaling efficiency를 달성할 수 있다.

2. NCCL: GPU 간 통신의 핵심

Multi-GPU 학습의 핵심은 GPU 간 효율적인 통신이다. NVIDIA는 이를 위해 **NCCL(NVIDIA Collective Communications Library)**이라는 전용 통신 라이브러리를 제공한다.

NCCL이란

NCCL은 Multi-GPU 및 Multi-Node 환경에서 collective communication primitives를 최적화한 라이브러리로, NVIDIA GPU와 네트워킹에 특화되어 있다. NCCL 2.29.1(현재 최신 버전) 기준으로 다음과 같은 통신 연산을 지원한다:

AllReduce: 모든 GPU의 데이터를 합산(또는 평균)하여 모든 GPU에 배포. DDP에서 gradient 동기화에 핵심적으로 사용된다.
AllGather: 각 GPU의 데이터를 모아서 모든 GPU에 전체 데이터를 배포. FSDP에서 forward pass 전 파라미터를 복원할 때 사용된다.
ReduceScatter: 데이터를 reduce한 뒤 각 GPU에 나누어 배포. FSDP의 backward pass에서 gradient를 분산할 때 사용된다.
Broadcast: 한 GPU의 데이터를 모든 GPU에 전송. 모델 초기화 시 rank 0의 가중치를 다른 GPU에 복제할 때 사용된다.
Send/Recv: Point-to-point 통신. Pipeline Parallelism에서 스테이지 간 데이터 전달에 사용된다.

NCCL의 핵심 환경 변수

실전에서 NCCL 관련 이슈를 디버깅하거나 성능을 튜닝할 때 알아야 하는 주요 환경 변수는 다음과 같다:

# NCCL 디버깅 로그 활성화
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# 통신 인터페이스 지정 (Multi-Node 환경에서 중요)
export NCCL_SOCKET_IFNAME=eth0

# InfiniBand 사용 비활성화 (필요시)
export NCCL_IB_DISABLE=1

# P2P(Peer-to-Peer) 통신 레벨 설정
export NCCL_P2P_LEVEL=NVL  # NVLink 사용

NCCL은 PyTorch의 torch.distributed 패키지에서 기본 backend로 사용된다. init_process_group(backend="nccl")을 호출하면 NCCL 기반의 통신 그룹이 생성된다.

3. NVLink vs PCIe: GPU 간 통신 대역폭 비교

Multi-GPU 학습의 성능은 GPU 간 통신 대역폭에 크게 좌우된다. GPU 간 연결 방식은 크게 PCIe와 NVLink 두 가지로 나뉜다.

PCIe (Peripheral Component Interconnect Express)

PCIe는 범용 인터페이스로 GPU, SSD, NIC 등 다양한 장치를 연결한다. 현재 주요 세대별 대역폭은 다음과 같다:

세대	단방향 대역폭 (x16)	양방향 대역폭 (x16)
PCIe 4.0	32 GB/s	64 GB/s
PCIe 5.0	64 GB/s	128 GB/s
PCIe 6.0	128 GB/s	256 GB/s

PCIe 통신에서는 GPU 간 데이터가 CPU/chipset을 경유해야 하므로, 추가적인 hop과 latency가 발생한다.

NVLink

NVLink는 NVIDIA의 고속 GPU 전용 인터커넥트로, GPU 간 직접 통신을 제공한다. CPU를 거치지 않고 GPU끼리 직접 데이터를 주고받으므로, latency가 크게 줄어들고 bandwidth가 대폭 향상된다.

세대	GPU	대역폭 (양방향)
NVLink 3세대	A100	600 GB/s
NVLink 4세대	H100	900 GB/s
NVLink 5세대	B200	1,800 GB/s

NVLink 4세대(H100)는 PCIe 5.0 대비 약 7배의 대역폭을 제공한다. 이는 대규모 모델의 gradient 동기화, 파라미터 AllGather 등 통신 집약적인 작업에서 결정적인 성능 차이를 만든다.

NVSwitch

DGX 시스템에서는 NVSwitch를 통해 노드 내 모든 GPU가 full-bisection bandwidth로 상호 연결된다. 예를 들어 DGX H100 시스템에서는 8개의 H100 GPU가 NVSwitch로 연결되어, 어떤 GPU 쌍 간에도 900 GB/s의 대역폭을 보장한다.

실전 시사점

Multi-GPU 학습 시 NVLink 유무에 따라 전략이 달라질 수 있다:

NVLink가 있는 경우: 통신 오버헤드가 작으므로 DDP, FSDP 모두 효율적으로 동작한다.
PCIe만 있는 경우: gradient compression, gradient accumulation 등으로 통신 빈도를 줄이는 것이 중요하다.

4. 분산 학습 패러다임: Data / Model / Pipeline Parallelism

Multi-GPU 학습 전략은 크게 세 가지 패러다임으로 분류된다.

Data Parallelism (데이터 병렬화)

가장 기본적이고 널리 사용되는 방식이다. 동일한 모델을 모든 GPU에 복제하고, 학습 데이터를 GPU 수만큼 나누어 각 GPU가 서로 다른 mini-batch를 처리한다. Forward/backward pass 이후 각 GPU에서 계산된 gradient를 AllReduce로 동기화하고, 동일한 optimizer step을 수행한다.

GPU 0: Model Copy + Data Shard 0 → Gradient 0 ─┐
GPU 1: Model Copy + Data Shard 1 → Gradient 1 ─┤── AllReduce ──→ Averaged Gradient
GPU 2: Model Copy + Data Shard 2 → Gradient 2 ─┤
GPU 3: Model Copy + Data Shard 3 → Gradient 3 ─┘

장점: 구현이 간단하고, 선형에 가까운 scaling이 가능하다. 단점: 모든 GPU에 전체 모델이 복제되므로, 단일 GPU 메모리에 모델이 들어가야 한다.

Model Parallelism (모델 병렬화)

모델의 레이어를 여러 GPU에 나누어 배치한다. Tensor Parallelism이라고도 하며, 하나의 레이어 내부의 연산(예: 행렬 곱셈)을 여러 GPU가 분담한다. Megatron-LM에서 주로 사용하는 방식이다.

GPU 0: Layer의 Weight 행렬 상반부  ─┐
                                     ├── 결합 → Layer Output
GPU 1: Layer의 Weight 행렬 하반부  ─┘

장점: 단일 GPU 메모리보다 큰 레이어도 처리 가능하다. 단점: GPU 간 통신이 매우 빈번하게 발생하므로, NVLink 수준의 고속 인터커넥트가 필수적이다.

Pipeline Parallelism (파이프라인 병렬화)

모델의 레이어를 순서대로 여러 GPU에 분배한다. 각 GPU가 일부 레이어만 담당하고, micro-batch를 파이프라인처럼 흘려보내어 GPU idle time(bubble)을 최소화한다.

GPU 0: Layers 1-8   │ Micro-batch 1 → │ Micro-batch 2 → │ ...
GPU 1: Layers 9-16  │                  │ Micro-batch 1 → │ Micro-batch 2 → │ ...
GPU 2: Layers 17-24 │                  │                  │ Micro-batch 1 → │ ...
GPU 3: Layers 25-32 │                  │                  │                  │ Micro-batch 1 → │

장점: 각 GPU가 전체 모델의 일부만 저장하므로 메모리 효율적이다. 단점: Pipeline bubble로 인한 GPU 유휴 시간이 발생한다.

실전에서의 조합

현대의 대규모 학습에서는 이 세 가지를 조합하는 3D Parallelism을 사용한다. 예를 들어, Megatron-DeepSpeed에서는 노드 내부에서 Tensor Parallelism(NVLink 활용), 노드 간에 Pipeline Parallelism, 그리고 전체적으로 Data Parallelism을 적용한다.

5. PyTorch DDP (DistributedDataParallel) 공식 문서 분석

PyTorch DDP는 Data Parallelism을 구현하는 PyTorch의 공식 모듈이다. torch.nn.DataParallel과 달리 multi-process 기반으로 동작하여 Python GIL 병목이 없으며, multi-node 환경도 지원한다.

DDP의 내부 동작 원리

PyTorch 공식 문서에 따르면, DDP의 동작 과정은 다음과 같다:

초기화 단계: Rank 0 프로세스의 모델 state를 broadcast하여 모든 프로세스가 동일한 초기 상태에서 시작한다.
Forward Pass: 각 프로세스가 자신의 data shard에 대해 독립적으로 forward pass를 수행한다.
Backward Pass: Backward pass 중에 gradient를 bucket 단위로 AllReduce하여 동기화한다. Bucket 크기는 bucket_cap_mb 파라미터로 조정할 수 있다(기본값: 25 MB).
Optimizer Step: 모든 프로세스가 동기화된 동일한 gradient로 optimizer step을 수행하므로, 모델 파라미터가 항상 동일하게 유지된다.

핵심은 gradient AllReduce가 backward computation과 overlap된다는 점이다. 하나의 bucket에 속한 모든 gradient가 준비되면 즉시 비동기 AllReduce를 시작하여, 나머지 backward 연산과 동시에 통신이 진행된다. 이것이 DDP의 높은 효율성을 가능하게 하는 핵심 메커니즘이다.

DDP 코드 예시

import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size):
    setup(rank, world_size)

    # 모델 생성 및 DDP 래핑
    model = nn.Linear(1024, 512).to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    # Dataset 및 DataLoader (DistributedSampler 사용 필수)
    dataset = MyDataset()
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)

    optimizer = optim.Adam(ddp_model.parameters(), lr=1e-3)
    loss_fn = nn.MSELoss()

    for epoch in range(10):
        sampler.set_epoch(epoch)  # 매 epoch마다 호출하여 데이터 셔플 보장
        for batch in dataloader:
            inputs, targets = batch
            inputs = inputs.to(rank)
            targets = targets.to(rank)

            optimizer.zero_grad()
            outputs = ddp_model(inputs)
            loss = loss_fn(outputs, targets)
            loss.backward()  # 여기서 자동으로 gradient AllReduce 수행
            optimizer.step()

    cleanup()

# 실행: torchrun --nproc_per_node=4 train_script.py

Gradient Accumulation과 no_sync()

Gradient accumulation을 사용할 때는 no_sync() context manager로 불필요한 AllReduce를 방지할 수 있다:

accumulation_steps = 4

for i, batch in enumerate(dataloader):
    inputs, targets = batch[0].to(rank), batch[1].to(rank)

    # 마지막 accumulation step에서만 gradient 동기화
    context = ddp_model.no_sync if (i + 1) % accumulation_steps != 0 else nullcontext
    with context():
        outputs = ddp_model(inputs)
        loss = loss_fn(outputs, targets) / accumulation_steps
        loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

torchrun을 통한 실행

DDP 학습 스크립트는 torchrun(또는 torch.distributed.launch)으로 실행한다:

# 단일 노드, 4 GPU
torchrun --nproc_per_node=4 train.py

# 멀티 노드 (2 노드, 각 8 GPU)
# 노드 0에서:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
    --master_addr="192.168.1.1" --master_port=29500 train.py
# 노드 1에서:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \
    --master_addr="192.168.1.1" --master_port=29500 train.py

6. PyTorch FSDP (Fully Sharded Data Parallel) 공식 문서 분석

DDP는 모든 GPU에 전체 모델을 복제하므로, 모델이 단일 GPU 메모리에 들어가야 한다는 한계가 있다. FSDP는 모델 파라미터, gradient, optimizer state를 GPU들에 분산(shard)하여 이 한계를 극복한다.

FSDP의 핵심 원리

FSDP는 Microsoft의 ZeRO(Zero Redundancy Optimizer) 논문에서 영감을 받아 구현되었다. 핵심 아이디어는 다음과 같다:

Shard 상태: 평소에는 각 GPU가 파라미터의 일부(shard)만 보유한다.
Forward Pass: 연산이 필요한 레이어에 도달하면 AllGather로 모든 shard를 모아 전체 파라미터를 복원하고, 연산을 수행한 뒤 다시 reshard한다.
Backward Pass: 마찬가지로 AllGather로 파라미터를 복원하여 gradient를 계산한 뒤, ReduceScatter로 gradient를 분산한다.
Optimizer Step: 각 GPU가 자신이 담당하는 shard에 대해서만 optimizer step을 수행한다.

FSDP1 vs FSDP2

PyTorch 공식 문서에 따르면, FSDP1은 deprecated 되었으며 FSDP2의 사용이 권장된다. 주요 차이점은 다음과 같다:

구분	FSDP1	FSDP2
Sharding 방식	Flat-parameter sharding	Per-parameter DTensor 기반 dim-0 sharding
API	`FullyShardedDataParallel` wrapper	`fully_shard()` 함수형 API
메모리 관리	`recordStream` 기반	`recordStream` 미사용, 결정적 GPU 메모리
Prefetching	제한적	Implicit/Explicit prefetching 모두 지원

FSDP2는 torch.chunk(dim=0)을 사용하여 각 파라미터를 dim-0 기준으로 data parallel worker 수만큼 분할한다.

FSDP2 코드 예시

import torch
import torch.distributed as dist
from torch.distributed.fsdp import fully_shard, MixedPrecisionPolicy

def train_fsdp(rank, world_size):
    dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

    # Transformer 모델 생성
    model = Transformer(model_args).to(rank)

    # Mixed Precision 설정
    mp_policy = MixedPrecisionPolicy(
        param_dtype=torch.bfloat16,    # 파라미터를 bfloat16으로 저장
        reduce_dtype=torch.float32,     # gradient reduce는 float32로 수행
    )

    # 각 Transformer 레이어에 FSDP 적용 (레이어 단위 sharding)
    for layer in model.layers:
        fully_shard(layer, mp_policy=mp_policy)

    # 최상위 모델에도 FSDP 적용
    fully_shard(model, mp_policy=mp_policy)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    for epoch in range(num_epochs):
        for batch in dataloader:
            inputs = batch["input_ids"].to(rank)
            labels = batch["labels"].to(rank)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()

    dist.destroy_process_group()

FSDP의 Sharding Strategy

FSDP는 다양한 sharding 전략을 제공한다:

FULL_SHARD: 파라미터, gradient, optimizer state 모두를 shard한다. 메모리 절감이 가장 크지만, 통신 오버헤드도 가장 크다. (ZeRO Stage 3에 해당)
SHARD_GRAD_OP: Gradient와 optimizer state만 shard하고, forward 이후 파라미터를 유지한다. 메모리 절감이 중간 수준이며, 통신이 적다. (ZeRO Stage 2에 해당)
NO_SHARD: Sharding 없이 DDP와 동일하게 동작한다. (ZeRO Stage 0에 해당)

7. DeepSpeed ZeRO Stage 1/2/3 비교

DeepSpeed는 Microsoft에서 개발한 분산 학습 라이브러리로, **ZeRO(Zero Redundancy Optimizer)**를 통해 메모리 효율적인 학습을 가능하게 한다. DeepSpeed 공식 문서에 따르면, ZeRO는 세 가지 Stage로 구성된다.

ZeRO Stage별 비교

ZeRO Stage 1: Optimizer State Partitioning

Optimizer state만 GPU 간에 분할한다. Adam optimizer는 파라미터 외에 first moment(m)와 second moment(v)를 추가로 저장하는데, 이것만 분할해도 상당한 메모리를 절약할 수 있다.

메모리 절감: FP16 학습 기준, 8개 GPU 사용 시 per-device 메모리 소비를 약 4배 감소시킬 수 있다.
통신 오버헤드: DDP와 동일 (AllReduce만 사용)
CPU Offloading: 지원

{
  "zero_optimization": {
    "stage": 1,
    "reduce_bucket_size": 5e8
  }
}

ZeRO Stage 2: Gradient Partitioning

Optimizer state에 더해 gradient도 분할한다. 각 GPU는 자신이 담당하는 optimizer state에 해당하는 gradient만 보유한다.

메모리 절감: Stage 1 대비 gradient 메모리도 분산되어 추가 절약
통신 오버헤드: AllReduce 대신 ReduceScatter를 사용하여 약간 더 효율적
CPU Offloading: 지원

{
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  }
}

ZeRO Stage 3: Parameter Partitioning

Optimizer state, gradient에 더해 모델 파라미터까지 분할한다. 이를 통해 단일 GPU 메모리보다 큰 모델도 학습할 수 있다.

메모리 절감: 가장 극적. 모든 model state가 분산된다.
통신 오버헤드: Forward/backward 중 AllGather가 추가로 필요하여 통신량 증가
CPU/NVMe Offloading: 지원 (ZeRO-Infinity)

{
  "zero_optimization": {
    "stage": 3,
    "contiguous_gradients": true,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_prefetch_bucket_size": 1e7,
    "stage3_param_persistence_threshold": 1e5,
    "reduce_bucket_size": 1e7,
    "sub_group_size": 1e9,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    }
  }
}

ZeRO Stage 비교 요약

구분	Stage 1	Stage 2	Stage 3
Optimizer State 분할	O	O	O
Gradient 분할	X	O	O
Parameter 분할	X	X	O
CPU Offloading	O	O	O
NVMe Offloading	X	X	O
통신 오버헤드	낮음	낮음	높음
메모리 절감	중간	높음	매우 높음
학습 속도	가장 빠름	빠름	상대적으로 느림

DeepSpeed 전체 설정 예시

{
  "train_batch_size": 64,
  "gradient_accumulation_steps": 4,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 2e8,
    "allgather_bucket_size": 2e8
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-4,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 1e-4,
      "warmup_num_steps": 1000,
      "total_num_steps": 50000
    }
  }
}

실전 Stage 선택 가이드

모델이 단일 GPU에 들어가는 경우: ZeRO Stage 1 또는 DDP가 가장 빠르다.
모델이 단일 GPU에 들어가지만 batch size를 키우고 싶은 경우: ZeRO Stage 2로 gradient 메모리를 절약한다.
모델이 단일 GPU에 들어가지 않는 경우: ZeRO Stage 3 또는 FSDP를 사용한다.
GPU 메모리가 극도로 부족한 경우: ZeRO Stage 3 + CPU/NVMe Offloading을 사용한다.

8. HuggingFace Accelerate: 통합 분산 학습 인터페이스

Accelerate는 HuggingFace에서 개발한 라이브러리로, DDP, FSDP, DeepSpeed 등 다양한 분산 학습 전략을 통합된 인터페이스로 제공한다. 기존 PyTorch 학습 코드를 최소한으로 수정하면서 분산 학습을 적용할 수 있다는 것이 핵심 장점이다.

Accelerate의 핵심 개념

Accelerate는 PyTorch 위에 얇은 래퍼(thin wrapper)를 제공하며, 새로운 프레임워크를 배울 필요 없이 기존 학습 루프를 거의 그대로 유지하면서 분산 학습을 적용할 수 있다. 전체 API가 Accelerator 하나의 클래스에 집중되어 있다.

기본 사용법

from accelerate import Accelerator

accelerator = Accelerator()

# 기존 코드에서 변경하는 부분은 이것뿐
model, optimizer, dataloader, scheduler = accelerator.prepare(
    model, optimizer, dataloader, scheduler
)

for batch in dataloader:
    optimizer.zero_grad()
    outputs = model(batch["input_ids"])
    loss = loss_fn(outputs, batch["labels"])
    accelerator.backward(loss)  # loss.backward() 대신
    optimizer.step()
    scheduler.step()

accelerate config로 분산 전략 설정

accelerate config 명령어를 실행하면 대화형으로 분산 학습 환경을 설정할 수 있다:

$ accelerate config

# 질문에 답하면 자동으로 설정 파일이 생성된다
# - 분산 학습 유형 (multi-GPU, multi-node, TPU 등)
# - GPU 개수
# - Mixed precision 사용 여부
# - DeepSpeed / FSDP 사용 여부 및 설정

생성된 설정 파일(default_config.yaml) 예시:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 4
mixed_precision: bf16
use_cpu: false

DeepSpeed와 함께 사용하기

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  zero_stage: 2
  gradient_accumulation_steps: 4
  offload_optimizer_device: none
  offload_param_device: none
mixed_precision: bf16
num_machines: 1
num_processes: 8

FSDP와 함께 사용하기

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_state_dict_type: SHARDED_STATE_DICT
mixed_precision: bf16
num_machines: 1
num_processes: 8

학습 실행

accelerate launch train.py

Accelerate는 설정 파일에 따라 자동으로 적절한 분산 전략을 적용하여 학습을 실행한다. torchrun이나 deepspeed launcher를 직접 다루지 않아도 되므로, 실험 시 분산 전략 전환이 매우 간편하다.

9. nvidia-smi 모니터링 및 GPU 활용률 최적화

분산 학습의 효율을 극대화하려면 실시간 GPU 모니터링이 필수적이다. nvidia-smi는 NVIDIA 드라이버에 포함된 CLI 유틸리티로, GPU 상태를 실시간으로 조회할 수 있다.

기본 모니터링 명령어

# 기본 GPU 상태 확인
nvidia-smi

# 1초 간격 자동 갱신 모니터링
watch -n 1 nvidia-smi

# 특정 GPU만 모니터링
nvidia-smi --id=0,1

# 프로세스별 GPU 사용량 모니터링
nvidia-smi pmon -i 0 -s um -d 1

# 연속 디바이스 모니터링 (1초 간격)
nvidia-smi dmon -d 1

핵심 모니터링 지표

# CSV 형태로 주요 지표 출력 (스크립트에서 파싱하기 용이)
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total,power.draw --format=csv,noheader,nounits

# 출력 예시:
# 0, NVIDIA A100-SXM4-80GB, 45, 98, 72, 65536, 81920, 285
# 1, NVIDIA A100-SXM4-80GB, 43, 95, 68, 61440, 81920, 278

주요 지표 해석:

지표	설명	이상적인 값
GPU Utilization	GPU 코어 활용률	90% 이상
Memory Utilization	메모리 대역폭 활용률	60~80%
Memory Used	사용 중인 VRAM	전체의 80~95%
Temperature	GPU 온도	80도 이하
Power Draw	전력 소비량	TDP의 80~100%

GPU 활용률이 낮은 경우의 원인과 해결

GPU Utilization이 낮은 경우 (< 80%):

DataLoader 병목: num_workers를 CPU 코어 수에 맞게 늘리고, pin_memory=True 설정
Batch size가 너무 작음: GPU의 병렬 연산 유닛을 충분히 활용하지 못함
CPU 전처리 병목: 데이터 전처리를 GPU에서 수행하거나 (DALI 등) 사전 처리하여 캐싱

dataloader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=8,           # CPU 코어 수에 맞게 조정
    pin_memory=True,          # GPU 전송 속도 향상
    prefetch_factor=2,        # 미리 로드할 batch 수
    persistent_workers=True,  # worker를 epoch 간 유지
)

Memory Utilization이 낮은 경우:

Batch size를 점진적으로 키워서 GPU 메모리를 최대한 활용한다.
Mixed precision(FP16/BF16)을 사용하면 동일 메모리에서 약 2배의 batch size를 사용할 수 있다.

PyTorch 내장 메모리 모니터링

import torch

# 현재 GPU 메모리 사용량 확인
print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
print(f"Max Allocated: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

# 메모리 snapshot 저장 (상세 분석용)
torch.cuda.memory._record_memory_history()
# ... 학습 코드 실행 ...
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")

10. 실전 트러블슈팅: OOM, NCCL Timeout, 통신 병목

Multi-GPU 학습에서 가장 빈번하게 발생하는 문제들과 해결 방법을 정리한다.

10.1 OOM (Out of Memory) 에러

증상: CUDA out of memory. Tried to allocate X MiB 에러 발생

단계적 해결 방법:

# 1단계: Batch size 줄이기
batch_size = 16  # → 8, 4, 2로 단계적으로 줄여본다

# 2단계: Gradient accumulation으로 effective batch size 유지
gradient_accumulation_steps = 4  # batch_size * 4 = effective batch size

# 3단계: Mixed precision 사용
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

with autocast(dtype=torch.bfloat16):
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

# 4단계: Gradient checkpointing 활성화
from torch.utils.checkpoint import checkpoint
# 또는 HuggingFace 모델:
model.gradient_checkpointing_enable()

# 5단계: 메모리 할당 설정 최적화
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

메모리 부족 시 escalation 경로:

Batch size 줄이기 → Mixed Precision → Gradient Checkpointing
→ ZeRO Stage 2 → ZeRO Stage 3 → CPU Offloading → NVMe Offloading

10.2 NCCL Timeout 에러

증상: Watchdog caught collective operation timeout 또는 NCCL timeout 에러 발생. 학습이 hang되거나 특정 시점에서 멈춘다.

주요 원인 및 해결:

# 원인 1: 타임아웃 값이 너무 짧음
# 해결: 타임아웃 증가
export NCCL_BLOCKING_WAIT=1

# Python에서 타임아웃 설정
import datetime
dist.init_process_group(
    backend="nccl",
    timeout=datetime.timedelta(minutes=30)  # 기본값은 30분
)

# 원인 2: GPU 간 P2P 통신 문제
# 해결: P2P 비활성화
export NCCL_P2P_DISABLE=1

# 원인 3: 네트워크 인터페이스 선택 오류 (Multi-Node)
# 해결: 올바른 네트워크 인터페이스 지정
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0

# 원인 4: Docker 환경에서 Shared Memory 부족
# 해결: Docker 실행 시 SHM 크기 증가
# docker run --shm-size=16g ...

# 디버깅을 위한 NCCL 로그 활성화
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=DETAIL

하나의 GPU에서만 OOM이 발생하여 다른 GPU가 대기하는 경우:

이 상황은 NCCL timeout으로 나타나는데, 실제 원인은 OOM이다. 한 GPU가 OOM으로 크래시하면 나머지 GPU들이 AllReduce를 기다리다가 timeout이 발생한다. 이때는 OOM을 먼저 해결해야 한다.

10.3 통신 병목 진단 및 해결

증상: GPU utilization은 높지만, 학습 속도가 GPU 수에 비례하여 증가하지 않는다.

진단 방법:

import torch.autograd.profiler as profiler

with profiler.profile(
    activities=[
        profiler.ProfilerActivity.CPU,
        profiler.ProfilerActivity.CUDA,
    ],
    schedule=profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
    on_trace_ready=profiler.tensorboard_trace_handler('./log/profiler'),
    record_shapes=True,
    with_stack=True,
) as prof:
    for step, batch in enumerate(dataloader):
        if step >= 7:
            break
        train_step(model, batch)
        prof.step()

해결 방법:

# 1. Gradient compression 사용 (DDP)
from torch.distributed.algorithms.ddp_comm_hooks import (
    default_hooks as default,
    powerSGD_hook as powerSGD,
)
ddp_model.register_comm_hook(
    state=powerSGD.PowerSGDState(process_group=None, matrix_approximation_rank=1),
    hook=powerSGD.powerSGD_hook,
)

# 2. Bucket size 조정 (DDP)
ddp_model = DDP(model, device_ids=[rank], bucket_cap_mb=50)  # 기본값 25MB

# 3. overlap_comm 활성화 (DeepSpeed)
# ds_config.json에서:
# "zero_optimization": { "overlap_comm": true }

10.4 흔한 실수와 해결

문제	원인	해결
모든 GPU에서 동일한 데이터 사용	`DistributedSampler` 미사용	`DistributedSampler` 적용
Epoch마다 같은 순서로 데이터 공급	`sampler.set_epoch(epoch)` 미호출	매 epoch 시작 시 호출
모델 저장 시 충돌	모든 rank에서 저장 시도	`if rank == 0:`으로 rank 0에서만 저장
`DDP` 래핑 후 원본 모델 접근	`model`과 `ddp_model` 혼용	`ddp_model.module`로 접근
`find_unused_parameters` 경고	일부 파라미터가 forward에서 미사용	`DDP(..., find_unused_parameters=True)` 설정

11. 정리: 어떤 전략을 선택할 것인가

최종적으로 분산 학습 전략 선택은 다음 의사결정 트리를 따른다:

모델이 단일 GPU에 들어가는가?
├── YES → DDP (가장 빠르고 간단)
│         ├── Batch size를 키우고 싶다면 → ZeRO Stage 1/2 or FSDP SHARD_GRAD_OP
│         └── 충분하다면 → DDP만으로 충분
└── NO → 모델이 전체 GPU 메모리 합에 들어가는가?
          ├── YES → FSDP (FULL_SHARD) 또는 ZeRO Stage 3
          │         ├── PyTorch 네이티브 선호 → FSDP
          │         └── 더 많은 옵션/커스터마이징 → DeepSpeed
          └── NO → ZeRO Stage 3 + CPU/NVMe Offloading
                    또는 Pipeline Parallelism + Tensor Parallelism 조합

References

PyTorch 공식 문서

DeepSpeed 공식 문서

NVIDIA 공식 문서

HuggingFace 공식 문서

Complete Guide to Multi-GPU Distributed Training: DDP, FSDP, DeepSpeed

1. Why Multi-GPU Training Is Necessary
- Model Size and Memory Requirements
- Training Time Reduction
2. NCCL: The Core of GPU Communication
- What is NCCL
- Key NCCL Environment Variables
3. NVLink vs PCIe: GPU Interconnect Bandwidth Comparison
4. Distributed Training Paradigms: Data / Model / Pipeline Parallelism
5. PyTorch DDP (DistributedDataParallel) Official Documentation Analysis
6. PyTorch FSDP (Fully Sharded Data Parallel) Official Documentation Analysis
7. DeepSpeed ZeRO Stage 1/2/3 Comparison
8. HuggingFace Accelerate: Unified Distributed Training Interface
9. nvidia-smi Monitoring and GPU Utilization Optimization
10. Practical Troubleshooting: OOM, NCCL Timeout, Communication Bottlenecks
11. Summary: Which Strategy to Choose
References
Quiz

1. Why Multi-GPU Training Is Necessary

The number of parameters in recent large language models (LLMs) has been growing exponentially. From GPT-3 with 175B, PaLM with 540B, to Llama 3 with 405B parameters, training on a single GPU has become physically impossible.

Model Size and Memory Requirements

Calculating a model's memory usage makes the severity of the problem clear. With FP32 (32-bit floating point), a single parameter occupies 4 bytes. Therefore:

7B model: 7 x 10^9 x 4 bytes = approximately 28 GB (parameters only)
13B model: 13 x 10^9 x 4 bytes = approximately 52 GB
70B model: 70 x 10^9 x 4 bytes = approximately 280 GB

Adding optimizer states (2x the parameters for Adam), gradients (same size as parameters), and activation memory, the actual memory required during training is approximately 4-8x the parameter size. Training a 7B model in FP32 requires a minimum of 112-224 GB of GPU memory.

The most widely used NVIDIA A100 has 80 GB of VRAM, and the H100 also has 80 GB. Even full fine-tuning of a 7B model is challenging on a single GPU. This is why multi-GPU distributed training has become a necessity rather than an option.

Training Time Reduction

Beyond memory concerns, reducing training time is another key motivation. By distributing training that would take weeks to months on a single GPU across multiple GPUs, near-linear speedup can be expected. Using 8 GPUs can theoretically reduce training time to 1/8, and in practice, achieving over 90% scaling efficiency is possible by minimizing communication overhead.

2. NCCL: The Core of GPU Communication

The key to multi-GPU training is efficient communication between GPUs. NVIDIA provides NCCL (NVIDIA Collective Communications Library), a dedicated communication library for this purpose.

What is NCCL

NCCL is a library that optimizes collective communication primitives for multi-GPU and multi-node environments, specifically designed for NVIDIA GPUs and networking. As of NCCL 2.29.1 (current latest version), it supports the following communication operations:

AllReduce: Sums (or averages) data from all GPUs and distributes the result to all GPUs. Used critically for gradient synchronization in DDP.
AllGather: Collects data from each GPU and distributes the complete data to all GPUs. Used in FSDP to reconstruct parameters before the forward pass.
ReduceScatter: Reduces data and distributes portions to each GPU. Used in FSDP's backward pass to distribute gradients.
Broadcast: Sends data from one GPU to all GPUs. Used to replicate rank 0's weights to other GPUs during model initialization.
Send/Recv: Point-to-point communication. Used for data transfer between stages in Pipeline Parallelism.

Key NCCL Environment Variables

Important environment variables for debugging NCCL issues or tuning performance in practice:

# Enable NCCL debug logging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# Specify communication interface (important in multi-node environments)
export NCCL_SOCKET_IFNAME=eth0

# Disable InfiniBand (if needed)
export NCCL_IB_DISABLE=1

# Set P2P (Peer-to-Peer) communication level
export NCCL_P2P_LEVEL=NVL  # Use NVLink

NCCL is used as the default backend in PyTorch's torch.distributed package. Calling init_process_group(backend="nccl") creates an NCCL-based communication group.

3. NVLink vs PCIe: GPU Interconnect Bandwidth Comparison

Multi-GPU training performance is heavily influenced by GPU interconnect bandwidth. GPU interconnect methods are broadly divided into PCIe and NVLink.

PCIe (Peripheral Component Interconnect Express)

PCIe is a general-purpose interface connecting various devices such as GPUs, SSDs, and NICs. Bandwidth by major generation:

Generation	Unidirectional Bandwidth (x16)	Bidirectional Bandwidth (x16)
PCIe 4.0	32 GB/s	64 GB/s
PCIe 5.0	64 GB/s	128 GB/s
PCIe 6.0	128 GB/s	256 GB/s

In PCIe communication, data between GPUs must pass through the CPU/chipset, adding additional hops and latency.

NVLink

NVLink is NVIDIA's high-speed GPU-dedicated interconnect that provides direct GPU-to-GPU communication. Since data is exchanged directly between GPUs without going through the CPU, latency is greatly reduced and bandwidth is substantially increased.

Generation	GPU	Bandwidth (Bidirectional)
NVLink 3rd Gen	A100	600 GB/s
NVLink 4th Gen	H100	900 GB/s
NVLink 5th Gen	B200	1,800 GB/s

NVLink 4th gen (H100) provides approximately 7x the bandwidth of PCIe 5.0. This creates a decisive performance difference in communication-intensive tasks such as gradient synchronization and parameter AllGather for large models.

NVSwitch

In DGX systems, NVSwitch connects all GPUs within a node at full-bisection bandwidth. For example, in a DGX H100 system, 8 H100 GPUs are connected via NVSwitch, guaranteeing 900 GB/s bandwidth between any GPU pair.

Practical Implications

Strategies may differ depending on the availability of NVLink during multi-GPU training:

With NVLink: Communication overhead is small, so both DDP and FSDP operate efficiently.
PCIe only: It's important to reduce communication frequency through gradient compression, gradient accumulation, etc.

4. Distributed Training Paradigms: Data / Model / Pipeline Parallelism

Multi-GPU training strategies are broadly classified into three paradigms.

Data Parallelism

The most basic and widely used approach. The same model is replicated on all GPUs, and training data is divided among GPUs so each processes different mini-batches. After forward/backward passes, gradients computed on each GPU are synchronized via AllReduce, and identical optimizer steps are performed.

GPU 0: Model Copy + Data Shard 0 → Gradient 0 ─┐
GPU 1: Model Copy + Data Shard 1 → Gradient 1 ─┤── AllReduce ──→ Averaged Gradient
GPU 2: Model Copy + Data Shard 2 → Gradient 2 ─┤
GPU 3: Model Copy + Data Shard 3 → Gradient 3 ─┘

Advantages: Simple implementation and near-linear scaling. Disadvantages: The entire model is replicated on every GPU, so the model must fit in a single GPU's memory.

Model Parallelism

Model layers are distributed across multiple GPUs. Also known as Tensor Parallelism, where operations within a single layer (e.g., matrix multiplication) are shared among multiple GPUs. This approach is primarily used in Megatron-LM.

GPU 0: Upper half of layer's weight matrix  ─┐
                                               ├── Combine → Layer Output
GPU 1: Lower half of layer's weight matrix  ─┘

Advantages: Can handle layers larger than a single GPU's memory. Disadvantages: Very frequent inter-GPU communication, requiring high-speed interconnects at the NVLink level.

Pipeline Parallelism

Model layers are sequentially distributed across multiple GPUs. Each GPU handles only some layers, and micro-batches flow through like a pipeline to minimize GPU idle time (bubbles).

GPU 0: Layers 1-8   │ Micro-batch 1 → │ Micro-batch 2 → │ ...
GPU 1: Layers 9-16  │                  │ Micro-batch 1 → │ Micro-batch 2 → │ ...
GPU 2: Layers 17-24 │                  │                  │ Micro-batch 1 → │ ...
GPU 3: Layers 25-32 │                  │                  │                  │ Micro-batch 1 → │

Advantages: Memory efficient as each GPU stores only a portion of the model. Disadvantages: GPU idle time due to pipeline bubbles.

Combining in Practice

Modern large-scale training uses 3D Parallelism, combining all three approaches. For example, Megatron-DeepSpeed applies Tensor Parallelism within nodes (utilizing NVLink), Pipeline Parallelism across nodes, and Data Parallelism overall.

5. PyTorch DDP (DistributedDataParallel) Official Documentation Analysis

PyTorch DDP is the official PyTorch module for implementing Data Parallelism. Unlike torch.nn.DataParallel, it operates on a multi-process basis, eliminating the Python GIL bottleneck, and supports multi-node environments.

DDP's Internal Operation

According to PyTorch official documentation, DDP operates as follows:

Initialization: The model state from rank 0 is broadcast to all processes so they start with identical initial states.
Forward Pass: Each process independently performs the forward pass on its data shard.
Backward Pass: During the backward pass, gradients are synchronized via AllReduce in bucket units. Bucket size can be adjusted with the bucket_cap_mb parameter (default: 25 MB).
Optimizer Step: All processes perform optimizer steps with the same synchronized gradients, ensuring model parameters remain identical.

The key is that gradient AllReduce overlaps with backward computation. Once all gradients in a bucket are ready, asynchronous AllReduce begins immediately, running concurrently with the remaining backward computation. This is the core mechanism enabling DDP's high efficiency.

DDP Code Example

import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size):
    setup(rank, world_size)

    # Create model and wrap with DDP
    model = nn.Linear(1024, 512).to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    # Dataset and DataLoader (DistributedSampler is required)
    dataset = MyDataset()
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)

    optimizer = optim.Adam(ddp_model.parameters(), lr=1e-3)
    loss_fn = nn.MSELoss()

    for epoch in range(10):
        sampler.set_epoch(epoch)  # Call every epoch to ensure data shuffling
        for batch in dataloader:
            inputs, targets = batch
            inputs = inputs.to(rank)
            targets = targets.to(rank)

            optimizer.zero_grad()
            outputs = ddp_model(inputs)
            loss = loss_fn(outputs, targets)
            loss.backward()  # Gradient AllReduce is performed automatically here
            optimizer.step()

    cleanup()

# Execution: torchrun --nproc_per_node=4 train_script.py

Gradient Accumulation and no_sync()

When using gradient accumulation, no_sync() context manager can prevent unnecessary AllReduce:

accumulation_steps = 4

for i, batch in enumerate(dataloader):
    inputs, targets = batch[0].to(rank), batch[1].to(rank)

    # Only synchronize gradients on the last accumulation step
    context = ddp_model.no_sync if (i + 1) % accumulation_steps != 0 else nullcontext
    with context():
        outputs = ddp_model(inputs)
        loss = loss_fn(outputs, targets) / accumulation_steps
        loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Execution via torchrun

DDP training scripts are launched with torchrun (or torch.distributed.launch):

# Single node, 4 GPUs
torchrun --nproc_per_node=4 train.py

# Multi-node (2 nodes, 8 GPUs each)
# On node 0:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
    --master_addr="192.168.1.1" --master_port=29500 train.py
# On node 1:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \
    --master_addr="192.168.1.1" --master_port=29500 train.py

6. PyTorch FSDP (Fully Sharded Data Parallel) Official Documentation Analysis

DDP replicates the entire model on every GPU, limiting it to models that fit in a single GPU's memory. FSDP overcomes this limitation by distributing (sharding) model parameters, gradients, and optimizer states across GPUs.

Core Principles of FSDP

FSDP was inspired by Microsoft's ZeRO (Zero Redundancy Optimizer) paper. The core idea is:

Sharded State: Normally, each GPU holds only a portion (shard) of the parameters.
Forward Pass: When a layer needs to be computed, all shards are gathered via AllGather to reconstruct the full parameters, computation is performed, and then reshard occurs.
Backward Pass: Similarly, parameters are reconstructed via AllGather to compute gradients, then gradients are distributed via ReduceScatter.
Optimizer Step: Each GPU performs the optimizer step only on the shard it's responsible for.

FSDP1 vs FSDP2

According to PyTorch official documentation, FSDP1 is deprecated and FSDP2 is recommended. Key differences:

Aspect	FSDP1	FSDP2
Sharding Method	Flat-parameter sharding	Per-parameter DTensor-based dim-0 sharding
API	`FullyShardedDataParallel` wrapper	`fully_shard()` functional API
Memory Management	`recordStream` based	No `recordStream`, deterministic GPU memory
Prefetching	Limited	Both implicit/explicit prefetching supported

FSDP2 uses torch.chunk(dim=0) to split each parameter along dim-0 by the number of data parallel workers.

FSDP2 Code Example

import torch
import torch.distributed as dist
from torch.distributed.fsdp import fully_shard, MixedPrecisionPolicy

def train_fsdp(rank, world_size):
    dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

    # Create Transformer model
    model = Transformer(model_args).to(rank)

    # Mixed Precision configuration
    mp_policy = MixedPrecisionPolicy(
        param_dtype=torch.bfloat16,    # Store parameters in bfloat16
        reduce_dtype=torch.float32,     # Perform gradient reduce in float32
    )

    # Apply FSDP to each Transformer layer (layer-level sharding)
    for layer in model.layers:
        fully_shard(layer, mp_policy=mp_policy)

    # Apply FSDP to the top-level model as well
    fully_shard(model, mp_policy=mp_policy)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    for epoch in range(num_epochs):
        for batch in dataloader:
            inputs = batch["input_ids"].to(rank)
            labels = batch["labels"].to(rank)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()

    dist.destroy_process_group()

FSDP Sharding Strategies

FSDP offers various sharding strategies:

FULL_SHARD: Shards parameters, gradients, and optimizer states. Greatest memory savings but also greatest communication overhead. (Equivalent to ZeRO Stage 3)
SHARD_GRAD_OP: Shards only gradients and optimizer states, retaining parameters after forward. Moderate memory savings with less communication. (Equivalent to ZeRO Stage 2)
NO_SHARD: No sharding, operates the same as DDP. (Equivalent to ZeRO Stage 0)

7. DeepSpeed ZeRO Stage 1/2/3 Comparison

DeepSpeed is a distributed training library developed by Microsoft that enables memory-efficient training through ZeRO (Zero Redundancy Optimizer). According to DeepSpeed official documentation, ZeRO consists of three stages.

ZeRO Stage Comparison

ZeRO Stage 1: Optimizer State Partitioning

Only optimizer states are partitioned across GPUs. Since the Adam optimizer stores first moment (m) and second moment (v) in addition to parameters, partitioning these alone can save significant memory.

Memory savings: With FP16 training and 8 GPUs, per-device memory consumption can be reduced by approximately 4x.
Communication overhead: Same as DDP (AllReduce only)
CPU Offloading: Supported

{
  "zero_optimization": {
    "stage": 1,
    "reduce_bucket_size": 5e8
  }
}

ZeRO Stage 2: Gradient Partitioning

In addition to optimizer states, gradients are also partitioned. Each GPU only holds gradients corresponding to its assigned optimizer states.

Memory savings: Additional savings from distributed gradient memory compared to Stage 1
Communication overhead: Slightly more efficient using ReduceScatter instead of AllReduce
CPU Offloading: Supported

{
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  }
}

ZeRO Stage 3: Parameter Partitioning

In addition to optimizer states and gradients, model parameters are also partitioned. This enables training models larger than a single GPU's memory.

Memory savings: Most dramatic. All model states are distributed.
Communication overhead: Additional AllGather required during forward/backward, increasing communication volume
CPU/NVMe Offloading: Supported (ZeRO-Infinity)

{
  "zero_optimization": {
    "stage": 3,
    "contiguous_gradients": true,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_prefetch_bucket_size": 1e7,
    "stage3_param_persistence_threshold": 1e5,
    "reduce_bucket_size": 1e7,
    "sub_group_size": 1e9,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    }
  }
}

ZeRO Stage Comparison Summary

Aspect	Stage 1	Stage 2	Stage 3
Optimizer State Partition	O	O	O
Gradient Partition	X	O	O
Parameter Partition	X	X	O
CPU Offloading	O	O	O
NVMe Offloading	X	X	O
Communication Overhead	Low	Low	High
Memory Savings	Moderate	High	Very High
Training Speed	Fastest	Fast	Relatively Slower

Full DeepSpeed Configuration Example

{
  "train_batch_size": 64,
  "gradient_accumulation_steps": 4,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 2e8,
    "allgather_bucket_size": 2e8
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-4,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 1e-4,
      "warmup_num_steps": 1000,
      "total_num_steps": 50000
    }
  }
}

Practical Stage Selection Guide

Model fits in a single GPU: ZeRO Stage 1 or DDP is fastest.
Model fits in a single GPU but you want larger batch sizes: ZeRO Stage 2 saves gradient memory.
Model doesn't fit in a single GPU: Use ZeRO Stage 3 or FSDP.
Extremely limited GPU memory: Use ZeRO Stage 3 + CPU/NVMe Offloading.

8. HuggingFace Accelerate: Unified Distributed Training Interface

Accelerate is a library developed by HuggingFace that provides a unified interface for various distributed training strategies including DDP, FSDP, and DeepSpeed. Its key advantage is enabling distributed training with minimal modifications to existing PyTorch training code.

Core Concepts of Accelerate

Accelerate provides a thin wrapper on top of PyTorch, allowing you to apply distributed training while keeping your existing training loop virtually unchanged without learning a new framework. The entire API is centered around a single Accelerator class.

Basic Usage

from accelerate import Accelerator

accelerator = Accelerator()

# This is the only change from existing code
model, optimizer, dataloader, scheduler = accelerator.prepare(
    model, optimizer, dataloader, scheduler
)

for batch in dataloader:
    optimizer.zero_grad()
    outputs = model(batch["input_ids"])
    loss = loss_fn(outputs, batch["labels"])
    accelerator.backward(loss)  # Instead of loss.backward()
    optimizer.step()
    scheduler.step()

Setting Up Distributed Strategy with accelerate config

Running accelerate config interactively sets up the distributed training environment:

$ accelerate config

# Answering questions automatically generates a configuration file
# - Distributed training type (multi-GPU, multi-node, TPU, etc.)
# - Number of GPUs
# - Whether to use mixed precision
# - DeepSpeed / FSDP usage and settings

Example generated configuration file (default_config.yaml):

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 4
mixed_precision: bf16
use_cpu: false

Using with DeepSpeed

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  zero_stage: 2
  gradient_accumulation_steps: 4
  offload_optimizer_device: none
  offload_param_device: none
mixed_precision: bf16
num_machines: 1
num_processes: 8

Using with FSDP

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_state_dict_type: SHARDED_STATE_DICT
mixed_precision: bf16
num_machines: 1
num_processes: 8

Launching Training

accelerate launch train.py

Accelerate automatically applies the appropriate distributed strategy based on the configuration file. Since you don't need to deal with torchrun or deepspeed launchers directly, switching between distributed strategies during experiments is very convenient.

9. nvidia-smi Monitoring and GPU Utilization Optimization

Real-time GPU monitoring is essential for maximizing distributed training efficiency. nvidia-smi is a CLI utility included with the NVIDIA driver for querying GPU status in real time.

Basic Monitoring Commands

# Basic GPU status check
nvidia-smi

# Auto-refresh monitoring at 1-second intervals
watch -n 1 nvidia-smi

# Monitor specific GPUs only
nvidia-smi --id=0,1

# Per-process GPU usage monitoring
nvidia-smi pmon -i 0 -s um -d 1

# Continuous device monitoring (1-second interval)
nvidia-smi dmon -d 1

Key Monitoring Metrics

# Output key metrics in CSV format (easy for script parsing)
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total,power.draw --format=csv,noheader,nounits

# Example output:
# 0, NVIDIA A100-SXM4-80GB, 45, 98, 72, 65536, 81920, 285
# 1, NVIDIA A100-SXM4-80GB, 43, 95, 68, 61440, 81920, 278

Key metric interpretation:

Metric	Description	Ideal Value
GPU Utilization	GPU core utilization rate	90% or higher
Memory Utilization	Memory bandwidth utilization	60-80%
Memory Used	VRAM in use	80-95% of total
Temperature	GPU temperature	Below 80C
Power Draw	Power consumption	80-100% of TDP

Causes and Solutions for Low GPU Utilization

When GPU Utilization is low (less than 80%):

DataLoader bottleneck: Increase num_workers to match CPU core count and set pin_memory=True
Batch size too small: GPU parallel compute units are not fully utilized
CPU preprocessing bottleneck: Perform data preprocessing on GPU (DALI, etc.) or pre-process and cache

dataloader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=8,           # Adjust to match CPU core count
    pin_memory=True,          # Improve GPU transfer speed
    prefetch_factor=2,        # Number of batches to preload
    persistent_workers=True,  # Keep workers across epochs
)

When Memory Utilization is low:

Gradually increase batch size to maximize GPU memory utilization.
Using mixed precision (FP16/BF16) allows approximately 2x the batch size with the same memory.

PyTorch Built-in Memory Monitoring

import torch

# Check current GPU memory usage
print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
print(f"Max Allocated: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

# Save memory snapshot (for detailed analysis)
torch.cuda.memory._record_memory_history()
# ... execute training code ...
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")

10. Practical Troubleshooting: OOM, NCCL Timeout, Communication Bottlenecks

Here we compile the most frequently occurring issues in multi-GPU training and their solutions.

10.1 OOM (Out of Memory) Error

Symptoms: CUDA out of memory. Tried to allocate X MiB error

Step-by-step resolution:

# Step 1: Reduce batch size
batch_size = 16  # → Try decreasing to 8, 4, 2 incrementally

# Step 2: Use gradient accumulation to maintain effective batch size
gradient_accumulation_steps = 4  # batch_size * 4 = effective batch size

# Step 3: Use mixed precision
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

with autocast(dtype=torch.bfloat16):
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

# Step 4: Enable gradient checkpointing
from torch.utils.checkpoint import checkpoint
# Or for HuggingFace models:
model.gradient_checkpointing_enable()

# Step 5: Optimize memory allocation settings
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

Escalation path for memory shortages:

Reduce batch size → Mixed Precision → Gradient Checkpointing
→ ZeRO Stage 2 → ZeRO Stage 3 → CPU Offloading → NVMe Offloading

10.2 NCCL Timeout Error

Symptoms: Watchdog caught collective operation timeout or NCCL timeout error. Training hangs or freezes at certain points.

Common causes and solutions:

# Cause 1: Timeout value too short
# Solution: Increase timeout
export NCCL_BLOCKING_WAIT=1

# Set timeout in Python
import datetime
dist.init_process_group(
    backend="nccl",
    timeout=datetime.timedelta(minutes=30)  # Default is 30 minutes
)

# Cause 2: GPU P2P communication issues
# Solution: Disable P2P
export NCCL_P2P_DISABLE=1

# Cause 3: Network interface selection error (multi-node)
# Solution: Specify the correct network interface
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0

# Cause 4: Insufficient shared memory in Docker environment
# Solution: Increase SHM size when running Docker
# docker run --shm-size=16g ...

# Enable NCCL logs for debugging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=DETAIL

When OOM occurs on only one GPU, causing other GPUs to wait:

This situation manifests as an NCCL timeout, but the actual cause is OOM. When one GPU crashes from OOM, the remaining GPUs timeout waiting for AllReduce. The OOM must be resolved first.

10.3 Communication Bottleneck Diagnosis and Resolution

Symptoms: GPU utilization is high, but training speed doesn't scale proportionally with the number of GPUs.

Diagnostic approach:

import torch.autograd.profiler as profiler

with profiler.profile(
    activities=[
        profiler.ProfilerActivity.CPU,
        profiler.ProfilerActivity.CUDA,
    ],
    schedule=profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
    on_trace_ready=profiler.tensorboard_trace_handler('./log/profiler'),
    record_shapes=True,
    with_stack=True,
) as prof:
    for step, batch in enumerate(dataloader):
        if step >= 7:
            break
        train_step(model, batch)
        prof.step()

Solutions:

# 1. Use gradient compression (DDP)
from torch.distributed.algorithms.ddp_comm_hooks import (
    default_hooks as default,
    powerSGD_hook as powerSGD,
)
ddp_model.register_comm_hook(
    state=powerSGD.PowerSGDState(process_group=None, matrix_approximation_rank=1),
    hook=powerSGD.powerSGD_hook,
)

# 2. Adjust bucket size (DDP)
ddp_model = DDP(model, device_ids=[rank], bucket_cap_mb=50)  # Default 25MB

# 3. Enable overlap_comm (DeepSpeed)
# In ds_config.json:
# "zero_optimization": { "overlap_comm": true }

10.4 Common Mistakes and Solutions

Problem	Cause	Solution
All GPUs use identical data	`DistributedSampler` not used	Apply `DistributedSampler`
Same data order every epoch	`sampler.set_epoch(epoch)` not called	Call at the start of every epoch
Conflicts when saving model	All ranks attempt to save	Save only on rank 0 with `if rank == 0:`
Accessing original model after DDP wrap	Mixing `model` and `ddp_model`	Access via `ddp_model.module`
`find_unused_parameters` warning	Some parameters unused in forward	Set `DDP(..., find_unused_parameters=True)`

11. Summary: Which Strategy to Choose

The final decision tree for selecting a distributed training strategy:

Does the model fit in a single GPU?
├── YES → DDP (fastest and simplest)
│         ├── Want larger batch sizes → ZeRO Stage 1/2 or FSDP SHARD_GRAD_OP
│         └── Sufficient → DDP alone is enough
└── NO → Does the model fit in the total GPU memory of all GPUs?
          ├── YES → FSDP (FULL_SHARD) or ZeRO Stage 3
          │         ├── Prefer PyTorch native → FSDP
          │         └── Need more options/customization → DeepSpeed
          └── NO → ZeRO Stage 3 + CPU/NVMe Offloading
                    or Pipeline Parallelism + Tensor Parallelism combination

References

PyTorch Official Documentation

DeepSpeed Official Documentation

NVIDIA Official Documentation

HuggingFace Official Documentation

Quiz

Q1: What is the main topic covered in "Complete Guide to Multi-GPU Distributed Training: DDP, FSDP, DeepSpeed"?

Systematically analyze the core components of multi-GPU distributed training including DDP, FSDP, and DeepSpeed ZeRO based on PyTorch official documentation, with practical setup instructions.

Q2: Why Multi-GPU Training Is Necessary?

Q3: Explain the core concept of NCCL: The Core of GPU Communication.

The key to multi-GPU training is efficient communication between GPUs. NVIDIA provides NCCL (NVIDIA Collective Communications Library), a dedicated communication library for this purpose.

Q4: What are the key differences in NVLink vs PCIe: GPU Interconnect Bandwidth Comparison?

Multi-GPU training performance is heavily influenced by GPU interconnect bandwidth. GPU interconnect methods are broadly divided into PCIe and NVLink.

Q5: How does Distributed Training Paradigms: Data / Model / Pipeline Parallelism work?

Multi-GPU training strategies are broadly classified into three paradigms. Data Parallelism The most basic and widely used approach. The same model is replicated on all GPUs, and training data is divided among GPUs so each processes different mini-batches.