Split View: DeepSpeed 완전 가이드: ZeRO 최적화와 대규모 모델 학습

DeepSpeed 완전 가이드: ZeRO 최적화와 대규모 모델 학습

들어가며

GPT-4, LLaMA, Falcon 같은 대규모 언어 모델(LLM)을 학습시키려면 단일 GPU 한 장으로는 불가능합니다. 1750억 파라미터 GPT-3를 fp16으로 저장하면 약 350GB의 GPU 메모리가 필요하고, 옵티마이저 상태(Adam)까지 포함하면 1.4TB에 달합니다. 이 문제를 해결하기 위해 Microsoft에서 개발한 라이브러리가 바로 DeepSpeed입니다.

DeepSpeed는 ZeRO(Zero Redundancy Optimizer) 최적화 기술을 중심으로, 수백억~수조 개 파라미터 모델도 일반적인 GPU 클러스터에서 학습할 수 있게 해줍니다. 이 가이드에서는 DeepSpeed의 핵심 개념부터 실전 설정까지 완전히 다룹니다.

1. DeepSpeed 소개

1.1 배경과 동기

딥러닝 모델이 커질수록 세 가지 병목이 생깁니다.

메모리 부족: 모델 파라미터, 그래디언트, 옵티마이저 상태가 GPU VRAM을 초과
연산 병목: 단일 GPU의 FLOPS 한계
통신 병목: 멀티 GPU 환경에서 그래디언트 동기화 오버헤드

기존의 데이터 병렬화(Data Parallelism)는 각 GPU가 모델 전체 복사본을 유지하므로 메모리 문제를 해결하지 못합니다. 모델 병렬화(Model Parallelism)는 구현이 복잡하고 비효율적인 경우가 많습니다.

DeepSpeed는 이 두 접근법의 장점을 결합하고 새로운 ZeRO 최적화를 통해 혁신적인 메모리 효율을 달성합니다.

1.2 PyTorch와의 통합

DeepSpeed는 기존 PyTorch 코드를 최소한으로 수정하여 사용할 수 있습니다. 핵심 변경은 deepspeed.initialize() 호출 하나로 엔진 생성, 분산 초기화, 옵티마이저 래핑을 모두 처리합니다.

import deepspeed

# 기존 PyTorch 코드
model = MyModel()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# DeepSpeed로 전환 (최소 수정)
model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=model.parameters(),
    config="ds_config.json"
)

1.3 설치

pip install deepspeed

# CUDA 커널 빌드 포함 설치
DS_BUILD_OPS=1 pip install deepspeed

# 설치 확인
ds_report

2. ZeRO 최적화 단계별 분석

ZeRO(Zero Redundancy Optimizer)는 데이터 병렬 학습의 메모리 중복을 제거하는 핵심 기술입니다. 데이터 병렬화에서 N개의 GPU가 모두 동일한 모델 복사본을 유지하는 것은 극도의 메모리 낭비입니다. ZeRO는 이 중복을 단계별로 제거합니다.

학습 중 메모리 사용량 분석

모델 파라미터 수를 Ψ라 할 때, 혼합 정밀도(fp16) 학습에서 GPU당 메모리는 다음과 같습니다.

구성 요소	크기	설명
파라미터 (fp16)	2Ψ bytes	순전파/역전파
그래디언트 (fp16)	2Ψ bytes	역전파 결과
마스터 파라미터 (fp32)	4Ψ bytes	옵티마이저용
Adam 모멘텀 (fp32)	4Ψ bytes	1차 모멘트
Adam 분산 (fp32)	4Ψ bytes	2차 모멘트
합계	16Ψ bytes

7B 파라미터 모델이라면 약 112GB, 70B 파라미터 모델이라면 약 1.1TB가 필요합니다.

2.1 ZeRO-1: 옵티마이저 상태 분산

ZeRO-1은 가장 큰 메모리 비중을 차지하는 옵티마이저 상태를 N개의 GPU에 분산합니다.

Adam의 경우: fp32 파라미터 + 모멘텀 + 분산 = 12Ψ bytes
N개 GPU로 분산하면 GPU당 옵티마이저 상태 = 12Ψ/N bytes

메모리 절약 효과:

기존: 16Ψ bytes per GPU
ZeRO-1: 4Ψ + 12Ψ/N bytes per GPU (N=64일 때 ≈ 4.2Ψ bytes, 약 4배 절약)

업데이트 과정에서 각 GPU는 자신이 담당하는 파라미터 샤드만 업데이트하고, 이후 All-gather로 전체 파라미터를 복원합니다.

{
  "zero_optimization": {
    "stage": 1
  }
}

2.2 ZeRO-2: 그래디언트 분산

ZeRO-2는 옵티마이저 상태에 더해 그래디언트도 분산합니다.

역전파 중 각 파라미터 샤드에 대응하는 그래디언트만 해당 GPU에 축적합니다. Reduce-scatter 연산으로 각 GPU가 담당 구간의 그래디언트만 수집합니다.

기존: 16Ψ bytes per GPU
ZeRO-2: 2Ψ + (2Ψ + 12Ψ)/N bytes per GPU (N=64일 때 ≈ 2.2Ψ bytes, 약 8배 절약)

{
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "overlap_comm": true,
    "contiguous_gradients": true
  }
}

2.3 ZeRO-3: 파라미터 분산

ZeRO-3는 모델 파라미터 자체도 분산합니다. 이것이 가장 강력하지만 구현이 복잡합니다.

동작 원리:

각 GPU는 전체 파라미터의 1/N만 상시 보유
순전파 시: 레이어 실행 직전 All-gather로 전체 파라미터 수집
계산 완료 후: 해당 파라미터 즉시 해제
역전파도 동일하게 on-demand로 파라미터 수집

ZeRO-3: (2Ψ + 2Ψ + 12Ψ)/N bytes per GPU
N=64일 때: 16Ψ/64 ≈ 0.25Ψ bytes (약 64배 절약!)

단점: All-gather 통신이 매 레이어마다 발생하여 통신 오버헤드 증가. 하지만 파이프라인 실행과 겹쳐 실질적 지연은 작습니다.

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

ZeRO-3 Python 초기화:

import deepspeed
from transformers import AutoModelForCausalLM

# ZeRO-3에서는 모델 생성 방식도 중요
with deepspeed.zero.Init(config_dict_or_path="ds_config.json"):
    model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# 이렇게 하면 처음부터 파라미터가 분산되어 메모리 피크 방지

3. ZeRO-Offload와 ZeRO-Infinity

3.1 ZeRO-Offload

ZeRO-Offload는 옵티마이저 상태와 그래디언트를 CPU 메모리로 옮겨 GPU 메모리 부담을 크게 줄입니다.

GPU: fp16 파라미터 + fp16 그래디언트 (순전파/역전파)
CPU: fp32 마스터 파라미터 + Adam 상태 (옵티마이저 업데이트)

단일 GPU에서도 수십억 파라미터 모델 학습이 가능해집니다.

{
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  }
}

파라미터 오프로드 (ZeRO-3 + Offload):

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    }
  }
}

3.2 ZeRO-Infinity (NVMe 오프로드)

ZeRO-Infinity는 NVMe SSD까지 활용하여 사실상 무제한의 모델 크기를 지원합니다.

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "nvme",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 4,
      "fast_init": false
    },
    "offload_param": {
      "device": "nvme",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 5,
      "buffer_size": 1e8
    },
    "aio": {
      "block_size": 1048576,
      "queue_depth": 8,
      "thread_count": 1,
      "single_submit": false,
      "overlap_events": true
    }
  }
}

대역폭 고려사항:

GPU-CPU 대역폭: PCIe 4.0 x16 기준 ~32 GB/s
CPU-NVMe 대역폭: NVMe SSD 기준 ~7 GB/s
NVMe는 CPU보다 느리므로 학습 속도 저하가 있지만, 불가능했던 모델 크기를 가능하게 함

4. DeepSpeed 설정 파일 완전 가이드

4.1 ds_config.json 기본 구조

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0,

  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },

  "bf16": {
    "enabled": "auto"
  },

  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "overlap_comm": true,
    "contiguous_gradients": true
  },

  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": 1e-8,
      "weight_decay": "auto"
    }
  },

  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },

  "activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
  },

  "wall_clock_breakdown": false,
  "steps_per_print": 100
}

4.2 혼합 정밀도 설정 상세

fp16 설정:

{
  "fp16": {
    "enabled": true,
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "consecutive_hysteresis": false,
    "min_loss_scale": 1
  }
}

loss_scale: 0: 동적 손실 스케일링 활성화
initial_scale_power: 16: 초기 손실 스케일 = 2^16 = 65536
loss_scale_window: 스케일 증가 전 성공 스텝 수
hysteresis: 스케일 감소 전 오버플로우 카운트

bf16 설정 (Ampere 이상 GPU):

{
  "bf16": {
    "enabled": true
  }
}

bf16은 fp16보다 동적 범위가 넓어 손실 스케일링이 불필요합니다. A100/H100에서 권장됩니다.

4.3 그래디언트 축적과 배치 크기

배치 크기 관계:

train_batch_size = train_micro_batch_size_per_gpu × gradient_accumulation_steps × world_size

{
  "train_batch_size": 2048,
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 64
}

위 설정은 8개 GPU 환경에서: 4 × 64 × 8 = 2048 글로벌 배치 크기.

"auto"를 사용하면 Transformers Trainer가 자동으로 값을 채워줍니다.

4.4 학습률 스케줄러 옵션

{
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "last_batch_iteration": -1,
      "total_num_steps": 100000,
      "warmup_min_lr": 0,
      "warmup_max_lr": 3e-4,
      "warmup_num_steps": 2000,
      "warmup_type": "linear"
    }
  }
}

지원하는 스케줄러 타입:

LRRangeTest
OneCycle
WarmupLR
WarmupDecayLR
WarmupCosineLR (커스텀)

5. DeepSpeed + PyTorch 통합

5.1 기본 학습 루프

import torch
import deepspeed
from torch.utils.data import DataLoader

def main():
    # 분산 환경 초기화
    deepspeed.init_distributed()

    # 모델 생성
    model = GPT2LMHeadModel.from_pretrained("gpt2")

    # DeepSpeed 엔진 초기화
    model_engine, optimizer, train_loader, lr_scheduler = deepspeed.initialize(
        model=model,
        training_data=train_dataset,
        config="ds_config.json"
    )

    # 학습 루프
    for epoch in range(num_epochs):
        for batch in train_loader:
            input_ids = batch["input_ids"].to(model_engine.device)
            labels = batch["labels"].to(model_engine.device)

            # 순전파
            outputs = model_engine(input_ids=input_ids, labels=labels)
            loss = outputs.loss

            # 역전파 (DeepSpeed가 스케일링/축적 처리)
            model_engine.backward(loss)

            # 파라미터 업데이트 (그래디언트 클리핑 포함)
            model_engine.step()

    print(f"Step: {model_engine.global_steps}, Loss: {loss.item():.4f}")

5.2 체크포인트 저장과 로딩

# 체크포인트 저장
def save_checkpoint(model_engine, save_dir, tag=None):
    # DeepSpeed 형식으로 저장 (분산 상태 포함)
    model_engine.save_checkpoint(save_dir, tag=tag)

# 예: 1000 스텝마다 저장
if model_engine.global_steps % 1000 == 0:
    save_checkpoint(model_engine, "./checkpoints", tag=f"step_{model_engine.global_steps}")

# 체크포인트 로딩
_, client_sd = model_engine.load_checkpoint("./checkpoints", tag="step_1000")

# ZeRO-3 환경에서 일반 PyTorch 형식으로 가중치 추출
if args.zero_stage == 3:
    state_dict = model_engine._zero3_consolidated_16bit_state_dict()
    if model_engine.local_rank == 0:
        torch.save(state_dict, "model_weights.pt")

5.3 Transformers Trainer 통합

HuggingFace Transformers와 DeepSpeed를 통합하는 가장 간편한 방법:

from transformers import TrainingArguments, Trainer, AutoModelForCausalLM

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    fp16=True,
    deepspeed="ds_config.json",  # DeepSpeed 설정 파일 경로
    logging_steps=10,
    save_steps=1000,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

명령줄 실행:

deepspeed --num_gpus=8 train.py \
    --deepspeed ds_config.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset_name openwebtext \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --num_train_epochs 3 \
    --fp16

5.4 Accelerate 통합

from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin

deepspeed_plugin = DeepSpeedPlugin(
    zero_stage=2,
    gradient_accumulation_steps=8,
    gradient_clipping=1.0,
)

accelerator = Accelerator(
    mixed_precision="fp16",
    deepspeed_plugin=deepspeed_plugin,
)

model, optimizer, train_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader
)

for batch in train_dataloader:
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)
    optimizer.step()
    optimizer.zero_grad()

6. DeepSpeed 파이프라인 병렬화

6.1 파이프라인 병렬화 개념

파이프라인 병렬화는 모델의 레이어를 여러 GPU에 순차적으로 배치합니다. GPU 0은 앞부분 레이어, GPU 1은 중간, GPU N은 마지막 레이어를 담당합니다.

1F1B 스케줄 (One Forward, One Backward): 파이프라인 버블을 최소화하는 스케줄. 각 GPU가 하나의 마이크로배치 순전파 후 곧바로 역전파를 수행합니다.

6.2 PipelineModule 사용

from deepspeed.pipe import PipelineModule, LayerSpec

class GPT2PipelineModel(PipelineModule):
    def __init__(self, num_layers, hidden_size, num_heads, vocab_size):
        layers = [
            LayerSpec(EmbeddingLayer, vocab_size, hidden_size),
        ]
        for i in range(num_layers):
            layers.append(LayerSpec(TransformerBlock, hidden_size, num_heads))
        layers.append(LayerSpec(LMHead, hidden_size, vocab_size))

        super().__init__(
            layers=layers,
            loss_fn=cross_entropy_loss,
            num_stages=4,  # 파이프라인 단계 수 (= GPU 수)
            topology=PipeDataParallelTopology(num_pp=4, num_dp=2)
        )

# ds_config에 파이프라인 설정 추가
pipeline_config = {
    "train_batch_size": 64,
    "train_micro_batch_size_per_gpu": 2,
    "gradient_accumulation_steps": 8,
    "pipeline": {
        "seed_layers": true,
        "activation_checkpoint_interval": 1
    }
}

6.3 파이프라인 + ZeRO 결합

model_engine, _, _, _ = deepspeed.initialize(
    model=pipeline_model,
    config={
        "train_micro_batch_size_per_gpu": 2,
        "gradient_accumulation_steps": 8,
        "zero_optimization": {"stage": 1},  # ZeRO-1과 결합
        "fp16": {"enabled": True}
    }
)

# 파이프라인 학습
for step in range(num_steps):
    loss = model_engine.train_batch()
    if model_engine.global_steps % 100 == 0:
        print(f"Step {model_engine.global_steps}, Loss: {loss.item():.4f}")

7. Tensor 병렬화

7.1 Megatron-DeepSpeed 통합

텐서 병렬화는 개별 레이어를 여러 GPU에 분산합니다. 행렬 곱셈을 열/행 방향으로 분할합니다.

# Megatron-DeepSpeed 설정
megatron_config = {
  "tensor_model_parallel_size": 4,
  "pipeline_model_parallel_size": 2,
  "data_parallel_size": 4,  # world_size / (tp_size × pp_size)
}

ds_config.json에서 텐서 병렬 설정:

{
  "tensor_parallel": {
    "tp_size": 4,
    "mpu": null,
    "tp_grain_size": 64
  }
}

실행 명령:

deepspeed --num_nodes=4 --num_gpus=8 \
    pretrain_gpt.py \
    --tensor-model-parallel-size 4 \
    --pipeline-model-parallel-size 4 \
    --num-layers 96 \
    --hidden-size 12288 \
    --num-attention-heads 96 \
    --seq-length 2048 \
    --global-batch-size 1024 \
    --train-iters 500000 \
    --deepspeed \
    --deepspeed_config ds_config.json

8. 활성화 체크포인팅 (Gradient Checkpointing)

8.1 개념과 트레이드오프

활성화 체크포인팅은 순전파 중 중간 활성화 값을 저장하지 않고, 역전파 시 다시 계산합니다.

메모리 절약: 레이어 수에 비례하는 활성화 메모리를 O(sqrt(레이어 수))로 감소
속도 저하: 약 33% 추가 연산 비용

{
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "contiguous_memory_optimization": true,
    "number_checkpoints": 4,
    "synchronize_checkpoint_boundary": false,
    "profile": false
  }
}

# PyTorch 레벨에서도 활성화
from deepspeed.runtime.activation_checkpointing import checkpointing

class TransformerLayer(nn.Module):
    def forward(self, x):
        return checkpointing.checkpoint(self._forward, x)

    def _forward(self, x):
        # 실제 연산
        return self.attention(self.norm1(x)) + x

9. 혼합 전문가 (MoE)와 DeepSpeed

9.1 MoE 아키텍처 개요

Mixture of Experts는 각 입력 토큰을 일부 "전문가" 서브네트워크만 처리하게 하여 파라미터 수를 늘리면서도 연산량은 유지합니다.

from deepspeed.moe.layer import MoE

class MoETransformerBlock(nn.Module):
    def __init__(self, hidden_size, num_experts, top_k=2):
        super().__init__()
        self.attention = MultiHeadAttention(hidden_size)
        self.moe_layer = MoE(
            hidden_size=hidden_size,
            expert=FeedForward(hidden_size, hidden_size * 4),
            num_experts=num_experts,
            ep_size=1,          # 전문가 병렬화 크기
            k=top_k,            # 활성화할 전문가 수
            capacity_factor=1.25,
            eval_capacity_factor=2.0,
            min_capacity=4,
            use_residual=False
        )

    def forward(self, x, attention_mask=None):
        x = x + self.attention(x, attention_mask)
        x, _, _ = self.moe_layer(x)
        return x

9.2 전문가 병렬화 설정

{
  "zero_optimization": {
    "stage": 2
  },
  "moe_expert_parallel_size": 4,
  "moe": {
    "enabled": true,
    "ep_size": 4,
    "moe_param_group": true
  }
}

# MoE 파라미터 그룹 분리 (중요!)
def create_moe_param_groups(model):
    parameters = {
        "params": [],
        "name": "parameters"
    }
    moe_parameters = {
        "params": [],
        "moe": True,
        "name": "moe_parameters"
    }
    for module_name, module in model.named_modules():
        if isinstance(module, MoE):
            moe_parameters["params"].extend(
                [p for n, p in module.named_parameters() if "expert" in n]
            )
        else:
            parameters["params"].extend(module.parameters(recurse=False))
    return [parameters, moe_parameters]

10. DeepSpeed Inference

10.1 추론 최적화

DeepSpeed Inference는 학습된 모델의 추론 속도를 극적으로 개선합니다.

import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# DeepSpeed 추론 엔진으로 변환
ds_engine = deepspeed.init_inference(
    model=model,
    mp_size=1,              # 텐서 병렬 크기
    dtype=torch.float16,    # 추론 dtype
    checkpoint=None,
    replace_with_kernel_inject=True,  # 최적화 커널 주입
    max_tokens=2048,
    replace_method="auto"
)

model = ds_engine.module

# 추론 실행
inputs = tokenizer("DeepSpeed는 ", return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.7
    )
print(tokenizer.decode(outputs[0]))

10.2 텐서 병렬 추론

# 멀티 GPU 추론 (텐서 병렬화)
ds_engine = deepspeed.init_inference(
    model=model,
    mp_size=4,              # 4개 GPU에 텐서 분산
    dtype=torch.float16,
    replace_with_kernel_inject=True,
    injection_policy={
        LlamaDecoderLayer: ("self_attn.o_proj", "mlp.down_proj")
    }
)

10.3 ZeRO-Inference

대규모 모델의 단일/소수 GPU 추론:

# ZeRO-3 기반 추론 (파라미터를 CPU/NVMe에서 스트리밍)
ds_engine = deepspeed.init_inference(
    model=model,
    config={
        "tensor_parallel": {"tp_size": 1},
        "dtype": "fp16",
        "enable_cuda_graph": False,
        "zero": {
            "stage": 3,
            "offload_param": {
                "device": "cpu"
            }
        }
    }
)

11. DeepSpeed Chat (RLHF)

11.1 RLHF 파이프라인 개요

DeepSpeed-Chat은 완전한 RLHF(Reinforcement Learning from Human Feedback) 파이프라인을 제공합니다.

3단계 학습:

SFT (Supervised Fine-Tuning): 지도 학습 파인튜닝
Reward Model Training: 보상 모델 학습
RLHF/PPO: PPO 알고리즘으로 강화학습

11.2 SFT 단계

# Step 1: SFT
deepspeed main.py \
    --data_path Dahoas/rm-static \
    --data_split 2,4,4 \
    --model_name_or_path facebook/opt-1.3b \
    --per_device_train_batch_size 8 \
    --max_seq_len 512 \
    --learning_rate 9.65e-6 \
    --weight_decay 0.1 \
    --num_train_epochs 1 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --num_warmup_steps 0 \
    --seed 1234 \
    --zero_stage 2 \
    --deepspeed \
    --output_dir /output/sft

11.3 보상 모델 학습

from deepspeed_chat.reward_model import RewardModel

# 보상 모델 = SFT 모델 + 선형 헤드
reward_model = RewardModel(
    base_model=sft_model,
    tokenizer=tokenizer,
    num_padding_at_beginning=0
)

# 선호 쌍(chosen, rejected) 데이터로 학습
def compute_reward_loss(reward_model, chosen_ids, rejected_ids):
    chosen_reward = reward_model(chosen_ids)
    rejected_reward = reward_model(rejected_ids)
    # 선호 응답의 보상이 더 높아야 함
    loss = -torch.mean(torch.log(torch.sigmoid(chosen_reward - rejected_reward)))
    return loss

11.4 PPO 학습

from deepspeed_chat.ppo_trainer import DeepSpeedPPOTrainer

trainer = DeepSpeedPPOTrainer(
    rlhf_engine=rlhf_engine,
    args=args
)

for epoch in range(args.num_train_epochs):
    for step, batch in enumerate(prompt_dataloader):
        out = trainer.generate_experience(batch["input_ids"])
        # Actor/Critic 모델 업데이트
        actor_loss, critic_loss = trainer.train_rlhf(out)

12. 실전 예제: LLM 학습

12.1 완전한 Llama-2 파인튜닝 예제

import torch
import deepspeed
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForSeq2Seq
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType

def main():
    # 모델과 토크나이저 로드
    model_name = "meta-llama/Llama-2-7b-hf"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    # ZeRO-3 컨텍스트에서 모델 로드 (메모리 효율)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map=None  # DeepSpeed가 직접 관리
    )

    # LoRA 설정 (선택사항: 파라미터 효율 파인튜닝)
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.1,
        bias="none"
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    # 데이터셋 로드
    dataset = load_dataset("tatsu-lab/alpaca", split="train")

    def preprocess(example):
        prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
        tokens = tokenizer(prompt, truncation=True, max_length=512, padding="max_length")
        tokens["labels"] = tokens["input_ids"].copy()
        return tokens

    tokenized_dataset = dataset.map(preprocess, remove_columns=dataset.column_names)

    # TrainingArguments with DeepSpeed
    training_args = TrainingArguments(
        output_dir="./llama2-finetuned",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        learning_rate=2e-4,
        warmup_steps=100,
        logging_steps=10,
        save_steps=500,
        bf16=True,
        deepspeed="ds_zero2_config.json",
        remove_unused_columns=False,
        report_to="wandb",
        run_name="llama2-alpaca-deepspeed"
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset,
        data_collator=DataCollatorForSeq2Seq(tokenizer, padding=True)
    )

    trainer.train()
    trainer.save_model("./llama2-finetuned-final")

if __name__ == "__main__":
    main()

ZeRO-2 설정 파일 (ds_zero2_config.json):

{
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": 1e-8,
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto",
      "total_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 100,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

실행:

deepspeed --num_gpus=8 train_llama.py \
    --deepspeed ds_zero2_config.json

12.2 성능 벤치마크

A100 80GB 8장 환경에서의 실측 비교:

설정	최대 모델 크기	처리량 (tokens/sec)
기본 DDP	~7B	12,000
ZeRO-1	~14B	11,500
ZeRO-2	~35B	10,800
ZeRO-3	~175B	9,200
ZeRO-3 + CPU Offload	~350B	4,100
ZeRO-Infinity	~1T+	1,800

13. 디버깅과 문제 해결

13.1 일반적인 문제와 해결책

OOM (Out of Memory) 오류:

# 메모리 사용량 프로파일링
ds_config = {
    "memory_breakdown": True,
    "wall_clock_breakdown": True,
}

# 단계적 해결:
# 1. ZeRO stage 높이기 (1 → 2 → 3)
# 2. CPU Offload 활성화
# 3. 활성화 체크포인팅 활성화
# 4. 배치 크기 줄이기
# 5. 시퀀스 길이 줄이기

그래디언트 오버플로우 (fp16):

{
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "initial_scale_power": 12,
    "hysteresis": 2,
    "min_loss_scale": 1e-8
  }
}

ZeRO-3 파라미터 수집 오류:

# ZeRO-3에서 파라미터에 직접 접근할 때
from deepspeed import zero

with zero.GatheredParameters(model.parameters()):
    # 이 블록 안에서만 전체 파라미터 사용 가능
    weight_norm = model.weight.norm()

13.2 성능 튜닝 팁

# 1. 통신과 연산 오버랩
zero_config = {
    "overlap_comm": True,
    "contiguous_gradients": True,
}

# 2. 버킷 크기 최적화 (GPU 메모리와 통신 대역폭에 맞게)
zero_config["allgather_bucket_size"] = 5e8  # 500MB
zero_config["reduce_bucket_size"] = 5e8

# 3. ZeRO-3 프리페치
zero_config["stage3_prefetch_bucket_size"] = 5e7
zero_config["stage3_param_persistence_threshold"] = 1e6

# 4. 컴파일 최적화 (PyTorch 2.0+)
model = torch.compile(model)

마치며

DeepSpeed는 현대 LLM 학습의 핵심 도구입니다. ZeRO 최적화를 통해 단일 GPU로는 불가능했던 대규모 모델을 합리적인 비용으로 학습할 수 있게 해줍니다.

핵심 요약:

ZeRO-1/2: 옵티마이저/그래디언트 분산으로 데이터 병렬화 효율 개선
ZeRO-3: 파라미터까지 분산, GPU 수에 비례하는 메모리 절약
ZeRO-Offload: CPU/NVMe로 오프로드, 소규모 클러스터에서 대형 모델 학습
Pipeline + Tensor 병렬화: 수평/수직 모델 분산으로 통신 효율 극대화
DeepSpeed Inference: 커널 최적화로 추론 속도 2-5배 향상

실제 프로젝트에서는 ZeRO-2 + CPU Offload 조합이 좋은 시작점입니다. 더 큰 모델이 필요하면 ZeRO-3로 이동하고, 추론은 DeepSpeed Inference의 kernel injection을 활용하세요.

참고 자료

DeepSpeed Complete Guide: ZeRO Optimization and Large-Scale Model Training

Introduction

Training large language models (LLMs) like GPT-4, LLaMA, or Falcon on a single GPU is simply not feasible. Storing GPT-3's 175 billion parameters in fp16 requires approximately 350GB of GPU memory, and including optimizer states (Adam) brings that to 1.4TB. Microsoft developed DeepSpeed to solve exactly this problem.

DeepSpeed centers around the ZeRO (Zero Redundancy Optimizer) optimization technology, enabling training of models with hundreds of billions to trillions of parameters on ordinary GPU clusters. This guide covers everything from DeepSpeed's core concepts to production configurations.

1. Introduction to DeepSpeed

1.1 Background and Motivation

As deep learning models grow larger, three bottlenecks emerge:

Memory shortage: Model parameters, gradients, and optimizer states exceed GPU VRAM
Compute bottleneck: Single GPU FLOPS limits
Communication bottleneck: Gradient synchronization overhead in multi-GPU setups

Traditional data parallelism (DP) requires each GPU to hold a full copy of the model, failing to address the memory problem. Model parallelism is complex to implement and often inefficient.

DeepSpeed combines the strengths of both approaches and achieves revolutionary memory efficiency through its novel ZeRO optimization.

1.2 Integration with PyTorch

DeepSpeed requires minimal changes to existing PyTorch code. The core change is a single deepspeed.initialize() call that handles engine creation, distributed initialization, and optimizer wrapping simultaneously.

import deepspeed

# Existing PyTorch code
model = MyModel()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# Switching to DeepSpeed (minimal changes)
model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=model.parameters(),
    config="ds_config.json"
)

1.3 Installation

pip install deepspeed

# Install with CUDA kernel builds
DS_BUILD_OPS=1 pip install deepspeed

# Verify installation
ds_report

2. ZeRO Optimization: Stage-by-Stage Analysis

ZeRO (Zero Redundancy Optimizer) is the core technology for eliminating memory redundancy in data-parallel training. In standard data parallelism, N GPUs all maintain identical model copies — an extreme waste of memory. ZeRO eliminates this redundancy incrementally.

Memory Usage Analysis During Training

Given Ψ model parameters, per-GPU memory in mixed-precision (fp16) training is:

Component	Size	Description
Parameters (fp16)	2Ψ bytes	Forward/backward pass
Gradients (fp16)	2Ψ bytes	Backward pass results
Master parameters (fp32)	4Ψ bytes	For optimizer
Adam momentum (fp32)	4Ψ bytes	1st moment
Adam variance (fp32)	4Ψ bytes	2nd moment
Total	16Ψ bytes

A 7B parameter model requires ~112GB; a 70B model needs ~1.1TB.

2.1 ZeRO Stage 1: Optimizer State Partitioning

ZeRO-1 partitions optimizer states — the largest memory component — across N GPUs.

Adam states: fp32 parameters + momentum + variance = 12Ψ bytes
With N GPUs: optimizer state per GPU = 12Ψ/N bytes

Memory savings:

Baseline: 16Ψ bytes per GPU
ZeRO-1: 4Ψ + 12Ψ/N bytes per GPU
(at N=64: ≈ 4.2Ψ bytes, ~4x savings)

During updates, each GPU updates only its assigned parameter shard, then an All-gather operation restores the full parameters.

{
  "zero_optimization": {
    "stage": 1
  }
}

2.2 ZeRO Stage 2: Gradient Partitioning

ZeRO-2 also partitions gradients in addition to optimizer states.

During backpropagation, each GPU accumulates only the gradients corresponding to its parameter shard. A Reduce-scatter operation collects gradients for each GPU's assigned segment.

Baseline: 16Ψ bytes per GPU
ZeRO-2: 2Ψ + (2Ψ + 12Ψ)/N bytes per GPU
(at N=64: ≈ 2.2Ψ bytes, ~8x savings)

{
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "overlap_comm": true,
    "contiguous_gradients": true
  }
}

2.3 ZeRO Stage 3: Parameter Partitioning

ZeRO-3 also partitions the model parameters themselves. This is the most powerful stage but most complex to implement.

How it works:

Each GPU permanently holds only 1/N of total parameters
Forward pass: All-gather collects full parameters just before each layer executes
After computation: parameters are immediately freed
Backward pass: same on-demand parameter collection

ZeRO-3: (2Ψ + 2Ψ + 12Ψ)/N bytes per GPU
At N=64: 16Ψ/64 ≈ 0.25Ψ bytes (~64x savings!)

Tradeoff: All-gather communication occurs at every layer, increasing communication overhead. However, overlapping with pipeline execution makes the practical delay small.

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

ZeRO-3 Python initialization:

import deepspeed
from transformers import AutoModelForCausalLM

# Model creation method matters for ZeRO-3
with deepspeed.zero.Init(config_dict_or_path="ds_config.json"):
    model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Parameters are partitioned from the start, preventing memory peaks

3. ZeRO-Offload and ZeRO-Infinity

3.1 ZeRO-Offload

ZeRO-Offload moves optimizer states and gradients to CPU memory, greatly reducing GPU memory pressure.

GPU: fp16 parameters + fp16 gradients (forward/backward)
CPU: fp32 master parameters + Adam states (optimizer update)

This enables training multi-billion parameter models even on a single GPU.

{
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  }
}

Parameter offload (ZeRO-3 + Offload):

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    }
  }
}

3.2 ZeRO-Infinity (NVMe Offload)

ZeRO-Infinity leverages NVMe SSDs to support virtually unlimited model sizes.

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "nvme",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 4,
      "fast_init": false
    },
    "offload_param": {
      "device": "nvme",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 5,
      "buffer_size": 1e8
    },
    "aio": {
      "block_size": 1048576,
      "queue_depth": 8,
      "thread_count": 1,
      "single_submit": false,
      "overlap_events": true
    }
  }
}

Bandwidth considerations:

GPU-CPU bandwidth: ~32 GB/s over PCIe 4.0 x16
CPU-NVMe bandwidth: ~7 GB/s for NVMe SSDs
NVMe is slower than CPU memory, adding training overhead, but enables previously impossible model sizes

4. Complete DeepSpeed Configuration Guide

4.1 ds_config.json Basic Structure

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0,

  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },

  "bf16": {
    "enabled": "auto"
  },

  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "overlap_comm": true,
    "contiguous_gradients": true
  },

  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": 1e-8,
      "weight_decay": "auto"
    }
  },

  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },

  "activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
  },

  "wall_clock_breakdown": false,
  "steps_per_print": 100
}

4.2 Mixed Precision Settings in Detail

fp16 configuration:

{
  "fp16": {
    "enabled": true,
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "consecutive_hysteresis": false,
    "min_loss_scale": 1
  }
}

loss_scale: 0: Enables dynamic loss scaling
initial_scale_power: 16: Initial loss scale = 2^16 = 65536
loss_scale_window: Successful steps before scale increase
hysteresis: Overflow count before scale decrease

bf16 configuration (Ampere+ GPUs):

{
  "bf16": {
    "enabled": true
  }
}

bf16 has wider dynamic range than fp16, eliminating the need for loss scaling. Recommended for A100/H100.

4.3 Gradient Accumulation and Batch Sizes

Batch size relationship:

train_batch_size = train_micro_batch_size_per_gpu × gradient_accumulation_steps × world_size

{
  "train_batch_size": 2048,
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 64
}

This configuration on 8 GPUs: 4 × 64 × 8 = 2048 global batch size.

Using "auto" lets the Transformers Trainer automatically fill in values.

4.4 Learning Rate Scheduler Options

{
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "last_batch_iteration": -1,
      "total_num_steps": 100000,
      "warmup_min_lr": 0,
      "warmup_max_lr": 3e-4,
      "warmup_num_steps": 2000,
      "warmup_type": "linear"
    }
  }
}

Supported scheduler types:

LRRangeTest
OneCycle
WarmupLR
WarmupDecayLR
WarmupCosineLR (custom)

5. DeepSpeed + PyTorch Integration

5.1 Basic Training Loop

import torch
import deepspeed
from torch.utils.data import DataLoader

def main():
    # Initialize distributed environment
    deepspeed.init_distributed()

    # Create model
    model = GPT2LMHeadModel.from_pretrained("gpt2")

    # Initialize DeepSpeed engine
    model_engine, optimizer, train_loader, lr_scheduler = deepspeed.initialize(
        model=model,
        training_data=train_dataset,
        config="ds_config.json"
    )

    # Training loop
    for epoch in range(num_epochs):
        for batch in train_loader:
            input_ids = batch["input_ids"].to(model_engine.device)
            labels = batch["labels"].to(model_engine.device)

            # Forward pass
            outputs = model_engine(input_ids=input_ids, labels=labels)
            loss = outputs.loss

            # Backward pass (DeepSpeed handles scaling/accumulation)
            model_engine.backward(loss)

            # Parameter update (includes gradient clipping)
            model_engine.step()

    print(f"Step: {model_engine.global_steps}, Loss: {loss.item():.4f}")

5.2 Checkpoint Saving and Loading

# Save checkpoint
def save_checkpoint(model_engine, save_dir, tag=None):
    # Save in DeepSpeed format (includes distributed state)
    model_engine.save_checkpoint(save_dir, tag=tag)

# Example: save every 1000 steps
if model_engine.global_steps % 1000 == 0:
    save_checkpoint(model_engine, "./checkpoints", tag=f"step_{model_engine.global_steps}")

# Load checkpoint
_, client_sd = model_engine.load_checkpoint("./checkpoints", tag="step_1000")

# Extract weights in standard PyTorch format under ZeRO-3
if args.zero_stage == 3:
    state_dict = model_engine._zero3_consolidated_16bit_state_dict()
    if model_engine.local_rank == 0:
        torch.save(state_dict, "model_weights.pt")

5.3 Transformers Trainer Integration

The simplest way to integrate DeepSpeed with HuggingFace Transformers:

from transformers import TrainingArguments, Trainer, AutoModelForCausalLM

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    fp16=True,
    deepspeed="ds_config.json",  # Path to DeepSpeed config
    logging_steps=10,
    save_steps=1000,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

Command-line execution:

deepspeed --num_gpus=8 train.py \
    --deepspeed ds_config.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset_name openwebtext \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --num_train_epochs 3 \
    --fp16

5.4 Accelerate Integration

from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin

deepspeed_plugin = DeepSpeedPlugin(
    zero_stage=2,
    gradient_accumulation_steps=8,
    gradient_clipping=1.0,
)

accelerator = Accelerator(
    mixed_precision="fp16",
    deepspeed_plugin=deepspeed_plugin,
)

model, optimizer, train_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader
)

for batch in train_dataloader:
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)
    optimizer.step()
    optimizer.zero_grad()

6. DeepSpeed Pipeline Parallelism

6.1 Pipeline Parallelism Concept

Pipeline parallelism places model layers sequentially across multiple GPUs. GPU 0 handles early layers, GPU 1 middle layers, and GPU N the final layers.

1F1B Schedule (One Forward, One Backward): Minimizes pipeline bubbles. Each GPU performs one microbatch forward pass immediately followed by a backward pass.

6.2 PipelineModule Usage

from deepspeed.pipe import PipelineModule, LayerSpec

class GPT2PipelineModel(PipelineModule):
    def __init__(self, num_layers, hidden_size, num_heads, vocab_size):
        layers = [
            LayerSpec(EmbeddingLayer, vocab_size, hidden_size),
        ]
        for i in range(num_layers):
            layers.append(LayerSpec(TransformerBlock, hidden_size, num_heads))
        layers.append(LayerSpec(LMHead, hidden_size, vocab_size))

        super().__init__(
            layers=layers,
            loss_fn=cross_entropy_loss,
            num_stages=4,       # Pipeline stages (= number of GPUs)
            topology=PipeDataParallelTopology(num_pp=4, num_dp=2)
        )

# Add pipeline settings to ds_config
pipeline_config = {
    "train_batch_size": 64,
    "train_micro_batch_size_per_gpu": 2,
    "gradient_accumulation_steps": 8,
    "pipeline": {
        "seed_layers": True,
        "activation_checkpoint_interval": 1
    }
}

6.3 Pipeline + ZeRO Combination

model_engine, _, _, _ = deepspeed.initialize(
    model=pipeline_model,
    config={
        "train_micro_batch_size_per_gpu": 2,
        "gradient_accumulation_steps": 8,
        "zero_optimization": {"stage": 1},  # Combine with ZeRO-1
        "fp16": {"enabled": True}
    }
)

# Pipeline training
for step in range(num_steps):
    loss = model_engine.train_batch()
    if model_engine.global_steps % 100 == 0:
        print(f"Step {model_engine.global_steps}, Loss: {loss.item():.4f}")

7. Tensor Parallelism

7.1 Megatron-DeepSpeed Integration

Tensor parallelism distributes individual layers across multiple GPUs by partitioning matrix multiplications along column or row dimensions.

# Megatron-DeepSpeed configuration
megatron_config = {
  "tensor_model_parallel_size": 4,
  "pipeline_model_parallel_size": 2,
  "data_parallel_size": 4,  # world_size / (tp_size x pp_size)
}

Tensor parallel settings in ds_config.json:

{
  "tensor_parallel": {
    "tp_size": 4,
    "mpu": null,
    "tp_grain_size": 64
  }
}

Launch command:

deepspeed --num_nodes=4 --num_gpus=8 \
    pretrain_gpt.py \
    --tensor-model-parallel-size 4 \
    --pipeline-model-parallel-size 4 \
    --num-layers 96 \
    --hidden-size 12288 \
    --num-attention-heads 96 \
    --seq-length 2048 \
    --global-batch-size 1024 \
    --train-iters 500000 \
    --deepspeed \
    --deepspeed_config ds_config.json

8. Activation Checkpointing (Gradient Checkpointing)

8.1 Concept and Trade-offs

Activation checkpointing discards intermediate activations during the forward pass and recomputes them during backpropagation.

Memory savings: Reduces activation memory from O(layers) to O(sqrt(layers))
Speed cost: ~33% additional compute overhead

{
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "contiguous_memory_optimization": true,
    "number_checkpoints": 4,
    "synchronize_checkpoint_boundary": false,
    "profile": false
  }
}

# Also activate at PyTorch level
from deepspeed.runtime.activation_checkpointing import checkpointing

class TransformerLayer(nn.Module):
    def forward(self, x):
        return checkpointing.checkpoint(self._forward, x)

    def _forward(self, x):
        # Actual computation
        return self.attention(self.norm1(x)) + x

9. Mixture of Experts (MoE) with DeepSpeed

9.1 MoE Architecture Overview

Mixture of Experts routes each input token to only a subset of "expert" sub-networks, increasing parameter count while maintaining constant compute.

from deepspeed.moe.layer import MoE

class MoETransformerBlock(nn.Module):
    def __init__(self, hidden_size, num_experts, top_k=2):
        super().__init__()
        self.attention = MultiHeadAttention(hidden_size)
        self.moe_layer = MoE(
            hidden_size=hidden_size,
            expert=FeedForward(hidden_size, hidden_size * 4),
            num_experts=num_experts,
            ep_size=1,           # Expert parallelism size
            k=top_k,             # Number of experts to activate
            capacity_factor=1.25,
            eval_capacity_factor=2.0,
            min_capacity=4,
            use_residual=False
        )

    def forward(self, x, attention_mask=None):
        x = x + self.attention(x, attention_mask)
        x, _, _ = self.moe_layer(x)
        return x

9.2 Expert Parallelism Configuration

{
  "zero_optimization": {
    "stage": 2
  },
  "moe_expert_parallel_size": 4,
  "moe": {
    "enabled": true,
    "ep_size": 4,
    "moe_param_group": true
  }
}

# Separate MoE parameter groups (important!)
def create_moe_param_groups(model):
    parameters = {
        "params": [],
        "name": "parameters"
    }
    moe_parameters = {
        "params": [],
        "moe": True,
        "name": "moe_parameters"
    }
    for module_name, module in model.named_modules():
        if isinstance(module, MoE):
            moe_parameters["params"].extend(
                [p for n, p in module.named_parameters() if "expert" in n]
            )
        else:
            parameters["params"].extend(module.parameters(recurse=False))
    return [parameters, moe_parameters]

10. DeepSpeed Inference

10.1 Inference Optimization

DeepSpeed Inference dramatically improves the inference speed of trained models.

import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Convert to DeepSpeed inference engine
ds_engine = deepspeed.init_inference(
    model=model,
    mp_size=1,               # Tensor parallel size
    dtype=torch.float16,     # Inference dtype
    checkpoint=None,
    replace_with_kernel_inject=True,   # Inject optimized kernels
    max_tokens=2048,
    replace_method="auto"
)

model = ds_engine.module

# Run inference
inputs = tokenizer("DeepSpeed is ", return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.7
    )
print(tokenizer.decode(outputs[0]))

10.2 Tensor Parallel Inference

# Multi-GPU inference (tensor parallelism)
ds_engine = deepspeed.init_inference(
    model=model,
    mp_size=4,               # Distribute across 4 GPUs
    dtype=torch.float16,
    replace_with_kernel_inject=True,
    injection_policy={
        LlamaDecoderLayer: ("self_attn.o_proj", "mlp.down_proj")
    }
)

10.3 ZeRO-Inference

Large model inference on a single or few GPUs:

# ZeRO-3-based inference (stream parameters from CPU/NVMe)
ds_engine = deepspeed.init_inference(
    model=model,
    config={
        "tensor_parallel": {"tp_size": 1},
        "dtype": "fp16",
        "enable_cuda_graph": False,
        "zero": {
            "stage": 3,
            "offload_param": {
                "device": "cpu"
            }
        }
    }
)

11. DeepSpeed Chat (RLHF)

11.1 RLHF Pipeline Overview

DeepSpeed-Chat provides a complete RLHF (Reinforcement Learning from Human Feedback) pipeline.

Three training stages:

SFT (Supervised Fine-Tuning)
Reward Model Training
RLHF/PPO with PPO algorithm

11.2 SFT Stage

# Step 1: SFT
deepspeed main.py \
    --data_path Dahoas/rm-static \
    --data_split 2,4,4 \
    --model_name_or_path facebook/opt-1.3b \
    --per_device_train_batch_size 8 \
    --max_seq_len 512 \
    --learning_rate 9.65e-6 \
    --weight_decay 0.1 \
    --num_train_epochs 1 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --num_warmup_steps 0 \
    --seed 1234 \
    --zero_stage 2 \
    --deepspeed \
    --output_dir /output/sft

11.3 Reward Model Training

from deepspeed_chat.reward_model import RewardModel

# Reward model = SFT model + linear head
reward_model = RewardModel(
    base_model=sft_model,
    tokenizer=tokenizer,
    num_padding_at_beginning=0
)

# Train on preference pairs (chosen, rejected)
def compute_reward_loss(reward_model, chosen_ids, rejected_ids):
    chosen_reward = reward_model(chosen_ids)
    rejected_reward = reward_model(rejected_ids)
    # Preferred responses should have higher reward
    loss = -torch.mean(torch.log(torch.sigmoid(chosen_reward - rejected_reward)))
    return loss

11.4 PPO Training

from deepspeed_chat.ppo_trainer import DeepSpeedPPOTrainer

trainer = DeepSpeedPPOTrainer(
    rlhf_engine=rlhf_engine,
    args=args
)

for epoch in range(args.num_train_epochs):
    for step, batch in enumerate(prompt_dataloader):
        out = trainer.generate_experience(batch["input_ids"])
        # Update Actor/Critic models
        actor_loss, critic_loss = trainer.train_rlhf(out)

12. Practical Example: LLM Training

12.1 Complete Llama-2 Fine-tuning Example

import torch
import deepspeed
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForSeq2Seq
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType

def main():
    # Load model and tokenizer
    model_name = "meta-llama/Llama-2-7b-hf"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    # Load model (DeepSpeed manages device placement)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map=None  # DeepSpeed handles this
    )

    # LoRA configuration (optional: parameter-efficient fine-tuning)
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.1,
        bias="none"
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    # Load dataset
    dataset = load_dataset("tatsu-lab/alpaca", split="train")

    def preprocess(example):
        prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
        tokens = tokenizer(prompt, truncation=True, max_length=512, padding="max_length")
        tokens["labels"] = tokens["input_ids"].copy()
        return tokens

    tokenized_dataset = dataset.map(preprocess, remove_columns=dataset.column_names)

    # TrainingArguments with DeepSpeed
    training_args = TrainingArguments(
        output_dir="./llama2-finetuned",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        learning_rate=2e-4,
        warmup_steps=100,
        logging_steps=10,
        save_steps=500,
        bf16=True,
        deepspeed="ds_zero2_config.json",
        remove_unused_columns=False,
        report_to="wandb",
        run_name="llama2-alpaca-deepspeed"
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset,
        data_collator=DataCollatorForSeq2Seq(tokenizer, padding=True)
    )

    trainer.train()
    trainer.save_model("./llama2-finetuned-final")

if __name__ == "__main__":
    main()

ZeRO-2 configuration file (ds_zero2_config.json):

{
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": 1e-8,
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto",
      "total_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 100,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Launch:

deepspeed --num_gpus=8 train_llama.py \
    --deepspeed ds_zero2_config.json

12.2 Performance Benchmarks

Measured results on 8x A100 80GB:

Configuration	Max Model Size	Throughput (tokens/sec)
Baseline DDP	~7B	12,000
ZeRO-1	~14B	11,500
ZeRO-2	~35B	10,800
ZeRO-3	~175B	9,200
ZeRO-3 + CPU Offload	~350B	4,100
ZeRO-Infinity	~1T+	1,800

13. Debugging and Troubleshooting

13.1 Common Issues and Solutions

OOM (Out of Memory) errors:

# Profile memory usage
ds_config = {
    "memory_breakdown": True,
    "wall_clock_breakdown": True,
}

# Step-by-step resolution:
# 1. Increase ZeRO stage (1 -> 2 -> 3)
# 2. Enable CPU Offload
# 3. Enable activation checkpointing
# 4. Reduce batch size
# 5. Reduce sequence length

Gradient overflow (fp16):

{
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "initial_scale_power": 12,
    "hysteresis": 2,
    "min_loss_scale": 1e-8
  }
}

ZeRO-3 parameter gather error:

# Accessing parameters directly under ZeRO-3
from deepspeed import zero

with zero.GatheredParameters(model.parameters()):
    # Full parameters available only within this block
    weight_norm = model.weight.norm()

13.2 Performance Tuning Tips

# 1. Overlap communication with computation
zero_config = {
    "overlap_comm": True,
    "contiguous_gradients": True,
}

# 2. Optimize bucket sizes (tune for GPU memory and bandwidth)
zero_config["allgather_bucket_size"] = 5e8   # 500MB
zero_config["reduce_bucket_size"] = 5e8

# 3. ZeRO-3 prefetching
zero_config["stage3_prefetch_bucket_size"] = 5e7
zero_config["stage3_param_persistence_threshold"] = 1e6

# 4. Compile optimization (PyTorch 2.0+)
model = torch.compile(model)

Conclusion

DeepSpeed is an essential tool for modern LLM training. Its ZeRO optimization enables training of large-scale models that would be impossible on a single GPU, at a reasonable cost.

Key takeaways:

ZeRO-1/2: Optimizer/gradient partitioning improves data parallelism efficiency
ZeRO-3: Also partitions parameters, providing memory savings proportional to GPU count
ZeRO-Offload: Offloads to CPU/NVMe, enabling large model training on small clusters
Pipeline + Tensor Parallelism: Horizontal/vertical model distribution maximizes communication efficiency
DeepSpeed Inference: Kernel optimizations provide 2-5x inference speedups

For real projects, ZeRO-2 + CPU Offload is a good starting point. Move to ZeRO-3 for larger models, and use DeepSpeed Inference's kernel injection for serving.