💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Training large language models (LLMs) like GPT-4, LLaMA, or Falcon on a single GPU is simply not feasible. Storing GPT-3's 175 billion parameters in fp16 requires approximately 350GB of GPU memory, and including optimizer states (Adam) brings that to 1.4TB. Microsoft developed **DeepSpeed** to solve exactly this problem.

DeepSpeed centers around the ZeRO (Zero Redundancy Optimizer) optimization technology, enabling training of models with hundreds of billions to trillions of parameters on ordinary GPU clusters. This guide covers everything from DeepSpeed's core concepts to production configurations.

1. Introduction to DeepSpeed

1.1 Background and Motivation

As deep learning models grow larger, three bottlenecks emerge:

- **Memory shortage**: Model parameters, gradients, and optimizer states exceed GPU VRAM

- **Compute bottleneck**: Single GPU FLOPS limits

- **Communication bottleneck**: Gradient synchronization overhead in multi-GPU setups

Traditional data parallelism (DP) requires each GPU to hold a full copy of the model, failing to address the memory problem. Model parallelism is complex to implement and often inefficient.

DeepSpeed combines the strengths of both approaches and achieves revolutionary memory efficiency through its novel ZeRO optimization.

1.2 Integration with PyTorch

DeepSpeed requires minimal changes to existing PyTorch code. The core change is a single `deepspeed.initialize()` call that handles engine creation, distributed initialization, and optimizer wrapping simultaneously.

Existing PyTorch code

model = MyModel()

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

Switching to DeepSpeed (minimal changes)

model_engine, optimizer, _, _ = deepspeed.initialize(

args=args,

model=model,

model_parameters=model.parameters(),

config="ds_config.json"

)

1.3 Installation

pip install deepspeed

Install with CUDA kernel builds

DS_BUILD_OPS=1 pip install deepspeed

Verify installation

ds_report

2. ZeRO Optimization: Stage-by-Stage Analysis

ZeRO (Zero Redundancy Optimizer) is the core technology for eliminating memory redundancy in data-parallel training. In standard data parallelism, N GPUs all maintain identical model copies — an extreme waste of memory. ZeRO eliminates this redundancy incrementally.

Memory Usage Analysis During Training

Given Ψ model parameters, per-GPU memory in mixed-precision (fp16) training is:

| Component | Size | Description |

| ------------------------ | ------------- | --------------------- |

| Parameters (fp16) | 2Ψ bytes | Forward/backward pass |

| Gradients (fp16) | 2Ψ bytes | Backward pass results |

| Master parameters (fp32) | 4Ψ bytes | For optimizer |

| Adam momentum (fp32) | 4Ψ bytes | 1st moment |

| Adam variance (fp32) | 4Ψ bytes | 2nd moment |

| **Total** | **16Ψ bytes** | |

A 7B parameter model requires ~112GB; a 70B model needs ~1.1TB.

2.1 ZeRO Stage 1: Optimizer State Partitioning

ZeRO-1 partitions **optimizer states** — the largest memory component — across N GPUs.

- Adam states: fp32 parameters + momentum + variance = **12Ψ bytes**

- With N GPUs: optimizer state per GPU = 12Ψ/N bytes

**Memory savings:**

Baseline: 16Ψ bytes per GPU

ZeRO-1: 4Ψ + 12Ψ/N bytes per GPU

(at N=64: ≈ 4.2Ψ bytes, ~4x savings)

During updates, each GPU updates only its assigned parameter shard, then an All-gather operation restores the full parameters.

{

"zero_optimization": {

"stage": 1

}

2.2 ZeRO Stage 2: Gradient Partitioning

ZeRO-2 also partitions **gradients** in addition to optimizer states.

During backpropagation, each GPU accumulates only the gradients corresponding to its parameter shard. A Reduce-scatter operation collects gradients for each GPU's assigned segment.

Baseline: 16Ψ bytes per GPU

ZeRO-2: 2Ψ + (2Ψ + 12Ψ)/N bytes per GPU

(at N=64: ≈ 2.2Ψ bytes, ~8x savings)

{

"zero_optimization": {

"stage": 2,

"allgather_partitions": true,

"allgather_bucket_size": 2e8,

"reduce_scatter": true,

"reduce_bucket_size": 2e8,

"overlap_comm": true,

"contiguous_gradients": true

}

2.3 ZeRO Stage 3: Parameter Partitioning

ZeRO-3 also partitions the model **parameters themselves**. This is the most powerful stage but most complex to implement.

**How it works:**

1. Each GPU permanently holds only 1/N of total parameters

2. Forward pass: All-gather collects full parameters just before each layer executes

3. After computation: parameters are immediately freed

4. Backward pass: same on-demand parameter collection

ZeRO-3: (2Ψ + 2Ψ + 12Ψ)/N bytes per GPU

At N=64: 16Ψ/64 ≈ 0.25Ψ bytes (~64x savings!)

**Tradeoff:** All-gather communication occurs at every layer, increasing communication overhead. However, overlapping with pipeline execution makes the practical delay small.

{

"zero_optimization": {

"stage": 3,

"overlap_comm": true,

"contiguous_gradients": true,

"sub_group_size": 1e9,

"reduce_bucket_size": "auto",

"stage3_prefetch_bucket_size": "auto",

"stage3_param_persistence_threshold": "auto",

"stage3_max_live_parameters": 1e9,

"stage3_max_reuse_distance": 1e9,

"stage3_gather_16bit_weights_on_model_save": true

}

**ZeRO-3 Python initialization:**

from transformers import AutoModelForCausalLM

Model creation method matters for ZeRO-3

with deepspeed.zero.Init(config_dict_or_path="ds_config.json"):

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

Parameters are partitioned from the start, preventing memory peaks

3. ZeRO-Offload and ZeRO-Infinity

3.1 ZeRO-Offload

ZeRO-Offload moves optimizer states and gradients to **CPU memory**, greatly reducing GPU memory pressure.

- GPU: fp16 parameters + fp16 gradients (forward/backward)

- CPU: fp32 master parameters + Adam states (optimizer update)

This enables training multi-billion parameter models even on a single GPU.

{

"zero_optimization": {

"stage": 2,

"offload_optimizer": {

"device": "cpu",

"pin_memory": true

"allgather_partitions": true,

"allgather_bucket_size": 2e8,

"overlap_comm": true,

"reduce_scatter": true,

"reduce_bucket_size": 2e8,

"contiguous_gradients": true

}

**Parameter offload (ZeRO-3 + Offload):**

{

"zero_optimization": {

"stage": 3,

"offload_optimizer": {

"device": "cpu",

"pin_memory": true

"offload_param": {

"device": "cpu",

"pin_memory": true

}

3.2 ZeRO-Infinity (NVMe Offload)

ZeRO-Infinity leverages NVMe SSDs to support virtually unlimited model sizes.

{

"zero_optimization": {

"stage": 3,

"offload_optimizer": {

"device": "nvme",

"nvme_path": "/local_nvme",

"pin_memory": true,

"buffer_count": 4,

"fast_init": false

"offload_param": {

"device": "nvme",

"nvme_path": "/local_nvme",

"pin_memory": true,

"buffer_count": 5,

"buffer_size": 1e8

"aio": {

"block_size": 1048576,

"queue_depth": 8,

"thread_count": 1,

"single_submit": false,

"overlap_events": true

}

**Bandwidth considerations:**

- GPU-CPU bandwidth: ~32 GB/s over PCIe 4.0 x16

- CPU-NVMe bandwidth: ~7 GB/s for NVMe SSDs

- NVMe is slower than CPU memory, adding training overhead, but enables previously impossible model sizes

4. Complete DeepSpeed Configuration Guide

4.1 ds_config.json Basic Structure

{

"train_batch_size": "auto",

"train_micro_batch_size_per_gpu": "auto",

"gradient_accumulation_steps": "auto",

"gradient_clipping": 1.0,

"fp16": {

"enabled": "auto",

"loss_scale": 0,

"loss_scale_window": 1000,

"initial_scale_power": 16,

"hysteresis": 2,

"min_loss_scale": 1

"bf16": {

"enabled": "auto"

"zero_optimization": {

"stage": 2,

"allgather_partitions": true,

"allgather_bucket_size": 2e8,

"reduce_scatter": true,

"reduce_bucket_size": 2e8,

"overlap_comm": true,

"contiguous_gradients": true

"optimizer": {

"type": "AdamW",

"params": {

"lr": "auto",

"betas": "auto",

"eps": 1e-8,

"weight_decay": "auto"

}

"scheduler": {

"type": "WarmupLR",

"params": {

"warmup_min_lr": "auto",

"warmup_max_lr": "auto",

"warmup_num_steps": "auto"

}

"activation_checkpointing": {

"partition_activations": false,

"cpu_checkpointing": false,

"contiguous_memory_optimization": false,

"number_checkpoints": null,

"synchronize_checkpoint_boundary": false,

"profile": false

"wall_clock_breakdown": false,

"steps_per_print": 100

}

4.2 Mixed Precision Settings in Detail

**fp16 configuration:**

{

"fp16": {

"enabled": true,

"auto_cast": false,

"loss_scale": 0,

"initial_scale_power": 16,

"loss_scale_window": 1000,

"hysteresis": 2,

"consecutive_hysteresis": false,

"min_loss_scale": 1

}

- `loss_scale: 0`: Enables dynamic loss scaling

- `initial_scale_power: 16`: Initial loss scale = 2^16 = 65536

- `loss_scale_window`: Successful steps before scale increase

- `hysteresis`: Overflow count before scale decrease

**bf16 configuration (Ampere+ GPUs):**

{

"bf16": {

"enabled": true

}

bf16 has wider dynamic range than fp16, eliminating the need for loss scaling. Recommended for A100/H100.

4.3 Gradient Accumulation and Batch Sizes

Batch size relationship:

train_batch_size = train_micro_batch_size_per_gpu × gradient_accumulation_steps × world_size

{

"train_batch_size": 2048,

"train_micro_batch_size_per_gpu": 4,

"gradient_accumulation_steps": 64

}

This configuration on 8 GPUs: 4 × 64 × 8 = 2048 global batch size.

Using "auto" lets the Transformers Trainer automatically fill in values.

4.4 Learning Rate Scheduler Options

{

"scheduler": {

"type": "WarmupDecayLR",

"params": {

"last_batch_iteration": -1,

"total_num_steps": 100000,

"warmup_min_lr": 0,

"warmup_max_lr": 3e-4,

"warmup_num_steps": 2000,

"warmup_type": "linear"

}

Supported scheduler types:

- `LRRangeTest`

- `OneCycle`

- `WarmupLR`

- `WarmupDecayLR`

- `WarmupCosineLR` (custom)

5. DeepSpeed + PyTorch Integration

5.1 Basic Training Loop

from torch.utils.data import DataLoader

def main():

Initialize distributed environment

deepspeed.init_distributed()

Create model

model = GPT2LMHeadModel.from_pretrained("gpt2")

Initialize DeepSpeed engine

model_engine, optimizer, train_loader, lr_scheduler = deepspeed.initialize(

model=model,

training_data=train_dataset,

config="ds_config.json"

)

Training loop

for epoch in range(num_epochs):

for batch in train_loader:

input_ids = batch["input_ids"].to(model_engine.device)

labels = batch["labels"].to(model_engine.device)

Forward pass

outputs = model_engine(input_ids=input_ids, labels=labels)

loss = outputs.loss

Backward pass (DeepSpeed handles scaling/accumulation)

model_engine.backward(loss)

Parameter update (includes gradient clipping)

model_engine.step()

print(f"Step: {model_engine.global_steps}, Loss: {loss.item():.4f}")

5.2 Checkpoint Saving and Loading

Save checkpoint

def save_checkpoint(model_engine, save_dir, tag=None):

Save in DeepSpeed format (includes distributed state)

model_engine.save_checkpoint(save_dir, tag=tag)

Example: save every 1000 steps

if model_engine.global_steps % 1000 == 0:

save_checkpoint(model_engine, "./checkpoints", tag=f"step_{model_engine.global_steps}")

Load checkpoint

_, client_sd = model_engine.load_checkpoint("./checkpoints", tag="step_1000")

Extract weights in standard PyTorch format under ZeRO-3

if args.zero_stage == 3:

state_dict = model_engine._zero3_consolidated_16bit_state_dict()

if model_engine.local_rank == 0:

torch.save(state_dict, "model_weights.pt")

5.3 Transformers Trainer Integration

The simplest way to integrate DeepSpeed with HuggingFace Transformers:

from transformers import TrainingArguments, Trainer, AutoModelForCausalLM

training_args = TrainingArguments(

output_dir="./results",

num_train_epochs=3,

per_device_train_batch_size=4,

gradient_accumulation_steps=8,

fp16=True,

deepspeed="ds_config.json", # Path to DeepSpeed config

logging_steps=10,

save_steps=1000,

)

trainer = Trainer(

model=model,

args=training_args,

train_dataset=train_dataset,

eval_dataset=eval_dataset,

)

trainer.train()

Command-line execution:

deepspeed --num_gpus=8 train.py \

--deepspeed ds_config.json \

--model_name_or_path meta-llama/Llama-2-7b-hf \

--dataset_name openwebtext \

--per_device_train_batch_size 4 \

--gradient_accumulation_steps 8 \

--num_train_epochs 3 \

--fp16

5.4 Accelerate Integration

from accelerate import Accelerator

from accelerate.utils import DeepSpeedPlugin

deepspeed_plugin = DeepSpeedPlugin(

zero_stage=2,

gradient_accumulation_steps=8,

gradient_clipping=1.0,

)

accelerator = Accelerator(

mixed_precision="fp16",

deepspeed_plugin=deepspeed_plugin,

)

model, optimizer, train_dataloader = accelerator.prepare(

model, optimizer, train_dataloader

)

for batch in train_dataloader:

outputs = model(**batch)

loss = outputs.loss

accelerator.backward(loss)

optimizer.step()

optimizer.zero_grad()

6. DeepSpeed Pipeline Parallelism

6.1 Pipeline Parallelism Concept

Pipeline parallelism places model layers sequentially across multiple GPUs. GPU 0 handles early layers, GPU 1 middle layers, and GPU N the final layers.

**1F1B Schedule (One Forward, One Backward):**

Minimizes pipeline bubbles. Each GPU performs one microbatch forward pass immediately followed by a backward pass.

6.2 PipelineModule Usage

from deepspeed.pipe import PipelineModule, LayerSpec

class GPT2PipelineModel(PipelineModule):

def __init__(self, num_layers, hidden_size, num_heads, vocab_size):

layers = [

LayerSpec(EmbeddingLayer, vocab_size, hidden_size),

]

for i in range(num_layers):

layers.append(LayerSpec(TransformerBlock, hidden_size, num_heads))

layers.append(LayerSpec(LMHead, hidden_size, vocab_size))

super().__init__(

layers=layers,

loss_fn=cross_entropy_loss,

num_stages=4, # Pipeline stages (= number of GPUs)

topology=PipeDataParallelTopology(num_pp=4, num_dp=2)

)

Add pipeline settings to ds_config

pipeline_config = {

"train_batch_size": 64,

"train_micro_batch_size_per_gpu": 2,

"gradient_accumulation_steps": 8,

"pipeline": {

"seed_layers": True,

"activation_checkpoint_interval": 1

}

6.3 Pipeline + ZeRO Combination

model_engine, _, _, _ = deepspeed.initialize(

model=pipeline_model,

config={

"train_micro_batch_size_per_gpu": 2,

"gradient_accumulation_steps": 8,

"zero_optimization": {"stage": 1}, # Combine with ZeRO-1

"fp16": {"enabled": True}

}

)

Pipeline training

for step in range(num_steps):

loss = model_engine.train_batch()

if model_engine.global_steps % 100 == 0:

print(f"Step {model_engine.global_steps}, Loss: {loss.item():.4f}")

7. Tensor Parallelism

7.1 Megatron-DeepSpeed Integration

Tensor parallelism distributes individual layers across multiple GPUs by partitioning matrix multiplications along column or row dimensions.

Megatron-DeepSpeed configuration

megatron_config = {

"tensor_model_parallel_size": 4,

"pipeline_model_parallel_size": 2,

"data_parallel_size": 4, # world_size / (tp_size x pp_size)

}

**Tensor parallel settings in ds_config.json:**

{

"tensor_parallel": {

"tp_size": 4,

"mpu": null,

"tp_grain_size": 64

}

Launch command:

deepspeed --num_nodes=4 --num_gpus=8 \

pretrain_gpt.py \

--tensor-model-parallel-size 4 \

--pipeline-model-parallel-size 4 \

--num-layers 96 \

--hidden-size 12288 \

--num-attention-heads 96 \

--seq-length 2048 \

--global-batch-size 1024 \

--train-iters 500000 \

--deepspeed \

--deepspeed_config ds_config.json

8. Activation Checkpointing (Gradient Checkpointing)

8.1 Concept and Trade-offs

Activation checkpointing discards intermediate activations during the forward pass and recomputes them during backpropagation.

- **Memory savings**: Reduces activation memory from O(layers) to O(sqrt(layers))

- **Speed cost**: ~33% additional compute overhead

{

"activation_checkpointing": {

"partition_activations": true,

"cpu_checkpointing": true,

"contiguous_memory_optimization": true,

"number_checkpoints": 4,

"synchronize_checkpoint_boundary": false,

"profile": false

}

Also activate at PyTorch level

from deepspeed.runtime.activation_checkpointing import checkpointing

class TransformerLayer(nn.Module):

def forward(self, x):

return checkpointing.checkpoint(self._forward, x)

def _forward(self, x):

Actual computation

return self.attention(self.norm1(x)) + x

9. Mixture of Experts (MoE) with DeepSpeed

9.1 MoE Architecture Overview

Mixture of Experts routes each input token to only a subset of "expert" sub-networks, increasing parameter count while maintaining constant compute.

from deepspeed.moe.layer import MoE

class MoETransformerBlock(nn.Module):

def __init__(self, hidden_size, num_experts, top_k=2):

super().__init__()

self.attention = MultiHeadAttention(hidden_size)

self.moe_layer = MoE(

hidden_size=hidden_size,

expert=FeedForward(hidden_size, hidden_size * 4),

num_experts=num_experts,

ep_size=1, # Expert parallelism size

k=top_k, # Number of experts to activate

capacity_factor=1.25,

eval_capacity_factor=2.0,

min_capacity=4,

use_residual=False

)

def forward(self, x, attention_mask=None):

x = x + self.attention(x, attention_mask)

x, _, _ = self.moe_layer(x)

return x

9.2 Expert Parallelism Configuration

{

"zero_optimization": {

"stage": 2

"moe_expert_parallel_size": 4,

"moe": {

"enabled": true,

"ep_size": 4,

"moe_param_group": true

}

Separate MoE parameter groups (important!)

def create_moe_param_groups(model):

parameters = {

"params": [],

"name": "parameters"

}

moe_parameters = {

"params": [],

"moe": True,

"name": "moe_parameters"

}

for module_name, module in model.named_modules():

if isinstance(module, MoE):

moe_parameters["params"].extend(

[p for n, p in module.named_parameters() if "expert" in n]

)

else:

parameters["params"].extend(module.parameters(recurse=False))

return [parameters, moe_parameters]

10. DeepSpeed Inference

10.1 Inference Optimization

DeepSpeed Inference dramatically improves the inference speed of trained models.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

Convert to DeepSpeed inference engine

ds_engine = deepspeed.init_inference(

model=model,

mp_size=1, # Tensor parallel size

dtype=torch.float16, # Inference dtype

checkpoint=None,

replace_with_kernel_inject=True, # Inject optimized kernels

max_tokens=2048,

replace_method="auto"

)

model = ds_engine.module

Run inference

inputs = tokenizer("DeepSpeed is ", return_tensors="pt").to("cuda")

with torch.no_grad():

outputs = model.generate(

**inputs,

max_new_tokens=200,

do_sample=True,

temperature=0.7

)

print(tokenizer.decode(outputs[0]))

10.2 Tensor Parallel Inference

Multi-GPU inference (tensor parallelism)

ds_engine = deepspeed.init_inference(

model=model,

mp_size=4, # Distribute across 4 GPUs

dtype=torch.float16,

replace_with_kernel_inject=True,

injection_policy={

LlamaDecoderLayer: ("self_attn.o_proj", "mlp.down_proj")

}

)

10.3 ZeRO-Inference

Large model inference on a single or few GPUs:

ZeRO-3-based inference (stream parameters from CPU/NVMe)

ds_engine = deepspeed.init_inference(

model=model,

config={

"tensor_parallel": {"tp_size": 1},

"dtype": "fp16",

"enable_cuda_graph": False,

"zero": {

"stage": 3,

"offload_param": {

"device": "cpu"

}

)

11. DeepSpeed Chat (RLHF)

11.1 RLHF Pipeline Overview

DeepSpeed-Chat provides a complete RLHF (Reinforcement Learning from Human Feedback) pipeline.

**Three training stages:**

1. SFT (Supervised Fine-Tuning)

2. Reward Model Training

3. RLHF/PPO with PPO algorithm

11.2 SFT Stage

Step 1: SFT

deepspeed main.py \

--data_path Dahoas/rm-static \

--data_split 2,4,4 \

--model_name_or_path facebook/opt-1.3b \

--per_device_train_batch_size 8 \

--max_seq_len 512 \

--learning_rate 9.65e-6 \

--weight_decay 0.1 \

--num_train_epochs 1 \

--gradient_accumulation_steps 1 \

--lr_scheduler_type cosine \

--num_warmup_steps 0 \

--seed 1234 \

--zero_stage 2 \

--deepspeed \

--output_dir /output/sft

11.3 Reward Model Training

from deepspeed_chat.reward_model import RewardModel

Reward model = SFT model + linear head

reward_model = RewardModel(

base_model=sft_model,

tokenizer=tokenizer,

num_padding_at_beginning=0

)

Train on preference pairs (chosen, rejected)

def compute_reward_loss(reward_model, chosen_ids, rejected_ids):

chosen_reward = reward_model(chosen_ids)

rejected_reward = reward_model(rejected_ids)

Preferred responses should have higher reward

loss = -torch.mean(torch.log(torch.sigmoid(chosen_reward - rejected_reward)))

return loss

11.4 PPO Training

from deepspeed_chat.ppo_trainer import DeepSpeedPPOTrainer

trainer = DeepSpeedPPOTrainer(

rlhf_engine=rlhf_engine,

args=args

)

for epoch in range(args.num_train_epochs):

for step, batch in enumerate(prompt_dataloader):

out = trainer.generate_experience(batch["input_ids"])

Update Actor/Critic models

actor_loss, critic_loss = trainer.train_rlhf(out)

12. Practical Example: LLM Training

12.1 Complete Llama-2 Fine-tuning Example

from transformers import (

AutoModelForCausalLM,

AutoTokenizer,

TrainingArguments,

Trainer,

DataCollatorForSeq2Seq

)

from datasets import load_dataset

from peft import LoraConfig, get_peft_model, TaskType

def main():

Load model and tokenizer

model_name = "meta-llama/Llama-2-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token

Load model (DeepSpeed manages device placement)

model = AutoModelForCausalLM.from_pretrained(

model_name,

torch_dtype=torch.float16,

device_map=None # DeepSpeed handles this

)

LoRA configuration (optional: parameter-efficient fine-tuning)

lora_config = LoraConfig(

task_type=TaskType.CAUSAL_LM,

r=16,

lora_alpha=32,

target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],

lora_dropout=0.1,

bias="none"

)

model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

Load dataset

dataset = load_dataset("tatsu-lab/alpaca", split="train")

def preprocess(example):

prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"

tokens = tokenizer(prompt, truncation=True, max_length=512, padding="max_length")

tokens["labels"] = tokens["input_ids"].copy()

return tokens

tokenized_dataset = dataset.map(preprocess, remove_columns=dataset.column_names)

TrainingArguments with DeepSpeed

training_args = TrainingArguments(

output_dir="./llama2-finetuned",

num_train_epochs=3,

per_device_train_batch_size=4,

gradient_accumulation_steps=8,

learning_rate=2e-4,

warmup_steps=100,

logging_steps=10,

save_steps=500,

bf16=True,

deepspeed="ds_zero2_config.json",

remove_unused_columns=False,

report_to="wandb",

run_name="llama2-alpaca-deepspeed"

)

trainer = Trainer(

model=model,

args=training_args,

train_dataset=tokenized_dataset,

data_collator=DataCollatorForSeq2Seq(tokenizer, padding=True)

)

trainer.train()

trainer.save_model("./llama2-finetuned-final")

if __name__ == "__main__":

main()

**ZeRO-2 configuration file (ds_zero2_config.json):**

{

"bf16": {

"enabled": "auto"

"optimizer": {

"type": "AdamW",

"params": {

"lr": "auto",

"betas": "auto",

"eps": 1e-8,

"weight_decay": "auto"

}

"scheduler": {

"type": "WarmupDecayLR",

"params": {

"warmup_min_lr": "auto",

"warmup_max_lr": "auto",

"warmup_num_steps": "auto",

"total_num_steps": "auto"

}

"zero_optimization": {

"stage": 2,

"offload_optimizer": {

"device": "cpu",

"pin_memory": true

"allgather_partitions": true,

"allgather_bucket_size": 2e8,

"overlap_comm": true,

"reduce_scatter": true,

"reduce_bucket_size": 2e8,

"contiguous_gradients": true

"gradient_accumulation_steps": "auto",

"gradient_clipping": "auto",

"steps_per_print": 100,

"train_batch_size": "auto",

"train_micro_batch_size_per_gpu": "auto",

"wall_clock_breakdown": false

}

Launch:

deepspeed --num_gpus=8 train_llama.py \

--deepspeed ds_zero2_config.json

12.2 Performance Benchmarks

Measured results on 8x A100 80GB:

| Configuration | Max Model Size | Throughput (tokens/sec) |

| -------------------- | -------------- | ----------------------- |

| Baseline DDP | ~7B | 12,000 |

| ZeRO-1 | ~14B | 11,500 |

| ZeRO-2 | ~35B | 10,800 |

| ZeRO-3 | ~175B | 9,200 |

| ZeRO-3 + CPU Offload | ~350B | 4,100 |

| ZeRO-Infinity | ~1T+ | 1,800 |

13. Debugging and Troubleshooting

13.1 Common Issues and Solutions

**OOM (Out of Memory) errors:**

Profile memory usage

ds_config = {

"memory_breakdown": True,

"wall_clock_breakdown": True,

}

Step-by-step resolution:

1. Increase ZeRO stage (1 -> 2 -> 3)

2. Enable CPU Offload

3. Enable activation checkpointing

4. Reduce batch size

5. Reduce sequence length

**Gradient overflow (fp16):**

{

"fp16": {

"enabled": true,

"loss_scale": 0,

"loss_scale_window": 500,

"initial_scale_power": 12,

"hysteresis": 2,

"min_loss_scale": 1e-8

}

**ZeRO-3 parameter gather error:**

Accessing parameters directly under ZeRO-3

from deepspeed import zero

with zero.GatheredParameters(model.parameters()):

Full parameters available only within this block

weight_norm = model.weight.norm()

13.2 Performance Tuning Tips

1. Overlap communication with computation

zero_config = {

"overlap_comm": True,

"contiguous_gradients": True,

}

2. Optimize bucket sizes (tune for GPU memory and bandwidth)

zero_config["allgather_bucket_size"] = 5e8 # 500MB

zero_config["reduce_bucket_size"] = 5e8

3. ZeRO-3 prefetching

zero_config["stage3_prefetch_bucket_size"] = 5e7

zero_config["stage3_param_persistence_threshold"] = 1e6

4. Compile optimization (PyTorch 2.0+)

model = torch.compile(model)

Conclusion

DeepSpeed is an essential tool for modern LLM training. Its ZeRO optimization enables training of large-scale models that would be impossible on a single GPU, at a reasonable cost.

Key takeaways:

- **ZeRO-1/2**: Optimizer/gradient partitioning improves data parallelism efficiency

- **ZeRO-3**: Also partitions parameters, providing memory savings proportional to GPU count

- **ZeRO-Offload**: Offloads to CPU/NVMe, enabling large model training on small clusters

- **Pipeline + Tensor Parallelism**: Horizontal/vertical model distribution maximizes communication efficiency

- **DeepSpeed Inference**: Kernel optimizations provide 2-5x inference speedups

For real projects, ZeRO-2 + CPU Offload is a good starting point. Move to ZeRO-3 for larger models, and use DeepSpeed Inference's kernel injection for serving.

References

- [DeepSpeed GitHub](https://github.com/microsoft/DeepSpeed)

- [DeepSpeed Documentation](https://www.deepspeed.ai/docs/config-json/)

- [ZeRO Paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)

- [DeepSpeed-Chat](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat)

- [ZeRO-Infinity](https://arxiv.org/abs/2104.07857)