DeepSpeed Complete Guide: ZeRO Optimization and Large-Scale Model Training

Introduction

Training large language models (LLMs) like GPT-4, LLaMA, or Falcon on a single GPU is simply not feasible. Storing GPT-3's 175 billion parameters in fp16 requires approximately 350GB of GPU memory, and including optimizer states (Adam) brings that to 1.4TB. Microsoft developed DeepSpeed to solve exactly this problem.

DeepSpeed centers around the ZeRO (Zero Redundancy Optimizer) optimization technology, enabling training of models with hundreds of billions to trillions of parameters on ordinary GPU clusters. This guide covers everything from DeepSpeed's core concepts to production configurations.

1. Introduction to DeepSpeed

1.1 Background and Motivation

As deep learning models grow larger, three bottlenecks emerge:

Memory shortage: Model parameters, gradients, and optimizer states exceed GPU VRAM
Compute bottleneck: Single GPU FLOPS limits
Communication bottleneck: Gradient synchronization overhead in multi-GPU setups

Traditional data parallelism (DP) requires each GPU to hold a full copy of the model, failing to address the memory problem. Model parallelism is complex to implement and often inefficient.

DeepSpeed combines the strengths of both approaches and achieves revolutionary memory efficiency through its novel ZeRO optimization.

1.2 Integration with PyTorch

DeepSpeed requires minimal changes to existing PyTorch code. The core change is a single deepspeed.initialize() call that handles engine creation, distributed initialization, and optimizer wrapping simultaneously.

import deepspeed

# Existing PyTorch code
model = MyModel()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# Switching to DeepSpeed (minimal changes)
model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=model.parameters(),
    config="ds_config.json"
)

1.3 Installation

pip install deepspeed

# Install with CUDA kernel builds
DS_BUILD_OPS=1 pip install deepspeed

# Verify installation
ds_report

2. ZeRO Optimization: Stage-by-Stage Analysis

ZeRO (Zero Redundancy Optimizer) is the core technology for eliminating memory redundancy in data-parallel training. In standard data parallelism, N GPUs all maintain identical model copies — an extreme waste of memory. ZeRO eliminates this redundancy incrementally.

Memory Usage Analysis During Training

Given Ψ model parameters, per-GPU memory in mixed-precision (fp16) training is:

Component	Size	Description
Parameters (fp16)	2Ψ bytes	Forward/backward pass
Gradients (fp16)	2Ψ bytes	Backward pass results
Master parameters (fp32)	4Ψ bytes	For optimizer
Adam momentum (fp32)	4Ψ bytes	1st moment
Adam variance (fp32)	4Ψ bytes	2nd moment
Total	16Ψ bytes

A 7B parameter model requires ~112GB; a 70B model needs ~1.1TB.

2.1 ZeRO Stage 1: Optimizer State Partitioning

ZeRO-1 partitions optimizer states — the largest memory component — across N GPUs.

Adam states: fp32 parameters + momentum + variance = 12Ψ bytes
With N GPUs: optimizer state per GPU = 12Ψ/N bytes

Memory savings:

Baseline: 16Ψ bytes per GPU
ZeRO-1: 4Ψ + 12Ψ/N bytes per GPU
(at N=64: ≈ 4.2Ψ bytes, ~4x savings)

During updates, each GPU updates only its assigned parameter shard, then an All-gather operation restores the full parameters.

{
  "zero_optimization": {
    "stage": 1
  }
}

2.2 ZeRO Stage 2: Gradient Partitioning

ZeRO-2 also partitions gradients in addition to optimizer states.

During backpropagation, each GPU accumulates only the gradients corresponding to its parameter shard. A Reduce-scatter operation collects gradients for each GPU's assigned segment.

Baseline: 16Ψ bytes per GPU
ZeRO-2: 2Ψ + (2Ψ + 12Ψ)/N bytes per GPU
(at N=64: ≈ 2.2Ψ bytes, ~8x savings)

{
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "overlap_comm": true,
    "contiguous_gradients": true
  }
}

2.3 ZeRO Stage 3: Parameter Partitioning

ZeRO-3 also partitions the model parameters themselves. This is the most powerful stage but most complex to implement.

How it works:

Each GPU permanently holds only 1/N of total parameters
Forward pass: All-gather collects full parameters just before each layer executes
After computation: parameters are immediately freed
Backward pass: same on-demand parameter collection

ZeRO-3: (2Ψ + 2Ψ + 12Ψ)/N bytes per GPU
At N=64: 16Ψ/64 ≈ 0.25Ψ bytes (~64x savings!)

Tradeoff: All-gather communication occurs at every layer, increasing communication overhead. However, overlapping with pipeline execution makes the practical delay small.

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

ZeRO-3 Python initialization:

import deepspeed
from transformers import AutoModelForCausalLM

# Model creation method matters for ZeRO-3
with deepspeed.zero.Init(config_dict_or_path="ds_config.json"):
    model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Parameters are partitioned from the start, preventing memory peaks

3. ZeRO-Offload and ZeRO-Infinity

3.1 ZeRO-Offload

ZeRO-Offload moves optimizer states and gradients to CPU memory, greatly reducing GPU memory pressure.

GPU: fp16 parameters + fp16 gradients (forward/backward)
CPU: fp32 master parameters + Adam states (optimizer update)

This enables training multi-billion parameter models even on a single GPU.

{
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  }
}

Parameter offload (ZeRO-3 + Offload):

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    }
  }
}

3.2 ZeRO-Infinity (NVMe Offload)

ZeRO-Infinity leverages NVMe SSDs to support virtually unlimited model sizes.

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "nvme",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 4,
      "fast_init": false
    },
    "offload_param": {
      "device": "nvme",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 5,
      "buffer_size": 1e8
    },
    "aio": {
      "block_size": 1048576,
      "queue_depth": 8,
      "thread_count": 1,
      "single_submit": false,
      "overlap_events": true
    }
  }
}

Bandwidth considerations:

GPU-CPU bandwidth: ~32 GB/s over PCIe 4.0 x16
CPU-NVMe bandwidth: ~7 GB/s for NVMe SSDs
NVMe is slower than CPU memory, adding training overhead, but enables previously impossible model sizes

4. Complete DeepSpeed Configuration Guide

4.1 ds_config.json Basic Structure

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0,

  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },

  "bf16": {
    "enabled": "auto"
  },

  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "overlap_comm": true,
    "contiguous_gradients": true
  },

  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": 1e-8,
      "weight_decay": "auto"
    }
  },

  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },

  "activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
  },

  "wall_clock_breakdown": false,
  "steps_per_print": 100
}

4.2 Mixed Precision Settings in Detail

fp16 configuration:

{
  "fp16": {
    "enabled": true,
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "consecutive_hysteresis": false,
    "min_loss_scale": 1
  }
}

loss_scale: 0: Enables dynamic loss scaling
initial_scale_power: 16: Initial loss scale = 2^16 = 65536
loss_scale_window: Successful steps before scale increase
hysteresis: Overflow count before scale decrease

bf16 configuration (Ampere+ GPUs):

{
  "bf16": {
    "enabled": true
  }
}

bf16 has wider dynamic range than fp16, eliminating the need for loss scaling. Recommended for A100/H100.

4.3 Gradient Accumulation and Batch Sizes

Batch size relationship:

train_batch_size = train_micro_batch_size_per_gpu × gradient_accumulation_steps × world_size

{
  "train_batch_size": 2048,
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 64
}

This configuration on 8 GPUs: 4 × 64 × 8 = 2048 global batch size.

Using "auto" lets the Transformers Trainer automatically fill in values.

4.4 Learning Rate Scheduler Options

{
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "last_batch_iteration": -1,
      "total_num_steps": 100000,
      "warmup_min_lr": 0,
      "warmup_max_lr": 3e-4,
      "warmup_num_steps": 2000,
      "warmup_type": "linear"
    }
  }
}

Supported scheduler types:

LRRangeTest
OneCycle
WarmupLR
WarmupDecayLR
WarmupCosineLR (custom)

5. DeepSpeed + PyTorch Integration

5.1 Basic Training Loop

import torch
import deepspeed
from torch.utils.data import DataLoader

def main():
    # Initialize distributed environment
    deepspeed.init_distributed()

    # Create model
    model = GPT2LMHeadModel.from_pretrained("gpt2")

    # Initialize DeepSpeed engine
    model_engine, optimizer, train_loader, lr_scheduler = deepspeed.initialize(
        model=model,
        training_data=train_dataset,
        config="ds_config.json"
    )

    # Training loop
    for epoch in range(num_epochs):
        for batch in train_loader:
            input_ids = batch["input_ids"].to(model_engine.device)
            labels = batch["labels"].to(model_engine.device)

            # Forward pass
            outputs = model_engine(input_ids=input_ids, labels=labels)
            loss = outputs.loss

            # Backward pass (DeepSpeed handles scaling/accumulation)
            model_engine.backward(loss)

            # Parameter update (includes gradient clipping)
            model_engine.step()

    print(f"Step: {model_engine.global_steps}, Loss: {loss.item():.4f}")

5.2 Checkpoint Saving and Loading

# Save checkpoint
def save_checkpoint(model_engine, save_dir, tag=None):
    # Save in DeepSpeed format (includes distributed state)
    model_engine.save_checkpoint(save_dir, tag=tag)

# Example: save every 1000 steps
if model_engine.global_steps % 1000 == 0:
    save_checkpoint(model_engine, "./checkpoints", tag=f"step_{model_engine.global_steps}")

# Load checkpoint
_, client_sd = model_engine.load_checkpoint("./checkpoints", tag="step_1000")

# Extract weights in standard PyTorch format under ZeRO-3
if args.zero_stage == 3:
    state_dict = model_engine._zero3_consolidated_16bit_state_dict()
    if model_engine.local_rank == 0:
        torch.save(state_dict, "model_weights.pt")

5.3 Transformers Trainer Integration

The simplest way to integrate DeepSpeed with HuggingFace Transformers:

from transformers import TrainingArguments, Trainer, AutoModelForCausalLM

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    fp16=True,
    deepspeed="ds_config.json",  # Path to DeepSpeed config
    logging_steps=10,
    save_steps=1000,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

Command-line execution:

deepspeed --num_gpus=8 train.py \
    --deepspeed ds_config.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset_name openwebtext \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --num_train_epochs 3 \
    --fp16

5.4 Accelerate Integration

from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin

deepspeed_plugin = DeepSpeedPlugin(
    zero_stage=2,
    gradient_accumulation_steps=8,
    gradient_clipping=1.0,
)

accelerator = Accelerator(
    mixed_precision="fp16",
    deepspeed_plugin=deepspeed_plugin,
)

model, optimizer, train_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader
)

for batch in train_dataloader:
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)
    optimizer.step()
    optimizer.zero_grad()

6. DeepSpeed Pipeline Parallelism

6.1 Pipeline Parallelism Concept

Pipeline parallelism places model layers sequentially across multiple GPUs. GPU 0 handles early layers, GPU 1 middle layers, and GPU N the final layers.

1F1B Schedule (One Forward, One Backward): Minimizes pipeline bubbles. Each GPU performs one microbatch forward pass immediately followed by a backward pass.

6.2 PipelineModule Usage

from deepspeed.pipe import PipelineModule, LayerSpec

class GPT2PipelineModel(PipelineModule):
    def __init__(self, num_layers, hidden_size, num_heads, vocab_size):
        layers = [
            LayerSpec(EmbeddingLayer, vocab_size, hidden_size),
        ]
        for i in range(num_layers):
            layers.append(LayerSpec(TransformerBlock, hidden_size, num_heads))
        layers.append(LayerSpec(LMHead, hidden_size, vocab_size))

        super().__init__(
            layers=layers,
            loss_fn=cross_entropy_loss,
            num_stages=4,       # Pipeline stages (= number of GPUs)
            topology=PipeDataParallelTopology(num_pp=4, num_dp=2)
        )

# Add pipeline settings to ds_config
pipeline_config = {
    "train_batch_size": 64,
    "train_micro_batch_size_per_gpu": 2,
    "gradient_accumulation_steps": 8,
    "pipeline": {
        "seed_layers": True,
        "activation_checkpoint_interval": 1
    }
}

6.3 Pipeline + ZeRO Combination

model_engine, _, _, _ = deepspeed.initialize(
    model=pipeline_model,
    config={
        "train_micro_batch_size_per_gpu": 2,
        "gradient_accumulation_steps": 8,
        "zero_optimization": {"stage": 1},  # Combine with ZeRO-1
        "fp16": {"enabled": True}
    }
)

# Pipeline training
for step in range(num_steps):
    loss = model_engine.train_batch()
    if model_engine.global_steps % 100 == 0:
        print(f"Step {model_engine.global_steps}, Loss: {loss.item():.4f}")

7. Tensor Parallelism

7.1 Megatron-DeepSpeed Integration

Tensor parallelism distributes individual layers across multiple GPUs by partitioning matrix multiplications along column or row dimensions.

# Megatron-DeepSpeed configuration
megatron_config = {
  "tensor_model_parallel_size": 4,
  "pipeline_model_parallel_size": 2,
  "data_parallel_size": 4,  # world_size / (tp_size x pp_size)
}

Tensor parallel settings in ds_config.json:

{
  "tensor_parallel": {
    "tp_size": 4,
    "mpu": null,
    "tp_grain_size": 64
  }
}

Launch command:

deepspeed --num_nodes=4 --num_gpus=8 \
    pretrain_gpt.py \
    --tensor-model-parallel-size 4 \
    --pipeline-model-parallel-size 4 \
    --num-layers 96 \
    --hidden-size 12288 \
    --num-attention-heads 96 \
    --seq-length 2048 \
    --global-batch-size 1024 \
    --train-iters 500000 \
    --deepspeed \
    --deepspeed_config ds_config.json

8. Activation Checkpointing (Gradient Checkpointing)

8.1 Concept and Trade-offs

Activation checkpointing discards intermediate activations during the forward pass and recomputes them during backpropagation.

Memory savings: Reduces activation memory from O(layers) to O(sqrt(layers))
Speed cost: ~33% additional compute overhead

{
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "contiguous_memory_optimization": true,
    "number_checkpoints": 4,
    "synchronize_checkpoint_boundary": false,
    "profile": false
  }
}

# Also activate at PyTorch level
from deepspeed.runtime.activation_checkpointing import checkpointing

class TransformerLayer(nn.Module):
    def forward(self, x):
        return checkpointing.checkpoint(self._forward, x)

    def _forward(self, x):
        # Actual computation
        return self.attention(self.norm1(x)) + x

9. Mixture of Experts (MoE) with DeepSpeed

9.1 MoE Architecture Overview

Mixture of Experts routes each input token to only a subset of "expert" sub-networks, increasing parameter count while maintaining constant compute.

from deepspeed.moe.layer import MoE

class MoETransformerBlock(nn.Module):
    def __init__(self, hidden_size, num_experts, top_k=2):
        super().__init__()
        self.attention = MultiHeadAttention(hidden_size)
        self.moe_layer = MoE(
            hidden_size=hidden_size,
            expert=FeedForward(hidden_size, hidden_size * 4),
            num_experts=num_experts,
            ep_size=1,           # Expert parallelism size
            k=top_k,             # Number of experts to activate
            capacity_factor=1.25,
            eval_capacity_factor=2.0,
            min_capacity=4,
            use_residual=False
        )

    def forward(self, x, attention_mask=None):
        x = x + self.attention(x, attention_mask)
        x, _, _ = self.moe_layer(x)
        return x

9.2 Expert Parallelism Configuration

{
  "zero_optimization": {
    "stage": 2
  },
  "moe_expert_parallel_size": 4,
  "moe": {
    "enabled": true,
    "ep_size": 4,
    "moe_param_group": true
  }
}

# Separate MoE parameter groups (important!)
def create_moe_param_groups(model):
    parameters = {
        "params": [],
        "name": "parameters"
    }
    moe_parameters = {
        "params": [],
        "moe": True,
        "name": "moe_parameters"
    }
    for module_name, module in model.named_modules():
        if isinstance(module, MoE):
            moe_parameters["params"].extend(
                [p for n, p in module.named_parameters() if "expert" in n]
            )
        else:
            parameters["params"].extend(module.parameters(recurse=False))
    return [parameters, moe_parameters]

10. DeepSpeed Inference

10.1 Inference Optimization

DeepSpeed Inference dramatically improves the inference speed of trained models.

import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Convert to DeepSpeed inference engine
ds_engine = deepspeed.init_inference(
    model=model,
    mp_size=1,               # Tensor parallel size
    dtype=torch.float16,     # Inference dtype
    checkpoint=None,
    replace_with_kernel_inject=True,   # Inject optimized kernels
    max_tokens=2048,
    replace_method="auto"
)

model = ds_engine.module

# Run inference
inputs = tokenizer("DeepSpeed is ", return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.7
    )
print(tokenizer.decode(outputs[0]))

10.2 Tensor Parallel Inference

# Multi-GPU inference (tensor parallelism)
ds_engine = deepspeed.init_inference(
    model=model,
    mp_size=4,               # Distribute across 4 GPUs
    dtype=torch.float16,
    replace_with_kernel_inject=True,
    injection_policy={
        LlamaDecoderLayer: ("self_attn.o_proj", "mlp.down_proj")
    }
)

10.3 ZeRO-Inference

Large model inference on a single or few GPUs:

# ZeRO-3-based inference (stream parameters from CPU/NVMe)
ds_engine = deepspeed.init_inference(
    model=model,
    config={
        "tensor_parallel": {"tp_size": 1},
        "dtype": "fp16",
        "enable_cuda_graph": False,
        "zero": {
            "stage": 3,
            "offload_param": {
                "device": "cpu"
            }
        }
    }
)

11. DeepSpeed Chat (RLHF)

11.1 RLHF Pipeline Overview

DeepSpeed-Chat provides a complete RLHF (Reinforcement Learning from Human Feedback) pipeline.

Three training stages:

SFT (Supervised Fine-Tuning)
Reward Model Training
RLHF/PPO with PPO algorithm

11.2 SFT Stage

# Step 1: SFT
deepspeed main.py \
    --data_path Dahoas/rm-static \
    --data_split 2,4,4 \
    --model_name_or_path facebook/opt-1.3b \
    --per_device_train_batch_size 8 \
    --max_seq_len 512 \
    --learning_rate 9.65e-6 \
    --weight_decay 0.1 \
    --num_train_epochs 1 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --num_warmup_steps 0 \
    --seed 1234 \
    --zero_stage 2 \
    --deepspeed \
    --output_dir /output/sft

11.3 Reward Model Training

from deepspeed_chat.reward_model import RewardModel

# Reward model = SFT model + linear head
reward_model = RewardModel(
    base_model=sft_model,
    tokenizer=tokenizer,
    num_padding_at_beginning=0
)

# Train on preference pairs (chosen, rejected)
def compute_reward_loss(reward_model, chosen_ids, rejected_ids):
    chosen_reward = reward_model(chosen_ids)
    rejected_reward = reward_model(rejected_ids)
    # Preferred responses should have higher reward
    loss = -torch.mean(torch.log(torch.sigmoid(chosen_reward - rejected_reward)))
    return loss

11.4 PPO Training

from deepspeed_chat.ppo_trainer import DeepSpeedPPOTrainer

trainer = DeepSpeedPPOTrainer(
    rlhf_engine=rlhf_engine,
    args=args
)

for epoch in range(args.num_train_epochs):
    for step, batch in enumerate(prompt_dataloader):
        out = trainer.generate_experience(batch["input_ids"])
        # Update Actor/Critic models
        actor_loss, critic_loss = trainer.train_rlhf(out)

12. Practical Example: LLM Training

12.1 Complete Llama-2 Fine-tuning Example

import torch
import deepspeed
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForSeq2Seq
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType

def main():
    # Load model and tokenizer
    model_name = "meta-llama/Llama-2-7b-hf"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    # Load model (DeepSpeed manages device placement)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map=None  # DeepSpeed handles this
    )

    # LoRA configuration (optional: parameter-efficient fine-tuning)
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.1,
        bias="none"
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    # Load dataset
    dataset = load_dataset("tatsu-lab/alpaca", split="train")

    def preprocess(example):
        prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
        tokens = tokenizer(prompt, truncation=True, max_length=512, padding="max_length")
        tokens["labels"] = tokens["input_ids"].copy()
        return tokens

    tokenized_dataset = dataset.map(preprocess, remove_columns=dataset.column_names)

    # TrainingArguments with DeepSpeed
    training_args = TrainingArguments(
        output_dir="./llama2-finetuned",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        learning_rate=2e-4,
        warmup_steps=100,
        logging_steps=10,
        save_steps=500,
        bf16=True,
        deepspeed="ds_zero2_config.json",
        remove_unused_columns=False,
        report_to="wandb",
        run_name="llama2-alpaca-deepspeed"
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset,
        data_collator=DataCollatorForSeq2Seq(tokenizer, padding=True)
    )

    trainer.train()
    trainer.save_model("./llama2-finetuned-final")

if __name__ == "__main__":
    main()

ZeRO-2 configuration file (ds_zero2_config.json):

{
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": 1e-8,
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto",
      "total_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 100,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Launch:

deepspeed --num_gpus=8 train_llama.py \
    --deepspeed ds_zero2_config.json

12.2 Performance Benchmarks

Measured results on 8x A100 80GB:

Configuration	Max Model Size	Throughput (tokens/sec)
Baseline DDP	~7B	12,000
ZeRO-1	~14B	11,500
ZeRO-2	~35B	10,800
ZeRO-3	~175B	9,200
ZeRO-3 + CPU Offload	~350B	4,100
ZeRO-Infinity	~1T+	1,800

13. Debugging and Troubleshooting

13.1 Common Issues and Solutions

OOM (Out of Memory) errors:

# Profile memory usage
ds_config = {
    "memory_breakdown": True,
    "wall_clock_breakdown": True,
}

# Step-by-step resolution:
# 1. Increase ZeRO stage (1 -> 2 -> 3)
# 2. Enable CPU Offload
# 3. Enable activation checkpointing
# 4. Reduce batch size
# 5. Reduce sequence length

Gradient overflow (fp16):

{
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "initial_scale_power": 12,
    "hysteresis": 2,
    "min_loss_scale": 1e-8
  }
}

ZeRO-3 parameter gather error:

# Accessing parameters directly under ZeRO-3
from deepspeed import zero

with zero.GatheredParameters(model.parameters()):
    # Full parameters available only within this block
    weight_norm = model.weight.norm()

13.2 Performance Tuning Tips

# 1. Overlap communication with computation
zero_config = {
    "overlap_comm": True,
    "contiguous_gradients": True,
}

# 2. Optimize bucket sizes (tune for GPU memory and bandwidth)
zero_config["allgather_bucket_size"] = 5e8   # 500MB
zero_config["reduce_bucket_size"] = 5e8

# 3. ZeRO-3 prefetching
zero_config["stage3_prefetch_bucket_size"] = 5e7
zero_config["stage3_param_persistence_threshold"] = 1e6

# 4. Compile optimization (PyTorch 2.0+)
model = torch.compile(model)

Conclusion

DeepSpeed is an essential tool for modern LLM training. Its ZeRO optimization enables training of large-scale models that would be impossible on a single GPU, at a reasonable cost.

Key takeaways:

ZeRO-1/2: Optimizer/gradient partitioning improves data parallelism efficiency
ZeRO-3: Also partitions parameters, providing memory savings proportional to GPU count
ZeRO-Offload: Offloads to CPU/NVMe, enabling large model training on small clusters
Pipeline + Tensor Parallelism: Horizontal/vertical model distribution maximizes communication efficiency
DeepSpeed Inference: Kernel optimizations provide 2-5x inference speedups

For real projects, ZeRO-2 + CPU Offload is a good starting point. Move to ZeRO-3 for larger models, and use DeepSpeed Inference's kernel injection for serving.