- Authors
- Name
- Introduction
- MoE Basic Structure and Mathematical Principles
- Routing Strategies: Top-k, Expert Choice, Hash Routing
- Load Balancing and Auxiliary Loss
- Evolution from Switch Transformer to DeepSeek-V3
- Training Stability and Troubleshooting
- Inference Optimization: Expert Parallelism, Offloading
- Operations Checklist
- Failure Cases and Recovery
- References

Introduction
The number of parameters in large language models (LLMs) is growing exponentially, but Dense models that activate all parameters for every token have hit a wall of computational cost. Training a GPT-4 level Dense model requires tens of thousands of GPUs running for months, and inference costs also increase proportionally to the number of parameters. To address this fundamental inefficiency, the conditional computation paradigm has emerged, with Sparse Mixture of Experts (MoE) architecture at its center.
The core idea of MoE is simple. Place hundreds of expert networks, but for each input token, activate only a small number of experts to dramatically reduce computation. Since the total number of parameters determines model capacity, and the number of active parameters determines actual computational cost, high quality and low cost can be achieved simultaneously. Mixtral 8x7B activates only 13B out of a total 47B parameters per token, and DeepSeek-V3 activates only 37B out of 671B parameters.
This article provides an in-depth treatment of the mathematical foundations of MoE, routing strategies, load balancing, architectural evolution from Switch Transformer to DeepSeek-V3 and Qwen3-235B-A22B, and practical optimization strategies for training and inference. Each topic is explained with PyTorch code, including failure cases and recovery procedures encountered in production environments.
MoE Basic Structure and Mathematical Principles
Mathematical Definition of Sparse Activation
An MoE layer consists of N expert networks E_1, E_2, ..., E_N and a gating network G. The output y of an MoE layer for input token x is defined as follows:
y = sum_{i=1}^{N} G(x)_i * E_i(x)
Here, G(x) is the gating function, which outputs an N-dimensional vector for input x. In Dense MoE, all G(x)_i have non-zero values, but in Sparse MoE, only Top-K experts are selected and the gating values of the rest are set to 0.
G(x)_i = softmax(W_g * x + noise)_i (if i in TopK)
G(x)_i = 0 (otherwise)
The noise term promotes exploration during training to increase the diversity of expert utilization. Thanks to this sparsity, while total parameters grow proportionally to N, actual computation (FLOPs) is proportional to K, making it N/K times more efficient than Dense models.
Expert Network Structure
Each expert typically replaces the Feed-Forward Network (FFN) in a Transformer. The standard design is for Self-Attention to be shared across all tokens, with only the FFN portion separated by expert. This is because Self-Attention plays a global role in capturing inter-token relationships, while FFN plays a local role in transforming individual token representations.
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
@dataclass
class MoEConfig:
d_model: int = 1024
d_ff: int = 4096
num_experts: int = 8
top_k: int = 2
dropout: float = 0.1
aux_loss_weight: float = 0.01
class SwiGLUExpert(nn.Module):
"""Expert FFN using SwiGLU activation.
The standard structure adopted by modern models like LLaMA, Mistral, etc."""
def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
super().__init__()
self.w_gate = nn.Linear(d_model, d_ff, bias=False)
self.w_up = nn.Linear(d_model, d_ff, bias=False)
self.w_down = nn.Linear(d_ff, d_model, bias=False)
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor) -> torch.Tensor:
gate = F.silu(self.w_gate(x))
up = self.w_up(x)
return self.w_down(self.dropout(gate * up))
class TopKGating(nn.Module):
"""Top-K Gating Network.
Noisy Top-K Gating (Shazeer et al., 2017) implementation."""
def __init__(self, d_model: int, num_experts: int, top_k: int = 2):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
self.gate = nn.Linear(d_model, num_experts, bias=False)
self.noise_linear = nn.Linear(d_model, num_experts, bias=False)
def forward(self, x: torch.Tensor):
# x: (batch * seq_len, d_model)
logits = self.gate(x)
if self.training:
noise = F.softplus(self.noise_linear(x))
logits = logits + noise * torch.randn_like(logits)
probs = F.softmax(logits, dim=-1)
top_k_probs, top_k_indices = torch.topk(probs, self.top_k, dim=-1)
# Renormalization: ensure selected expert weights sum to 1
top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)
return top_k_probs, top_k_indices, probs
In the code above, SwiGLUExpert is an expert using the SwiGLU activation function adopted as the standard by modern models like LLaMA, Mistral, and Qwen. It has been empirically confirmed to have higher training efficiency compared to conventional ReLU or GELU. TopKGating implements the Noisy Top-K Gating proposed by Shazeer et al. (2017), which adds learnable noise to gating logits during training to promote expert exploration.
Dense vs Sparse MoE Quantitative Comparison
| Item | Dense 70B | Sparse MoE 8x7B (Top-2) | Sparse MoE 256x3B (Top-2) |
|---|---|---|---|
| Total Parameters | 70B | 47B | 768B |
| Active Parameters | 70B | 13B | 6B |
| FLOPs/token | 140 TFLOPs | 26 TFLOPs | 12 TFLOPs |
| GPU Memory (FP16) | 140 GB | 94 GB | 1.5 TB |
| Training Cost Ratio | 1.0x | 0.35x (FLOPs basis) | 0.17x (FLOPs basis) |
| Inference Speed | Baseline | 2-3x faster | Expert loading bottleneck |
A noteworthy point is the memory requirement of MoE models. Although active parameters are few, all parameters must be kept in memory, so when the number of experts is very large, memory can actually be larger than Dense models. This is the fundamental reason why Expert Parallelism and Offloading strategies are necessary.
Routing Strategies: Top-k, Expert Choice, Hash Routing
Routing is both the core and the most challenging design problem of MoE. The strategy for deciding which experts to activate significantly affects model quality, training stability, and inference efficiency.
Top-K Routing
The most traditional routing method, where the gating network computes scores for each expert and selects the top K. Shazeer et al. (2017) proposed Top-2, and Switch Transformer (Fedus et al., 2022) simplified it to Top-1, cutting communication costs in half.
Advantages of Top-1: Each token uses exactly one expert, minimizing All-to-All communication in distributed environments. Implementation is also straightforward.
Disadvantages of Top-1: Reliance on a single expert limits expressiveness, and the discrete gating decision increases the risk of expert collapse during early training.
Top-2 Compromise: Mixtral 8x7B and many modern models adopt Top-2. Since the outputs of two experts are combined through weighted sum, expressiveness is richer, and if one expert becomes unstable, the other compensates.
Expert Choice Routing
Expert Choice routing, proposed by Zhou et al. (2022), reverses the perspective. Instead of tokens choosing experts, experts choose which tokens to process. Since each expert selects the K tokens most suitable for itself, load balancing is structurally guaranteed.
class ExpertChoiceGating(nn.Module):
"""Expert Choice Routing implementation.
Each expert directly selects which tokens to process,
structurally guaranteeing load balancing."""
def __init__(
self,
d_model: int,
num_experts: int,
capacity_factor: float = 1.0,
):
super().__init__()
self.num_experts = num_experts
self.capacity_factor = capacity_factor
self.gate = nn.Linear(d_model, num_experts, bias=False)
def forward(self, x: torch.Tensor):
# x: (num_tokens, d_model)
num_tokens = x.shape[0]
expert_capacity = int(
num_tokens * self.capacity_factor / self.num_experts
)
# Gating scores: (num_tokens, num_experts)
gate_logits = self.gate(x)
# Scores from expert perspective: (num_experts, num_tokens)
gate_scores = F.softmax(gate_logits.T, dim=-1)
# Each expert selects top capacity tokens
top_k_scores, top_k_indices = torch.topk(
gate_scores, expert_capacity, dim=-1
) # (num_experts, capacity)
# Create dispatch mask
dispatch_mask = torch.zeros(
self.num_experts, num_tokens,
device=x.device, dtype=x.dtype,
)
dispatch_mask.scatter_(1, top_k_indices, top_k_scores)
return dispatch_mask, top_k_indices, top_k_scores
The key advantage of Expert Choice is that it achieves perfect load balancing without auxiliary loss. Since each expert is forced to process the same number of tokens, the expert collapse problem is fundamentally eliminated. However, there is an asymmetry where a single token may be selected by multiple experts or by no expert at all.
Hash Routing
Hash Routing, proposed by Roller et al. (2021), completely eliminates learnable gating and assigns tokens to experts using hash functions. Since the gating network's parameters and computation are eliminated, inference overhead is minimized. However, because the assignment rule is fixed, it cannot reflect input semantics, and in practice, quality is lower compared to learnable routing, so it has not been adopted as mainstream.
Routing Strategy Comparison
| Strategy | Load Balancing | Expressiveness | Comm. Cost | Implementation Complexity | Representative Model |
|---|---|---|---|---|---|
| Top-1 | Aux Loss Required | Low | Minimum | Low | Switch Transformer |
| Top-2 | Aux Loss Required | Medium | Medium | Medium | Mixtral 8x7B |
| Top-K (K=6,8) | Aux Loss Required | High | High | Medium | DeepSeek-V3 (Top-8/256) |
| Expert Choice | Structurally Guaranteed | High | Medium | High | Research models |
| Hash Routing | Perfect | Low | Minimum | Minimum | Research models |
Load Balancing and Auxiliary Loss
The most serious problem in MoE training is expert collapse. This is a phenomenon where the gating network concentrates tokens on only a few experts, while the remaining experts lose training opportunities and effectively become dead parameters. Various load balancing techniques have been developed to prevent this.
Auxiliary Loss
This is the standard approach proposed by Switch Transformer. A loss term that encourages uniform token distribution across experts is added to the main language modeling loss.
def compute_load_balancing_loss(
gate_probs: torch.Tensor,
top_k_indices: torch.Tensor,
num_experts: int,
top_k: int,
) -> torch.Tensor:
"""Switch Transformer-style load balancing auxiliary loss computation.
Args:
gate_probs: Gating probabilities (num_tokens, num_experts)
top_k_indices: Selected expert indices (num_tokens, top_k)
num_experts: Number of experts
top_k: Number of selected experts
Returns:
Auxiliary loss scalar value
"""
num_tokens = gate_probs.shape[0]
# f_i: fraction of tokens assigned to expert i
expert_mask = F.one_hot(top_k_indices, num_experts).float()
# (num_tokens, top_k, num_experts) -> (num_tokens, num_experts)
expert_mask = expert_mask.sum(dim=1)
tokens_per_expert = expert_mask.sum(dim=0) # (num_experts,)
f = tokens_per_expert / (num_tokens * top_k)
# P_i: average gating probability for expert i
P = gate_probs.mean(dim=0) # (num_experts,)
# Auxiliary loss: N * sum(f_i * P_i)
# Achieves minimum value at uniform distribution
aux_loss = num_experts * (f * P).sum()
return aux_loss
class MoELayerWithAuxLoss(nn.Module):
"""Complete MoE layer implementation with auxiliary loss."""
def __init__(self, config: MoEConfig):
super().__init__()
self.config = config
self.experts = nn.ModuleList([
SwiGLUExpert(config.d_model, config.d_ff, config.dropout)
for _ in range(config.num_experts)
])
self.gating = TopKGating(
config.d_model, config.num_experts, config.top_k
)
self.aux_loss_weight = config.aux_loss_weight
def forward(self, x: torch.Tensor):
batch_size, seq_len, d_model = x.shape
x_flat = x.view(-1, d_model)
top_k_probs, top_k_indices, gate_probs = self.gating(x_flat)
# Compute auxiliary loss
aux_loss = self.aux_loss_weight * compute_load_balancing_loss(
gate_probs, top_k_indices,
self.config.num_experts, self.config.top_k,
)
# Compute expert outputs
output = torch.zeros_like(x_flat)
for k in range(self.config.top_k):
expert_indices = top_k_indices[:, k] # (num_tokens,)
expert_weights = top_k_probs[:, k] # (num_tokens,)
for i in range(self.config.num_experts):
mask = (expert_indices == i)
if mask.any():
expert_input = x_flat[mask]
expert_output = self.experts[i](expert_input)
output[mask] += expert_weights[mask].unsqueeze(-1) * expert_output
output = output.view(batch_size, seq_len, d_model)
return output, aux_loss
The auxiliary loss weight alpha is a very sensitive hyperparameter. Switch Transformer recommended alpha=0.01, but adjustment is needed depending on model scale and number of experts. If alpha is too large, language modeling quality degrades; if too small, the load balancing effect is negligible. ST-MoE (Zoph et al., 2022) proposed adding router z-loss to constrain the magnitude of gating logits themselves.
DeepSeek's Auxiliary-Loss-Free Strategy
DeepSeek-V3 (2024) proposed an innovative method to achieve load balancing without auxiliary loss. It adds a non-trainable bias term to each expert and uses a dynamic adjustment mechanism that lowers the bias of experts where tokens are excessively concentrated and raises the bias of underutilized experts during training. This approach completely eliminates the problem of auxiliary loss interfering with the main training objective, showing benefits in both training stability and final model quality.
Evolution from Switch Transformer to DeepSeek-V3
Switch Transformer (Fedus et al., 2022)
Switch Transformer is the key paper that simplified MoE routing to Top-1. It reduced the communication cost of existing Top-2 routing by half while ensuring training stability through appropriate capacity factor and auxiliary loss. It trained a 1.6T parameter model 4x faster than T5-XXL while achieving equivalent quality.
Key design decisions:
- Top-1 Routing: Minimized communication cost
- Capacity Factor: Dynamically adjusted expert buffer size to prevent token dropping
- Selective Precision: Mixed gating in FP32 and expert computation in BF16 for simultaneous stability and efficiency
GShard (Lepikhin et al., 2021)
Google's GShard established the distributed training pipeline for 600B parameter MoE models. It used Top-2 routing with Group-level balancing and presented a framework for efficient training across thousands of TPUs using the SPMD (Single Program Multiple Data) programming model.
Mixtral 8x7B (Jiang et al., 2024)
Mistral AI's Mixtral is a milestone that proved the practicality of open-source MoE models. With a structure selecting Top-2 from 8 experts, it uses an active 13B out of a total 47B parameters. It showed benchmark performance equal to or better than LLaMA-2 70B, while inference FLOPs are at 1/3 the level of 70B.
DeepSeek-V3 (DeepSeek, 2024)
DeepSeek-V3 achieved innovation across multiple design areas of MoE architecture.
- Fine-grained Expert Segmentation: Uses 256 small-scale experts with Top-8 selection. Increasing the number of experts while reducing their size deepens specialization, improving model quality.
- Shared Expert: One shared expert processes all tokens to handle common knowledge, while routed experts handle specialized knowledge.
- Auxiliary-Loss-Free Load Balancing: The bias-based dynamic balancing described above eliminates side effects of auxiliary loss.
- Multi-Token Prediction (MTP): Added a training objective that predicts multiple tokens at once to improve data efficiency.
- FP8 Training: Trained the 671B parameter model on 2048 H800 GPUs with FP8 precision, dramatically reducing costs.
Qwen3-235B-A22B (Alibaba, 2025)
Qwen3-235B-A22B is an MoE model that activates only 22B out of a total 235B parameters, selecting Top-8 from 128 experts. By converting the existing Qwen2.5 series Dense architecture to MoE, it achieved GPT-4o level performance at approximately one-tenth the inference cost.
MoE Model Comparison
| Model | Total Params | Active Params | Num Experts | Top-K | Shared Expert | Routing Strategy |
|---|---|---|---|---|---|---|
| Switch Transformer | 1.6T | ~100B | 128 | 1 | None | Learned Top-1 |
| GShard | 600B | ~20B | 2048 | 2 | None | Learned Top-2 |
| Mixtral 8x7B | 47B | 13B | 8 | 2 | None | Learned Top-2 |
| DeepSeek-V3 | 671B | 37B | 256+1 | 8 | 1 | Bias-adjusted |
| Qwen3-235B-A22B | 235B | 22B | 128 | 8 | Yes | Learned Top-K |
| DBRX | 132B | 36B | 16 | 4 | None | Learned Top-4 |
Training Stability and Troubleshooting
Diagnosing Expert Collapse
Expert collapse is the most common and critical problem in MoE training. When the gating network concentrates tokens on specific experts, the gradients of remaining experts converge to 0, stopping their learning, and this imbalance forms a self-reinforcing loop that worsens over time.
import logging
from collections import defaultdict
logger = logging.getLogger(__name__)
class ExpertUtilizationMonitor:
"""Expert utilization monitoring and collapse detection tool.
Tracks each expert's utilization during training
and detects early signs of collapse.
"""
def __init__(
self,
num_experts: int,
collapse_threshold: float = 0.01,
window_size: int = 100,
):
self.num_experts = num_experts
self.collapse_threshold = collapse_threshold
self.window_size = window_size
self.history: list[dict[int, float]] = []
def record(self, expert_counts: dict[int, int], total_tokens: int):
"""Record expert utilization per batch."""
utilization = {
i: expert_counts.get(i, 0) / max(total_tokens, 1)
for i in range(self.num_experts)
}
self.history.append(utilization)
if len(self.history) > self.window_size:
self.history = self.history[-self.window_size:]
def detect_collapse(self) -> list[int]:
"""Detect expert collapse. Returns experts with utilization below threshold."""
if len(self.history) < self.window_size // 2:
return []
collapsed = []
for expert_id in range(self.num_experts):
recent_util = [
h[expert_id] for h in self.history[-self.window_size:]
]
avg_util = sum(recent_util) / len(recent_util)
if avg_util < self.collapse_threshold:
collapsed.append(expert_id)
if collapsed:
logger.warning(
f"Expert collapse detected! "
f"Experts {collapsed} have utilization below "
f"{self.collapse_threshold:.2%}. "
f"Consider increasing aux_loss_weight or "
f"reinitializing collapsed experts."
)
return collapsed
def get_load_imbalance_ratio(self) -> float:
"""Calculate load imbalance ratio.
1.0 means perfect balance; higher values mean more imbalance."""
if not self.history:
return 0.0
latest = self.history[-1]
utils = list(latest.values())
max_util = max(utils) if utils else 0
min_util = min(utils) if utils else 0
avg_util = sum(utils) / len(utils) if utils else 0
if avg_util == 0:
return float("inf")
return max_util / avg_util
Training Instability Causes and Countermeasures
| Symptom | Cause | Countermeasure |
|---|---|---|
| Loss spike | Gating logit explosion | Add Router z-loss, keep gating in FP32 |
| Expert collapse | Insufficient aux loss, high LR | Increase aux loss weight, extend LR warm-up |
| Expert overlap | Initialization similarity | Orthogonal expert initialization, diversity regularization |
| Token dropping | Insufficient capacity factor | Increase capacity factor to 1.25-1.5 |
| Gating oscillation | Excessive learning rate | Separate gating LR to 0.1x of main LR |
Hyperparameter Guide for Stable Training
The key principles for training stability are as follows. First, gating network computations must be performed in FP32. In BF16 or FP16, numerical instability of softmax causes routing oscillation. Second, extend learning rate warmup 2-3x longer than for Dense models. Applying a high learning rate before gating stabilizes causes expert collapse. Third, set batch size as large as possible. With small batches, gating token distribution is sensitive to noise and becomes unstable.
Inference Optimization: Expert Parallelism, Offloading
MoE model inference poses fundamentally different challenges from Dense models. Although active parameters are few, all parameters must be kept accessible, so memory management and expert placement strategies are critical.
Expert Parallelism
Expert Parallelism (EP) is a strategy for distributing experts across multiple GPUs. When distributing N experts across P GPUs, each GPU stores only N/P experts. When a token is routed to a specific expert, All-to-All communication sends the token to the corresponding GPU, and the computation result is returned to the original GPU.
import torch
import torch.distributed as dist
from typing import Optional
class ExpertParallelRouter:
"""Token dispatch/gather implementation for Expert Parallelism.
Each GPU handles a subset of experts,
routing tokens via All-to-All communication.
"""
def __init__(
self,
num_experts: int,
ep_group: Optional[dist.ProcessGroup] = None,
):
self.num_experts = num_experts
self.ep_group = ep_group
self.ep_size = dist.get_world_size(ep_group) if ep_group else 1
self.ep_rank = dist.get_rank(ep_group) if ep_group else 0
self.experts_per_rank = num_experts // self.ep_size
def dispatch(
self,
tokens: torch.Tensor,
expert_indices: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor]:
"""Dispatch tokens to responsible GPUs.
Args:
tokens: (num_tokens, d_model) input tokens
expert_indices: (num_tokens,) target expert index for each token
Returns:
dispatched_tokens: tokens this GPU should process
recv_counts: number of tokens received from each GPU
"""
# Calculate number of tokens to send to each GPU
send_counts = torch.zeros(
self.ep_size, dtype=torch.long, device=tokens.device
)
for rank in range(self.ep_size):
start_expert = rank * self.experts_per_rank
end_expert = start_expert + self.experts_per_rank
mask = (expert_indices >= start_expert) & (
expert_indices < end_expert
)
send_counts[rank] = mask.sum()
# Exchange receive counts via All-to-All
recv_counts = torch.zeros_like(send_counts)
dist.all_to_all_single(
recv_counts, send_counts, group=self.ep_group
)
# Sort tokens and All-to-All transfer
sorted_indices = torch.argsort(expert_indices)
sorted_tokens = tokens[sorted_indices]
send_splits = send_counts.tolist()
recv_splits = recv_counts.tolist()
dispatched_tokens = torch.zeros(
int(recv_counts.sum()), tokens.shape[1],
dtype=tokens.dtype, device=tokens.device,
)
dist.all_to_all_single(
dispatched_tokens, sorted_tokens,
output_split_sizes=recv_splits,
input_split_sizes=send_splits,
group=self.ep_group,
)
return dispatched_tokens, recv_counts
Expert Offloading
When GPU memory is insufficient, this strategy stores inactive experts in CPU memory or NVMe SSD and loads them to GPU only when needed. It is critically utilized in DeepSpeed-MoE and Mixtral's inference optimization.
The key to offloading is prefetching. By asynchronously loading experts that will be activated in the next layer onto the GPU while the current layer's expert computation is in progress, expert swap latency can be hidden. With PCIe 4.0 x16, approximately 32 GB/s bandwidth is available, allowing transfer of a single expert (several hundred MB) within a few milliseconds.
Inference Optimization Strategy Comparison
| Strategy | GPU Memory | Inference Latency | Throughput | Suitable Scenario |
|---|---|---|---|---|
| Full Model on GPU | Maximum | Minimum | Maximum | High-end multi-GPU server |
| Expert Parallelism | Distributed | Communication overhead | High | Multi-GPU cluster |
| CPU Offloading | Minimum | Loading latency | Medium | Limited GPU environment |
| NVMe Offloading | Minimum | High loading latency | Low | Single GPU environment |
| Speculative Expert Prefetch | Medium | Medium | High | Batch inference server |
Operations Checklist
The following is a list of items that must be checked when deploying MoE models to production.
Training Phase
- Verify gating precision: Confirm that the gating network's forward/backward operations are performed in FP32. BF16 gating may appear normal initially but can cause instability after tens of thousands of steps.
- Build load balancing metrics dashboard: Monitor per-expert token allocation, max/min utilization ratios, and auxiliary loss values in real-time.
- Checkpoint strategy: In expert parallel environments, checkpoints may be saved separately per GPU. Prepare a script to consolidate the full model in advance.
- Capacity factor tuning: If the token drop rate exceeds 1%, increase the capacity factor. Dropped tokens are only passed through residual connections, degrading quality.
- Set expert collapse alerts: Trigger alerts when a specific expert's utilization drops below 10% of the average, and reinitialize the affected expert if necessary.
Inference/Deployment Phase
- Memory profiling: Verify whether all parameters can fit in GPU memory; if not, choose EP or Offloading strategy.
- Batch size optimization: In MoE inference, batch size directly affects expert utilization efficiency. With small batches, only some experts are activated, leading to poor GPU utilization.
- KV Cache management: MoE models also require KV Cache management since Attention layers are identical to Dense models. Combining with PagedAttention (vLLM) is efficient.
- Routing consistency testing: Verify that the same experts are selected for the same input. Especially when mixing Tensor Parallelism and Expert Parallelism, numerical errors can cause different routing decisions.
- Fallback strategy: Implement fallback logic to substitute the next-ranked expert when a specific expert fails to load.
- A/B testing pipeline: Verify quality equivalence of MoE models versus Dense models in the serving environment.
Failure Cases and Recovery
Case 1: Quality Degradation from Expert Collapse
Symptom: After 30,000 training steps, benchmark scores suddenly decline. The loss itself decreases normally, but generation quality deteriorates.
Root Cause Analysis: Monitoring revealed that 2 out of 8 experts were processing over 60% of all tokens, while 3 experts had utilization under 2%. The auxiliary loss weight (alpha=0.001) was too low for effective balancing.
Recovery Procedure:
- Roll back to checkpoint just before expert collapse (20,000 steps)
- Increase auxiliary loss weight 10x from 0.001 to 0.01
- Reinitialize collapsed expert parameters with parameters from active experts
- Set gating network learning rate to 0.1x of main learning rate
- Monitor expert utilization until it becomes balanced (8-17% range based on 12.5% average) after retraining
Case 2: All-to-All Communication Bottleneck
Symptom: When training with Expert Parallelism across 64 GPUs, GPU utilization plummets to 40%. The profiler shows All-to-All communication accounting for 45% of total training time.
Root Cause Analysis: Network topology analysis revealed that expert placement did not consider network structure, causing excessive inter-node communication. The bandwidth difference between intra-node GPU communication (NVLink, 900 GB/s) and inter-node communication (InfiniBand, 400 Gb/s) was over 20x.
Recovery Procedure:
- Switch to Hierarchical All-to-All: separate intra-node and inter-node communication into two stages
- Rearrange expert placement to be topology-aware: place frequently co-activated experts on the same node
- Communication-computation overlap: pipeline expert computation with token dispatch for the next batch
Case 3: Expert Loading Latency During Inference
Symptom: When serving Mixtral 8x7B with CPU Offloading on a single GPU (24GB), time to first token (TTFT) exceeds 5 seconds.
Root Cause Analysis: Loading 2 experts from CPU to GPU at each layer takes 100-200ms, and processing 32 layers sequentially results in cumulative latency of 3.2-6.4 seconds.
Recovery Procedure:
- Implement expert prefetching: pre-compute gating scores for the next layer during current layer processing and asynchronously load required experts
- Hot expert caching: keep the top 2-3 most frequently activated experts resident on GPU
- Expert weight quantization: reduce expert size by 75% with INT4 quantization to shorten transfer time
- When PCIe bandwidth is the bottleneck, optimize CPU-GPU transfer using pinned memory
Case 4: Loss Spike During Training
Symptom: During large-scale MoE model (over 100B) training, loss spikes repeatedly every few thousand steps. Recovery occurs after each spike, but training time is wasted.
Root Cause Analysis: The softmax input logits of the gating network intermittently take very large values, causing numerical instability. Especially during BF16 training, the range of gating logits is narrower than FP32, making overflow more likely.
Recovery Procedure:
- Add Router z-loss to directly constrain the magnitude of gating logits.
def router_z_loss(gate_logits: torch.Tensor) -> torch.Tensor:
"""ST-MoE style Router z-loss.
Constrains the magnitude of gating logits to improve numerical stability.
Args:
gate_logits: (num_tokens, num_experts) gating logits
Returns:
z_loss scalar
"""
log_z = torch.logsumexp(gate_logits, dim=-1) # (num_tokens,)
z_loss = (log_z ** 2).mean()
return z_loss
- Force gating computations to FP32 to ensure numerical stability.
- Apply gradient clipping separately to the gating network (max_norm=1.0).
- Extend learning rate warmup period to 5-10% of total training.
References
Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR, 23(120), 1-39. https://arxiv.org/abs/2101.03961
DeepSeek-AI. (2024). DeepSeek-V3 Technical Report. https://arxiv.org/abs/2401.06066
Cai, W. et al. (2024). A Survey on Mixture of Experts. https://arxiv.org/abs/2407.10671
FriendliAI. (2024). MoE Models Comparison: Architectures and Performance. https://friendli.ai/blog/moe-models-comparison
Zilliz. (2024). What is Mixture of Experts? A Complete Guide. https://zilliz.com/learn/what-is-mixture-of-experts
Wikipedia. Mixture of Experts. https://en.wikipedia.org/wiki/Mixture_of_experts
Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017. https://arxiv.org/abs/1701.06538
Jiang, A. Q. et al. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088