- Published on
Deep Dive into Mixture of Experts (MoE) Architecture: From Switch Transformer to Mixtral and DeepSeek
- Authors
- Name
- Introduction
- History and Evolution of MoE Architecture
- Mathematical Foundations of Sparse MoE
- Switch Transformer Analysis
- Mixtral 8x7B Architecture Details
- DeepSeek-V2/V3 Innovations: DeepSeekMoE
- Routing Strategy Comparison
- Training Stability Techniques
- Inference Optimization
- Major MoE Model Comparison
- Complete Implementation: Custom MoE Transformer Block
- Conclusion and Future Directions
- References

Introduction
In the era of large language models (LLMs), endlessly scaling model parameters hits fundamental walls in both training cost and inference cost. Dense Transformers activate all parameters for every input token, meaning that as parameter count grows, FLOPs grow proportionally. The Mixture of Experts (MoE) architecture solves this through conditional computation. The core idea is to maintain large model capacity while keeping actual computation constant by activating only a subset of expert networks based on the input.
Since Shazeer et al. proposed the Sparsely-Gated MoE in their 2017 paper "Outrageously Large Neural Networks," MoE has evolved rapidly -- from Google's Switch Transformer in 2021, to Mistral's Mixtral 8x7B in 2023, and DeepSeek-V2/V3 in 2024. By 2025, Meta adopted MoE with Llama 4, and DeepSeek-R1 built on the V3 architecture to maximize reasoning capabilities, gaining worldwide attention.
This article provides a paper-level deep dive from the mathematical foundations of MoE architecture through the design philosophies of major models, comparative analysis of routing strategies, training stability techniques, and inference optimization.
History and Evolution of MoE Architecture
The concept of MoE was first proposed by Jacobs et al. in their 1991 paper "Adaptive Mixtures of Local Experts." Initially, it was a simple approach of computing a weighted sum of multiple expert network outputs through a gating network.
The key milestones in modern MoE evolution are as follows:
- 2017: Shazeer et al. proposed LSTM-based Sparsely-Gated MoE with 4,096 experts and 100B+ parameters
- 2021: Google's Switch Transformer simplified routing to Top-1, achieving 1.6 trillion parameters
- 2022: Google's ST-MoE (Stable and Transferable MoE) systematized training stability techniques
- 2022: Expert Choice Routing paper proposed reverse routing where experts select tokens
- 2023: Mixtral 8x7B opened the era of open-source MoE with Top-2 routing and SwiGLU experts
- 2024: DeepSeek-V2 introduced Fine-Grained Experts and Auxiliary-Loss-Free strategy
- 2024: DeepSeek-V3 achieved state-of-the-art with 671B parameters (37B active)
- 2025: Llama 4 Scout (16 experts, 109B/17B active) marked Meta's first MoE adoption
Mathematical Foundations of Sparse MoE
Basic Formulation
The output of an MoE layer is defined as:
where is the hidden representation of the input token, is the number of experts, is the -th expert network, and is the weight assigned by the gating function to the -th expert.
Sparse Gating Function
The Noisy Top-K gating function proposed by Shazeer (2017) works as follows:
where is the gating weight matrix and is the noise weight matrix. The TopK operation retains only the top values and sets the rest to , making them zero after Softmax.
PyTorch Implementation: Basic Sparse Gating
import torch
import torch.nn as nn
import torch.nn.functional as F
class TopKGating(nn.Module):
"""Noisy Top-K Gating mechanism for MoE."""
def __init__(self, input_dim: int, num_experts: int, top_k: int = 2):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
self.gate = nn.Linear(input_dim, num_experts, bias=False)
self.noise = nn.Linear(input_dim, num_experts, bias=False)
def forward(self, x: torch.Tensor):
# x shape: (batch_size, seq_len, input_dim)
logits = self.gate(x) # (batch, seq, num_experts)
# Training noise for exploration
if self.training:
noise = torch.randn_like(logits) * F.softplus(self.noise(x))
logits = logits + noise
# Top-K selection
top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
# (batch, seq, top_k)
# Sparse softmax: only over selected experts
top_k_gates = F.softmax(top_k_logits, dim=-1)
return top_k_gates, top_k_indices
Switch Transformer Analysis
Core Innovation: Top-1 Routing
The core innovation of the Switch Transformer, published by Fedus et al. in 2021, is Top-1 routing. Previous research assumed that at least two experts needed to be activated for stable training, but Switch Transformer proved that selecting exactly one expert per token is sufficient.
The routed output is simply the product of the gating probability and the expert output:
Architecture Characteristics
Switch Transformer replaces the FFN (Feed-Forward Network) layers of the T5 architecture with MoE layers. Each MoE layer can accommodate up to 2,048 experts, enabling models at the 1.6 trillion parameter scale. Top-1 routing cuts communication costs in half and simplifies the routing computation itself.
Performance
With 64 experts, Switch Transformer achieves 7x faster pretraining speed compared to T5-Base at equivalent compute. This is because model capacity increases while per-token computation remains constant.
PyTorch Implementation: Switch Transformer MoE Layer
class SwitchMoELayer(nn.Module):
"""Switch Transformer style MoE layer with Top-1 routing."""
def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int,
capacity_factor: float = 1.25):
super().__init__()
self.num_experts = num_experts
self.capacity_factor = capacity_factor
self.router = nn.Linear(hidden_dim, num_experts, bias=False)
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(hidden_dim, ffn_dim),
nn.ReLU(),
nn.Linear(ffn_dim, hidden_dim)
) for _ in range(num_experts)
])
def forward(self, x: torch.Tensor):
batch_size, seq_len, hidden_dim = x.shape
x_flat = x.view(-1, hidden_dim) # (B*S, D)
num_tokens = x_flat.shape[0]
# Router: Top-1 selection
router_logits = self.router(x_flat) # (B*S, E)
router_probs = F.softmax(router_logits, dim=-1)
expert_indices = router_probs.argmax(dim=-1) # (B*S,)
expert_gates = router_probs.gather(1, expert_indices.unsqueeze(-1)).squeeze(-1)
# Capacity: max tokens per expert
capacity = int(self.capacity_factor * num_tokens / self.num_experts)
# Dispatch tokens to experts
output = torch.zeros_like(x_flat)
for i in range(self.num_experts):
mask = (expert_indices == i)
if mask.sum() == 0:
continue
selected = x_flat[mask][:capacity] # enforce capacity
expert_out = self.experts[i](selected)
gates = expert_gates[mask][:capacity].unsqueeze(-1)
output[mask][:capacity] = expert_out * gates
return output.view(batch_size, seq_len, hidden_dim)
Mixtral 8x7B Architecture Details
Design Philosophy
Mixtral 8x7B, released by Mistral AI in December 2023, is based on the Mistral 7B architecture with each Transformer layer's FFN replaced by an MoE layer consisting of 8 experts. It uses Top-2 routing to activate 2 experts per token.
Key Numbers
- Total parameters: 46.7B (8 experts x ~5.6B FFN + shared attention parameters)
- Active parameters: ~13B (2 expert FFNs per token + shared parameters)
- Expert function: SwiGLU FFN
- Attention: Grouped Query Attention (GQA)
- Context length: 32K tokens
- Sliding Window Attention applied
Top-2 Routing Formula
The MoE layer output in Mixtral is computed as:
The gating function computes a Softmax probability distribution over experts for input and selects the top 2. The gating weights of the selected 2 experts are renormalized to sum to 1.
SwiGLU Expert Network
Each expert is an FFN using the SwiGLU activation function:
PyTorch Implementation: Mixtral MoE Block
class MixtralMoEBlock(nn.Module):
"""Mixtral-style MoE block with Top-2 SwiGLU experts."""
def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int = 8):
super().__init__()
self.num_experts = num_experts
self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
self.experts = nn.ModuleList([
SwiGLUExpert(hidden_dim, ffn_dim) for _ in range(num_experts)
])
def forward(self, x: torch.Tensor):
# x: (batch, seq_len, hidden_dim)
gate_logits = self.gate(x) # (batch, seq, num_experts)
gate_probs = F.softmax(gate_logits, dim=-1)
# Top-2 selection
top2_probs, top2_indices = gate_probs.topk(2, dim=-1)
# Renormalize gates to sum to 1
top2_probs = top2_probs / top2_probs.sum(dim=-1, keepdim=True)
# Compute expert outputs and combine
batch, seq, dim = x.shape
output = torch.zeros_like(x)
for k in range(2):
expert_idx = top2_indices[:, :, k] # (batch, seq)
gate_val = top2_probs[:, :, k].unsqueeze(-1) # (batch, seq, 1)
for i in range(self.num_experts):
mask = (expert_idx == i)
if mask.any():
expert_input = x[mask]
expert_output = self.experts[i](expert_input)
output[mask] += gate_val[mask].squeeze(-1).unsqueeze(-1) * expert_output
return output
class SwiGLUExpert(nn.Module):
"""SwiGLU Feed-Forward Network used as expert."""
def __init__(self, hidden_dim: int, ffn_dim: int):
super().__init__()
self.w1 = nn.Linear(hidden_dim, ffn_dim, bias=False)
self.v = nn.Linear(hidden_dim, ffn_dim, bias=False)
self.w2 = nn.Linear(ffn_dim, hidden_dim, bias=False)
def forward(self, x: torch.Tensor):
return self.w2(F.silu(self.w1(x)) * self.v(x))
DeepSeek-V2/V3 Innovations: DeepSeekMoE
Fine-Grained Expert Segmentation
DeepSeek-V2 (2024) took a fundamentally different approach from existing MoE models. The core idea is Fine-Grained Expert Segmentation -- splitting experts into smaller and more numerous units.
The original experts are increased to , while each expert's hidden dimension is reduced by . Simultaneously, the number of activated experts increases proportionally from to , maintaining the same per-token computation while enabling finer-grained expert combinations.
DeepSeek-V3 Architecture
DeepSeek-V3 (December 2024) features the following key configuration:
- Total parameters: 671B
- Active parameters: 37B (per token)
- Routed experts: 256 (per layer)
- Shared experts: 1 (per layer, always active)
- Active routed experts: 8 (per token)
- Attention: Multi-head Latent Attention (MLA)
Auxiliary-Loss-Free Load Balancing
One of the most innovative contributions of DeepSeek-V3 is its auxiliary-loss-free load balancing strategy. Traditional MoE models use auxiliary losses for load balancing, but calibrating the coefficient is difficult, and excessive values degrade model performance.
Instead, DeepSeek-V3 adds a bias term to each expert, used only for routing decisions:
The bias term only affects routing decisions and is not included in the actual gating weight computation. Overloaded experts have their decreased while underloaded experts have their increased, achieving load balance without contaminating the loss function.
Device-Limited Routing
To limit communication costs, DeepSeek-V3 restricts each token to be sent to at most nodes. It selects the top nodes based on affinity scores of experts distributed across each node, then performs Top-K routing only among experts within those selected nodes.
Routing Strategy Comparison
Top-1 Routing (Switch Transformer)
Activates exactly 1 expert per token. Minimizes communication cost and simplifies implementation, but reliance on a single expert may limit representational power.
Top-2 Routing (Mixtral, GShard)
Activates 2 experts per token with weighted combination. Richer representation than Top-1 but doubles communication cost.
Expert Choice Routing (Zhou et al., 2022)
Reverses the traditional approach: experts select tokens. Since each expert selects a fixed number of tokens, load balance is automatically guaranteed. However, a single token may be selected by zero or multiple experts, introducing non-determinism.
Soft MoE (Puigcerver et al., 2023)
Instead of discrete routing, passes weighted combinations of all tokens to each expert. Fully differentiable with no token dropping, but not truly sparse since every expert processes information from all tokens.
PyTorch Implementation: Expert Choice Routing
class ExpertChoiceRouter(nn.Module):
"""Expert Choice Routing: experts select tokens."""
def __init__(self, hidden_dim: int, num_experts: int, capacity_factor: float = 1.0):
super().__init__()
self.num_experts = num_experts
self.capacity_factor = capacity_factor
self.router = nn.Linear(hidden_dim, num_experts, bias=False)
def forward(self, x: torch.Tensor):
# x: (num_tokens, hidden_dim)
num_tokens = x.shape[0]
capacity = int(self.capacity_factor * num_tokens / self.num_experts)
# Compute affinity scores
scores = self.router(x) # (num_tokens, num_experts)
scores = F.softmax(scores, dim=0) # softmax over tokens (not experts)
# Each expert selects top-capacity tokens
# Transpose: (num_experts, num_tokens)
expert_scores = scores.t()
# Top-capacity selection per expert
top_scores, top_indices = expert_scores.topk(capacity, dim=-1)
# top_scores: (num_experts, capacity)
# top_indices: (num_experts, capacity)
return top_scores, top_indices
Training Stability Techniques
Load Balancing Loss
The load balancing loss proposed in Switch Transformer is defined as:
where is the number of experts, is the fraction of tokens routed to expert , and is the mean router probability assigned to expert . The coefficient is typically set between 0.01 and 0.1.
Under ideal uniform distribution, , so the loss equals . The loss increases as imbalance grows.
Router Z-Loss
The Router Z-Loss proposed in ST-MoE (2022) constrains the magnitude of router logits to improve training stability:
where denotes the router logits. This loss prevents logits from growing excessively large, mitigating instability and convergence issues in routing decisions.
PyTorch Implementation: Load Balancing + Z-Loss
def compute_moe_auxiliary_losses(
router_logits: torch.Tensor,
expert_indices: torch.Tensor,
num_experts: int,
alpha_balance: float = 0.01,
alpha_z: float = 0.001
):
"""Compute load balancing loss and router z-loss.
Args:
router_logits: Raw router logits (batch*seq, num_experts)
expert_indices: Selected expert indices (batch*seq, top_k)
num_experts: Total number of experts
alpha_balance: Weight for load balancing loss
alpha_z: Weight for router z-loss
"""
num_tokens = router_logits.shape[0]
router_probs = F.softmax(router_logits, dim=-1)
# --- Load Balancing Loss ---
# f_i: fraction of tokens routed to expert i
expert_mask = F.one_hot(expert_indices, num_experts).float()
if expert_mask.dim() == 3:
expert_mask = expert_mask.sum(dim=1) # sum over top_k
expert_mask = (expert_mask > 0).float()
f = expert_mask.mean(dim=0) # (num_experts,)
# P_i: mean router probability for expert i
P = router_probs.mean(dim=0) # (num_experts,)
balance_loss = alpha_balance * num_experts * (f * P).sum()
# --- Router Z-Loss ---
log_z = torch.logsumexp(router_logits, dim=-1) # (num_tokens,)
z_loss = alpha_z * (log_z ** 2).mean()
return balance_loss + z_loss
Inference Optimization
Expert Offloading
Due to the large total parameter count of MoE models, it may be difficult to fit all experts in GPU memory. Expert Offloading stores inactive experts in CPU RAM or on disk and loads them to GPU only when needed.
Key techniques include:
- LRU Cache: Caches recently used experts on GPU
- Predictive Prefetch: Asynchronously preloads experts needed for the next layer
- Speculative Decoding + Offloading: Combines with speculative decoding to hide offloading latency
Quantization
Quantization for MoE models is similar to dense models, but requires additional consideration since weight distributions may differ across experts.
- GPTQ/AWQ: Independent quantization configurations per expert are possible
- Mixed Precision: Higher precision for frequently used experts, lower for rarely used ones
- MiLo (2025): Adds Low-Rank compensators to extremely quantized MoE models to recover accuracy
Expert Parallelism
In MoE inference, Expert Parallelism places each expert on a separate GPU for parallel processing. All-to-All communication sends tokens to the GPU hosting their assigned expert, processes them, and gathers results back.
Major MoE Model Comparison
| Model | Year | Total Params | Active Params | Experts | Routing | Expert Type | Key Features |
|---|---|---|---|---|---|---|---|
| Sparsely-Gated MoE | 2017 | 137B | - | 4096 | Top-K | MLP | First large-scale Sparse MoE |
| Switch Transformer | 2021 | 1.6T | - | 2048 | Top-1 | FFN | Simplified routing, T5-based |
| GLaM | 2022 | 1.2T | 97B | 64 | Top-2 | FFN | 1/3 energy vs GPT-3 |
| ST-MoE | 2022 | 269B | - | 32 | Top-2 | FFN | Z-Loss, stability focus |
| Expert Choice | 2022 | - | - | - | Expert Choice | FFN | Experts select tokens |
| Mixtral 8x7B | 2023 | 46.7B | 13B | 8 | Top-2 | SwiGLU | Open-source, GQA |
| DeepSeek-V2 | 2024 | 236B | 21B | 160 | Top-6 | Fine-Grained | Auxiliary-Loss-Free |
| DeepSeek-V3 | 2024 | 671B | 37B | 256+1 | Top-8 | Fine-Grained | MLA + shared expert |
| Llama 4 Scout | 2025 | 109B | 17B | 16 | Top-1 | - | Meta first MoE |
Complete Implementation: Custom MoE Transformer Block
Below is a complete implementation of a Transformer block combining an attention layer with an MoE FFN.
class MoETransformerBlock(nn.Module):
"""Complete Transformer block with MoE FFN layer."""
def __init__(
self,
hidden_dim: int = 768,
num_heads: int = 12,
ffn_dim: int = 3072,
num_experts: int = 8,
top_k: int = 2,
capacity_factor: float = 1.25,
dropout: float = 0.1
):
super().__init__()
# Multi-Head Attention
self.attn_norm = nn.LayerNorm(hidden_dim)
self.attention = nn.MultiheadAttention(
hidden_dim, num_heads, dropout=dropout, batch_first=True
)
# MoE FFN
self.ffn_norm = nn.LayerNorm(hidden_dim)
self.router = nn.Linear(hidden_dim, num_experts, bias=False)
self.experts = nn.ModuleList([
SwiGLUExpert(hidden_dim, ffn_dim)
for _ in range(num_experts)
])
self.top_k = top_k
self.num_experts = num_experts
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, mask=None):
# Pre-norm Attention
residual = x
x_norm = self.attn_norm(x)
attn_out, _ = self.attention(x_norm, x_norm, x_norm, attn_mask=mask)
x = residual + self.dropout(attn_out)
# Pre-norm MoE FFN
residual = x
x_norm = self.ffn_norm(x)
moe_out, aux_loss = self._moe_forward(x_norm)
x = residual + self.dropout(moe_out)
return x, aux_loss
def _moe_forward(self, x: torch.Tensor):
B, S, D = x.shape
x_flat = x.view(-1, D)
# Router
logits = self.router(x_flat)
probs = F.softmax(logits, dim=-1)
top_k_probs, top_k_idx = probs.topk(self.top_k, dim=-1)
top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)
# Dispatch and combine
output = torch.zeros_like(x_flat)
for k in range(self.top_k):
for i in range(self.num_experts):
mask = (top_k_idx[:, k] == i)
if mask.any():
expert_out = self.experts[i](x_flat[mask])
output[mask] += top_k_probs[mask, k].unsqueeze(-1) * expert_out
# Auxiliary loss
aux_loss = compute_moe_auxiliary_losses(
logits, top_k_idx, self.num_experts
)
return output.view(B, S, D), aux_loss
Conclusion and Future Directions
MoE architecture has established itself as the most practical approach for simultaneously achieving "model capacity scaling" and "computational efficiency." Starting from Switch Transformer's Top-1 simplification, Mixtral 8x7B brought MoE to the open-source ecosystem, and DeepSeek-V3 set new standards with Fine-Grained Experts and Auxiliary-Loss-Free strategies.
Key future research directions include:
- Dynamic expert activation: Adaptive routing that adjusts the number of active experts based on input difficulty
- Training-inference consistency: Techniques ensuring routing patterns from training are maintained during inference
- Expert specialization analysis: Interpretability research on what knowledge or functions each expert specializes in
- MoE for edge devices: Lightweight MoE designs for mobile and edge environments
References
- Shazeer, N., et al. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017.
- Fedus, W., et al. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022.
- Jiang, A.Q., et al. "Mixtral of Experts." arXiv:2401.04088, 2024.
- DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, 2024.
- DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024.
- Zoph, B., et al. "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906, 2022.
- Zhou, Y., et al. "Mixture-of-Experts with Expert Choice Routing." NeurIPS 2022.
- Puigcerver, J., et al. "From Sparse to Soft Mixtures of Experts." ICLR 2024.
- Du, N., et al. "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts." ICML 2022.
- Jacobs, R.A., et al. "Adaptive Mixtures of Local Experts." Neural Computation, 1991.