- Authors
- Name
- 1. What is MoE?
- 2. Core Components of MoE Architecture
- 3. Major MoE Model Analysis
- 4. Routing Strategies
- 5. Load Balancing
- 6. Inference Optimization
- 7. Quiz

1. What is MoE?
Mixture of Experts (MoE) is an architecture that improves computational efficiency by activating only a subset of the model's total parameters. Unlike Dense models that use all parameters for every input, MoE selects and activates only the optimal experts based on the input.
Dense vs Sparse Models
- Dense Model: All parameters are activated for every input (e.g., LLaMA, GPT-4)
- Sparse MoE: Only a fraction of parameters are activated (e.g., Mixtral, DeepSeek-V3)
The key advantage is that the model has a large number of parameters but low computational cost. Mixtral 8x7B has 46.7B total parameters, but only about 12.9B are activated during inference.
2. Core Components of MoE Architecture
Expert Network
Each Expert is an independent FFN (Feed-Forward Network):
import torch
import torch.nn as nn
class Expert(nn.Module):
def __init__(self, d_model: int, d_ff: int):
super().__init__()
self.w1 = nn.Linear(d_model, d_ff, bias=False)
self.w2 = nn.Linear(d_ff, d_model, bias=False)
self.w3 = nn.Linear(d_model, d_ff, bias=False) # SwiGLU gate
def forward(self, x: torch.Tensor) -> torch.Tensor:
# SwiGLU activation
return self.w2(nn.functional.silu(self.w1(x)) * self.w3(x))
Router (Gating Network)
The Router determines which Expert each token is sent to:
class TopKRouter(nn.Module):
def __init__(self, d_model: int, num_experts: int, top_k: int = 2):
super().__init__()
self.gate = nn.Linear(d_model, num_experts, bias=False)
self.top_k = top_k
def forward(self, x: torch.Tensor):
# x shape: (batch, seq_len, d_model)
logits = self.gate(x) # (batch, seq_len, num_experts)
top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
top_k_weights = torch.softmax(top_k_logits, dim=-1)
return top_k_weights, top_k_indices
Full MoE Layer Implementation
class MoELayer(nn.Module):
def __init__(self, d_model: int, d_ff: int,
num_experts: int = 8, top_k: int = 2):
super().__init__()
self.experts = nn.ModuleList([
Expert(d_model, d_ff) for _ in range(num_experts)
])
self.router = TopKRouter(d_model, num_experts, top_k)
self.num_experts = num_experts
def forward(self, x: torch.Tensor) -> torch.Tensor:
batch_size, seq_len, d_model = x.shape
weights, indices = self.router(x)
# Reshape for expert processing
flat_x = x.view(-1, d_model)
flat_weights = weights.view(-1, weights.shape[-1])
flat_indices = indices.view(-1, indices.shape[-1])
output = torch.zeros_like(flat_x)
for i, expert in enumerate(self.experts):
# Find tokens routed to this expert
mask = (flat_indices == i).any(dim=-1)
if mask.any():
expert_input = flat_x[mask]
expert_output = expert(expert_input)
# Weight by router probability
idx = (flat_indices[mask] == i).float()
w = (flat_weights[mask] * idx).sum(dim=-1, keepdim=True)
output[mask] += w * expert_output
return output.view(batch_size, seq_len, d_model)
3. Major MoE Model Analysis
Mixtral 8x7B (Mistral AI)
- 8 Experts, Top-2 routing
- 46.7B total parameters, 12.9B active
- Attention layers are shared; only FFN is split into Experts
DeepSeek-V3 MoE
DeepSeek-V3 employs a more sophisticated MoE design:
class DeepSeekMoE(nn.Module):
"""DeepSeek-V3 style: Shared Expert + Routed Expert"""
def __init__(self, d_model, d_ff, num_shared=1,
num_routed=256, top_k=8):
super().__init__()
# Shared Expert that all tokens pass through
self.shared_experts = nn.ModuleList([
Expert(d_model, d_ff) for _ in range(num_shared)
])
# Routed Expert selected per token
self.routed_experts = nn.ModuleList([
Expert(d_model, d_ff // 4) for _ in range(num_routed)
])
self.router = TopKRouter(d_model, num_routed, top_k)
def forward(self, x):
# Shared Expert output
shared_out = sum(e(x) for e in self.shared_experts)
# Routed Expert output
weights, indices = self.router(x)
routed_out = self._route_tokens(x, weights, indices)
return shared_out + routed_out
- 1 Shared Expert + 256 Routed Experts (Top-8 selection)
- 671B total parameters, 37B active
- Introduced Auxiliary-loss-free load balancing
Model Comparison
| Model | Total Params | Active Params | Experts | Top-K |
|---|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | 8 | 2 |
| Mixtral 8x22B | 141B | 39B | 8 | 2 |
| DeepSeek-V3 | 671B | 37B | 256+1 | 8+1 |
| Qwen2.5-MoE | 14.3B | 2.7B | 60+4 | 4+4 |
4. Routing Strategies
Token Choice vs Expert Choice
# Token Choice: Each token selects its Experts
def token_choice_routing(logits, top_k=2):
top_k_vals, top_k_idx = logits.topk(top_k, dim=-1)
weights = torch.softmax(top_k_vals, dim=-1)
return weights, top_k_idx
# Expert Choice: Each Expert selects its tokens
def expert_choice_routing(logits, capacity_factor=1.25):
num_tokens = logits.shape[0]
num_experts = logits.shape[1]
capacity = int(num_tokens * capacity_factor / num_experts)
expert_scores = logits.T # (num_experts, num_tokens)
top_k_vals, top_k_idx = expert_scores.topk(capacity, dim=-1)
return top_k_vals, top_k_idx
5. Load Balancing
Load imbalance across Experts is a core challenge in MoE:
def load_balancing_loss(router_logits, top_k_indices, num_experts):
"""Auxiliary load balancing loss (Switch Transformer style)"""
# Token ratio per Expert
mask = torch.zeros_like(router_logits)
mask.scatter_(-1, top_k_indices, 1.0)
tokens_per_expert = mask.float().mean(dim=0) # (num_experts,)
# Average routing probability per Expert
router_probs = torch.softmax(router_logits, dim=-1)
router_prob_per_expert = router_probs.mean(dim=0)
# Dot product of the two distributions = measure of imbalance
loss = num_experts * (tokens_per_expert * router_prob_per_expert).sum()
return loss
DeepSeek-V3's Auxiliary-loss-free approach dynamically adjusts per-Expert bias terms, lowering the bias for overloaded Experts and raising it for underutilized ones, achieving balanced load without an additional loss term.
6. Inference Optimization
# Expert Parallelism: Distribute Experts across multiple GPUs
# GPU 0: Expert 0-3, GPU 1: Expert 4-7
class ExpertParallel(nn.Module):
def __init__(self, experts_per_gpu, rank, world_size):
super().__init__()
self.local_experts = nn.ModuleList([
Expert(d_model, d_ff)
for _ in range(experts_per_gpu)
])
self.rank = rank
self.world_size = world_size
def forward(self, x, indices):
# Redistribute tokens via All-to-All communication
dispatched = all_to_all(x, indices, self.world_size)
# Process with local Experts
output = self._process_local(dispatched)
# Recombine results
return all_to_all(output, indices, self.world_size)
7. Quiz
Q1: Approximately how many parameters does a single token use during inference in Mixtral 8x7B?
Approximately 12.9B parameters. Since only the Top-2 out of 8 Experts are activated, only the FFN parameters of 2 Experts plus the shared Attention parameters are used. This is roughly 28% of the total 46.7B.
Q2: What is the core idea behind DeepSeek-V3's Auxiliary-loss-free load balancing?
Traditional MoE adds an auxiliary loss for load balancing, which can degrade model performance. DeepSeek-V3 introduces dynamic bias terms for each Expert, lowering the bias for Experts receiving too many tokens and raising it for underutilized ones, naturally achieving balance. This enables stable load balancing without an additional loss.
Q3: What are the differences and trade-offs between Token Choice and Expert Choice routing?
- Token Choice: Each token selects its Top-K Experts. Simple to implement but can lead to load imbalance where tokens concentrate on certain Experts.
- Expert Choice: Each Expert selects the tokens it processes. Guarantees perfect load balancing but some tokens may not be selected by any Expert.
In practice, Token Choice combined with a load balancing loss is the most commonly used approach.