Split View: Mixture of Experts(MoE) 아키텍처 심층 분석: Switch Transformer부터 Mixtral·DeepSeek까지
Mixture of Experts(MoE) 아키텍처 심층 분석: Switch Transformer부터 Mixtral·DeepSeek까지
- 들어가며
- MoE 아키텍처의 역사와 발전
- Sparse MoE의 수학적 기초
- Switch Transformer 분석
- Mixtral 8x7B 아키텍처 상세
- DeepSeek-V2/V3의 혁신: DeepSeekMoE
- 라우팅 전략 비교
- 학습 안정성 기법
- 추론 최적화
- 주요 MoE 모델 비교
- 종합 구현 예시: 커스텀 MoE Transformer 블록
- 결론 및 향후 전망
- 참고문헌

들어가며
대규모 언어 모델(LLM)의 시대에서 모델 파라미터를 무한히 늘리는 것은 학습 비용과 추론 비용 양쪽에서 한계에 부딪힌다. Dense Transformer는 모든 입력 토큰에 대해 전체 파라미터를 활성화하므로, 파라미터 수가 늘어나면 연산량(FLOPs)도 비례하여 증가한다. Mixture of Experts(MoE) 아키텍처는 이 문제를 **조건부 연산(conditional computation)**으로 해결한다. 전체 파라미터 중 입력에 따라 일부 전문가(expert)만 활성화하여, 모델 용량은 크게 유지하면서 실제 연산량은 일정하게 제한하는 것이 핵심 아이디어다.
2017년 Shazeer 등이 "Outrageously Large Neural Networks" 논문에서 Sparsely-Gated MoE를 제안한 이래, 2021년 Google의 Switch Transformer, 2023년 Mistral의 Mixtral 8x7B, 그리고 2024년 DeepSeek-V2/V3에 이르기까지 MoE 아키텍처는 급속도로 발전해왔다. 2025년에는 Meta의 Llama 4가 MoE를 채택하고, DeepSeek-R1이 V3 아키텍처 위에서 추론 능력을 극대화하며 전 세계적으로 주목을 받았다.
이 글에서는 MoE 아키텍처의 수학적 기초부터 주요 모델들의 설계 철학, 라우팅 전략의 비교 분석, 학습 안정성 기법, 추론 최적화까지 논문 수준에서 심층 분석한다.
MoE 아키텍처의 역사와 발전
MoE의 개념은 1991년 Jacobs 등의 논문 "Adaptive Mixtures of Local Experts"에서 처음 제안되었다. 초기에는 단순한 게이팅 네트워크로 여러 전문가 네트워크의 출력을 가중합하는 방식이었다.
현대적인 MoE의 전환점은 다음과 같이 정리할 수 있다.
- 2017년: Shazeer 등이 LSTM 기반 Sparsely-Gated MoE 제안. 4096개의 전문가로 1000억 파라미터급 모델 구현
- 2021년: Google의 Switch Transformer가 Top-1 라우팅으로 단순화하여 1.6조 파라미터 모델 달성
- 2022년: Google의 ST-MoE(Stable and Transferable MoE)가 학습 안정성 기법 체계화
- 2022년: Expert Choice Routing 논문이 전문가가 토큰을 선택하는 역방향 라우팅 제안
- 2023년: Mixtral 8x7B가 Top-2 라우팅과 SwiGLU 전문가로 오픈소스 MoE 시대 개막
- 2024년: DeepSeek-V2가 Fine-Grained Expert와 Auxiliary-Loss-Free 전략 도입
- 2024년: DeepSeek-V3가 671B 파라미터(37B 활성)로 최첨단 성능 달성
- 2025년: Llama 4 Scout(16 전문가, 109B/17B 활성)로 Meta도 MoE 채택
Sparse MoE의 수학적 기초
기본 수식
MoE 레이어의 출력은 다음과 같이 정의된다.
여기서 는 입력 토큰의 히든 표현, 은 전문가 수, 는 번째 전문가 네트워크, 는 게이팅 함수가 번째 전문가에 할당한 가중치이다.
Sparse Gating 함수
Shazeer(2017)가 제안한 Noisy Top-K 게이팅 함수는 다음과 같다.
여기서 는 게이팅 가중치 행렬, 는 노이즈 가중치 행렬이다. TopK 연산은 상위 개의 값만 유지하고 나머지는 로 설정하여 Softmax 후 0이 되도록 한다.
PyTorch 구현: 기본 Sparse Gating
import torch
import torch.nn as nn
import torch.nn.functional as F
class TopKGating(nn.Module):
"""Noisy Top-K Gating mechanism for MoE."""
def __init__(self, input_dim: int, num_experts: int, top_k: int = 2):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
self.gate = nn.Linear(input_dim, num_experts, bias=False)
self.noise = nn.Linear(input_dim, num_experts, bias=False)
def forward(self, x: torch.Tensor):
# x shape: (batch_size, seq_len, input_dim)
logits = self.gate(x) # (batch, seq, num_experts)
# Training noise for exploration
if self.training:
noise = torch.randn_like(logits) * F.softplus(self.noise(x))
logits = logits + noise
# Top-K selection
top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
# (batch, seq, top_k)
# Sparse softmax: only over selected experts
top_k_gates = F.softmax(top_k_logits, dim=-1)
return top_k_gates, top_k_indices
Switch Transformer 분석
핵심 혁신: Top-1 라우팅
2021년 Fedus 등이 발표한 Switch Transformer의 핵심 혁신은 Top-1 라우팅이다. 기존 연구에서는 최소 2개 이상의 전문가를 활성화해야 안정적인 학습이 가능하다고 간주했지만, Switch Transformer는 토큰당 정확히 하나의 전문가만 선택하는 전략으로도 충분하다는 것을 입증했다.
라우팅된 출력은 단순히 게이팅 확률과 해당 전문가 출력의 곱이다.
아키텍처 특징
Switch Transformer는 T5 아키텍처의 FFN(Feed-Forward Network) 레이어를 MoE 레이어로 대체한다. 각 MoE 레이어에는 최대 2048개의 전문가를 배치할 수 있으며, 이를 통해 1.6조 파라미터 규모의 모델을 구현했다. Top-1 라우팅은 통신 비용을 절반으로 줄이고, 라우팅 연산 자체도 단순화한다.
성능
64개 전문가를 사용한 Switch Transformer는 동일 연산량 기준으로 T5-Base 대비 7배 빠른 사전학습 속도를 달성했다. 이는 모델 용량이 증가하면서도 토큰당 연산량은 동일하게 유지되기 때문이다.
PyTorch 구현: Switch Transformer MoE Layer
class SwitchMoELayer(nn.Module):
"""Switch Transformer style MoE layer with Top-1 routing."""
def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int,
capacity_factor: float = 1.25):
super().__init__()
self.num_experts = num_experts
self.capacity_factor = capacity_factor
self.router = nn.Linear(hidden_dim, num_experts, bias=False)
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(hidden_dim, ffn_dim),
nn.ReLU(),
nn.Linear(ffn_dim, hidden_dim)
) for _ in range(num_experts)
])
def forward(self, x: torch.Tensor):
batch_size, seq_len, hidden_dim = x.shape
x_flat = x.view(-1, hidden_dim) # (B*S, D)
num_tokens = x_flat.shape[0]
# Router: Top-1 selection
router_logits = self.router(x_flat) # (B*S, E)
router_probs = F.softmax(router_logits, dim=-1)
expert_indices = router_probs.argmax(dim=-1) # (B*S,)
expert_gates = router_probs.gather(1, expert_indices.unsqueeze(-1)).squeeze(-1)
# Capacity: max tokens per expert
capacity = int(self.capacity_factor * num_tokens / self.num_experts)
# Dispatch tokens to experts
output = torch.zeros_like(x_flat)
for i in range(self.num_experts):
mask = (expert_indices == i)
if mask.sum() == 0:
continue
selected = x_flat[mask][:capacity] # enforce capacity
expert_out = self.experts[i](selected)
gates = expert_gates[mask][:capacity].unsqueeze(-1)
output[mask][:capacity] = expert_out * gates
return output.view(batch_size, seq_len, hidden_dim)
Mixtral 8x7B 아키텍처 상세
설계 철학
Mistral AI가 2023년 12월에 공개한 Mixtral 8x7B는 Mistral 7B의 아키텍처를 기반으로, 각 Transformer 레이어의 FFN을 8개의 전문가로 구성된 MoE 레이어로 대체했다. Top-2 라우팅을 사용하여 토큰당 2개의 전문가를 활성화한다.
핵심 수치
- 총 파라미터: 46.7B (전문가 8개 x 약 5.6B FFN + 공유 어텐션 파라미터)
- 활성 파라미터: 약 13B (토큰당 2개 전문가의 FFN + 공유 파라미터)
- 전문가 함수: SwiGLU FFN
- 어텐션: Grouped Query Attention (GQA)
- 컨텍스트 길이: 32K 토큰
- Sliding Window Attention 적용
Top-2 라우팅 수식
Mixtral의 MoE 레이어 출력은 다음과 같이 계산된다.
여기서 게이팅 함수 는 입력 에 대해 Softmax 확률 분포를 계산하고, 상위 2개의 전문가를 선택한다. 선택된 2개 전문가의 게이팅 가중치는 재정규화(renormalization)되어 합이 1이 된다.
SwiGLU 전문가 네트워크
각 전문가는 SwiGLU 활성화 함수를 사용하는 FFN이다.
PyTorch 구현: Mixtral MoE Block
class MixtralMoEBlock(nn.Module):
"""Mixtral-style MoE block with Top-2 SwiGLU experts."""
def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int = 8):
super().__init__()
self.num_experts = num_experts
self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
self.experts = nn.ModuleList([
SwiGLUExpert(hidden_dim, ffn_dim) for _ in range(num_experts)
])
def forward(self, x: torch.Tensor):
# x: (batch, seq_len, hidden_dim)
gate_logits = self.gate(x) # (batch, seq, num_experts)
gate_probs = F.softmax(gate_logits, dim=-1)
# Top-2 selection
top2_probs, top2_indices = gate_probs.topk(2, dim=-1)
# Renormalize gates to sum to 1
top2_probs = top2_probs / top2_probs.sum(dim=-1, keepdim=True)
# Compute expert outputs and combine
batch, seq, dim = x.shape
output = torch.zeros_like(x)
for k in range(2):
expert_idx = top2_indices[:, :, k] # (batch, seq)
gate_val = top2_probs[:, :, k].unsqueeze(-1) # (batch, seq, 1)
for i in range(self.num_experts):
mask = (expert_idx == i)
if mask.any():
expert_input = x[mask]
expert_output = self.experts[i](expert_input)
output[mask] += gate_val[mask].squeeze(-1).unsqueeze(-1) * expert_output
return output
class SwiGLUExpert(nn.Module):
"""SwiGLU Feed-Forward Network used as expert."""
def __init__(self, hidden_dim: int, ffn_dim: int):
super().__init__()
self.w1 = nn.Linear(hidden_dim, ffn_dim, bias=False)
self.v = nn.Linear(hidden_dim, ffn_dim, bias=False)
self.w2 = nn.Linear(ffn_dim, hidden_dim, bias=False)
def forward(self, x: torch.Tensor):
return self.w2(F.silu(self.w1(x)) * self.v(x))
DeepSeek-V2/V3의 혁신: DeepSeekMoE
Fine-Grained Expert 분할
DeepSeek-V2(2024)는 기존 MoE 대비 근본적으로 다른 접근을 취했다. 핵심 아이디어는 Fine-Grained Expert Segmentation으로, 전문가를 더 작고 더 많은 수로 분할하는 것이다.
기존 개의 전문가를 개로 증가시키되, 각 전문가의 히든 차원을 로 축소한다. 동시에 활성화하는 전문가 수도 에서 로 비례 증가시켜, 토큰당 총 연산량은 동일하게 유지하면서 더 세밀한 전문가 조합을 가능하게 한다.
DeepSeek-V3 아키텍처
DeepSeek-V3(2024년 12월)는 다음 핵심 구성을 갖는다.
- 총 파라미터: 671B
- 활성 파라미터: 37B (토큰당)
- 라우팅 전문가: 256개 (레이어당)
- 공유 전문가: 1개 (레이어당, 항상 활성)
- 활성 라우팅 전문가: 8개 (토큰당)
- 어텐션: Multi-head Latent Attention (MLA)
Auxiliary-Loss-Free 부하 균형
DeepSeek-V3의 가장 혁신적인 기여 중 하나는 보조 손실 없는 부하 균형 전략이다. 기존 MoE 모델은 부하 균형을 위해 보조 손실(auxiliary loss)을 사용하지만, 이 보조 손실의 계수를 적절히 설정하는 것이 어렵고, 과도한 값은 모델 성능을 저하시킨다.
DeepSeek-V3는 대신 각 전문가에 편향 항(bias term) 를 추가하여 라우팅 결정에만 사용한다.
편향 항 는 라우팅 결정에만 영향을 미치고, 실제 게이팅 가중치 계산에는 포함되지 않는다. 과부하 전문가의 는 감소시키고 과소부하 전문가의 는 증가시키는 방식으로, 손실 함수를 오염시키지 않으면서 부하 균형을 달성한다.
Device-Limited Routing
DeepSeek-V3는 통신 비용을 제한하기 위해, 각 토큰이 최대 개의 노드에만 전송되도록 제한한다. 각 노드에 분산된 전문가들의 어피니티 스코어를 기반으로 상위 개 노드를 선택한 후, 해당 노드 내 전문가들 사이에서만 Top-K 라우팅을 수행한다.
라우팅 전략 비교
Top-1 Routing (Switch Transformer)
토큰당 정확히 1개의 전문가만 활성화한다. 통신 비용이 최소이고 구현이 단순하지만, 단일 전문가에 의존하므로 표현력이 제한될 수 있다.
Top-2 Routing (Mixtral, GShard)
토큰당 2개의 전문가를 활성화하여 가중 합산한다. Top-1 대비 더 풍부한 표현이 가능하지만 통신 비용이 2배이다.
Expert Choice Routing (Zhou et al., 2022)
기존 방식과 반대로, 전문가가 토큰을 선택한다. 각 전문가가 고정된 수의 토큰을 선택하므로 부하 균형이 자동으로 보장된다. 그러나 하나의 토큰이 0개 또는 여러 개의 전문가에 선택될 수 있다는 비결정적 특성이 있다.
Soft MoE (Puigcerver et al., 2023)
이산적 라우팅 대신, 모든 토큰의 가중 조합을 각 전문가에 전달한다. 완전 미분 가능하며 토큰 드롭이 없지만, 모든 전문가가 모든 토큰의 정보를 처리하므로 진정한 희소성은 아니다.
PyTorch 구현: Expert Choice Routing
class ExpertChoiceRouter(nn.Module):
"""Expert Choice Routing: experts select tokens."""
def __init__(self, hidden_dim: int, num_experts: int, capacity_factor: float = 1.0):
super().__init__()
self.num_experts = num_experts
self.capacity_factor = capacity_factor
self.router = nn.Linear(hidden_dim, num_experts, bias=False)
def forward(self, x: torch.Tensor):
# x: (num_tokens, hidden_dim)
num_tokens = x.shape[0]
capacity = int(self.capacity_factor * num_tokens / self.num_experts)
# Compute affinity scores
scores = self.router(x) # (num_tokens, num_experts)
scores = F.softmax(scores, dim=0) # softmax over tokens (not experts)
# Each expert selects top-capacity tokens
# Transpose: (num_experts, num_tokens)
expert_scores = scores.t()
# Top-capacity selection per expert
top_scores, top_indices = expert_scores.topk(capacity, dim=-1)
# top_scores: (num_experts, capacity)
# top_indices: (num_experts, capacity)
return top_scores, top_indices
학습 안정성 기법
Load Balancing Loss
Switch Transformer에서 제안된 부하 균형 손실은 다음과 같다.
여기서 은 전문가 수, 는 전문가 에 라우팅된 토큰의 비율, 는 라우터가 전문가 에 할당한 확률의 평균이다. 계수 는 일반적으로 0.01~0.1 사이로 설정한다.
이상적인 균등 분배 시 이므로 손실은 가 되고, 불균형이 심할수록 손실이 증가한다.
Router Z-Loss
ST-MoE(2022)에서 제안된 Router Z-Loss는 라우터 로짓의 크기를 제한하여 학습 안정성을 높인다.
여기서 는 라우터의 로짓이다. 이 손실은 로짓이 과도하게 커지는 것을 방지하여, 라우팅 결정의 불안정성과 수렴 문제를 완화한다.
PyTorch 구현: Load Balancing + Z-Loss
def compute_moe_auxiliary_losses(
router_logits: torch.Tensor,
expert_indices: torch.Tensor,
num_experts: int,
alpha_balance: float = 0.01,
alpha_z: float = 0.001
):
"""Compute load balancing loss and router z-loss.
Args:
router_logits: Raw router logits (batch*seq, num_experts)
expert_indices: Selected expert indices (batch*seq, top_k)
num_experts: Total number of experts
alpha_balance: Weight for load balancing loss
alpha_z: Weight for router z-loss
"""
num_tokens = router_logits.shape[0]
router_probs = F.softmax(router_logits, dim=-1)
# --- Load Balancing Loss ---
# f_i: fraction of tokens routed to expert i
expert_mask = F.one_hot(expert_indices, num_experts).float()
if expert_mask.dim() == 3:
expert_mask = expert_mask.sum(dim=1) # sum over top_k
expert_mask = (expert_mask > 0).float()
f = expert_mask.mean(dim=0) # (num_experts,)
# P_i: mean router probability for expert i
P = router_probs.mean(dim=0) # (num_experts,)
balance_loss = alpha_balance * num_experts * (f * P).sum()
# --- Router Z-Loss ---
log_z = torch.logsumexp(router_logits, dim=-1) # (num_tokens,)
z_loss = alpha_z * (log_z ** 2).mean()
return balance_loss + z_loss
추론 최적화
Expert Offloading
MoE 모델은 총 파라미터 수가 크기 때문에 모든 전문가를 GPU 메모리에 올리기 어려울 수 있다. Expert Offloading은 현재 활성화되지 않은 전문가를 CPU RAM이나 디스크에 저장하고, 필요할 때만 GPU로 로드하는 기법이다.
주요 기법은 다음과 같다.
- LRU Cache: 최근 사용된 전문가를 GPU에 캐싱
- Predictive Prefetch: 다음 레이어에서 사용할 전문가를 비동기적으로 미리 로드
- Speculative Decoding + Offloading: 투기적 디코딩과 결합하여 오프로딩 지연 은닉
Quantization
MoE 모델의 양자화는 Dense 모델과 유사하지만, 전문가별 가중치 분포가 다를 수 있다는 점에서 추가적인 고려가 필요하다.
- GPTQ/AWQ: 전문가별로 독립적인 양자화 설정 적용 가능
- Mixed Precision: 자주 사용되는 전문가는 높은 정밀도, 드물게 사용되는 전문가는 낮은 정밀도
- MiLo(2025): 극도로 양자화된 MoE에 Low-Rank 보상기를 추가하여 정확도 복구
Expert Parallelism
MoE 모델의 추론에서 Expert Parallelism은 각 전문가를 별도의 GPU에 배치하여 병렬 처리하는 전략이다. All-to-All 통신으로 토큰을 해당 전문가가 있는 GPU로 전송하고, 처리 후 다시 수집한다.
주요 MoE 모델 비교
| 모델 | 발표 | 총 파라미터 | 활성 파라미터 | 전문가 수 | 라우팅 | 전문가 유형 | 특징 |
|---|---|---|---|---|---|---|---|
| Sparsely-Gated MoE | 2017 | 137B | - | 4096 | Top-K | MLP | 최초의 대규모 Sparse MoE |
| Switch Transformer | 2021 | 1.6T | - | 2048 | Top-1 | FFN | 단순화된 라우팅, T5 기반 |
| GLaM | 2022 | 1.2T | 97B | 64 | Top-2 | FFN | GPT-3 대비 1/3 에너지 사용 |
| ST-MoE | 2022 | 269B | - | 32 | Top-2 | FFN | Z-Loss, 안정성 중심 |
| Expert Choice | 2022 | - | - | - | Expert Choice | FFN | 전문가가 토큰 선택 |
| Mixtral 8x7B | 2023 | 46.7B | 13B | 8 | Top-2 | SwiGLU | 오픈소스, GQA |
| DeepSeek-V2 | 2024 | 236B | 21B | 160 | Top-6 | Fine-Grained | Auxiliary-Loss-Free |
| DeepSeek-V3 | 2024 | 671B | 37B | 256+1 | Top-8 | Fine-Grained | MLA + 공유 전문가 |
| Llama 4 Scout | 2025 | 109B | 17B | 16 | Top-1 | - | Meta 최초 MoE |
종합 구현 예시: 커스텀 MoE Transformer 블록
아래는 어텐션 레이어와 MoE FFN을 결합한 전체 Transformer 블록의 구현이다.
class MoETransformerBlock(nn.Module):
"""Complete Transformer block with MoE FFN layer."""
def __init__(
self,
hidden_dim: int = 768,
num_heads: int = 12,
ffn_dim: int = 3072,
num_experts: int = 8,
top_k: int = 2,
capacity_factor: float = 1.25,
dropout: float = 0.1
):
super().__init__()
# Multi-Head Attention
self.attn_norm = nn.LayerNorm(hidden_dim)
self.attention = nn.MultiheadAttention(
hidden_dim, num_heads, dropout=dropout, batch_first=True
)
# MoE FFN
self.ffn_norm = nn.LayerNorm(hidden_dim)
self.router = nn.Linear(hidden_dim, num_experts, bias=False)
self.experts = nn.ModuleList([
SwiGLUExpert(hidden_dim, ffn_dim)
for _ in range(num_experts)
])
self.top_k = top_k
self.num_experts = num_experts
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, mask=None):
# Pre-norm Attention
residual = x
x_norm = self.attn_norm(x)
attn_out, _ = self.attention(x_norm, x_norm, x_norm, attn_mask=mask)
x = residual + self.dropout(attn_out)
# Pre-norm MoE FFN
residual = x
x_norm = self.ffn_norm(x)
moe_out, aux_loss = self._moe_forward(x_norm)
x = residual + self.dropout(moe_out)
return x, aux_loss
def _moe_forward(self, x: torch.Tensor):
B, S, D = x.shape
x_flat = x.view(-1, D)
# Router
logits = self.router(x_flat)
probs = F.softmax(logits, dim=-1)
top_k_probs, top_k_idx = probs.topk(self.top_k, dim=-1)
top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)
# Dispatch and combine
output = torch.zeros_like(x_flat)
for k in range(self.top_k):
for i in range(self.num_experts):
mask = (top_k_idx[:, k] == i)
if mask.any():
expert_out = self.experts[i](x_flat[mask])
output[mask] += top_k_probs[mask, k].unsqueeze(-1) * expert_out
# Auxiliary loss
aux_loss = compute_moe_auxiliary_losses(
logits, top_k_idx, self.num_experts
)
return output.view(B, S, D), aux_loss
결론 및 향후 전망
MoE 아키텍처는 "모델 용량 확대"와 "연산 효율성"이라는 두 가지 목표를 동시에 달성할 수 있는 가장 실용적인 접근법으로 자리잡았다. Switch Transformer의 Top-1 단순화에서 시작하여, Mixtral 8x7B가 오픈소스 생태계에 MoE를 보급하고, DeepSeek-V3가 Fine-Grained Expert와 Auxiliary-Loss-Free 전략으로 새로운 기준을 세웠다.
향후 연구 방향은 다음과 같다.
- 동적 전문가 활성화: 입력 난이도에 따라 활성화 전문가 수를 조절하는 적응적 라우팅
- 학습-추론 일관성: 학습 시의 라우팅 패턴이 추론 시에도 유지되도록 하는 기법
- 전문가 특화 분석: 각 전문가가 어떤 지식/기능에 특화되는지에 대한 해석 가능성 연구
- Edge 디바이스용 MoE: 모바일 및 엣지 환경에서의 경량 MoE 설계
참고문헌
- Shazeer, N., et al. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017.
- Fedus, W., et al. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022.
- Jiang, A.Q., et al. "Mixtral of Experts." arXiv:2401.04088, 2024.
- DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, 2024.
- DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024.
- Zoph, B., et al. "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906, 2022.
- Zhou, Y., et al. "Mixture-of-Experts with Expert Choice Routing." NeurIPS 2022.
- Puigcerver, J., et al. "From Sparse to Soft Mixtures of Experts." ICLR 2024.
- Du, N., et al. "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts." ICML 2022.
- Jacobs, R.A., et al. "Adaptive Mixtures of Local Experts." Neural Computation, 1991.
Deep Dive into Mixture of Experts (MoE) Architecture: From Switch Transformer to Mixtral and DeepSeek
- Introduction
- History and Evolution of MoE Architecture
- Mathematical Foundations of Sparse MoE
- Switch Transformer Analysis
- Mixtral 8x7B Architecture Details
- DeepSeek-V2/V3 Innovations: DeepSeekMoE
- Routing Strategy Comparison
- Training Stability Techniques
- Inference Optimization
- Major MoE Model Comparison
- Complete Implementation: Custom MoE Transformer Block
- Conclusion and Future Directions
- References

Introduction
In the era of large language models (LLMs), endlessly scaling model parameters hits fundamental walls in both training cost and inference cost. Dense Transformers activate all parameters for every input token, meaning that as parameter count grows, FLOPs grow proportionally. The Mixture of Experts (MoE) architecture solves this through conditional computation. The core idea is to maintain large model capacity while keeping actual computation constant by activating only a subset of expert networks based on the input.
Since Shazeer et al. proposed the Sparsely-Gated MoE in their 2017 paper "Outrageously Large Neural Networks," MoE has evolved rapidly -- from Google's Switch Transformer in 2021, to Mistral's Mixtral 8x7B in 2023, and DeepSeek-V2/V3 in 2024. By 2025, Meta adopted MoE with Llama 4, and DeepSeek-R1 built on the V3 architecture to maximize reasoning capabilities, gaining worldwide attention.
This article provides a paper-level deep dive from the mathematical foundations of MoE architecture through the design philosophies of major models, comparative analysis of routing strategies, training stability techniques, and inference optimization.
History and Evolution of MoE Architecture
The concept of MoE was first proposed by Jacobs et al. in their 1991 paper "Adaptive Mixtures of Local Experts." Initially, it was a simple approach of computing a weighted sum of multiple expert network outputs through a gating network.
The key milestones in modern MoE evolution are as follows:
- 2017: Shazeer et al. proposed LSTM-based Sparsely-Gated MoE with 4,096 experts and 100B+ parameters
- 2021: Google's Switch Transformer simplified routing to Top-1, achieving 1.6 trillion parameters
- 2022: Google's ST-MoE (Stable and Transferable MoE) systematized training stability techniques
- 2022: Expert Choice Routing paper proposed reverse routing where experts select tokens
- 2023: Mixtral 8x7B opened the era of open-source MoE with Top-2 routing and SwiGLU experts
- 2024: DeepSeek-V2 introduced Fine-Grained Experts and Auxiliary-Loss-Free strategy
- 2024: DeepSeek-V3 achieved state-of-the-art with 671B parameters (37B active)
- 2025: Llama 4 Scout (16 experts, 109B/17B active) marked Meta's first MoE adoption
Mathematical Foundations of Sparse MoE
Basic Formulation
The output of an MoE layer is defined as:
where is the hidden representation of the input token, is the number of experts, is the -th expert network, and is the weight assigned by the gating function to the -th expert.
Sparse Gating Function
The Noisy Top-K gating function proposed by Shazeer (2017) works as follows:
where is the gating weight matrix and is the noise weight matrix. The TopK operation retains only the top values and sets the rest to , making them zero after Softmax.
PyTorch Implementation: Basic Sparse Gating
import torch
import torch.nn as nn
import torch.nn.functional as F
class TopKGating(nn.Module):
"""Noisy Top-K Gating mechanism for MoE."""
def __init__(self, input_dim: int, num_experts: int, top_k: int = 2):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
self.gate = nn.Linear(input_dim, num_experts, bias=False)
self.noise = nn.Linear(input_dim, num_experts, bias=False)
def forward(self, x: torch.Tensor):
# x shape: (batch_size, seq_len, input_dim)
logits = self.gate(x) # (batch, seq, num_experts)
# Training noise for exploration
if self.training:
noise = torch.randn_like(logits) * F.softplus(self.noise(x))
logits = logits + noise
# Top-K selection
top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
# (batch, seq, top_k)
# Sparse softmax: only over selected experts
top_k_gates = F.softmax(top_k_logits, dim=-1)
return top_k_gates, top_k_indices
Switch Transformer Analysis
Core Innovation: Top-1 Routing
The core innovation of the Switch Transformer, published by Fedus et al. in 2021, is Top-1 routing. Previous research assumed that at least two experts needed to be activated for stable training, but Switch Transformer proved that selecting exactly one expert per token is sufficient.
The routed output is simply the product of the gating probability and the expert output:
Architecture Characteristics
Switch Transformer replaces the FFN (Feed-Forward Network) layers of the T5 architecture with MoE layers. Each MoE layer can accommodate up to 2,048 experts, enabling models at the 1.6 trillion parameter scale. Top-1 routing cuts communication costs in half and simplifies the routing computation itself.
Performance
With 64 experts, Switch Transformer achieves 7x faster pretraining speed compared to T5-Base at equivalent compute. This is because model capacity increases while per-token computation remains constant.
PyTorch Implementation: Switch Transformer MoE Layer
class SwitchMoELayer(nn.Module):
"""Switch Transformer style MoE layer with Top-1 routing."""
def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int,
capacity_factor: float = 1.25):
super().__init__()
self.num_experts = num_experts
self.capacity_factor = capacity_factor
self.router = nn.Linear(hidden_dim, num_experts, bias=False)
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(hidden_dim, ffn_dim),
nn.ReLU(),
nn.Linear(ffn_dim, hidden_dim)
) for _ in range(num_experts)
])
def forward(self, x: torch.Tensor):
batch_size, seq_len, hidden_dim = x.shape
x_flat = x.view(-1, hidden_dim) # (B*S, D)
num_tokens = x_flat.shape[0]
# Router: Top-1 selection
router_logits = self.router(x_flat) # (B*S, E)
router_probs = F.softmax(router_logits, dim=-1)
expert_indices = router_probs.argmax(dim=-1) # (B*S,)
expert_gates = router_probs.gather(1, expert_indices.unsqueeze(-1)).squeeze(-1)
# Capacity: max tokens per expert
capacity = int(self.capacity_factor * num_tokens / self.num_experts)
# Dispatch tokens to experts
output = torch.zeros_like(x_flat)
for i in range(self.num_experts):
mask = (expert_indices == i)
if mask.sum() == 0:
continue
selected = x_flat[mask][:capacity] # enforce capacity
expert_out = self.experts[i](selected)
gates = expert_gates[mask][:capacity].unsqueeze(-1)
output[mask][:capacity] = expert_out * gates
return output.view(batch_size, seq_len, hidden_dim)
Mixtral 8x7B Architecture Details
Design Philosophy
Mixtral 8x7B, released by Mistral AI in December 2023, is based on the Mistral 7B architecture with each Transformer layer's FFN replaced by an MoE layer consisting of 8 experts. It uses Top-2 routing to activate 2 experts per token.
Key Numbers
- Total parameters: 46.7B (8 experts x ~5.6B FFN + shared attention parameters)
- Active parameters: ~13B (2 expert FFNs per token + shared parameters)
- Expert function: SwiGLU FFN
- Attention: Grouped Query Attention (GQA)
- Context length: 32K tokens
- Sliding Window Attention applied
Top-2 Routing Formula
The MoE layer output in Mixtral is computed as:
The gating function computes a Softmax probability distribution over experts for input and selects the top 2. The gating weights of the selected 2 experts are renormalized to sum to 1.
SwiGLU Expert Network
Each expert is an FFN using the SwiGLU activation function:
PyTorch Implementation: Mixtral MoE Block
class MixtralMoEBlock(nn.Module):
"""Mixtral-style MoE block with Top-2 SwiGLU experts."""
def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int = 8):
super().__init__()
self.num_experts = num_experts
self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
self.experts = nn.ModuleList([
SwiGLUExpert(hidden_dim, ffn_dim) for _ in range(num_experts)
])
def forward(self, x: torch.Tensor):
# x: (batch, seq_len, hidden_dim)
gate_logits = self.gate(x) # (batch, seq, num_experts)
gate_probs = F.softmax(gate_logits, dim=-1)
# Top-2 selection
top2_probs, top2_indices = gate_probs.topk(2, dim=-1)
# Renormalize gates to sum to 1
top2_probs = top2_probs / top2_probs.sum(dim=-1, keepdim=True)
# Compute expert outputs and combine
batch, seq, dim = x.shape
output = torch.zeros_like(x)
for k in range(2):
expert_idx = top2_indices[:, :, k] # (batch, seq)
gate_val = top2_probs[:, :, k].unsqueeze(-1) # (batch, seq, 1)
for i in range(self.num_experts):
mask = (expert_idx == i)
if mask.any():
expert_input = x[mask]
expert_output = self.experts[i](expert_input)
output[mask] += gate_val[mask].squeeze(-1).unsqueeze(-1) * expert_output
return output
class SwiGLUExpert(nn.Module):
"""SwiGLU Feed-Forward Network used as expert."""
def __init__(self, hidden_dim: int, ffn_dim: int):
super().__init__()
self.w1 = nn.Linear(hidden_dim, ffn_dim, bias=False)
self.v = nn.Linear(hidden_dim, ffn_dim, bias=False)
self.w2 = nn.Linear(ffn_dim, hidden_dim, bias=False)
def forward(self, x: torch.Tensor):
return self.w2(F.silu(self.w1(x)) * self.v(x))
DeepSeek-V2/V3 Innovations: DeepSeekMoE
Fine-Grained Expert Segmentation
DeepSeek-V2 (2024) took a fundamentally different approach from existing MoE models. The core idea is Fine-Grained Expert Segmentation -- splitting experts into smaller and more numerous units.
The original experts are increased to , while each expert's hidden dimension is reduced by . Simultaneously, the number of activated experts increases proportionally from to , maintaining the same per-token computation while enabling finer-grained expert combinations.
DeepSeek-V3 Architecture
DeepSeek-V3 (December 2024) features the following key configuration:
- Total parameters: 671B
- Active parameters: 37B (per token)
- Routed experts: 256 (per layer)
- Shared experts: 1 (per layer, always active)
- Active routed experts: 8 (per token)
- Attention: Multi-head Latent Attention (MLA)
Auxiliary-Loss-Free Load Balancing
One of the most innovative contributions of DeepSeek-V3 is its auxiliary-loss-free load balancing strategy. Traditional MoE models use auxiliary losses for load balancing, but calibrating the coefficient is difficult, and excessive values degrade model performance.
Instead, DeepSeek-V3 adds a bias term to each expert, used only for routing decisions:
The bias term only affects routing decisions and is not included in the actual gating weight computation. Overloaded experts have their decreased while underloaded experts have their increased, achieving load balance without contaminating the loss function.
Device-Limited Routing
To limit communication costs, DeepSeek-V3 restricts each token to be sent to at most nodes. It selects the top nodes based on affinity scores of experts distributed across each node, then performs Top-K routing only among experts within those selected nodes.
Routing Strategy Comparison
Top-1 Routing (Switch Transformer)
Activates exactly 1 expert per token. Minimizes communication cost and simplifies implementation, but reliance on a single expert may limit representational power.
Top-2 Routing (Mixtral, GShard)
Activates 2 experts per token with weighted combination. Richer representation than Top-1 but doubles communication cost.
Expert Choice Routing (Zhou et al., 2022)
Reverses the traditional approach: experts select tokens. Since each expert selects a fixed number of tokens, load balance is automatically guaranteed. However, a single token may be selected by zero or multiple experts, introducing non-determinism.
Soft MoE (Puigcerver et al., 2023)
Instead of discrete routing, passes weighted combinations of all tokens to each expert. Fully differentiable with no token dropping, but not truly sparse since every expert processes information from all tokens.
PyTorch Implementation: Expert Choice Routing
class ExpertChoiceRouter(nn.Module):
"""Expert Choice Routing: experts select tokens."""
def __init__(self, hidden_dim: int, num_experts: int, capacity_factor: float = 1.0):
super().__init__()
self.num_experts = num_experts
self.capacity_factor = capacity_factor
self.router = nn.Linear(hidden_dim, num_experts, bias=False)
def forward(self, x: torch.Tensor):
# x: (num_tokens, hidden_dim)
num_tokens = x.shape[0]
capacity = int(self.capacity_factor * num_tokens / self.num_experts)
# Compute affinity scores
scores = self.router(x) # (num_tokens, num_experts)
scores = F.softmax(scores, dim=0) # softmax over tokens (not experts)
# Each expert selects top-capacity tokens
# Transpose: (num_experts, num_tokens)
expert_scores = scores.t()
# Top-capacity selection per expert
top_scores, top_indices = expert_scores.topk(capacity, dim=-1)
# top_scores: (num_experts, capacity)
# top_indices: (num_experts, capacity)
return top_scores, top_indices
Training Stability Techniques
Load Balancing Loss
The load balancing loss proposed in Switch Transformer is defined as:
where is the number of experts, is the fraction of tokens routed to expert , and is the mean router probability assigned to expert . The coefficient is typically set between 0.01 and 0.1.
Under ideal uniform distribution, , so the loss equals . The loss increases as imbalance grows.
Router Z-Loss
The Router Z-Loss proposed in ST-MoE (2022) constrains the magnitude of router logits to improve training stability:
where denotes the router logits. This loss prevents logits from growing excessively large, mitigating instability and convergence issues in routing decisions.
PyTorch Implementation: Load Balancing + Z-Loss
def compute_moe_auxiliary_losses(
router_logits: torch.Tensor,
expert_indices: torch.Tensor,
num_experts: int,
alpha_balance: float = 0.01,
alpha_z: float = 0.001
):
"""Compute load balancing loss and router z-loss.
Args:
router_logits: Raw router logits (batch*seq, num_experts)
expert_indices: Selected expert indices (batch*seq, top_k)
num_experts: Total number of experts
alpha_balance: Weight for load balancing loss
alpha_z: Weight for router z-loss
"""
num_tokens = router_logits.shape[0]
router_probs = F.softmax(router_logits, dim=-1)
# --- Load Balancing Loss ---
# f_i: fraction of tokens routed to expert i
expert_mask = F.one_hot(expert_indices, num_experts).float()
if expert_mask.dim() == 3:
expert_mask = expert_mask.sum(dim=1) # sum over top_k
expert_mask = (expert_mask > 0).float()
f = expert_mask.mean(dim=0) # (num_experts,)
# P_i: mean router probability for expert i
P = router_probs.mean(dim=0) # (num_experts,)
balance_loss = alpha_balance * num_experts * (f * P).sum()
# --- Router Z-Loss ---
log_z = torch.logsumexp(router_logits, dim=-1) # (num_tokens,)
z_loss = alpha_z * (log_z ** 2).mean()
return balance_loss + z_loss
Inference Optimization
Expert Offloading
Due to the large total parameter count of MoE models, it may be difficult to fit all experts in GPU memory. Expert Offloading stores inactive experts in CPU RAM or on disk and loads them to GPU only when needed.
Key techniques include:
- LRU Cache: Caches recently used experts on GPU
- Predictive Prefetch: Asynchronously preloads experts needed for the next layer
- Speculative Decoding + Offloading: Combines with speculative decoding to hide offloading latency
Quantization
Quantization for MoE models is similar to dense models, but requires additional consideration since weight distributions may differ across experts.
- GPTQ/AWQ: Independent quantization configurations per expert are possible
- Mixed Precision: Higher precision for frequently used experts, lower for rarely used ones
- MiLo (2025): Adds Low-Rank compensators to extremely quantized MoE models to recover accuracy
Expert Parallelism
In MoE inference, Expert Parallelism places each expert on a separate GPU for parallel processing. All-to-All communication sends tokens to the GPU hosting their assigned expert, processes them, and gathers results back.
Major MoE Model Comparison
| Model | Year | Total Params | Active Params | Experts | Routing | Expert Type | Key Features |
|---|---|---|---|---|---|---|---|
| Sparsely-Gated MoE | 2017 | 137B | - | 4096 | Top-K | MLP | First large-scale Sparse MoE |
| Switch Transformer | 2021 | 1.6T | - | 2048 | Top-1 | FFN | Simplified routing, T5-based |
| GLaM | 2022 | 1.2T | 97B | 64 | Top-2 | FFN | 1/3 energy vs GPT-3 |
| ST-MoE | 2022 | 269B | - | 32 | Top-2 | FFN | Z-Loss, stability focus |
| Expert Choice | 2022 | - | - | - | Expert Choice | FFN | Experts select tokens |
| Mixtral 8x7B | 2023 | 46.7B | 13B | 8 | Top-2 | SwiGLU | Open-source, GQA |
| DeepSeek-V2 | 2024 | 236B | 21B | 160 | Top-6 | Fine-Grained | Auxiliary-Loss-Free |
| DeepSeek-V3 | 2024 | 671B | 37B | 256+1 | Top-8 | Fine-Grained | MLA + shared expert |
| Llama 4 Scout | 2025 | 109B | 17B | 16 | Top-1 | - | Meta first MoE |
Complete Implementation: Custom MoE Transformer Block
Below is a complete implementation of a Transformer block combining an attention layer with an MoE FFN.
class MoETransformerBlock(nn.Module):
"""Complete Transformer block with MoE FFN layer."""
def __init__(
self,
hidden_dim: int = 768,
num_heads: int = 12,
ffn_dim: int = 3072,
num_experts: int = 8,
top_k: int = 2,
capacity_factor: float = 1.25,
dropout: float = 0.1
):
super().__init__()
# Multi-Head Attention
self.attn_norm = nn.LayerNorm(hidden_dim)
self.attention = nn.MultiheadAttention(
hidden_dim, num_heads, dropout=dropout, batch_first=True
)
# MoE FFN
self.ffn_norm = nn.LayerNorm(hidden_dim)
self.router = nn.Linear(hidden_dim, num_experts, bias=False)
self.experts = nn.ModuleList([
SwiGLUExpert(hidden_dim, ffn_dim)
for _ in range(num_experts)
])
self.top_k = top_k
self.num_experts = num_experts
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, mask=None):
# Pre-norm Attention
residual = x
x_norm = self.attn_norm(x)
attn_out, _ = self.attention(x_norm, x_norm, x_norm, attn_mask=mask)
x = residual + self.dropout(attn_out)
# Pre-norm MoE FFN
residual = x
x_norm = self.ffn_norm(x)
moe_out, aux_loss = self._moe_forward(x_norm)
x = residual + self.dropout(moe_out)
return x, aux_loss
def _moe_forward(self, x: torch.Tensor):
B, S, D = x.shape
x_flat = x.view(-1, D)
# Router
logits = self.router(x_flat)
probs = F.softmax(logits, dim=-1)
top_k_probs, top_k_idx = probs.topk(self.top_k, dim=-1)
top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)
# Dispatch and combine
output = torch.zeros_like(x_flat)
for k in range(self.top_k):
for i in range(self.num_experts):
mask = (top_k_idx[:, k] == i)
if mask.any():
expert_out = self.experts[i](x_flat[mask])
output[mask] += top_k_probs[mask, k].unsqueeze(-1) * expert_out
# Auxiliary loss
aux_loss = compute_moe_auxiliary_losses(
logits, top_k_idx, self.num_experts
)
return output.view(B, S, D), aux_loss
Conclusion and Future Directions
MoE architecture has established itself as the most practical approach for simultaneously achieving "model capacity scaling" and "computational efficiency." Starting from Switch Transformer's Top-1 simplification, Mixtral 8x7B brought MoE to the open-source ecosystem, and DeepSeek-V3 set new standards with Fine-Grained Experts and Auxiliary-Loss-Free strategies.
Key future research directions include:
- Dynamic expert activation: Adaptive routing that adjusts the number of active experts based on input difficulty
- Training-inference consistency: Techniques ensuring routing patterns from training are maintained during inference
- Expert specialization analysis: Interpretability research on what knowledge or functions each expert specializes in
- MoE for edge devices: Lightweight MoE designs for mobile and edge environments
References
- Shazeer, N., et al. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017.
- Fedus, W., et al. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022.
- Jiang, A.Q., et al. "Mixtral of Experts." arXiv:2401.04088, 2024.
- DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, 2024.
- DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024.
- Zoph, B., et al. "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906, 2022.
- Zhou, Y., et al. "Mixture-of-Experts with Expert Choice Routing." NeurIPS 2022.
- Puigcerver, J., et al. "From Sparse to Soft Mixtures of Experts." ICLR 2024.
- Du, N., et al. "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts." ICML 2022.
- Jacobs, R.A., et al. "Adaptive Mixtures of Local Experts." Neural Computation, 1991.