Split View: Mixture of Experts(MoE) 아키텍처 완벽 분석
Mixture of Experts(MoE) 아키텍처 완벽 분석

1. MoE란 무엇인가?
**Mixture of Experts(MoE)**는 모델의 전체 파라미터 중 일부만 활성화하여 연산 효율성을 높이는 아키텍처입니다. 모든 입력에 대해 전체 모델을 사용하는 Dense 모델과 달리, MoE는 입력에 따라 **최적의 전문가(Expert)**만 선택하여 처리합니다.
Dense vs Sparse 모델
- Dense 모델: 모든 파라미터가 매번 활성화 (예: LLaMA, GPT-4)
- Sparse MoE: 전체 파라미터의 일부만 활성화 (예: Mixtral, DeepSeek-V3)
핵심 장점은 파라미터 수는 크지만 연산량은 적다는 것입니다. Mixtral 8x7B는 총 46.7B 파라미터를 가지지만 추론 시 활성화되는 파라미터는 약 12.9B에 불과합니다.
2. MoE 아키텍처의 핵심 구성요소
Expert Network
각 Expert는 독립적인 FFN(Feed-Forward Network)입니다:
import torch
import torch.nn as nn
class Expert(nn.Module):
def __init__(self, d_model: int, d_ff: int):
super().__init__()
self.w1 = nn.Linear(d_model, d_ff, bias=False)
self.w2 = nn.Linear(d_ff, d_model, bias=False)
self.w3 = nn.Linear(d_model, d_ff, bias=False) # SwiGLU gate
def forward(self, x: torch.Tensor) -> torch.Tensor:
# SwiGLU activation
return self.w2(nn.functional.silu(self.w1(x)) * self.w3(x))
Router (Gating Network)
Router는 각 토큰을 어떤 Expert에게 보낼지 결정합니다:
class TopKRouter(nn.Module):
def __init__(self, d_model: int, num_experts: int, top_k: int = 2):
super().__init__()
self.gate = nn.Linear(d_model, num_experts, bias=False)
self.top_k = top_k
def forward(self, x: torch.Tensor):
# x shape: (batch, seq_len, d_model)
logits = self.gate(x) # (batch, seq_len, num_experts)
top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
top_k_weights = torch.softmax(top_k_logits, dim=-1)
return top_k_weights, top_k_indices
MoE Layer 전체 구현
class MoELayer(nn.Module):
def __init__(self, d_model: int, d_ff: int,
num_experts: int = 8, top_k: int = 2):
super().__init__()
self.experts = nn.ModuleList([
Expert(d_model, d_ff) for _ in range(num_experts)
])
self.router = TopKRouter(d_model, num_experts, top_k)
self.num_experts = num_experts
def forward(self, x: torch.Tensor) -> torch.Tensor:
batch_size, seq_len, d_model = x.shape
weights, indices = self.router(x)
# Reshape for expert processing
flat_x = x.view(-1, d_model)
flat_weights = weights.view(-1, weights.shape[-1])
flat_indices = indices.view(-1, indices.shape[-1])
output = torch.zeros_like(flat_x)
for i, expert in enumerate(self.experts):
# Find tokens routed to this expert
mask = (flat_indices == i).any(dim=-1)
if mask.any():
expert_input = flat_x[mask]
expert_output = expert(expert_input)
# Weight by router probability
idx = (flat_indices[mask] == i).float()
w = (flat_weights[mask] * idx).sum(dim=-1, keepdim=True)
output[mask] += w * expert_output
return output.view(batch_size, seq_len, d_model)
3. 주요 MoE 모델 분석
Mixtral 8x7B (Mistral AI)
- 8개 Expert, Top-2 routing
- 총 46.7B 파라미터, 활성 12.9B
- Attention 레이어는 공유, FFN만 Expert로 분리
DeepSeek-V3 MoE
DeepSeek-V3는 더 정교한 MoE 설계를 채택했습니다:
class DeepSeekMoE(nn.Module):
"""DeepSeek-V3 스타일: 공유 Expert + Routed Expert"""
def __init__(self, d_model, d_ff, num_shared=1,
num_routed=256, top_k=8):
super().__init__()
# 모든 토큰이 거치는 공유 Expert
self.shared_experts = nn.ModuleList([
Expert(d_model, d_ff) for _ in range(num_shared)
])
# 토큰별로 선택되는 Routed Expert
self.routed_experts = nn.ModuleList([
Expert(d_model, d_ff // 4) for _ in range(num_routed)
])
self.router = TopKRouter(d_model, num_routed, top_k)
def forward(self, x):
# 공유 Expert 결과
shared_out = sum(e(x) for e in self.shared_experts)
# Routed Expert 결과
weights, indices = self.router(x)
routed_out = self._route_tokens(x, weights, indices)
return shared_out + routed_out
- 1개 공유 Expert + 256개 Routed Expert (Top-8 선택)
- 총 671B 파라미터, 활성 37B
- Auxiliary-loss-free load balancing 도입
모델 비교
| 모델 | 총 파라미터 | 활성 파라미터 | Expert 수 | Top-K |
|---|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | 8 | 2 |
| Mixtral 8x22B | 141B | 39B | 8 | 2 |
| DeepSeek-V3 | 671B | 37B | 256+1 | 8+1 |
| Qwen2.5-MoE | 14.3B | 2.7B | 60+4 | 4+4 |
4. Routing 전략
Token Choice vs Expert Choice
# Token Choice: 각 토큰이 Expert를 선택
def token_choice_routing(logits, top_k=2):
top_k_vals, top_k_idx = logits.topk(top_k, dim=-1)
weights = torch.softmax(top_k_vals, dim=-1)
return weights, top_k_idx
# Expert Choice: 각 Expert가 토큰을 선택
def expert_choice_routing(logits, capacity_factor=1.25):
num_tokens = logits.shape[0]
num_experts = logits.shape[1]
capacity = int(num_tokens * capacity_factor / num_experts)
expert_scores = logits.T # (num_experts, num_tokens)
top_k_vals, top_k_idx = expert_scores.topk(capacity, dim=-1)
return top_k_vals, top_k_idx
5. Load Balancing
Expert 간 부하 불균형은 MoE의 핵심 문제입니다:
def load_balancing_loss(router_logits, top_k_indices, num_experts):
"""Auxiliary load balancing loss (Switch Transformer 방식)"""
# Expert별 토큰 비율
mask = torch.zeros_like(router_logits)
mask.scatter_(-1, top_k_indices, 1.0)
tokens_per_expert = mask.float().mean(dim=0) # (num_experts,)
# Expert별 라우팅 확률 평균
router_probs = torch.softmax(router_logits, dim=-1)
router_prob_per_expert = router_probs.mean(dim=0)
# 두 분포의 내적 = 불균형 척도
loss = num_experts * (tokens_per_expert * router_prob_per_expert).sum()
return loss
DeepSeek-V3의 Auxiliary-loss-free 방식은 Expert별 bias term을 동적으로 조절하여 별도의 loss 없이 균형을 유지합니다.
6. 추론 최적화
# Expert Parallelism: Expert를 여러 GPU에 분산
# GPU 0: Expert 0-3, GPU 1: Expert 4-7
class ExpertParallel(nn.Module):
def __init__(self, experts_per_gpu, rank, world_size):
super().__init__()
self.local_experts = nn.ModuleList([
Expert(d_model, d_ff)
for _ in range(experts_per_gpu)
])
self.rank = rank
self.world_size = world_size
def forward(self, x, indices):
# All-to-All 통신으로 토큰 재분배
dispatched = all_to_all(x, indices, self.world_size)
# 로컬 Expert 처리
output = self._process_local(dispatched)
# 결과 재조합
return all_to_all(output, indices, self.world_size)
7. 퀴즈
Q1: Mixtral 8x7B에서 하나의 토큰이 추론 시 사용하는 파라미터 수는 약 얼마인가?
약 12.9B 파라미터입니다. 8개 Expert 중 Top-2만 활성화되므로 2개 Expert의 FFN 파라미터 + 공유되는 Attention 파라미터만 사용됩니다. 전체 46.7B의 약 28%만 활성화되는 것입니다.
Q2: DeepSeek-V3가 도입한 Auxiliary-loss-free load balancing의 핵심 아이디어는?
기존 MoE는 load balancing을 위해 별도의 auxiliary loss를 추가하는데, 이는 모델 성능을 저하시킬 수 있습니다. DeepSeek-V3는 각 Expert에 동적 bias term을 두고, 토큰을 많이 받는 Expert의 bias를 낮추고 적게 받는 Expert의 bias를 높여 자연스럽게 균형을 맞춥니다. 이로써 별도의 loss 없이도 안정적인 load balancing이 가능합니다.
Q3: Token Choice와 Expert Choice routing의 차이점과 각각의 장단점은?
- Token Choice: 각 토큰이 Top-K Expert를 선택합니다. 구현이 간단하지만 일부 Expert에 토큰이 집중되는 불균형이 발생할 수 있습니다.
- Expert Choice: 각 Expert가 처리할 토큰을 선택합니다. 완벽한 load balancing이 보장되지만, 일부 토큰이 아무 Expert에도 선택되지 않을 수 있습니다.
실제로는 Token Choice + load balancing loss의 조합이 가장 많이 사용됩니다.
Mixture of Experts (MoE) Architecture: A Complete Analysis
- 1. What is MoE?
- 2. Core Components of MoE Architecture
- 3. Major MoE Model Analysis
- 4. Routing Strategies
- 5. Load Balancing
- 6. Inference Optimization
- 7. Quiz
- Quiz

1. What is MoE?
Mixture of Experts (MoE) is an architecture that improves computational efficiency by activating only a subset of the model's total parameters. Unlike Dense models that use all parameters for every input, MoE selects and activates only the optimal experts based on the input.
Dense vs Sparse Models
- Dense Model: All parameters are activated for every input (e.g., LLaMA, GPT-4)
- Sparse MoE: Only a fraction of parameters are activated (e.g., Mixtral, DeepSeek-V3)
The key advantage is that the model has a large number of parameters but low computational cost. Mixtral 8x7B has 46.7B total parameters, but only about 12.9B are activated during inference.
2. Core Components of MoE Architecture
Expert Network
Each Expert is an independent FFN (Feed-Forward Network):
import torch
import torch.nn as nn
class Expert(nn.Module):
def __init__(self, d_model: int, d_ff: int):
super().__init__()
self.w1 = nn.Linear(d_model, d_ff, bias=False)
self.w2 = nn.Linear(d_ff, d_model, bias=False)
self.w3 = nn.Linear(d_model, d_ff, bias=False) # SwiGLU gate
def forward(self, x: torch.Tensor) -> torch.Tensor:
# SwiGLU activation
return self.w2(nn.functional.silu(self.w1(x)) * self.w3(x))
Router (Gating Network)
The Router determines which Expert each token is sent to:
class TopKRouter(nn.Module):
def __init__(self, d_model: int, num_experts: int, top_k: int = 2):
super().__init__()
self.gate = nn.Linear(d_model, num_experts, bias=False)
self.top_k = top_k
def forward(self, x: torch.Tensor):
# x shape: (batch, seq_len, d_model)
logits = self.gate(x) # (batch, seq_len, num_experts)
top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
top_k_weights = torch.softmax(top_k_logits, dim=-1)
return top_k_weights, top_k_indices
Full MoE Layer Implementation
class MoELayer(nn.Module):
def __init__(self, d_model: int, d_ff: int,
num_experts: int = 8, top_k: int = 2):
super().__init__()
self.experts = nn.ModuleList([
Expert(d_model, d_ff) for _ in range(num_experts)
])
self.router = TopKRouter(d_model, num_experts, top_k)
self.num_experts = num_experts
def forward(self, x: torch.Tensor) -> torch.Tensor:
batch_size, seq_len, d_model = x.shape
weights, indices = self.router(x)
# Reshape for expert processing
flat_x = x.view(-1, d_model)
flat_weights = weights.view(-1, weights.shape[-1])
flat_indices = indices.view(-1, indices.shape[-1])
output = torch.zeros_like(flat_x)
for i, expert in enumerate(self.experts):
# Find tokens routed to this expert
mask = (flat_indices == i).any(dim=-1)
if mask.any():
expert_input = flat_x[mask]
expert_output = expert(expert_input)
# Weight by router probability
idx = (flat_indices[mask] == i).float()
w = (flat_weights[mask] * idx).sum(dim=-1, keepdim=True)
output[mask] += w * expert_output
return output.view(batch_size, seq_len, d_model)
3. Major MoE Model Analysis
Mixtral 8x7B (Mistral AI)
- 8 Experts, Top-2 routing
- 46.7B total parameters, 12.9B active
- Attention layers are shared; only FFN is split into Experts
DeepSeek-V3 MoE
DeepSeek-V3 employs a more sophisticated MoE design:
class DeepSeekMoE(nn.Module):
"""DeepSeek-V3 style: Shared Expert + Routed Expert"""
def __init__(self, d_model, d_ff, num_shared=1,
num_routed=256, top_k=8):
super().__init__()
# Shared Expert that all tokens pass through
self.shared_experts = nn.ModuleList([
Expert(d_model, d_ff) for _ in range(num_shared)
])
# Routed Expert selected per token
self.routed_experts = nn.ModuleList([
Expert(d_model, d_ff // 4) for _ in range(num_routed)
])
self.router = TopKRouter(d_model, num_routed, top_k)
def forward(self, x):
# Shared Expert output
shared_out = sum(e(x) for e in self.shared_experts)
# Routed Expert output
weights, indices = self.router(x)
routed_out = self._route_tokens(x, weights, indices)
return shared_out + routed_out
- 1 Shared Expert + 256 Routed Experts (Top-8 selection)
- 671B total parameters, 37B active
- Introduced Auxiliary-loss-free load balancing
Model Comparison
| Model | Total Params | Active Params | Experts | Top-K |
|---|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | 8 | 2 |
| Mixtral 8x22B | 141B | 39B | 8 | 2 |
| DeepSeek-V3 | 671B | 37B | 256+1 | 8+1 |
| Qwen2.5-MoE | 14.3B | 2.7B | 60+4 | 4+4 |
4. Routing Strategies
Token Choice vs Expert Choice
# Token Choice: Each token selects its Experts
def token_choice_routing(logits, top_k=2):
top_k_vals, top_k_idx = logits.topk(top_k, dim=-1)
weights = torch.softmax(top_k_vals, dim=-1)
return weights, top_k_idx
# Expert Choice: Each Expert selects its tokens
def expert_choice_routing(logits, capacity_factor=1.25):
num_tokens = logits.shape[0]
num_experts = logits.shape[1]
capacity = int(num_tokens * capacity_factor / num_experts)
expert_scores = logits.T # (num_experts, num_tokens)
top_k_vals, top_k_idx = expert_scores.topk(capacity, dim=-1)
return top_k_vals, top_k_idx
5. Load Balancing
Load imbalance across Experts is a core challenge in MoE:
def load_balancing_loss(router_logits, top_k_indices, num_experts):
"""Auxiliary load balancing loss (Switch Transformer style)"""
# Token ratio per Expert
mask = torch.zeros_like(router_logits)
mask.scatter_(-1, top_k_indices, 1.0)
tokens_per_expert = mask.float().mean(dim=0) # (num_experts,)
# Average routing probability per Expert
router_probs = torch.softmax(router_logits, dim=-1)
router_prob_per_expert = router_probs.mean(dim=0)
# Dot product of the two distributions = measure of imbalance
loss = num_experts * (tokens_per_expert * router_prob_per_expert).sum()
return loss
DeepSeek-V3's Auxiliary-loss-free approach dynamically adjusts per-Expert bias terms, lowering the bias for overloaded Experts and raising it for underutilized ones, achieving balanced load without an additional loss term.
6. Inference Optimization
# Expert Parallelism: Distribute Experts across multiple GPUs
# GPU 0: Expert 0-3, GPU 1: Expert 4-7
class ExpertParallel(nn.Module):
def __init__(self, experts_per_gpu, rank, world_size):
super().__init__()
self.local_experts = nn.ModuleList([
Expert(d_model, d_ff)
for _ in range(experts_per_gpu)
])
self.rank = rank
self.world_size = world_size
def forward(self, x, indices):
# Redistribute tokens via All-to-All communication
dispatched = all_to_all(x, indices, self.world_size)
# Process with local Experts
output = self._process_local(dispatched)
# Recombine results
return all_to_all(output, indices, self.world_size)
7. Quiz
Q1: Approximately how many parameters does a single token use during inference in Mixtral 8x7B?
Approximately 12.9B parameters. Since only the Top-2 out of 8 Experts are activated, only the FFN parameters of 2 Experts plus the shared Attention parameters are used. This is roughly 28% of the total 46.7B.
Q2: What is the core idea behind DeepSeek-V3's Auxiliary-loss-free load balancing?
Traditional MoE adds an auxiliary loss for load balancing, which can degrade model performance. DeepSeek-V3 introduces dynamic bias terms for each Expert, lowering the bias for Experts receiving too many tokens and raising it for underutilized ones, naturally achieving balance. This enables stable load balancing without an additional loss.
Q3: What are the differences and trade-offs between Token Choice and Expert Choice routing?
- Token Choice: Each token selects its Top-K Experts. Simple to implement but can lead to load imbalance where tokens concentrate on certain Experts.
- Expert Choice: Each Expert selects the tokens it processes. Guarantees perfect load balancing but some tokens may not be selected by any Expert.
In practice, Token Choice combined with a load balancing loss is the most commonly used approach.
Quiz
Q1: What is the main topic covered in "Mixture of Experts (MoE) Architecture: A Complete
Analysis"?
A complete analysis of MoE architectures, from the principles of Sparse MoE to the MoE implementations in Mixtral and DeepSeek-V3, routing strategies, and load balancing.
Q2: What is MoE??
Mixture of Experts (MoE) is an architecture that improves computational efficiency by activating
only a subset of the model's total parameters. Unlike Dense models that use all parameters for
every input, MoE selects and activates only the optimal experts based on the input.
Q3: Describe the Core Components of MoE Architecture.
Expert Network Each Expert is an independent FFN (Feed-Forward Network): Router (Gating Network)
The Router determines which Expert each token is sent to: Full MoE Layer Implementation
Q4: What are the key aspects of Major MoE Model Analysis?
Mixtral 8x7B (Mistral AI) 8 Experts, Top-2 routing 46.7B total parameters, 12.9B active Attention
layers are shared; only FFN is split into Experts DeepSeek-V3 MoE DeepSeek-V3 employs a more
sophisticated MoE design: 1 Shared Expert + 256 Routed Experts (Top-8 selection) 671B tot...
Q5: How does Routing Strategies work?
Token Choice vs Expert Choice