Skip to content

Split View: Mixture of Experts(MoE) 아키텍처 완벽 분석

✨ Learn with Quiz
|

Mixture of Experts(MoE) 아키텍처 완벽 분석

Mixture of Experts Architecture

1. MoE란 무엇인가?

**Mixture of Experts(MoE)**는 모델의 전체 파라미터 중 일부만 활성화하여 연산 효율성을 높이는 아키텍처입니다. 모든 입력에 대해 전체 모델을 사용하는 Dense 모델과 달리, MoE는 입력에 따라 **최적의 전문가(Expert)**만 선택하여 처리합니다.

Dense vs Sparse 모델

  • Dense 모델: 모든 파라미터가 매번 활성화 (예: LLaMA, GPT-4)
  • Sparse MoE: 전체 파라미터의 일부만 활성화 (예: Mixtral, DeepSeek-V3)

핵심 장점은 파라미터 수는 크지만 연산량은 적다는 것입니다. Mixtral 8x7B는 총 46.7B 파라미터를 가지지만 추론 시 활성화되는 파라미터는 약 12.9B에 불과합니다.

2. MoE 아키텍처의 핵심 구성요소

Expert Network

각 Expert는 독립적인 FFN(Feed-Forward Network)입니다:

import torch
import torch.nn as nn

class Expert(nn.Module):
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff, bias=False)
        self.w2 = nn.Linear(d_ff, d_model, bias=False)
        self.w3 = nn.Linear(d_model, d_ff, bias=False)  # SwiGLU gate

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # SwiGLU activation
        return self.w2(nn.functional.silu(self.w1(x)) * self.w3(x))

Router (Gating Network)

Router는 각 토큰을 어떤 Expert에게 보낼지 결정합니다:

class TopKRouter(nn.Module):
    def __init__(self, d_model: int, num_experts: int, top_k: int = 2):
        super().__init__()
        self.gate = nn.Linear(d_model, num_experts, bias=False)
        self.top_k = top_k

    def forward(self, x: torch.Tensor):
        # x shape: (batch, seq_len, d_model)
        logits = self.gate(x)  # (batch, seq_len, num_experts)
        top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
        top_k_weights = torch.softmax(top_k_logits, dim=-1)
        return top_k_weights, top_k_indices

MoE Layer 전체 구현

class MoELayer(nn.Module):
    def __init__(self, d_model: int, d_ff: int,
                 num_experts: int = 8, top_k: int = 2):
        super().__init__()
        self.experts = nn.ModuleList([
            Expert(d_model, d_ff) for _ in range(num_experts)
        ])
        self.router = TopKRouter(d_model, num_experts, top_k)
        self.num_experts = num_experts

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len, d_model = x.shape
        weights, indices = self.router(x)

        # Reshape for expert processing
        flat_x = x.view(-1, d_model)
        flat_weights = weights.view(-1, weights.shape[-1])
        flat_indices = indices.view(-1, indices.shape[-1])

        output = torch.zeros_like(flat_x)
        for i, expert in enumerate(self.experts):
            # Find tokens routed to this expert
            mask = (flat_indices == i).any(dim=-1)
            if mask.any():
                expert_input = flat_x[mask]
                expert_output = expert(expert_input)
                # Weight by router probability
                idx = (flat_indices[mask] == i).float()
                w = (flat_weights[mask] * idx).sum(dim=-1, keepdim=True)
                output[mask] += w * expert_output

        return output.view(batch_size, seq_len, d_model)

3. 주요 MoE 모델 분석

Mixtral 8x7B (Mistral AI)

  • 8개 Expert, Top-2 routing
  • 총 46.7B 파라미터, 활성 12.9B
  • Attention 레이어는 공유, FFN만 Expert로 분리

DeepSeek-V3 MoE

DeepSeek-V3는 더 정교한 MoE 설계를 채택했습니다:

class DeepSeekMoE(nn.Module):
    """DeepSeek-V3 스타일: 공유 Expert + Routed Expert"""
    def __init__(self, d_model, d_ff, num_shared=1,
                 num_routed=256, top_k=8):
        super().__init__()
        # 모든 토큰이 거치는 공유 Expert
        self.shared_experts = nn.ModuleList([
            Expert(d_model, d_ff) for _ in range(num_shared)
        ])
        # 토큰별로 선택되는 Routed Expert
        self.routed_experts = nn.ModuleList([
            Expert(d_model, d_ff // 4) for _ in range(num_routed)
        ])
        self.router = TopKRouter(d_model, num_routed, top_k)

    def forward(self, x):
        # 공유 Expert 결과
        shared_out = sum(e(x) for e in self.shared_experts)
        # Routed Expert 결과
        weights, indices = self.router(x)
        routed_out = self._route_tokens(x, weights, indices)
        return shared_out + routed_out
  • 1개 공유 Expert + 256개 Routed Expert (Top-8 선택)
  • 총 671B 파라미터, 활성 37B
  • Auxiliary-loss-free load balancing 도입

모델 비교

모델총 파라미터활성 파라미터Expert 수Top-K
Mixtral 8x7B46.7B12.9B82
Mixtral 8x22B141B39B82
DeepSeek-V3671B37B256+18+1
Qwen2.5-MoE14.3B2.7B60+44+4

4. Routing 전략

Token Choice vs Expert Choice

# Token Choice: 각 토큰이 Expert를 선택
def token_choice_routing(logits, top_k=2):
    top_k_vals, top_k_idx = logits.topk(top_k, dim=-1)
    weights = torch.softmax(top_k_vals, dim=-1)
    return weights, top_k_idx

# Expert Choice: 각 Expert가 토큰을 선택
def expert_choice_routing(logits, capacity_factor=1.25):
    num_tokens = logits.shape[0]
    num_experts = logits.shape[1]
    capacity = int(num_tokens * capacity_factor / num_experts)

    expert_scores = logits.T  # (num_experts, num_tokens)
    top_k_vals, top_k_idx = expert_scores.topk(capacity, dim=-1)
    return top_k_vals, top_k_idx

5. Load Balancing

Expert 간 부하 불균형은 MoE의 핵심 문제입니다:

def load_balancing_loss(router_logits, top_k_indices, num_experts):
    """Auxiliary load balancing loss (Switch Transformer 방식)"""
    # Expert별 토큰 비율
    mask = torch.zeros_like(router_logits)
    mask.scatter_(-1, top_k_indices, 1.0)
    tokens_per_expert = mask.float().mean(dim=0)  # (num_experts,)

    # Expert별 라우팅 확률 평균
    router_probs = torch.softmax(router_logits, dim=-1)
    router_prob_per_expert = router_probs.mean(dim=0)

    # 두 분포의 내적 = 불균형 척도
    loss = num_experts * (tokens_per_expert * router_prob_per_expert).sum()
    return loss

DeepSeek-V3의 Auxiliary-loss-free 방식은 Expert별 bias term을 동적으로 조절하여 별도의 loss 없이 균형을 유지합니다.

6. 추론 최적화

# Expert Parallelism: Expert를 여러 GPU에 분산
# GPU 0: Expert 0-3, GPU 1: Expert 4-7
class ExpertParallel(nn.Module):
    def __init__(self, experts_per_gpu, rank, world_size):
        super().__init__()
        self.local_experts = nn.ModuleList([
            Expert(d_model, d_ff)
            for _ in range(experts_per_gpu)
        ])
        self.rank = rank
        self.world_size = world_size

    def forward(self, x, indices):
        # All-to-All 통신으로 토큰 재분배
        dispatched = all_to_all(x, indices, self.world_size)
        # 로컬 Expert 처리
        output = self._process_local(dispatched)
        # 결과 재조합
        return all_to_all(output, indices, self.world_size)

7. 퀴즈

Q1: Mixtral 8x7B에서 하나의 토큰이 추론 시 사용하는 파라미터 수는 약 얼마인가?

12.9B 파라미터입니다. 8개 Expert 중 Top-2만 활성화되므로 2개 Expert의 FFN 파라미터 + 공유되는 Attention 파라미터만 사용됩니다. 전체 46.7B의 약 28%만 활성화되는 것입니다.

Q2: DeepSeek-V3가 도입한 Auxiliary-loss-free load balancing의 핵심 아이디어는?

기존 MoE는 load balancing을 위해 별도의 auxiliary loss를 추가하는데, 이는 모델 성능을 저하시킬 수 있습니다. DeepSeek-V3는 각 Expert에 동적 bias term을 두고, 토큰을 많이 받는 Expert의 bias를 낮추고 적게 받는 Expert의 bias를 높여 자연스럽게 균형을 맞춥니다. 이로써 별도의 loss 없이도 안정적인 load balancing이 가능합니다.

Q3: Token Choice와 Expert Choice routing의 차이점과 각각의 장단점은?
  • Token Choice: 각 토큰이 Top-K Expert를 선택합니다. 구현이 간단하지만 일부 Expert에 토큰이 집중되는 불균형이 발생할 수 있습니다.
  • Expert Choice: 각 Expert가 처리할 토큰을 선택합니다. 완벽한 load balancing이 보장되지만, 일부 토큰이 아무 Expert에도 선택되지 않을 수 있습니다.

실제로는 Token Choice + load balancing loss의 조합이 가장 많이 사용됩니다.

Mixture of Experts (MoE) Architecture: A Complete Analysis

Mixture of Experts Architecture

1. What is MoE?

Mixture of Experts (MoE) is an architecture that improves computational efficiency by activating only a subset of the model's total parameters. Unlike Dense models that use all parameters for every input, MoE selects and activates only the optimal experts based on the input.

Dense vs Sparse Models

  • Dense Model: All parameters are activated for every input (e.g., LLaMA, GPT-4)
  • Sparse MoE: Only a fraction of parameters are activated (e.g., Mixtral, DeepSeek-V3)

The key advantage is that the model has a large number of parameters but low computational cost. Mixtral 8x7B has 46.7B total parameters, but only about 12.9B are activated during inference.

2. Core Components of MoE Architecture

Expert Network

Each Expert is an independent FFN (Feed-Forward Network):

import torch
import torch.nn as nn

class Expert(nn.Module):
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff, bias=False)
        self.w2 = nn.Linear(d_ff, d_model, bias=False)
        self.w3 = nn.Linear(d_model, d_ff, bias=False)  # SwiGLU gate

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # SwiGLU activation
        return self.w2(nn.functional.silu(self.w1(x)) * self.w3(x))

Router (Gating Network)

The Router determines which Expert each token is sent to:

class TopKRouter(nn.Module):
    def __init__(self, d_model: int, num_experts: int, top_k: int = 2):
        super().__init__()
        self.gate = nn.Linear(d_model, num_experts, bias=False)
        self.top_k = top_k

    def forward(self, x: torch.Tensor):
        # x shape: (batch, seq_len, d_model)
        logits = self.gate(x)  # (batch, seq_len, num_experts)
        top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
        top_k_weights = torch.softmax(top_k_logits, dim=-1)
        return top_k_weights, top_k_indices

Full MoE Layer Implementation

class MoELayer(nn.Module):
    def __init__(self, d_model: int, d_ff: int,
                 num_experts: int = 8, top_k: int = 2):
        super().__init__()
        self.experts = nn.ModuleList([
            Expert(d_model, d_ff) for _ in range(num_experts)
        ])
        self.router = TopKRouter(d_model, num_experts, top_k)
        self.num_experts = num_experts

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len, d_model = x.shape
        weights, indices = self.router(x)

        # Reshape for expert processing
        flat_x = x.view(-1, d_model)
        flat_weights = weights.view(-1, weights.shape[-1])
        flat_indices = indices.view(-1, indices.shape[-1])

        output = torch.zeros_like(flat_x)
        for i, expert in enumerate(self.experts):
            # Find tokens routed to this expert
            mask = (flat_indices == i).any(dim=-1)
            if mask.any():
                expert_input = flat_x[mask]
                expert_output = expert(expert_input)
                # Weight by router probability
                idx = (flat_indices[mask] == i).float()
                w = (flat_weights[mask] * idx).sum(dim=-1, keepdim=True)
                output[mask] += w * expert_output

        return output.view(batch_size, seq_len, d_model)

3. Major MoE Model Analysis

Mixtral 8x7B (Mistral AI)

  • 8 Experts, Top-2 routing
  • 46.7B total parameters, 12.9B active
  • Attention layers are shared; only FFN is split into Experts

DeepSeek-V3 MoE

DeepSeek-V3 employs a more sophisticated MoE design:

class DeepSeekMoE(nn.Module):
    """DeepSeek-V3 style: Shared Expert + Routed Expert"""
    def __init__(self, d_model, d_ff, num_shared=1,
                 num_routed=256, top_k=8):
        super().__init__()
        # Shared Expert that all tokens pass through
        self.shared_experts = nn.ModuleList([
            Expert(d_model, d_ff) for _ in range(num_shared)
        ])
        # Routed Expert selected per token
        self.routed_experts = nn.ModuleList([
            Expert(d_model, d_ff // 4) for _ in range(num_routed)
        ])
        self.router = TopKRouter(d_model, num_routed, top_k)

    def forward(self, x):
        # Shared Expert output
        shared_out = sum(e(x) for e in self.shared_experts)
        # Routed Expert output
        weights, indices = self.router(x)
        routed_out = self._route_tokens(x, weights, indices)
        return shared_out + routed_out
  • 1 Shared Expert + 256 Routed Experts (Top-8 selection)
  • 671B total parameters, 37B active
  • Introduced Auxiliary-loss-free load balancing

Model Comparison

ModelTotal ParamsActive ParamsExpertsTop-K
Mixtral 8x7B46.7B12.9B82
Mixtral 8x22B141B39B82
DeepSeek-V3671B37B256+18+1
Qwen2.5-MoE14.3B2.7B60+44+4

4. Routing Strategies

Token Choice vs Expert Choice

# Token Choice: Each token selects its Experts
def token_choice_routing(logits, top_k=2):
    top_k_vals, top_k_idx = logits.topk(top_k, dim=-1)
    weights = torch.softmax(top_k_vals, dim=-1)
    return weights, top_k_idx

# Expert Choice: Each Expert selects its tokens
def expert_choice_routing(logits, capacity_factor=1.25):
    num_tokens = logits.shape[0]
    num_experts = logits.shape[1]
    capacity = int(num_tokens * capacity_factor / num_experts)

    expert_scores = logits.T  # (num_experts, num_tokens)
    top_k_vals, top_k_idx = expert_scores.topk(capacity, dim=-1)
    return top_k_vals, top_k_idx

5. Load Balancing

Load imbalance across Experts is a core challenge in MoE:

def load_balancing_loss(router_logits, top_k_indices, num_experts):
    """Auxiliary load balancing loss (Switch Transformer style)"""
    # Token ratio per Expert
    mask = torch.zeros_like(router_logits)
    mask.scatter_(-1, top_k_indices, 1.0)
    tokens_per_expert = mask.float().mean(dim=0)  # (num_experts,)

    # Average routing probability per Expert
    router_probs = torch.softmax(router_logits, dim=-1)
    router_prob_per_expert = router_probs.mean(dim=0)

    # Dot product of the two distributions = measure of imbalance
    loss = num_experts * (tokens_per_expert * router_prob_per_expert).sum()
    return loss

DeepSeek-V3's Auxiliary-loss-free approach dynamically adjusts per-Expert bias terms, lowering the bias for overloaded Experts and raising it for underutilized ones, achieving balanced load without an additional loss term.

6. Inference Optimization

# Expert Parallelism: Distribute Experts across multiple GPUs
# GPU 0: Expert 0-3, GPU 1: Expert 4-7
class ExpertParallel(nn.Module):
    def __init__(self, experts_per_gpu, rank, world_size):
        super().__init__()
        self.local_experts = nn.ModuleList([
            Expert(d_model, d_ff)
            for _ in range(experts_per_gpu)
        ])
        self.rank = rank
        self.world_size = world_size

    def forward(self, x, indices):
        # Redistribute tokens via All-to-All communication
        dispatched = all_to_all(x, indices, self.world_size)
        # Process with local Experts
        output = self._process_local(dispatched)
        # Recombine results
        return all_to_all(output, indices, self.world_size)

7. Quiz

Q1: Approximately how many parameters does a single token use during inference in Mixtral 8x7B?

Approximately 12.9B parameters. Since only the Top-2 out of 8 Experts are activated, only the FFN parameters of 2 Experts plus the shared Attention parameters are used. This is roughly 28% of the total 46.7B.

Q2: What is the core idea behind DeepSeek-V3's Auxiliary-loss-free load balancing?

Traditional MoE adds an auxiliary loss for load balancing, which can degrade model performance. DeepSeek-V3 introduces dynamic bias terms for each Expert, lowering the bias for Experts receiving too many tokens and raising it for underutilized ones, naturally achieving balance. This enables stable load balancing without an additional loss.

Q3: What are the differences and trade-offs between Token Choice and Expert Choice routing?
  • Token Choice: Each token selects its Top-K Experts. Simple to implement but can lead to load imbalance where tokens concentrate on certain Experts.
  • Expert Choice: Each Expert selects the tokens it processes. Guarantees perfect load balancing but some tokens may not be selected by any Expert.

In practice, Token Choice combined with a load balancing loss is the most commonly used approach.

Quiz

Q1: What is the main topic covered in "Mixture of Experts (MoE) Architecture: A Complete Analysis"?

A complete analysis of MoE architectures, from the principles of Sparse MoE to the MoE implementations in Mixtral and DeepSeek-V3, routing strategies, and load balancing.

Q2: What is MoE?? Mixture of Experts (MoE) is an architecture that improves computational efficiency by activating only a subset of the model's total parameters. Unlike Dense models that use all parameters for every input, MoE selects and activates only the optimal experts based on the input.

Q3: Describe the Core Components of MoE Architecture. Expert Network Each Expert is an independent FFN (Feed-Forward Network): Router (Gating Network) The Router determines which Expert each token is sent to: Full MoE Layer Implementation

Q4: What are the key aspects of Major MoE Model Analysis? Mixtral 8x7B (Mistral AI) 8 Experts, Top-2 routing 46.7B total parameters, 12.9B active Attention layers are shared; only FFN is split into Experts DeepSeek-V3 MoE DeepSeek-V3 employs a more sophisticated MoE design: 1 Shared Expert + 256 Routed Experts (Top-8 selection) 671B tot...

Q5: How does Routing Strategies work? Token Choice vs Expert Choice