Mixture of Experts(MoE) 아키텍처 완벽 분석

1. MoE란 무엇인가?
- Dense vs Sparse 모델
2. MoE 아키텍처의 핵심 구성요소
3. 주요 MoE 모델 분석
4. Routing 전략
- Token Choice vs Expert Choice
5. Load Balancing
6. 추론 최적화
7. 퀴즈

1. MoE란 무엇인가?

Mixture of Experts(MoE)는 모델의 전체 파라미터 중 일부만 활성화하여 연산 효율성을 높이는 아키텍처입니다. 모든 입력에 대해 전체 모델을 사용하는 Dense 모델과 달리, MoE는 입력에 따라 최적의 전문가(Expert)만 선택하여 처리합니다.

Dense vs Sparse 모델

Dense 모델: 모든 파라미터가 매번 활성화 (예: LLaMA, GPT-4)
Sparse MoE: 전체 파라미터의 일부만 활성화 (예: Mixtral, DeepSeek-V3)

핵심 장점은 파라미터 수는 크지만 연산량은 적다는 것입니다. Mixtral 8x7B는 총 46.7B 파라미터를 가지지만 추론 시 활성화되는 파라미터는 약 12.9B에 불과합니다.

2. MoE 아키텍처의 핵심 구성요소

Expert Network

각 Expert는 독립적인 FFN(Feed-Forward Network)입니다:

import torch
import torch.nn as nn

class Expert(nn.Module):
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff, bias=False)
        self.w2 = nn.Linear(d_ff, d_model, bias=False)
        self.w3 = nn.Linear(d_model, d_ff, bias=False)  # SwiGLU gate

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # SwiGLU activation
        return self.w2(nn.functional.silu(self.w1(x)) * self.w3(x))

Router (Gating Network)

Router는 각 토큰을 어떤 Expert에게 보낼지 결정합니다:

class TopKRouter(nn.Module):
    def __init__(self, d_model: int, num_experts: int, top_k: int = 2):
        super().__init__()
        self.gate = nn.Linear(d_model, num_experts, bias=False)
        self.top_k = top_k

    def forward(self, x: torch.Tensor):
        # x shape: (batch, seq_len, d_model)
        logits = self.gate(x)  # (batch, seq_len, num_experts)
        top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
        top_k_weights = torch.softmax(top_k_logits, dim=-1)
        return top_k_weights, top_k_indices

MoE Layer 전체 구현

class MoELayer(nn.Module):
    def __init__(self, d_model: int, d_ff: int,
                 num_experts: int = 8, top_k: int = 2):
        super().__init__()
        self.experts = nn.ModuleList([
            Expert(d_model, d_ff) for _ in range(num_experts)
        ])
        self.router = TopKRouter(d_model, num_experts, top_k)
        self.num_experts = num_experts

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len, d_model = x.shape
        weights, indices = self.router(x)

        # Reshape for expert processing
        flat_x = x.view(-1, d_model)
        flat_weights = weights.view(-1, weights.shape[-1])
        flat_indices = indices.view(-1, indices.shape[-1])

        output = torch.zeros_like(flat_x)
        for i, expert in enumerate(self.experts):
            # Find tokens routed to this expert
            mask = (flat_indices == i).any(dim=-1)
            if mask.any():
                expert_input = flat_x[mask]
                expert_output = expert(expert_input)
                # Weight by router probability
                idx = (flat_indices[mask] == i).float()
                w = (flat_weights[mask] * idx).sum(dim=-1, keepdim=True)
                output[mask] += w * expert_output

        return output.view(batch_size, seq_len, d_model)

3. 주요 MoE 모델 분석

Mixtral 8x7B (Mistral AI)

8개 Expert, Top-2 routing
총 46.7B 파라미터, 활성 12.9B
Attention 레이어는 공유, FFN만 Expert로 분리

DeepSeek-V3 MoE

DeepSeek-V3는 더 정교한 MoE 설계를 채택했습니다:

class DeepSeekMoE(nn.Module):
    """DeepSeek-V3 스타일: 공유 Expert + Routed Expert"""
    def __init__(self, d_model, d_ff, num_shared=1,
                 num_routed=256, top_k=8):
        super().__init__()
        # 모든 토큰이 거치는 공유 Expert
        self.shared_experts = nn.ModuleList([
            Expert(d_model, d_ff) for _ in range(num_shared)
        ])
        # 토큰별로 선택되는 Routed Expert
        self.routed_experts = nn.ModuleList([
            Expert(d_model, d_ff // 4) for _ in range(num_routed)
        ])
        self.router = TopKRouter(d_model, num_routed, top_k)

    def forward(self, x):
        # 공유 Expert 결과
        shared_out = sum(e(x) for e in self.shared_experts)
        # Routed Expert 결과
        weights, indices = self.router(x)
        routed_out = self._route_tokens(x, weights, indices)
        return shared_out + routed_out

1개 공유 Expert + 256개 Routed Expert (Top-8 선택)
총 671B 파라미터, 활성 37B
Auxiliary-loss-free load balancing 도입

모델 비교

모델	총 파라미터	활성 파라미터	Expert 수	Top-K
Mixtral 8x7B	46.7B	12.9B	8	2
Mixtral 8x22B	141B	39B	8	2
DeepSeek-V3	671B	37B	256+1	8+1
Qwen2.5-MoE	14.3B	2.7B	60+4	4+4

4. Routing 전략

Token Choice vs Expert Choice

# Token Choice: 각 토큰이 Expert를 선택
def token_choice_routing(logits, top_k=2):
    top_k_vals, top_k_idx = logits.topk(top_k, dim=-1)
    weights = torch.softmax(top_k_vals, dim=-1)
    return weights, top_k_idx

# Expert Choice: 각 Expert가 토큰을 선택
def expert_choice_routing(logits, capacity_factor=1.25):
    num_tokens = logits.shape[0]
    num_experts = logits.shape[1]
    capacity = int(num_tokens * capacity_factor / num_experts)

    expert_scores = logits.T  # (num_experts, num_tokens)
    top_k_vals, top_k_idx = expert_scores.topk(capacity, dim=-1)
    return top_k_vals, top_k_idx

5. Load Balancing

Expert 간 부하 불균형은 MoE의 핵심 문제입니다:

def load_balancing_loss(router_logits, top_k_indices, num_experts):
    """Auxiliary load balancing loss (Switch Transformer 방식)"""
    # Expert별 토큰 비율
    mask = torch.zeros_like(router_logits)
    mask.scatter_(-1, top_k_indices, 1.0)
    tokens_per_expert = mask.float().mean(dim=0)  # (num_experts,)

    # Expert별 라우팅 확률 평균
    router_probs = torch.softmax(router_logits, dim=-1)
    router_prob_per_expert = router_probs.mean(dim=0)

    # 두 분포의 내적 = 불균형 척도
    loss = num_experts * (tokens_per_expert * router_prob_per_expert).sum()
    return loss

DeepSeek-V3의 Auxiliary-loss-free 방식은 Expert별 bias term을 동적으로 조절하여 별도의 loss 없이 균형을 유지합니다.

6. 추론 최적화

# Expert Parallelism: Expert를 여러 GPU에 분산
# GPU 0: Expert 0-3, GPU 1: Expert 4-7
class ExpertParallel(nn.Module):
    def __init__(self, experts_per_gpu, rank, world_size):
        super().__init__()
        self.local_experts = nn.ModuleList([
            Expert(d_model, d_ff)
            for _ in range(experts_per_gpu)
        ])
        self.rank = rank
        self.world_size = world_size

    def forward(self, x, indices):
        # All-to-All 통신으로 토큰 재분배
        dispatched = all_to_all(x, indices, self.world_size)
        # 로컬 Expert 처리
        output = self._process_local(dispatched)
        # 결과 재조합
        return all_to_all(output, indices, self.world_size)

7. 퀴즈

Q1: Mixtral 8x7B에서 하나의 토큰이 추론 시 사용하는 파라미터 수는 약 얼마인가?

약 12.9B 파라미터입니다. 8개 Expert 중 Top-2만 활성화되므로 2개 Expert의 FFN 파라미터 + 공유되는 Attention 파라미터만 사용됩니다. 전체 46.7B의 약 28%만 활성화되는 것입니다.

Q2: DeepSeek-V3가 도입한 Auxiliary-loss-free load balancing의 핵심 아이디어는?

기존 MoE는 load balancing을 위해 별도의 auxiliary loss를 추가하는데, 이는 모델 성능을 저하시킬 수 있습니다. DeepSeek-V3는 각 Expert에 동적 bias term을 두고, 토큰을 많이 받는 Expert의 bias를 낮추고 적게 받는 Expert의 bias를 높여 자연스럽게 균형을 맞춥니다. 이로써 별도의 loss 없이도 안정적인 load balancing이 가능합니다.

Q3: Token Choice와 Expert Choice routing의 차이점과 각각의 장단점은?

Token Choice: 각 토큰이 Top-K Expert를 선택합니다. 구현이 간단하지만 일부 Expert에 토큰이 집중되는 불균형이 발생할 수 있습니다.
Expert Choice: 각 Expert가 처리할 토큰을 선택합니다. 완벽한 load balancing이 보장되지만, 일부 토큰이 아무 Expert에도 선택되지 않을 수 있습니다.

실제로는 Token Choice + load balancing loss의 조합이 가장 많이 사용됩니다.