Split View: Mixture of Experts(MoE) 아키텍처 심층 분석: Switch Transformer부터 Mixtral·DeepSeek까지

Mixture of Experts(MoE) 아키텍처 심층 분석: Switch Transformer부터 Mixtral·DeepSeek까지

들어가며
MoE 아키텍처의 역사와 발전
Sparse MoE의 수학적 기초
Switch Transformer 분석
Mixtral 8x7B 아키텍처 상세
DeepSeek-V2/V3의 혁신: DeepSeekMoE
라우팅 전략 비교
학습 안정성 기법
추론 최적화
주요 MoE 모델 비교
종합 구현 예시: 커스텀 MoE Transformer 블록
결론 및 향후 전망
참고문헌

들어가며

대규모 언어 모델(LLM)의 시대에서 모델 파라미터를 무한히 늘리는 것은 학습 비용과 추론 비용 양쪽에서 한계에 부딪힌다. Dense Transformer는 모든 입력 토큰에 대해 전체 파라미터를 활성화하므로, 파라미터 수가 늘어나면 연산량(FLOPs)도 비례하여 증가한다. Mixture of Experts(MoE) 아키텍처는 이 문제를 **조건부 연산(conditional computation)**으로 해결한다. 전체 파라미터 중 입력에 따라 일부 전문가(expert)만 활성화하여, 모델 용량은 크게 유지하면서 실제 연산량은 일정하게 제한하는 것이 핵심 아이디어다.

2017년 Shazeer 등이 "Outrageously Large Neural Networks" 논문에서 Sparsely-Gated MoE를 제안한 이래, 2021년 Google의 Switch Transformer, 2023년 Mistral의 Mixtral 8x7B, 그리고 2024년 DeepSeek-V2/V3에 이르기까지 MoE 아키텍처는 급속도로 발전해왔다. 2025년에는 Meta의 Llama 4가 MoE를 채택하고, DeepSeek-R1이 V3 아키텍처 위에서 추론 능력을 극대화하며 전 세계적으로 주목을 받았다.

이 글에서는 MoE 아키텍처의 수학적 기초부터 주요 모델들의 설계 철학, 라우팅 전략의 비교 분석, 학습 안정성 기법, 추론 최적화까지 논문 수준에서 심층 분석한다.

MoE 아키텍처의 역사와 발전

MoE의 개념은 1991년 Jacobs 등의 논문 "Adaptive Mixtures of Local Experts"에서 처음 제안되었다. 초기에는 단순한 게이팅 네트워크로 여러 전문가 네트워크의 출력을 가중합하는 방식이었다.

현대적인 MoE의 전환점은 다음과 같이 정리할 수 있다.

2017년: Shazeer 등이 LSTM 기반 Sparsely-Gated MoE 제안. 4096개의 전문가로 1000억 파라미터급 모델 구현
2021년: Google의 Switch Transformer가 Top-1 라우팅으로 단순화하여 1.6조 파라미터 모델 달성
2022년: Google의 ST-MoE(Stable and Transferable MoE)가 학습 안정성 기법 체계화
2022년: Expert Choice Routing 논문이 전문가가 토큰을 선택하는 역방향 라우팅 제안
2023년: Mixtral 8x7B가 Top-2 라우팅과 SwiGLU 전문가로 오픈소스 MoE 시대 개막
2024년: DeepSeek-V2가 Fine-Grained Expert와 Auxiliary-Loss-Free 전략 도입
2024년: DeepSeek-V3가 671B 파라미터(37B 활성)로 최첨단 성능 달성
2025년: Llama 4 Scout(16 전문가, 109B/17B 활성)로 Meta도 MoE 채택

Sparse MoE의 수학적 기초

기본 수식

MoE 레이어의 출력은 다음과 같이 정의된다.

y = \sum_{i=1}^{N} g(x)_i \cdot E_i(x)

여기서 $x$ 는 입력 토큰의 히든 표현, $N$ 은 전문가 수, $E_i$ 는 $i$ 번째 전문가 네트워크, $g(x)_i$ 는 게이팅 함수가 $i$ 번째 전문가에 할당한 가중치이다.

Sparse Gating 함수

Shazeer(2017)가 제안한 Noisy Top-K 게이팅 함수는 다음과 같다.

g(x) = \text{Softmax}(\text{TopK}(H(x), k))

H(x)_i = (x \cdot W_g)_i + \epsilon \cdot \text{Softplus}((x \cdot W_{noise})_i)

여기서 $W_g$ 는 게이팅 가중치 행렬, $W_{noise}$ 는 노이즈 가중치 행렬이다. TopK 연산은 상위 $k$ 개의 값만 유지하고 나머지는 $-\infty$ 로 설정하여 Softmax 후 0이 되도록 한다.

PyTorch 구현: 기본 Sparse Gating

import torch
import torch.nn as nn
import torch.nn.functional as F

class TopKGating(nn.Module):
    """Noisy Top-K Gating mechanism for MoE."""

    def __init__(self, input_dim: int, num_experts: int, top_k: int = 2):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.gate = nn.Linear(input_dim, num_experts, bias=False)
        self.noise = nn.Linear(input_dim, num_experts, bias=False)

    def forward(self, x: torch.Tensor):
        # x shape: (batch_size, seq_len, input_dim)
        logits = self.gate(x)  # (batch, seq, num_experts)

        # Training noise for exploration
        if self.training:
            noise = torch.randn_like(logits) * F.softplus(self.noise(x))
            logits = logits + noise

        # Top-K selection
        top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
        # (batch, seq, top_k)

        # Sparse softmax: only over selected experts
        top_k_gates = F.softmax(top_k_logits, dim=-1)

        return top_k_gates, top_k_indices

Switch Transformer 분석

핵심 혁신: Top-1 라우팅

2021년 Fedus 등이 발표한 Switch Transformer의 핵심 혁신은 Top-1 라우팅이다. 기존 연구에서는 최소 2개 이상의 전문가를 활성화해야 안정적인 학습이 가능하다고 간주했지만, Switch Transformer는 토큰당 정확히 하나의 전문가만 선택하는 전략으로도 충분하다는 것을 입증했다.

g(x) = \text{Softmax}(x \cdot W_r), \quad i^* = \arg\max_i g(x)_i

라우팅된 출력은 단순히 게이팅 확률과 해당 전문가 출력의 곱이다.

y = g(x)_{i^*} \cdot E_{i^*}(x)

아키텍처 특징

Switch Transformer는 T5 아키텍처의 FFN(Feed-Forward Network) 레이어를 MoE 레이어로 대체한다. 각 MoE 레이어에는 최대 2048개의 전문가를 배치할 수 있으며, 이를 통해 1.6조 파라미터 규모의 모델을 구현했다. Top-1 라우팅은 통신 비용을 절반으로 줄이고, 라우팅 연산 자체도 단순화한다.

성능

64개 전문가를 사용한 Switch Transformer는 동일 연산량 기준으로 T5-Base 대비 7배 빠른 사전학습 속도를 달성했다. 이는 모델 용량이 증가하면서도 토큰당 연산량은 동일하게 유지되기 때문이다.

PyTorch 구현: Switch Transformer MoE Layer

class SwitchMoELayer(nn.Module):
    """Switch Transformer style MoE layer with Top-1 routing."""

    def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int,
                 capacity_factor: float = 1.25):
        super().__init__()
        self.num_experts = num_experts
        self.capacity_factor = capacity_factor
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_dim, ffn_dim),
                nn.ReLU(),
                nn.Linear(ffn_dim, hidden_dim)
            ) for _ in range(num_experts)
        ])

    def forward(self, x: torch.Tensor):
        batch_size, seq_len, hidden_dim = x.shape
        x_flat = x.view(-1, hidden_dim)  # (B*S, D)
        num_tokens = x_flat.shape[0]

        # Router: Top-1 selection
        router_logits = self.router(x_flat)  # (B*S, E)
        router_probs = F.softmax(router_logits, dim=-1)
        expert_indices = router_probs.argmax(dim=-1)  # (B*S,)
        expert_gates = router_probs.gather(1, expert_indices.unsqueeze(-1)).squeeze(-1)

        # Capacity: max tokens per expert
        capacity = int(self.capacity_factor * num_tokens / self.num_experts)

        # Dispatch tokens to experts
        output = torch.zeros_like(x_flat)
        for i in range(self.num_experts):
            mask = (expert_indices == i)
            if mask.sum() == 0:
                continue
            selected = x_flat[mask][:capacity]  # enforce capacity
            expert_out = self.experts[i](selected)
            gates = expert_gates[mask][:capacity].unsqueeze(-1)
            output[mask][:capacity] = expert_out * gates

        return output.view(batch_size, seq_len, hidden_dim)

Mixtral 8x7B 아키텍처 상세

설계 철학

Mistral AI가 2023년 12월에 공개한 Mixtral 8x7B는 Mistral 7B의 아키텍처를 기반으로, 각 Transformer 레이어의 FFN을 8개의 전문가로 구성된 MoE 레이어로 대체했다. Top-2 라우팅을 사용하여 토큰당 2개의 전문가를 활성화한다.

핵심 수치

총 파라미터: 46.7B (전문가 8개 x 약 5.6B FFN + 공유 어텐션 파라미터)
활성 파라미터: 약 13B (토큰당 2개 전문가의 FFN + 공유 파라미터)
전문가 함수: SwiGLU FFN
어텐션: Grouped Query Attention (GQA)
컨텍스트 길이: 32K 토큰
Sliding Window Attention 적용

Top-2 라우팅 수식

Mixtral의 MoE 레이어 출력은 다음과 같이 계산된다.

y = \sum_{i \in \text{Top2}(g(x))} g(x)_i \cdot \text{SwiGLU}_i(x)

여기서 게이팅 함수 $g(x)$ 는 입력 $x$ 에 대해 Softmax 확률 분포를 계산하고, 상위 2개의 전문가를 선택한다. 선택된 2개 전문가의 게이팅 가중치는 재정규화(renormalization)되어 합이 1이 된다.

SwiGLU 전문가 네트워크

각 전문가는 SwiGLU 활성화 함수를 사용하는 FFN이다.

\text{SwiGLU}(x) = (\text{Swish}(xW_1) \odot xV) W_2

PyTorch 구현: Mixtral MoE Block

class MixtralMoEBlock(nn.Module):
    """Mixtral-style MoE block with Top-2 SwiGLU experts."""

    def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int = 8):
        super().__init__()
        self.num_experts = num_experts
        self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
        self.experts = nn.ModuleList([
            SwiGLUExpert(hidden_dim, ffn_dim) for _ in range(num_experts)
        ])

    def forward(self, x: torch.Tensor):
        # x: (batch, seq_len, hidden_dim)
        gate_logits = self.gate(x)  # (batch, seq, num_experts)
        gate_probs = F.softmax(gate_logits, dim=-1)

        # Top-2 selection
        top2_probs, top2_indices = gate_probs.topk(2, dim=-1)
        # Renormalize gates to sum to 1
        top2_probs = top2_probs / top2_probs.sum(dim=-1, keepdim=True)

        # Compute expert outputs and combine
        batch, seq, dim = x.shape
        output = torch.zeros_like(x)
        for k in range(2):
            expert_idx = top2_indices[:, :, k]  # (batch, seq)
            gate_val = top2_probs[:, :, k].unsqueeze(-1)  # (batch, seq, 1)
            for i in range(self.num_experts):
                mask = (expert_idx == i)
                if mask.any():
                    expert_input = x[mask]
                    expert_output = self.experts[i](expert_input)
                    output[mask] += gate_val[mask].squeeze(-1).unsqueeze(-1) * expert_output

        return output


class SwiGLUExpert(nn.Module):
    """SwiGLU Feed-Forward Network used as expert."""

    def __init__(self, hidden_dim: int, ffn_dim: int):
        super().__init__()
        self.w1 = nn.Linear(hidden_dim, ffn_dim, bias=False)
        self.v = nn.Linear(hidden_dim, ffn_dim, bias=False)
        self.w2 = nn.Linear(ffn_dim, hidden_dim, bias=False)

    def forward(self, x: torch.Tensor):
        return self.w2(F.silu(self.w1(x)) * self.v(x))

DeepSeek-V2/V3의 혁신: DeepSeekMoE

Fine-Grained Expert 분할

DeepSeek-V2(2024)는 기존 MoE 대비 근본적으로 다른 접근을 취했다. 핵심 아이디어는 Fine-Grained Expert Segmentation으로, 전문가를 더 작고 더 많은 수로 분할하는 것이다.

기존 $N$ 개의 전문가를 $mN$ 개로 증가시키되, 각 전문가의 히든 차원을 $1/m$ 로 축소한다. 동시에 활성화하는 전문가 수도 $K$ 에서 $mK$ 로 비례 증가시켜, 토큰당 총 연산량은 동일하게 유지하면서 더 세밀한 전문가 조합을 가능하게 한다.

DeepSeek-V3 아키텍처

DeepSeek-V3(2024년 12월)는 다음 핵심 구성을 갖는다.

총 파라미터: 671B
활성 파라미터: 37B (토큰당)
라우팅 전문가: 256개 (레이어당)
공유 전문가: 1개 (레이어당, 항상 활성)
활성 라우팅 전문가: 8개 (토큰당)
어텐션: Multi-head Latent Attention (MLA)

Auxiliary-Loss-Free 부하 균형

DeepSeek-V3의 가장 혁신적인 기여 중 하나는 보조 손실 없는 부하 균형 전략이다. 기존 MoE 모델은 부하 균형을 위해 보조 손실(auxiliary loss)을 사용하지만, 이 보조 손실의 계수를 적절히 설정하는 것이 어렵고, 과도한 값은 모델 성능을 저하시킨다.

DeepSeek-V3는 대신 각 전문가에 편향 항(bias term) $b_i$ 를 추가하여 라우팅 결정에만 사용한다.

i^* = \text{TopK}(s(x)_i + b_i)

g(x)_i = \frac{s(x)_i}{\sum_{j \in \text{TopK}} s(x)_j}

편향 항 $b_i$ 는 라우팅 결정에만 영향을 미치고, 실제 게이팅 가중치 계산에는 포함되지 않는다. 과부하 전문가의 $b_i$ 는 감소시키고 과소부하 전문가의 $b_i$ 는 증가시키는 방식으로, 손실 함수를 오염시키지 않으면서 부하 균형을 달성한다.

Device-Limited Routing

DeepSeek-V3는 통신 비용을 제한하기 위해, 각 토큰이 최대 $M$ 개의 노드에만 전송되도록 제한한다. 각 노드에 분산된 전문가들의 어피니티 스코어를 기반으로 상위 $M$ 개 노드를 선택한 후, 해당 노드 내 전문가들 사이에서만 Top-K 라우팅을 수행한다.

라우팅 전략 비교

Top-1 Routing (Switch Transformer)

토큰당 정확히 1개의 전문가만 활성화한다. 통신 비용이 최소이고 구현이 단순하지만, 단일 전문가에 의존하므로 표현력이 제한될 수 있다.

Top-2 Routing (Mixtral, GShard)

토큰당 2개의 전문가를 활성화하여 가중 합산한다. Top-1 대비 더 풍부한 표현이 가능하지만 통신 비용이 2배이다.

Expert Choice Routing (Zhou et al., 2022)

기존 방식과 반대로, 전문가가 토큰을 선택한다. 각 전문가가 고정된 수의 토큰을 선택하므로 부하 균형이 자동으로 보장된다. 그러나 하나의 토큰이 0개 또는 여러 개의 전문가에 선택될 수 있다는 비결정적 특성이 있다.

Soft MoE (Puigcerver et al., 2023)

이산적 라우팅 대신, 모든 토큰의 가중 조합을 각 전문가에 전달한다. 완전 미분 가능하며 토큰 드롭이 없지만, 모든 전문가가 모든 토큰의 정보를 처리하므로 진정한 희소성은 아니다.

PyTorch 구현: Expert Choice Routing

class ExpertChoiceRouter(nn.Module):
    """Expert Choice Routing: experts select tokens."""

    def __init__(self, hidden_dim: int, num_experts: int, capacity_factor: float = 1.0):
        super().__init__()
        self.num_experts = num_experts
        self.capacity_factor = capacity_factor
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)

    def forward(self, x: torch.Tensor):
        # x: (num_tokens, hidden_dim)
        num_tokens = x.shape[0]
        capacity = int(self.capacity_factor * num_tokens / self.num_experts)

        # Compute affinity scores
        scores = self.router(x)  # (num_tokens, num_experts)
        scores = F.softmax(scores, dim=0)  # softmax over tokens (not experts)

        # Each expert selects top-capacity tokens
        # Transpose: (num_experts, num_tokens)
        expert_scores = scores.t()

        # Top-capacity selection per expert
        top_scores, top_indices = expert_scores.topk(capacity, dim=-1)
        # top_scores: (num_experts, capacity)
        # top_indices: (num_experts, capacity)

        return top_scores, top_indices

학습 안정성 기법

Load Balancing Loss

Switch Transformer에서 제안된 부하 균형 손실은 다음과 같다.

\mathcal{L}_{balance} = \alpha \cdot N \sum_{i=1}^{N} f_i \cdot P_i

여기서 $N$ 은 전문가 수, $f_i$ 는 전문가 $i$ 에 라우팅된 토큰의 비율, $P_i$ 는 라우터가 전문가 $i$ 에 할당한 확률의 평균이다. 계수 $\alpha$ 는 일반적으로 0.01~0.1 사이로 설정한다.

이상적인 균등 분배 시 $f_i = P_i = 1/N$ 이므로 손실은 $\alpha$ 가 되고, 불균형이 심할수록 손실이 증가한다.

Router Z-Loss

ST-MoE(2022)에서 제안된 Router Z-Loss는 라우터 로짓의 크기를 제한하여 학습 안정성을 높인다.

\mathcal{L}_{z} = \frac{1}{B} \sum_{x \in B} \left( \log \sum_{i=1}^{N} e^{z_i(x)} \right)^2

여기서 $z_i(x)$ 는 라우터의 로짓이다. 이 손실은 로짓이 과도하게 커지는 것을 방지하여, 라우팅 결정의 불안정성과 수렴 문제를 완화한다.

PyTorch 구현: Load Balancing + Z-Loss

def compute_moe_auxiliary_losses(
    router_logits: torch.Tensor,
    expert_indices: torch.Tensor,
    num_experts: int,
    alpha_balance: float = 0.01,
    alpha_z: float = 0.001
):
    """Compute load balancing loss and router z-loss.

    Args:
        router_logits: Raw router logits (batch*seq, num_experts)
        expert_indices: Selected expert indices (batch*seq, top_k)
        num_experts: Total number of experts
        alpha_balance: Weight for load balancing loss
        alpha_z: Weight for router z-loss
    """
    num_tokens = router_logits.shape[0]
    router_probs = F.softmax(router_logits, dim=-1)

    # --- Load Balancing Loss ---
    # f_i: fraction of tokens routed to expert i
    expert_mask = F.one_hot(expert_indices, num_experts).float()
    if expert_mask.dim() == 3:
        expert_mask = expert_mask.sum(dim=1)  # sum over top_k
    expert_mask = (expert_mask > 0).float()
    f = expert_mask.mean(dim=0)  # (num_experts,)

    # P_i: mean router probability for expert i
    P = router_probs.mean(dim=0)  # (num_experts,)

    balance_loss = alpha_balance * num_experts * (f * P).sum()

    # --- Router Z-Loss ---
    log_z = torch.logsumexp(router_logits, dim=-1)  # (num_tokens,)
    z_loss = alpha_z * (log_z ** 2).mean()

    return balance_loss + z_loss

추론 최적화

Expert Offloading

MoE 모델은 총 파라미터 수가 크기 때문에 모든 전문가를 GPU 메모리에 올리기 어려울 수 있다. Expert Offloading은 현재 활성화되지 않은 전문가를 CPU RAM이나 디스크에 저장하고, 필요할 때만 GPU로 로드하는 기법이다.

주요 기법은 다음과 같다.

LRU Cache: 최근 사용된 전문가를 GPU에 캐싱
Predictive Prefetch: 다음 레이어에서 사용할 전문가를 비동기적으로 미리 로드
Speculative Decoding + Offloading: 투기적 디코딩과 결합하여 오프로딩 지연 은닉

Quantization

MoE 모델의 양자화는 Dense 모델과 유사하지만, 전문가별 가중치 분포가 다를 수 있다는 점에서 추가적인 고려가 필요하다.

GPTQ/AWQ: 전문가별로 독립적인 양자화 설정 적용 가능
Mixed Precision: 자주 사용되는 전문가는 높은 정밀도, 드물게 사용되는 전문가는 낮은 정밀도
MiLo(2025): 극도로 양자화된 MoE에 Low-Rank 보상기를 추가하여 정확도 복구

Expert Parallelism

MoE 모델의 추론에서 Expert Parallelism은 각 전문가를 별도의 GPU에 배치하여 병렬 처리하는 전략이다. All-to-All 통신으로 토큰을 해당 전문가가 있는 GPU로 전송하고, 처리 후 다시 수집한다.

주요 MoE 모델 비교

모델	발표	총 파라미터	활성 파라미터	전문가 수	라우팅	전문가 유형	특징
Sparsely-Gated MoE	2017	137B	-	4096	Top-K	MLP	최초의 대규모 Sparse MoE
Switch Transformer	2021	1.6T	-	2048	Top-1	FFN	단순화된 라우팅, T5 기반
GLaM	2022	1.2T	97B	64	Top-2	FFN	GPT-3 대비 1/3 에너지 사용
ST-MoE	2022	269B	-	32	Top-2	FFN	Z-Loss, 안정성 중심
Expert Choice	2022	-	-	-	Expert Choice	FFN	전문가가 토큰 선택
Mixtral 8x7B	2023	46.7B	13B	8	Top-2	SwiGLU	오픈소스, GQA
DeepSeek-V2	2024	236B	21B	160	Top-6	Fine-Grained	Auxiliary-Loss-Free
DeepSeek-V3	2024	671B	37B	256+1	Top-8	Fine-Grained	MLA + 공유 전문가
Llama 4 Scout	2025	109B	17B	16	Top-1	-	Meta 최초 MoE

종합 구현 예시: 커스텀 MoE Transformer 블록

아래는 어텐션 레이어와 MoE FFN을 결합한 전체 Transformer 블록의 구현이다.

class MoETransformerBlock(nn.Module):
    """Complete Transformer block with MoE FFN layer."""

    def __init__(
        self,
        hidden_dim: int = 768,
        num_heads: int = 12,
        ffn_dim: int = 3072,
        num_experts: int = 8,
        top_k: int = 2,
        capacity_factor: float = 1.25,
        dropout: float = 0.1
    ):
        super().__init__()
        # Multi-Head Attention
        self.attn_norm = nn.LayerNorm(hidden_dim)
        self.attention = nn.MultiheadAttention(
            hidden_dim, num_heads, dropout=dropout, batch_first=True
        )

        # MoE FFN
        self.ffn_norm = nn.LayerNorm(hidden_dim)
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)
        self.experts = nn.ModuleList([
            SwiGLUExpert(hidden_dim, ffn_dim)
            for _ in range(num_experts)
        ])
        self.top_k = top_k
        self.num_experts = num_experts
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask=None):
        # Pre-norm Attention
        residual = x
        x_norm = self.attn_norm(x)
        attn_out, _ = self.attention(x_norm, x_norm, x_norm, attn_mask=mask)
        x = residual + self.dropout(attn_out)

        # Pre-norm MoE FFN
        residual = x
        x_norm = self.ffn_norm(x)
        moe_out, aux_loss = self._moe_forward(x_norm)
        x = residual + self.dropout(moe_out)

        return x, aux_loss

    def _moe_forward(self, x: torch.Tensor):
        B, S, D = x.shape
        x_flat = x.view(-1, D)

        # Router
        logits = self.router(x_flat)
        probs = F.softmax(logits, dim=-1)
        top_k_probs, top_k_idx = probs.topk(self.top_k, dim=-1)
        top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # Dispatch and combine
        output = torch.zeros_like(x_flat)
        for k in range(self.top_k):
            for i in range(self.num_experts):
                mask = (top_k_idx[:, k] == i)
                if mask.any():
                    expert_out = self.experts[i](x_flat[mask])
                    output[mask] += top_k_probs[mask, k].unsqueeze(-1) * expert_out

        # Auxiliary loss
        aux_loss = compute_moe_auxiliary_losses(
            logits, top_k_idx, self.num_experts
        )

        return output.view(B, S, D), aux_loss

결론 및 향후 전망

MoE 아키텍처는 "모델 용량 확대"와 "연산 효율성"이라는 두 가지 목표를 동시에 달성할 수 있는 가장 실용적인 접근법으로 자리잡았다. Switch Transformer의 Top-1 단순화에서 시작하여, Mixtral 8x7B가 오픈소스 생태계에 MoE를 보급하고, DeepSeek-V3가 Fine-Grained Expert와 Auxiliary-Loss-Free 전략으로 새로운 기준을 세웠다.

향후 연구 방향은 다음과 같다.

동적 전문가 활성화: 입력 난이도에 따라 활성화 전문가 수를 조절하는 적응적 라우팅
학습-추론 일관성: 학습 시의 라우팅 패턴이 추론 시에도 유지되도록 하는 기법
전문가 특화 분석: 각 전문가가 어떤 지식/기능에 특화되는지에 대한 해석 가능성 연구
Edge 디바이스용 MoE: 모바일 및 엣지 환경에서의 경량 MoE 설계

참고문헌

Shazeer, N., et al. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017.
Fedus, W., et al. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022.
Jiang, A.Q., et al. "Mixtral of Experts." arXiv:2401.04088, 2024.
DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, 2024.
DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024.
Zoph, B., et al. "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906, 2022.
Zhou, Y., et al. "Mixture-of-Experts with Expert Choice Routing." NeurIPS 2022.
Puigcerver, J., et al. "From Sparse to Soft Mixtures of Experts." ICLR 2024.
Du, N., et al. "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts." ICML 2022.
Jacobs, R.A., et al. "Adaptive Mixtures of Local Experts." Neural Computation, 1991.

Deep Dive into Mixture of Experts (MoE) Architecture: From Switch Transformer to Mixtral and DeepSeek

Introduction
History and Evolution of MoE Architecture
Mathematical Foundations of Sparse MoE
Switch Transformer Analysis
Mixtral 8x7B Architecture Details
DeepSeek-V2/V3 Innovations: DeepSeekMoE
Routing Strategy Comparison
Training Stability Techniques
Inference Optimization
Major MoE Model Comparison
Complete Implementation: Custom MoE Transformer Block
Conclusion and Future Directions
References

Introduction

In the era of large language models (LLMs), endlessly scaling model parameters hits fundamental walls in both training cost and inference cost. Dense Transformers activate all parameters for every input token, meaning that as parameter count grows, FLOPs grow proportionally. The Mixture of Experts (MoE) architecture solves this through conditional computation. The core idea is to maintain large model capacity while keeping actual computation constant by activating only a subset of expert networks based on the input.

Since Shazeer et al. proposed the Sparsely-Gated MoE in their 2017 paper "Outrageously Large Neural Networks," MoE has evolved rapidly -- from Google's Switch Transformer in 2021, to Mistral's Mixtral 8x7B in 2023, and DeepSeek-V2/V3 in 2024. By 2025, Meta adopted MoE with Llama 4, and DeepSeek-R1 built on the V3 architecture to maximize reasoning capabilities, gaining worldwide attention.

This article provides a paper-level deep dive from the mathematical foundations of MoE architecture through the design philosophies of major models, comparative analysis of routing strategies, training stability techniques, and inference optimization.

History and Evolution of MoE Architecture

The concept of MoE was first proposed by Jacobs et al. in their 1991 paper "Adaptive Mixtures of Local Experts." Initially, it was a simple approach of computing a weighted sum of multiple expert network outputs through a gating network.

The key milestones in modern MoE evolution are as follows:

2017: Shazeer et al. proposed LSTM-based Sparsely-Gated MoE with 4,096 experts and 100B+ parameters
2021: Google's Switch Transformer simplified routing to Top-1, achieving 1.6 trillion parameters
2022: Google's ST-MoE (Stable and Transferable MoE) systematized training stability techniques
2022: Expert Choice Routing paper proposed reverse routing where experts select tokens
2023: Mixtral 8x7B opened the era of open-source MoE with Top-2 routing and SwiGLU experts
2024: DeepSeek-V2 introduced Fine-Grained Experts and Auxiliary-Loss-Free strategy
2024: DeepSeek-V3 achieved state-of-the-art with 671B parameters (37B active)
2025: Llama 4 Scout (16 experts, 109B/17B active) marked Meta's first MoE adoption

Mathematical Foundations of Sparse MoE

Basic Formulation

The output of an MoE layer is defined as:

y = \sum_{i=1}^{N} g(x)_i \cdot E_i(x)

where $x$ is the hidden representation of the input token, $N$ is the number of experts, $E_i$ is the $i$ -th expert network, and $g(x)_i$ is the weight assigned by the gating function to the $i$ -th expert.

Sparse Gating Function

The Noisy Top-K gating function proposed by Shazeer (2017) works as follows:

g(x) = \text{Softmax}(\text{TopK}(H(x), k))

H(x)_i = (x \cdot W_g)_i + \epsilon \cdot \text{Softplus}((x \cdot W_{noise})_i)

where $W_g$ is the gating weight matrix and $W_{noise}$ is the noise weight matrix. The TopK operation retains only the top $k$ values and sets the rest to $-\infty$ , making them zero after Softmax.

PyTorch Implementation: Basic Sparse Gating

import torch
import torch.nn as nn
import torch.nn.functional as F

class TopKGating(nn.Module):
    """Noisy Top-K Gating mechanism for MoE."""

    def __init__(self, input_dim: int, num_experts: int, top_k: int = 2):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.gate = nn.Linear(input_dim, num_experts, bias=False)
        self.noise = nn.Linear(input_dim, num_experts, bias=False)

    def forward(self, x: torch.Tensor):
        # x shape: (batch_size, seq_len, input_dim)
        logits = self.gate(x)  # (batch, seq, num_experts)

        # Training noise for exploration
        if self.training:
            noise = torch.randn_like(logits) * F.softplus(self.noise(x))
            logits = logits + noise

        # Top-K selection
        top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
        # (batch, seq, top_k)

        # Sparse softmax: only over selected experts
        top_k_gates = F.softmax(top_k_logits, dim=-1)

        return top_k_gates, top_k_indices

Switch Transformer Analysis

Core Innovation: Top-1 Routing

The core innovation of the Switch Transformer, published by Fedus et al. in 2021, is Top-1 routing. Previous research assumed that at least two experts needed to be activated for stable training, but Switch Transformer proved that selecting exactly one expert per token is sufficient.

g(x) = \text{Softmax}(x \cdot W_r), \quad i^* = \arg\max_i g(x)_i

The routed output is simply the product of the gating probability and the expert output:

y = g(x)_{i^*} \cdot E_{i^*}(x)

Architecture Characteristics

Switch Transformer replaces the FFN (Feed-Forward Network) layers of the T5 architecture with MoE layers. Each MoE layer can accommodate up to 2,048 experts, enabling models at the 1.6 trillion parameter scale. Top-1 routing cuts communication costs in half and simplifies the routing computation itself.

Performance

With 64 experts, Switch Transformer achieves 7x faster pretraining speed compared to T5-Base at equivalent compute. This is because model capacity increases while per-token computation remains constant.

PyTorch Implementation: Switch Transformer MoE Layer

class SwitchMoELayer(nn.Module):
    """Switch Transformer style MoE layer with Top-1 routing."""

    def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int,
                 capacity_factor: float = 1.25):
        super().__init__()
        self.num_experts = num_experts
        self.capacity_factor = capacity_factor
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_dim, ffn_dim),
                nn.ReLU(),
                nn.Linear(ffn_dim, hidden_dim)
            ) for _ in range(num_experts)
        ])

    def forward(self, x: torch.Tensor):
        batch_size, seq_len, hidden_dim = x.shape
        x_flat = x.view(-1, hidden_dim)  # (B*S, D)
        num_tokens = x_flat.shape[0]

        # Router: Top-1 selection
        router_logits = self.router(x_flat)  # (B*S, E)
        router_probs = F.softmax(router_logits, dim=-1)
        expert_indices = router_probs.argmax(dim=-1)  # (B*S,)
        expert_gates = router_probs.gather(1, expert_indices.unsqueeze(-1)).squeeze(-1)

        # Capacity: max tokens per expert
        capacity = int(self.capacity_factor * num_tokens / self.num_experts)

        # Dispatch tokens to experts
        output = torch.zeros_like(x_flat)
        for i in range(self.num_experts):
            mask = (expert_indices == i)
            if mask.sum() == 0:
                continue
            selected = x_flat[mask][:capacity]  # enforce capacity
            expert_out = self.experts[i](selected)
            gates = expert_gates[mask][:capacity].unsqueeze(-1)
            output[mask][:capacity] = expert_out * gates

        return output.view(batch_size, seq_len, hidden_dim)

Mixtral 8x7B Architecture Details

Design Philosophy

Mixtral 8x7B, released by Mistral AI in December 2023, is based on the Mistral 7B architecture with each Transformer layer's FFN replaced by an MoE layer consisting of 8 experts. It uses Top-2 routing to activate 2 experts per token.

Key Numbers

Total parameters: 46.7B (8 experts x ~5.6B FFN + shared attention parameters)
Active parameters: ~13B (2 expert FFNs per token + shared parameters)
Expert function: SwiGLU FFN
Attention: Grouped Query Attention (GQA)
Context length: 32K tokens
Sliding Window Attention applied

Top-2 Routing Formula

The MoE layer output in Mixtral is computed as:

y = \sum_{i \in \text{Top2}(g(x))} g(x)_i \cdot \text{SwiGLU}_i(x)

The gating function $g(x)$ computes a Softmax probability distribution over experts for input $x$ and selects the top 2. The gating weights of the selected 2 experts are renormalized to sum to 1.

SwiGLU Expert Network

Each expert is an FFN using the SwiGLU activation function:

\text{SwiGLU}(x) = (\text{Swish}(xW_1) \odot xV) W_2

PyTorch Implementation: Mixtral MoE Block

class MixtralMoEBlock(nn.Module):
    """Mixtral-style MoE block with Top-2 SwiGLU experts."""

    def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int = 8):
        super().__init__()
        self.num_experts = num_experts
        self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
        self.experts = nn.ModuleList([
            SwiGLUExpert(hidden_dim, ffn_dim) for _ in range(num_experts)
        ])

    def forward(self, x: torch.Tensor):
        # x: (batch, seq_len, hidden_dim)
        gate_logits = self.gate(x)  # (batch, seq, num_experts)
        gate_probs = F.softmax(gate_logits, dim=-1)

        # Top-2 selection
        top2_probs, top2_indices = gate_probs.topk(2, dim=-1)
        # Renormalize gates to sum to 1
        top2_probs = top2_probs / top2_probs.sum(dim=-1, keepdim=True)

        # Compute expert outputs and combine
        batch, seq, dim = x.shape
        output = torch.zeros_like(x)
        for k in range(2):
            expert_idx = top2_indices[:, :, k]  # (batch, seq)
            gate_val = top2_probs[:, :, k].unsqueeze(-1)  # (batch, seq, 1)
            for i in range(self.num_experts):
                mask = (expert_idx == i)
                if mask.any():
                    expert_input = x[mask]
                    expert_output = self.experts[i](expert_input)
                    output[mask] += gate_val[mask].squeeze(-1).unsqueeze(-1) * expert_output

        return output


class SwiGLUExpert(nn.Module):
    """SwiGLU Feed-Forward Network used as expert."""

    def __init__(self, hidden_dim: int, ffn_dim: int):
        super().__init__()
        self.w1 = nn.Linear(hidden_dim, ffn_dim, bias=False)
        self.v = nn.Linear(hidden_dim, ffn_dim, bias=False)
        self.w2 = nn.Linear(ffn_dim, hidden_dim, bias=False)

    def forward(self, x: torch.Tensor):
        return self.w2(F.silu(self.w1(x)) * self.v(x))

DeepSeek-V2/V3 Innovations: DeepSeekMoE

Fine-Grained Expert Segmentation

DeepSeek-V2 (2024) took a fundamentally different approach from existing MoE models. The core idea is Fine-Grained Expert Segmentation -- splitting experts into smaller and more numerous units.

The original $N$ experts are increased to $mN$ , while each expert's hidden dimension is reduced by $1/m$ . Simultaneously, the number of activated experts increases proportionally from $K$ to $mK$ , maintaining the same per-token computation while enabling finer-grained expert combinations.

DeepSeek-V3 Architecture

DeepSeek-V3 (December 2024) features the following key configuration:

Total parameters: 671B
Active parameters: 37B (per token)
Routed experts: 256 (per layer)
Shared experts: 1 (per layer, always active)
Active routed experts: 8 (per token)
Attention: Multi-head Latent Attention (MLA)

Auxiliary-Loss-Free Load Balancing

One of the most innovative contributions of DeepSeek-V3 is its auxiliary-loss-free load balancing strategy. Traditional MoE models use auxiliary losses for load balancing, but calibrating the coefficient is difficult, and excessive values degrade model performance.

Instead, DeepSeek-V3 adds a bias term $b_i$ to each expert, used only for routing decisions:

i^* = \text{TopK}(s(x)_i + b_i)

g(x)_i = \frac{s(x)_i}{\sum_{j \in \text{TopK}} s(x)_j}

The bias term $b_i$ only affects routing decisions and is not included in the actual gating weight computation. Overloaded experts have their $b_i$ decreased while underloaded experts have their $b_i$ increased, achieving load balance without contaminating the loss function.

Device-Limited Routing

To limit communication costs, DeepSeek-V3 restricts each token to be sent to at most $M$ nodes. It selects the top $M$ nodes based on affinity scores of experts distributed across each node, then performs Top-K routing only among experts within those selected nodes.

Routing Strategy Comparison

Top-1 Routing (Switch Transformer)

Activates exactly 1 expert per token. Minimizes communication cost and simplifies implementation, but reliance on a single expert may limit representational power.

Top-2 Routing (Mixtral, GShard)

Activates 2 experts per token with weighted combination. Richer representation than Top-1 but doubles communication cost.

Expert Choice Routing (Zhou et al., 2022)

Reverses the traditional approach: experts select tokens. Since each expert selects a fixed number of tokens, load balance is automatically guaranteed. However, a single token may be selected by zero or multiple experts, introducing non-determinism.

Soft MoE (Puigcerver et al., 2023)

Instead of discrete routing, passes weighted combinations of all tokens to each expert. Fully differentiable with no token dropping, but not truly sparse since every expert processes information from all tokens.

PyTorch Implementation: Expert Choice Routing

class ExpertChoiceRouter(nn.Module):
    """Expert Choice Routing: experts select tokens."""

    def __init__(self, hidden_dim: int, num_experts: int, capacity_factor: float = 1.0):
        super().__init__()
        self.num_experts = num_experts
        self.capacity_factor = capacity_factor
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)

    def forward(self, x: torch.Tensor):
        # x: (num_tokens, hidden_dim)
        num_tokens = x.shape[0]
        capacity = int(self.capacity_factor * num_tokens / self.num_experts)

        # Compute affinity scores
        scores = self.router(x)  # (num_tokens, num_experts)
        scores = F.softmax(scores, dim=0)  # softmax over tokens (not experts)

        # Each expert selects top-capacity tokens
        # Transpose: (num_experts, num_tokens)
        expert_scores = scores.t()

        # Top-capacity selection per expert
        top_scores, top_indices = expert_scores.topk(capacity, dim=-1)
        # top_scores: (num_experts, capacity)
        # top_indices: (num_experts, capacity)

        return top_scores, top_indices

Training Stability Techniques

Load Balancing Loss

The load balancing loss proposed in Switch Transformer is defined as:

\mathcal{L}_{balance} = \alpha \cdot N \sum_{i=1}^{N} f_i \cdot P_i

where $N$ is the number of experts, $f_i$ is the fraction of tokens routed to expert $i$ , and $P_i$ is the mean router probability assigned to expert $i$ . The coefficient $\alpha$ is typically set between 0.01 and 0.1.

Under ideal uniform distribution, $f_i = P_i = 1/N$ , so the loss equals $\alpha$ . The loss increases as imbalance grows.

Router Z-Loss

The Router Z-Loss proposed in ST-MoE (2022) constrains the magnitude of router logits to improve training stability:

\mathcal{L}_{z} = \frac{1}{B} \sum_{x \in B} \left( \log \sum_{i=1}^{N} e^{z_i(x)} \right)^2

where $z_i(x)$ denotes the router logits. This loss prevents logits from growing excessively large, mitigating instability and convergence issues in routing decisions.

PyTorch Implementation: Load Balancing + Z-Loss

def compute_moe_auxiliary_losses(
    router_logits: torch.Tensor,
    expert_indices: torch.Tensor,
    num_experts: int,
    alpha_balance: float = 0.01,
    alpha_z: float = 0.001
):
    """Compute load balancing loss and router z-loss.

    Args:
        router_logits: Raw router logits (batch*seq, num_experts)
        expert_indices: Selected expert indices (batch*seq, top_k)
        num_experts: Total number of experts
        alpha_balance: Weight for load balancing loss
        alpha_z: Weight for router z-loss
    """
    num_tokens = router_logits.shape[0]
    router_probs = F.softmax(router_logits, dim=-1)

    # --- Load Balancing Loss ---
    # f_i: fraction of tokens routed to expert i
    expert_mask = F.one_hot(expert_indices, num_experts).float()
    if expert_mask.dim() == 3:
        expert_mask = expert_mask.sum(dim=1)  # sum over top_k
    expert_mask = (expert_mask > 0).float()
    f = expert_mask.mean(dim=0)  # (num_experts,)

    # P_i: mean router probability for expert i
    P = router_probs.mean(dim=0)  # (num_experts,)

    balance_loss = alpha_balance * num_experts * (f * P).sum()

    # --- Router Z-Loss ---
    log_z = torch.logsumexp(router_logits, dim=-1)  # (num_tokens,)
    z_loss = alpha_z * (log_z ** 2).mean()

    return balance_loss + z_loss

Inference Optimization

Expert Offloading

Due to the large total parameter count of MoE models, it may be difficult to fit all experts in GPU memory. Expert Offloading stores inactive experts in CPU RAM or on disk and loads them to GPU only when needed.

Key techniques include:

LRU Cache: Caches recently used experts on GPU
Predictive Prefetch: Asynchronously preloads experts needed for the next layer
Speculative Decoding + Offloading: Combines with speculative decoding to hide offloading latency

Quantization

Quantization for MoE models is similar to dense models, but requires additional consideration since weight distributions may differ across experts.

GPTQ/AWQ: Independent quantization configurations per expert are possible
Mixed Precision: Higher precision for frequently used experts, lower for rarely used ones
MiLo (2025): Adds Low-Rank compensators to extremely quantized MoE models to recover accuracy

Expert Parallelism

In MoE inference, Expert Parallelism places each expert on a separate GPU for parallel processing. All-to-All communication sends tokens to the GPU hosting their assigned expert, processes them, and gathers results back.

Major MoE Model Comparison

Model	Year	Total Params	Active Params	Experts	Routing	Expert Type	Key Features
Sparsely-Gated MoE	2017	137B	-	4096	Top-K	MLP	First large-scale Sparse MoE
Switch Transformer	2021	1.6T	-	2048	Top-1	FFN	Simplified routing, T5-based
GLaM	2022	1.2T	97B	64	Top-2	FFN	1/3 energy vs GPT-3
ST-MoE	2022	269B	-	32	Top-2	FFN	Z-Loss, stability focus
Expert Choice	2022	-	-	-	Expert Choice	FFN	Experts select tokens
Mixtral 8x7B	2023	46.7B	13B	8	Top-2	SwiGLU	Open-source, GQA
DeepSeek-V2	2024	236B	21B	160	Top-6	Fine-Grained	Auxiliary-Loss-Free
DeepSeek-V3	2024	671B	37B	256+1	Top-8	Fine-Grained	MLA + shared expert
Llama 4 Scout	2025	109B	17B	16	Top-1	-	Meta first MoE

Complete Implementation: Custom MoE Transformer Block

Below is a complete implementation of a Transformer block combining an attention layer with an MoE FFN.

class MoETransformerBlock(nn.Module):
    """Complete Transformer block with MoE FFN layer."""

    def __init__(
        self,
        hidden_dim: int = 768,
        num_heads: int = 12,
        ffn_dim: int = 3072,
        num_experts: int = 8,
        top_k: int = 2,
        capacity_factor: float = 1.25,
        dropout: float = 0.1
    ):
        super().__init__()
        # Multi-Head Attention
        self.attn_norm = nn.LayerNorm(hidden_dim)
        self.attention = nn.MultiheadAttention(
            hidden_dim, num_heads, dropout=dropout, batch_first=True
        )

        # MoE FFN
        self.ffn_norm = nn.LayerNorm(hidden_dim)
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)
        self.experts = nn.ModuleList([
            SwiGLUExpert(hidden_dim, ffn_dim)
            for _ in range(num_experts)
        ])
        self.top_k = top_k
        self.num_experts = num_experts
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask=None):
        # Pre-norm Attention
        residual = x
        x_norm = self.attn_norm(x)
        attn_out, _ = self.attention(x_norm, x_norm, x_norm, attn_mask=mask)
        x = residual + self.dropout(attn_out)

        # Pre-norm MoE FFN
        residual = x
        x_norm = self.ffn_norm(x)
        moe_out, aux_loss = self._moe_forward(x_norm)
        x = residual + self.dropout(moe_out)

        return x, aux_loss

    def _moe_forward(self, x: torch.Tensor):
        B, S, D = x.shape
        x_flat = x.view(-1, D)

        # Router
        logits = self.router(x_flat)
        probs = F.softmax(logits, dim=-1)
        top_k_probs, top_k_idx = probs.topk(self.top_k, dim=-1)
        top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # Dispatch and combine
        output = torch.zeros_like(x_flat)
        for k in range(self.top_k):
            for i in range(self.num_experts):
                mask = (top_k_idx[:, k] == i)
                if mask.any():
                    expert_out = self.experts[i](x_flat[mask])
                    output[mask] += top_k_probs[mask, k].unsqueeze(-1) * expert_out

        # Auxiliary loss
        aux_loss = compute_moe_auxiliary_losses(
            logits, top_k_idx, self.num_experts
        )

        return output.view(B, S, D), aux_loss

Conclusion and Future Directions

MoE architecture has established itself as the most practical approach for simultaneously achieving "model capacity scaling" and "computational efficiency." Starting from Switch Transformer's Top-1 simplification, Mixtral 8x7B brought MoE to the open-source ecosystem, and DeepSeek-V3 set new standards with Fine-Grained Experts and Auxiliary-Loss-Free strategies.

Key future research directions include:

Dynamic expert activation: Adaptive routing that adjusts the number of active experts based on input difficulty
Training-inference consistency: Techniques ensuring routing patterns from training are maintained during inference
Expert specialization analysis: Interpretability research on what knowledge or functions each expert specializes in
MoE for edge devices: Lightweight MoE designs for mobile and edge environments

References

Shazeer, N., et al. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017.
Fedus, W., et al. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022.
Jiang, A.Q., et al. "Mixtral of Experts." arXiv:2401.04088, 2024.
DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, 2024.
DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024.
Zoph, B., et al. "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906, 2022.
Zhou, Y., et al. "Mixture-of-Experts with Expert Choice Routing." NeurIPS 2022.
Puigcerver, J., et al. "From Sparse to Soft Mixtures of Experts." ICLR 2024.
Du, N., et al. "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts." ICML 2022.
Jacobs, R.A., et al. "Adaptive Mixtures of Local Experts." Neural Computation, 1991.