💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

はじめに

大規模言語モデル（LLM）の時代において、モデルパラメータを際限なく増やすことは、学習コストと推論コストの両面で限界に直面する。Dense Transformerはすべての入力トークンに対して全パラメータを活性化するため、パラメータ数が増えると演算量（FLOPs）も比例して増加する。Mixture of Experts（MoE）アーキテクチャは、この問題を**条件付き計算（conditional computation）**で解決する。全パラメータのうち入力に応じて一部のエキスパート（expert）のみを活性化し、モデル容量は大きく維持しながら実際の演算量を一定に制限することが核心的なアイデアである。

2017年にShazeerらが「Outrageously Large Neural Networks」論文でSparsely-Gated MoEを提案して以来、2021年のGoogleのSwitch Transformer、2023年のMistralのMixtral 8x7B、そして2024年のDeepSeek-V2/V3に至るまで、MoEアーキテクチャは急速に発展してきた。2025年にはMetaのLlama 4がMoEを採用し、DeepSeek-R1がV3アーキテクチャ上で推論能力を最大化して世界的な注目を集めた。

本稿では、MoEアーキテクチャの数学的基礎から主要モデルの設計思想、ルーティング戦略の比較分析、学習安定性手法、推論最適化まで、論文レベルで徹底分析する。

MoEアーキテクチャの歴史と発展

MoEの概念は1991年のJacobsらの論文「Adaptive Mixtures of Local Experts」で初めて提案された。初期は単純なゲーティングネットワークで複数のエキスパートネットワークの出力を加重和する方式であった。

現代的なMoEの転換点は以下のように整理できる。

- **2017年**：Shazeerらがlstm ベースのSparsely-Gated MoEを提案。4096個のエキスパートで1000億パラメータ級モデルを実現

- **2021年**：GoogleのSwitch TransformerがTop-1ルーティングで単純化し、1.6兆パラメータモデルを達成

- **2022年**：GoogleのST-MoE（Stable and Transferable MoE）が学習安定性手法を体系化

- **2022年**：Expert Choice Routing論文がエキスパートがトークンを選択する逆方向ルーティングを提案

- **2023年**：Mixtral 8x7BがTop-2ルーティングとSwiGLUエキスパートでオープンソースMoE時代を開幕

- **2024年**：DeepSeek-V2がFine-Grained ExpertとAuxiliary-Loss-Free戦略を導入

- **2024年**：DeepSeek-V3が671Bパラメータ（37B活性）で最先端性能を達成

- **2025年**：Llama 4 Scout（16エキスパート、109B/17B活性）でMetaもMoEを採用

Sparse MoEの数学的基礎

基本数式

MoEレイヤーの出力は以下のように定義される。

y = \sum_{i=1}^{N} g(x)_i \cdot E_i(x)

ここで$x$は入力トークンの隠れ表現、$N$はエキスパート数、$E_i$は$i$番目のエキスパートネットワーク、$g(x)_i$はゲーティング関数が$i$番目のエキスパートに割り当てた重みである。

Sparse Gating関数

Shazeer（2017）が提案したNoisy Top-Kゲーティング関数は以下の通りである。

g(x) = \text{Softmax}(\text{TopK}(H(x), k))

H(x)_i = (x \cdot W_g)_i + \epsilon \cdot \text{Softplus}((x \cdot W_{noise})_i)

ここで$W_g$はゲーティング重み行列、$W_{noise}$はノイズ重み行列である。TopK演算は上位$k$個の値のみを保持し、残りを$-\infty$に設定してSoftmax後に0になるようにする。

PyTorch実装：基本Sparse Gating

class TopKGating(nn.Module):

"""Noisy Top-K Gating mechanism for MoE."""

def __init__(self, input_dim: int, num_experts: int, top_k: int = 2):

super().__init__()

self.num_experts = num_experts

self.top_k = top_k

self.gate = nn.Linear(input_dim, num_experts, bias=False)

self.noise = nn.Linear(input_dim, num_experts, bias=False)

def forward(self, x: torch.Tensor):

x shape: (batch_size, seq_len, input_dim)

logits = self.gate(x) # (batch, seq, num_experts)

Training noise for exploration

if self.training:

noise = torch.randn_like(logits) * F.softplus(self.noise(x))

logits = logits + noise

Top-K selection

top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)

(batch, seq, top_k)

Sparse softmax: only over selected experts

top_k_gates = F.softmax(top_k_logits, dim=-1)

return top_k_gates, top_k_indices

Switch Transformer分析

核心イノベーション：Top-1ルーティング

2021年にFedusらが発表したSwitch Transformerの核心イノベーションは**Top-1ルーティング**である。従来の研究では安定した学習のために最低2つ以上のエキスパートの活性化が必要とされていたが、Switch Transformerはトークンあたり正確に1つのエキスパートのみを選択する戦略でも十分であることを実証した。

g(x) = \text{Softmax}(x \cdot W_r), \quad i^* = \arg\max_i g(x)_i

ルーティングされた出力は、単純にゲーティング確率と該当エキスパート出力の積である。

y = g(x)_{i^*} \cdot E_{i^*}(x)

アーキテクチャの特徴

Switch TransformerはT5アーキテクチャのFFN（Feed-Forward Network）レイヤーをMoEレイヤーに置き換える。各MoEレイヤーには最大2048個のエキスパートを配置可能で、これにより1.6兆パラメータ規模のモデルを実現した。Top-1ルーティングは通信コストを半分に削減し、ルーティング演算自体も単純化する。

性能

64個のエキスパートを使用したSwitch Transformerは、同一演算量基準でT5-Base比7倍速い事前学習速度を達成した。これはモデル容量が増加しながらもトークンあたりの演算量は同一に維持されるためである。

PyTorch実装：Switch Transformer MoEレイヤー

class SwitchMoELayer(nn.Module):

"""Switch Transformer style MoE layer with Top-1 routing."""

def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int,

capacity_factor: float = 1.25):

super().__init__()

self.num_experts = num_experts

self.capacity_factor = capacity_factor

self.router = nn.Linear(hidden_dim, num_experts, bias=False)

self.experts = nn.ModuleList([

nn.Sequential(

nn.Linear(hidden_dim, ffn_dim),

nn.ReLU(),

nn.Linear(ffn_dim, hidden_dim)

) for _ in range(num_experts)

])

def forward(self, x: torch.Tensor):

batch_size, seq_len, hidden_dim = x.shape

x_flat = x.view(-1, hidden_dim) # (B*S, D)

num_tokens = x_flat.shape[0]

Router: Top-1 selection

router_logits = self.router(x_flat) # (B*S, E)

router_probs = F.softmax(router_logits, dim=-1)

expert_indices = router_probs.argmax(dim=-1) # (B*S,)

expert_gates = router_probs.gather(1, expert_indices.unsqueeze(-1)).squeeze(-1)

Capacity: max tokens per expert

capacity = int(self.capacity_factor * num_tokens / self.num_experts)

Dispatch tokens to experts

output = torch.zeros_like(x_flat)

for i in range(self.num_experts):

mask = (expert_indices == i)

if mask.sum() == 0:

continue

selected = x_flat[mask][:capacity] # enforce capacity

expert_out = self.experts[i](selected)

gates = expert_gates[mask][:capacity].unsqueeze(-1)

output[mask][:capacity] = expert_out * gates

return output.view(batch_size, seq_len, hidden_dim)

Mixtral 8x7Bアーキテクチャ詳細

設計思想

Mistral AIが2023年12月に公開したMixtral 8x7Bは、Mistral 7Bのアーキテクチャを基盤に、各TransformerレイヤーのFFNを8個のエキスパートで構成されたMoEレイヤーに置き換えた。**Top-2ルーティング**を使用してトークンあたり2個のエキスパートを活性化する。

主要数値

- 総パラメータ：46.7B（エキスパート8個 x 約5.6B FFN + 共有アテンションパラメータ）

- 活性パラメータ：約13B（トークンあたり2個のエキスパートFFN + 共有パラメータ）

- エキスパート関数：SwiGLU FFN

- アテンション：Grouped Query Attention（GQA）

- コンテキスト長：32Kトークン

- Sliding Window Attention適用

Top-2ルーティング数式

MixtralのMoEレイヤー出力は以下のように計算される。

y = \sum_{i \in \text{Top2}(g(x))} g(x)_i \cdot \text{SwiGLU}_i(x)

ゲーティング関数$g(x)$は入力$x$に対してSoftmax確率分布を計算し、上位2つのエキスパートを選択する。選択された2つのエキスパートのゲーティング重みは再正規化（renormalization）されて合計が1になる。

SwiGLUエキスパートネットワーク

各エキスパートはSwiGLU活性化関数を使用するFFNである。

\text{SwiGLU}(x) = (\text{Swish}(xW_1) \odot xV) W_2

PyTorch実装：Mixtral MoEブロック

class MixtralMoEBlock(nn.Module):

"""Mixtral-style MoE block with Top-2 SwiGLU experts."""

def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int = 8):

super().__init__()

self.num_experts = num_experts

self.gate = nn.Linear(hidden_dim, num_experts, bias=False)

self.experts = nn.ModuleList([

SwiGLUExpert(hidden_dim, ffn_dim) for _ in range(num_experts)

])

def forward(self, x: torch.Tensor):

x: (batch, seq_len, hidden_dim)

gate_logits = self.gate(x) # (batch, seq, num_experts)

gate_probs = F.softmax(gate_logits, dim=-1)

Top-2 selection

top2_probs, top2_indices = gate_probs.topk(2, dim=-1)

Renormalize gates to sum to 1

top2_probs = top2_probs / top2_probs.sum(dim=-1, keepdim=True)

Compute expert outputs and combine

batch, seq, dim = x.shape

output = torch.zeros_like(x)

for k in range(2):

expert_idx = top2_indices[:, :, k] # (batch, seq)

gate_val = top2_probs[:, :, k].unsqueeze(-1) # (batch, seq, 1)

for i in range(self.num_experts):

mask = (expert_idx == i)

if mask.any():

expert_input = x[mask]

expert_output = self.experts[i](expert_input)

output[mask] += gate_val[mask].squeeze(-1).unsqueeze(-1) * expert_output

return output

class SwiGLUExpert(nn.Module):

"""SwiGLU Feed-Forward Network used as expert."""

def __init__(self, hidden_dim: int, ffn_dim: int):

super().__init__()

self.w1 = nn.Linear(hidden_dim, ffn_dim, bias=False)

self.v = nn.Linear(hidden_dim, ffn_dim, bias=False)

self.w2 = nn.Linear(ffn_dim, hidden_dim, bias=False)

def forward(self, x: torch.Tensor):

return self.w2(F.silu(self.w1(x)) * self.v(x))

DeepSeek-V2/V3のイノベーション：DeepSeekMoE

Fine-Grained Expert分割

DeepSeek-V2（2024）は、既存のMoEと根本的に異なるアプローチを取った。核心アイデアは**Fine-Grained Expert Segmentation**で、エキスパートをより小さく、より多数に分割することである。

元の$N$個のエキスパートを$mN$個に増加させつつ、各エキスパートの隠れ次元を$1/m$に縮小する。同時に活性化するエキスパート数も$K$から$mK$に比例増加させ、トークンあたりの総演算量は同一に維持しながら、より細粒度なエキスパートの組み合わせを可能にする。

DeepSeek-V3アーキテクチャ

DeepSeek-V3（2024年12月）は以下の主要構成を持つ。

- 総パラメータ：671B

- 活性パラメータ：37B（トークンあたり）

- ルーティングエキスパート：256個（レイヤーあたり）

- 共有エキスパート：1個（レイヤーあたり、常に活性）

- 活性ルーティングエキスパート：8個（トークンあたり）

- アテンション：Multi-head Latent Attention（MLA）

Auxiliary-Loss-Free負荷分散

DeepSeek-V3の最も革新的な貢献の一つは、**補助損失なしの負荷分散**戦略である。従来のMoEモデルは負荷分散のために補助損失（auxiliary loss）を使用するが、この補助損失の係数を適切に設定することが困難で、過度な値はモデル性能を低下させる。

代わりに、DeepSeek-V3は各エキスパートにバイアス項$b_i$を追加し、ルーティング決定にのみ使用する。

i^* = \text{TopK}(s(x)_i + b_i)

g(x)_i = \frac{s(x)_i}{\sum_{j \in \text{TopK}} s(x)_j}

バイアス項$b_i$はルーティング決定にのみ影響し、実際のゲーティング重み計算には含まれない。過負荷エキスパートの$b_i$を減少させ、低負荷エキスパートの$b_i$を増加させることで、損失関数を汚染せずに負荷分散を達成する。

Device-Limitedルーティング

通信コストを制限するため、DeepSeek-V3は各トークンが最大$M$個のノードにのみ送信されるよう制限する。各ノードに分散されたエキスパートのアフィニティスコアに基づいて上位$M$個のノードを選択し、そのノード内のエキスパート間でのみTop-Kルーティングを実行する。

ルーティング戦略の比較

Top-1 Routing（Switch Transformer）

トークンあたり正確に1個のエキスパートのみを活性化する。通信コストが最小で実装が単純だが、単一エキスパートへの依存により表現力が制限される可能性がある。

Top-2 Routing（Mixtral、GShard）

トークンあたり2個のエキスパートを活性化して加重和する。Top-1より豊かな表現が可能だが、通信コストは2倍になる。

Expert Choice Routing（Zhou et al., 2022）

従来の方式とは逆に、**エキスパートがトークンを選択**する。各エキスパートが固定数のトークンを選択するため、負荷分散が自動的に保証される。ただし、1つのトークンが0個または複数のエキスパートに選択される可能性がある非決定的な特性を持つ。

Soft MoE（Puigcerver et al., 2023）

離散的ルーティングの代わりに、すべてのトークンの加重組み合わせを各エキスパートに渡す。完全微分可能でトークンドロップがないが、すべてのエキスパートがすべてのトークンの情報を処理するため、真のスパース性ではない。

PyTorch実装：Expert Choice Routing

class ExpertChoiceRouter(nn.Module):

"""Expert Choice Routing: experts select tokens."""

def __init__(self, hidden_dim: int, num_experts: int, capacity_factor: float = 1.0):

super().__init__()

self.num_experts = num_experts

self.capacity_factor = capacity_factor

self.router = nn.Linear(hidden_dim, num_experts, bias=False)

def forward(self, x: torch.Tensor):

x: (num_tokens, hidden_dim)

num_tokens = x.shape[0]

capacity = int(self.capacity_factor * num_tokens / self.num_experts)

Compute affinity scores

scores = self.router(x) # (num_tokens, num_experts)

scores = F.softmax(scores, dim=0) # softmax over tokens (not experts)

Each expert selects top-capacity tokens

Transpose: (num_experts, num_tokens)

expert_scores = scores.t()

Top-capacity selection per expert

top_scores, top_indices = expert_scores.topk(capacity, dim=-1)

top_scores: (num_experts, capacity)

top_indices: (num_experts, capacity)

return top_scores, top_indices

学習安定性手法

Load Balancing Loss

Switch Transformerで提案された負荷分散損失は以下の通りである。

\mathcal{L}_{balance} = \alpha \cdot N \sum_{i=1}^{N} f_i \cdot P_i

ここで$N$はエキスパート数、$f_i$はエキスパート$i$にルーティングされたトークンの割合、$P_i$はルーターがエキスパート$i$に割り当てた確率の平均である。係数$\alpha$は一般的に0.01から0.1の間に設定される。

理想的な均等分配時は$f_i = P_i = 1/N$となり損失は$\alpha$になり、不均衡が大きいほど損失が増加する。

Router Z-Loss

ST-MoE（2022）で提案されたRouter Z-Lossは、ルーターロジットの大きさを制限して学習安定性を高める。

\mathcal{L}_{z} = \frac{1}{B} \sum_{x \in B} \left( \log \sum_{i=1}^{N} e^{z_i(x)} \right)^2

ここで$z_i(x)$はルーターのロジットである。この損失はロジットが過度に大きくなることを防ぎ、ルーティング決定の不安定性と収束問題を緩和する。

PyTorch実装：Load Balancing + Z-Loss

def compute_moe_auxiliary_losses(

router_logits: torch.Tensor,

expert_indices: torch.Tensor,

num_experts: int,

alpha_balance: float = 0.01,

alpha_z: float = 0.001

"""Compute load balancing loss and router z-loss.

Args:

router_logits: Raw router logits (batch*seq, num_experts)

expert_indices: Selected expert indices (batch*seq, top_k)

num_experts: Total number of experts

alpha_balance: Weight for load balancing loss

alpha_z: Weight for router z-loss

"""

num_tokens = router_logits.shape[0]

router_probs = F.softmax(router_logits, dim=-1)

--- Load Balancing Loss ---

f_i: fraction of tokens routed to expert i

expert_mask = F.one_hot(expert_indices, num_experts).float()

if expert_mask.dim() == 3:

expert_mask = expert_mask.sum(dim=1) # sum over top_k

expert_mask = (expert_mask > 0).float()

f = expert_mask.mean(dim=0) # (num_experts,)

P_i: mean router probability for expert i

P = router_probs.mean(dim=0) # (num_experts,)

balance_loss = alpha_balance * num_experts * (f * P).sum()

--- Router Z-Loss ---

log_z = torch.logsumexp(router_logits, dim=-1) # (num_tokens,)

z_loss = alpha_z * (log_z ** 2).mean()

return balance_loss + z_loss

推論最適化

Expert Offloading

MoEモデルは総パラメータ数が大きいため、すべてのエキスパートをGPUメモリに載せることが困難な場合がある。Expert Offloadingは、現在活性化されていないエキスパートをCPU RAMやディスクに保存し、必要な時のみGPUにロードする手法である。

主要な手法は以下の通りである。

- **LRU Cache**：最近使用されたエキスパートをGPUにキャッシュ

- **Predictive Prefetch**：次のレイヤーで使用するエキスパートを非同期的に事前ロード

- **Speculative Decoding + Offloading**：投機的デコーディングと組み合わせてオフローディングの遅延を隠蔽

量子化（Quantization）

MoEモデルの量子化はDenseモデルと類似しているが、エキスパートごとに重み分布が異なる可能性がある点で追加的な考慮が必要である。

- **GPTQ/AWQ**：エキスパートごとに独立した量子化設定の適用が可能

- **Mixed Precision**：頻繁に使用されるエキスパートは高精度、稀に使用されるエキスパートは低精度

- **MiLo（2025）**：極度に量子化されたMoEにLow-Rank補償器を追加して精度を回復

Expert Parallelism

MoEモデルの推論におけるExpert Parallelismは、各エキスパートを別々のGPUに配置して並列処理する戦略である。All-to-All通信でトークンを該当エキスパートがあるGPUに送信し、処理後に再び収集する。

主要MoEモデル比較

| :----------------- | :----: | :----------: | :------------: | :------------: | :-----------: | :----------------: | :--------------------------- |

| Sparsely-Gated MoE | 2017 | 137B | - | 4096 | Top-K | MLP | 初の大規模Sparse MoE |

| Switch Transformer | 2021 | 1.6T | - | 2048 | Top-1 | FFN | 単純化ルーティング、T5ベース |

| GLaM | 2022 | 1.2T | 97B | 64 | Top-2 | FFN | GPT-3比1/3エネルギー |

| ST-MoE | 2022 | 269B | - | 32 | Top-2 | FFN | Z-Loss、安定性重視 |

| Expert Choice | 2022 | - | - | - | Expert Choice | FFN | エキスパートがトークン選択 |

| Mixtral 8x7B | 2023 | 46.7B | 13B | 8 | Top-2 | SwiGLU | オープンソース、GQA |

| DeepSeek-V2 | 2024 | 236B | 21B | 160 | Top-6 | Fine-Grained | Auxiliary-Loss-Free |

| DeepSeek-V3 | 2024 | 671B | 37B | 256+1 | Top-8 | Fine-Grained | MLA + 共有エキスパート |

| Llama 4 Scout | 2025 | 109B | 17B | 16 | Top-1 | - | Meta初のMoE |

総合実装例：カスタムMoE Transformerブロック

以下はアテンションレイヤーとMoE FFNを組み合わせた完全なTransformerブロックの実装である。

class MoETransformerBlock(nn.Module):

"""Complete Transformer block with MoE FFN layer."""

def __init__(

self,

hidden_dim: int = 768,

num_heads: int = 12,

ffn_dim: int = 3072,

num_experts: int = 8,

top_k: int = 2,

capacity_factor: float = 1.25,

dropout: float = 0.1

super().__init__()

Multi-Head Attention

self.attn_norm = nn.LayerNorm(hidden_dim)

self.attention = nn.MultiheadAttention(

hidden_dim, num_heads, dropout=dropout, batch_first=True

)

MoE FFN

self.ffn_norm = nn.LayerNorm(hidden_dim)

self.router = nn.Linear(hidden_dim, num_experts, bias=False)

self.experts = nn.ModuleList([

SwiGLUExpert(hidden_dim, ffn_dim)

for _ in range(num_experts)

])

self.top_k = top_k

self.num_experts = num_experts

self.dropout = nn.Dropout(dropout)

def forward(self, x: torch.Tensor, mask=None):

Pre-norm Attention

residual = x

x_norm = self.attn_norm(x)

attn_out, _ = self.attention(x_norm, x_norm, x_norm, attn_mask=mask)

x = residual + self.dropout(attn_out)

Pre-norm MoE FFN

residual = x

x_norm = self.ffn_norm(x)

moe_out, aux_loss = self._moe_forward(x_norm)

x = residual + self.dropout(moe_out)

return x, aux_loss

def _moe_forward(self, x: torch.Tensor):

B, S, D = x.shape

x_flat = x.view(-1, D)

Router

logits = self.router(x_flat)

probs = F.softmax(logits, dim=-1)

top_k_probs, top_k_idx = probs.topk(self.top_k, dim=-1)

top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

Dispatch and combine

output = torch.zeros_like(x_flat)

for k in range(self.top_k):

for i in range(self.num_experts):

mask = (top_k_idx[:, k] == i)

if mask.any():

expert_out = self.experts[i](x_flat[mask])

output[mask] += top_k_probs[mask, k].unsqueeze(-1) * expert_out

Auxiliary loss

aux_loss = compute_moe_auxiliary_losses(

logits, top_k_idx, self.num_experts

)

return output.view(B, S, D), aux_loss

結論と今後の展望

MoEアーキテクチャは「モデル容量の拡大」と「計算効率性」という二つの目標を同時に達成できる最も実用的なアプローチとして確立された。Switch TransformerのTop-1単純化から始まり、Mixtral 8x7BがオープンソースエコシステムにMoEを普及させ、DeepSeek-V3がFine-Grained ExpertとAuxiliary-Loss-Free戦略で新たな基準を設けた。

今後の研究方向は以下の通りである。

1. **動的エキスパート活性化**：入力難易度に応じて活性化エキスパート数を調整する適応的ルーティング

2. **学習-推論一貫性**：学習時のルーティングパターンが推論時にも維持されるようにする手法

3. **エキスパート特化分析**：各エキスパートがどのような知識や機能に特化しているかの解釈可能性研究

4. **エッジデバイス向けMoE**：モバイルおよびエッジ環境での軽量MoE設計

参考文献

1. Shazeer, N., et al. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017.

2. Fedus, W., et al. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022.

3. Jiang, A.Q., et al. "Mixtral of Experts." arXiv:2401.04088, 2024.

4. DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, 2024.

5. DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024.

6. Zoph, B., et al. "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906, 2022.

7. Zhou, Y., et al. "Mixture-of-Experts with Expert Choice Routing." NeurIPS 2022.

8. Puigcerver, J., et al. "From Sparse to Soft Mixtures of Experts." ICLR 2024.

9. Du, N., et al. "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts." ICML 2022.

10. Jacobs, R.A., et al. "Adaptive Mixtures of Local Experts." Neural Computation, 1991.