Skip to content
Published on

Mixture of Experts (MoE) Architecture Paper Review and Production Scaling 2026

Authors
  • Name
    Twitter
Mixture of Experts (MoE) Architecture Paper Review and Production Scaling 2026

Overview

As the parameter counts of large language models scale to trillions, the limitations of Dense models that activate all parameters for every token have become clear. Training and inference costs increase proportionally with parameter count. Mixture of Experts (MoE) has established itself as the most promising solution to this problem. MoE selectively activates only a subset of experts from the total parameters, dramatically reducing actual computation compared to Dense models while maintaining large model capacity.

In 2022, Switch Transformer opened the possibility of trillion-parameter scaling with single-expert routing, and GShard presented a distributed training pipeline for a 600B parameter model. ST-MoE improved both training stability and transfer learning quality simultaneously, while Mixtral 8x7B proved the practicality of open-source MoE models. DeepSeek-MoE and DeepSeek-V3 opened new directions with fine-grained expert segmentation and auxiliary-loss-free load balancing. In this article, we analyze the key contributions of each paper and organize strategies for operating MoE in production, covering routing mechanisms, training stability, and inference optimization.

MoE Core Concepts

The Principle of Sparse Activation

The core idea of MoE is conditional computation. Since only a subset of experts is activated for each input token, the model's total parameter count and the parameters actually used for computation are decoupled. For example, Mixtral 8x7B has 47B total parameters, but each token uses only 13B parameters. Computation is reduced to approximately 1/3.6 compared to a Dense 47B model, while quality remains similar or even better.

Dense vs Sparse MoE Comparison

ItemDense ModelSparse MoE
Active parametersTotal parameters = active parametersOnly 10-30% of total activated
FLOPs efficiencyProportional to parameter countProportional to active parameter count
Memory requirementEqual to parameter sizeFull parameters must be stored (more memory)
Training stabilityRelatively stableRouting instability, expert collapse risk
ScalabilityLinear cost increaseLow-cost scaling by adding experts
Inference latencyPredictableRouting overhead, expert loading delays
RepresentativeLLaMA, GPT-4 (estimated)Mixtral, DeepSeek-V3, Switch Transformer

Basic MoE Layer Implementation

The basic structure of an MoE layer implemented in PyTorch is as follows. The gating network computes weights for each expert given an input token, selects the Top-K experts, and sums their outputs.

import torch
import torch.nn as nn
import torch.nn.functional as F

class Expert(nn.Module):
    """Single expert FFN module"""
    def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff)
        self.w2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.w2(self.dropout(F.gelu(self.w1(x))))

class MoELayer(nn.Module):
    """Mixture of Experts layer"""
    def __init__(
        self,
        d_model: int,
        d_ff: int,
        num_experts: int,
        top_k: int = 2,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.experts = nn.ModuleList([
            Expert(d_model, d_ff, dropout) for _ in range(num_experts)
        ])
        # Gating network: input -> expert scores
        self.gate = nn.Linear(d_model, num_experts, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch, seq_len, d_model)
        batch_size, seq_len, d_model = x.shape
        x_flat = x.view(-1, d_model)  # (B*S, d_model)

        # Compute gating scores
        gate_logits = self.gate(x_flat)  # (B*S, num_experts)
        gate_probs = F.softmax(gate_logits, dim=-1)

        # Select Top-K experts
        top_k_probs, top_k_indices = torch.topk(
            gate_probs, self.top_k, dim=-1
        )  # (B*S, top_k)
        # Renormalize selected expert weights
        top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # Sum expert outputs
        output = torch.zeros_like(x_flat)
        for k in range(self.top_k):
            expert_idx = top_k_indices[:, k]  # (B*S,)
            weight = top_k_probs[:, k].unsqueeze(-1)  # (B*S, 1)
            for i in range(self.num_experts):
                mask = (expert_idx == i)
                if mask.any():
                    expert_input = x_flat[mask]
                    expert_output = self.experts[i](expert_input)
                    output[mask] += weight[mask] * expert_output

        return output.view(batch_size, seq_len, d_model)

This implementation is for educational purposes. In actual production, optimized libraries like Megablocks or Tutel are used for efficient per-expert batch processing and All-to-All communication.

Key Paper Analysis

Switch Transformer (Fedus et al., 2022)

The key contribution of Switch Transformer was radically simplifying MoE routing. While previous MoE used Top-2 or more expert combinations, Switch Transformer introduced Switch Routing, which assigns only a single expert to each token. This decision is counterintuitive, but it was supported by experimental results showing that single-expert routing halves communication costs and reduces the overhead of the routing computation itself.

Based on the T5-Base architecture, it achieved up to 7x pre-training speed improvement at the same computational cost and scaled to trillion-parameter models. It also improved memory efficiency through bfloat16 mixed-precision training, and mitigated training instability by applying Expert Dropout together with auxiliary load balancing loss.

The expert Capacity Factor introduced by Switch Transformer determines the maximum number of tokens each expert can process. Tokens exceeding capacity are dropped, which causes information loss during training but is essential for balancing computational load.

GShard (Lepikhin et al., 2021)

GShard is a paper focused on system design for practically training MoE at massive distributed scale. It trained a 600B parameter multilingual translation model on 2,048 TPU v3 accelerators in just 4 days. There are three key technical contributions.

First, Expert Parallelism. It established the All-to-All communication pattern for distributing experts across multiple devices and routing tokens to the device where the target expert resides. Second, Random Routing. After deterministically selecting the Top-1 expert, the second expert is probabilistically selected with probability proportional to the gate weight. This approach promotes exploration during training. Third, it provided a lightweight API through XLA compiler extensions that can express various parallelism patterns with minimal code changes.

GShard's per-group capacity calculation formula C = 2N / (G * E) dynamically determines each expert's processing capacity considering group size G, number of experts E, and token count N.

ST-MoE (Zoph et al., 2022)

As the name ST-MoE implies, the core question of this paper is "Can MoE models be trained stably (Stable) and transferred effectively (Transferable)?" It trained a 269B parameter Sparse model (ST-MoE-32B) at the computational cost of a 32B Dense model, and achieved state-of-the-art performance for Sparse models for the first time across diverse benchmarks including SuperGLUE, ARC, and XSum.

For training stability, it introduced Router Z-Loss. This loss function controls the magnitude of router logits to prevent divergence during training. When used together with the existing auxiliary loss, training stability is significantly improved. It also systematically organized practical design recommendations: Top-2 routing, capacity factor 1.25 during training, and a maximum of 1 expert per core.

For fine-tuning, it presented experimental results showing that increasing the expert dropout rate, reducing batch size, and setting the learning rate conservatively are effective for improving transfer learning quality.

Mixtral 8x7B (Mistral AI, 2024)

Mixtral 8x7B is the model that decisively proved the practicality of open-source MoE models. Using the same Transformer architecture as Mistral 7B, it replaces each layer's FFN with 8 experts and selects Top-2 experts per token. Since only 13B of the total 47B parameters are activated, inference speed is similar to a 13B Dense model while quality rivals LLaMA 2 70B.

ItemMixtral 8x7BLLaMA 2 70BMistral 7B
Total parameters46.7B70B7.3B
Active parameters12.9B70B7.3B
Inference FLOPs~26B~140B~15B
MMLU70.669.860.1
ARC-Challenge66.464.655.5
Number of experts8--
Active experts2--

Analysis of Mixtral's expert specialization revealed that experts tend to specialize based on syntactic patterns rather than domains. Rather than specific experts responding only to code or math, expert selection changes based on sentence structure or token position.

DeepSeek-MoE and DeepSeek-V3

DeepSeek-MoE presented two key strategies aimed at "Ultimate Expert Specialization."

First, Fine-Grained Expert Segmentation. It subdivides the existing N experts into mN experts, reducing each expert's FFN intermediate dimension to 1/m to maintain the same total parameter count. Instead, mK experts are activated, enabling more flexible expert combinations. DeepSeek-MoE 16B achieved performance similar to LLaMA 2 7B with only 40% of the computation.

Second, Shared Expert isolation. Experts commonly used by all tokens are separated, preventing redundant knowledge storage among routed experts and increasing per-expert specialization.

DeepSeek-V3 scaled this to 671B parameters while introducing innovative Auxiliary-Loss-Free load balancing. It uses 256 routed experts and 1 shared expert, activating 8 experts per token. While traditional auxiliary loss approaches had a trade-off between load balancing and model performance, DeepSeek-V3 adds a bias term to each expert and dynamically adjusts the bias by monitoring expert load during training. This approach maintains balanced load throughout training without any token dropping.

MoE Model Comparison Summary

ModelTotal ParamsActive ParamsExpertsActive ExpertsRoutingLoad Balancing
Switch Transformer1.6T~1/E1281Top-1Auxiliary loss
GShard 600B600B~2/E20482Top-1 + RandomCapacity limit + aux loss
ST-MoE-32B269B32B equiv.642Top-2Router Z-Loss + aux loss
Mixtral 8x7B46.7B12.9B82Top-2Auxiliary loss
DeepSeek-V3671B37B256+18+1Top-8Auxiliary-Loss-Free

Routing Mechanism Deep Dive

Top-K Routing Implementation

The routing mechanism is the heart of MoE. It receives the hidden state of an input token and decides which experts to activate, and the quality of this decision determines overall model performance. Let us examine implementations of various routing strategies.

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple

class TopKRouter(nn.Module):
    """Top-K routing with noise injection"""
    def __init__(
        self,
        d_model: int,
        num_experts: int,
        top_k: int = 2,
        noise_std: float = 1.0,
    ):
        super().__init__()
        self.top_k = top_k
        self.num_experts = num_experts
        self.noise_std = noise_std
        self.gate = nn.Linear(d_model, num_experts, bias=False)

    def forward(
        self, x: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        Args:
            x: (batch * seq_len, d_model)
        Returns:
            top_k_indices: (batch * seq_len, top_k)
            top_k_weights: (batch * seq_len, top_k)
            gate_logits: (batch * seq_len, num_experts)
        """
        gate_logits = self.gate(x)

        # Inject noise during training to promote exploration
        if self.training:
            noise = torch.randn_like(gate_logits) * self.noise_std
            gate_logits = gate_logits + noise

        gate_probs = F.softmax(gate_logits, dim=-1)
        top_k_probs, top_k_indices = torch.topk(
            gate_probs, self.top_k, dim=-1
        )
        # Renormalization
        top_k_weights = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        return top_k_indices, top_k_weights, gate_logits

Routing Strategy Comparison

StrategyPaperActive ExpertsProsCons
Top-1Switch Transformer1Minimum comm cost, simpleHigh expert collapse risk
Top-2ST-MoE, Mixtral2Stable, high performance2x communication vs Top-1
Top-1 + RandomGShard2Promotes explorationNon-deterministic inference
Top-K (K >= 4)DeepSeek-V38Fine-grained expert usageOnly effective with many experts
Expert ChoiceEC RoutingVariableGuarantees perfect balanceUneven expert count per token

Expert Choice Routing flips the perspective from traditional routing. Instead of tokens choosing experts, each expert chooses which tokens to process. This approach guarantees perfect load balancing, but has the issue that certain tokens may not be selected by any expert.

Training Stability and Load Balancing

Load Balancing Loss Implementation

The most common problem in MoE training is Expert Collapse. When tokens concentrate on specific experts, only those experts are trained, and the remaining experts go unused, effectively becoming Dead Experts. Auxiliary load balancing loss is used to prevent this.

def load_balancing_loss(
    gate_logits: torch.Tensor,
    top_k_indices: torch.Tensor,
    num_experts: int,
    top_k: int = 2,
) -> torch.Tensor:
    """
    Switch Transformer style load balancing loss calculation

    Minimizes the dot product of the token assignment ratio per expert (f_i)
    and the mean gate probability (p_i).
    Achieves minimum value at uniform distribution.

    Args:
        gate_logits: (num_tokens, num_experts)
        top_k_indices: (num_tokens, top_k)
        num_experts: number of experts
        top_k: number of active experts
    Returns:
        Scalar load balancing loss
    """
    num_tokens = gate_logits.shape[0]
    gate_probs = F.softmax(gate_logits, dim=-1)

    # f_i: token assignment ratio per expert
    # Count assignments via one-hot encoding
    expert_mask = F.one_hot(
        top_k_indices, num_classes=num_experts
    ).float()  # (num_tokens, top_k, num_experts)
    expert_mask = expert_mask.sum(dim=1)  # (num_tokens, num_experts)
    tokens_per_expert = expert_mask.sum(dim=0)  # (num_experts,)
    f = tokens_per_expert / (num_tokens * top_k)

    # p_i: mean gate probability per expert
    p = gate_probs.mean(dim=0)  # (num_experts,)

    # Balancing loss: num_experts * sum(f_i * p_i)
    balance_loss = num_experts * (f * p).sum()

    return balance_loss

Router Z-Loss

Router Z-Loss, introduced by ST-MoE, controls the magnitude of router logits to prevent training divergence. When logit values become excessively large, the softmax gradient vanishes and training becomes unstable -- Z-Loss suppresses this.

def router_z_loss(gate_logits: torch.Tensor) -> torch.Tensor:
    """
    ST-MoE Router Z-Loss implementation

    Minimizes the squared mean of log-sum-exp of router logits.
    Prevents logits from growing excessively to ensure training stability.

    Args:
        gate_logits: (num_tokens, num_experts)
    Returns:
        Scalar z-loss value
    """
    # Square of log(sum(exp(x)))
    log_z = torch.logsumexp(gate_logits, dim=-1)  # (num_tokens,)
    z_loss = (log_z ** 2).mean()
    return z_loss


# Integration in the training loop
def compute_moe_loss(
    task_loss: torch.Tensor,
    gate_logits: torch.Tensor,
    top_k_indices: torch.Tensor,
    num_experts: int,
    alpha_balance: float = 0.01,
    alpha_z: float = 0.001,
) -> torch.Tensor:
    """
    Total MoE loss = task loss + balancing loss + Z-Loss
    """
    bal_loss = load_balancing_loss(
        gate_logits, top_k_indices, num_experts
    )
    z_loss = router_z_loss(gate_logits)

    total_loss = task_loss + alpha_balance * bal_loss + alpha_z * z_loss
    return total_loss

The settings of alpha_balance and alpha_z are critical. If the auxiliary loss coefficient is too large, model performance degrades; if too small, load balancing has no effect and expert collapse occurs. ST-MoE recommended alpha_balance = 0.01 and alpha_z = 0.001.

Auxiliary-Loss-Free Approach

The auxiliary-loss-free load balancing introduced by DeepSeek-V3 fundamentally resolves this trade-off. It adds a learnable bias term to each expert and monitors per-expert load in real-time during training to adjust the bias.

class AuxLossFreeRouter(nn.Module):
    """DeepSeek-V3 style Auxiliary-Loss-Free router"""
    def __init__(
        self,
        d_model: int,
        num_experts: int,
        top_k: int = 8,
        bias_update_speed: float = 0.001,
    ):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.gamma = bias_update_speed
        self.gate = nn.Linear(d_model, num_experts, bias=False)
        # Bias is adjusted rule-based, not via gradient updates
        self.register_buffer(
            "expert_bias", torch.zeros(num_experts)
        )
        self.register_buffer(
            "target_load", torch.ones(num_experts) / num_experts
        )

    def forward(self, x: torch.Tensor):
        gate_logits = self.gate(x)
        # Add bias to influence routing decisions
        biased_logits = gate_logits + self.expert_bias.unsqueeze(0)
        gate_probs = F.softmax(gate_logits, dim=-1)

        # Top-K selection uses biased logits
        _, top_k_indices = torch.topk(
            biased_logits, self.top_k, dim=-1
        )
        # Weights are extracted from original gate probabilities
        top_k_probs = torch.gather(gate_probs, 1, top_k_indices)
        top_k_weights = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # Update bias during training (no gradient needed)
        if self.training:
            with torch.no_grad():
                expert_load = F.one_hot(
                    top_k_indices.view(-1), self.num_experts
                ).float().sum(0)
                expert_load = expert_load / expert_load.sum()
                # Decrease bias for overloaded experts, increase for underloaded
                self.expert_bias += self.gamma * (
                    self.target_load - expert_load
                )

        return top_k_indices, top_k_weights, gate_logits

The key to this approach is that while the bias influences routing decisions (Top-K selection), the weight computation for expert outputs uses the original gate probabilities. Since the bias does not interfere with the model's gradient flow, load balancing is achieved without performance degradation.

Inference Optimization

Expert Offloading and Capacity Budgeting

The biggest challenge in MoE model inference is memory. When it is not feasible to keep all experts resident in GPU memory, Expert Offloading becomes essential. The strategy involves offloading inactive experts to CPU memory or disk and loading them to GPU only when needed.

Research emerging since 2025 has significantly improved performance by combining speculative decoding with expert offloading. SpecMoEOff hides offloading latency by expanding expert workload through speculative decoding, achieving up to 2.5x decode throughput improvement. MoE-Spec achieves 10-30% throughput improvement without training by fixing each layer's expert capacity and dropping infrequently used experts.

from dataclasses import dataclass
from typing import Dict, Optional
import torch

@dataclass
class ExpertCacheConfig:
    """Expert cache configuration"""
    gpu_cache_size: int = 4   # Number of experts to keep on GPU
    prefetch_count: int = 2    # Number of experts to prefetch
    device: str = "cuda:0"

class ExpertOffloadManager:
    """
    Expert offloading manager.
    Manages GPU cache based on LRU and
    prefetches next-layer experts.
    """
    def __init__(
        self,
        experts: Dict[int, torch.nn.Module],
        config: ExpertCacheConfig,
    ):
        self.experts = experts
        self.config = config
        self.gpu_cache: Dict[int, torch.nn.Module] = {}
        self.access_order: list = []

    def get_expert(self, expert_id: int) -> torch.nn.Module:
        """Load expert to GPU (LRU cache)"""
        if expert_id in self.gpu_cache:
            self.access_order.remove(expert_id)
            self.access_order.append(expert_id)
            return self.gpu_cache[expert_id]

        # Evict LRU expert if GPU cache is full
        if len(self.gpu_cache) >= self.config.gpu_cache_size:
            evict_id = self.access_order.pop(0)
            self.gpu_cache[evict_id].cpu()
            del self.gpu_cache[evict_id]

        # Move expert to GPU
        expert = self.experts[expert_id].to(self.config.device)
        self.gpu_cache[expert_id] = expert
        self.access_order.append(expert_id)
        return expert

    def prefetch(self, predicted_expert_ids: list):
        """Asynchronously prefetch experts expected in the next layer"""
        for eid in predicted_expert_ids[:self.config.prefetch_count]:
            if eid not in self.gpu_cache:
                self.get_expert(eid)

Quantization and Expert Pruning

Another axis for improving MoE model inference efficiency is quantization and expert pruning. Instead of quantizing all experts at the same bit level, an adaptive quantization strategy that maintains frequently activated experts at high precision while aggressively quantizing rarely used experts is effective. Expert Pruning completely removes experts with low activation frequency after training to reduce memory usage. It has been reported that pruning 2 experts from Mixtral 8x7B results in minimal performance degradation.

Troubleshooting

The following section organizes commonly occurring problems and solutions in MoE model training and inference.

Expert Collapse

Symptom: Only a small number of experts are activated during training, and the gate probabilities of the remaining experts converge to 0. Monitoring per-expert token assignment ratios in TensorBoard shows that over 90% of tokens are concentrated on specific experts.

The cause is usually that the auxiliary loss coefficient is too small, or the initial router weight initialization is imbalanced. Solutions include starting alpha_balance at 0.01 and gradually adjusting, initializing router weights uniformly with small values, and applying a warmup that performs random routing for a certain number of steps at the beginning of training.

Training Divergence

Symptom: Loss suddenly spikes or NaN occurs during training. This is especially common during bfloat16 training with MoE models.

Applying Router Z-Loss and maintaining separate float32 computation for router logits is effective. Lowering the learning rate by 2-5x compared to Dense models is also recommended.

All-to-All Communication Bottleneck

Symptom: During distributed training, All-to-All communication for expert parallelism accounts for over 30% of total training time.

Set the number of experts as a multiple of the number of devices, and reduce the capacity factor to decrease data transfer volume. Pipeline scheduling that overlaps communication and computation is also effective. The Megablocks library provides an approach that avoids All-to-All communication through block-sparse operations.

Inference Memory Shortage

Symptom: OOM occurs because all experts cannot fit in GPU memory.

Apply Expert Offloading, or reduce memory requirements through expert pruning. Combining Tensor Parallelism with Expert Parallelism to distribute across multiple GPUs is also an option.

Failure Cases

Excessive Auxiliary Loss Coefficient

A team training a custom MoE model based on the Mixtral architecture set alpha_balance to 0.1 (high) to prevent expert collapse. Load balancing was perfect, but the model learned to distribute tokens equally among experts regardless of token semantics, resulting in actual task performance lower than a Dense model. The auxiliary loss dominated the main loss, preventing the router from learning meaningful expert specialization. Performance only improved after lowering alpha_balance to 0.005 and applying Router Z-Loss together.

The Pitfall of Switching to Top-1 at Inference

There was a case where a model trained with Top-2 routing was switched to Top-1 to reduce inference costs. Computation was halved, but benchmark scores dropped by 8-12%. Because the outputs of two experts were trained to be complementary, using only one results in significant loss of expressiveness. If reducing inference costs is the goal, either design with Top-1 from the training stage, or approach through quantization and pruning.

Fine-Grained Segmentation Failure Without Shared Experts

There was a case where DeepSeek-MoE's fine-grained expert segmentation was applied without the Shared Expert. All experts redundantly learned basic language patterns (articles, prepositions, punctuation, etc.), reducing the amount of unique knowledge per expert and negating the benefits of segmentation. The shared expert must absorb common knowledge so that routed experts can focus on unique patterns.

Operations Checklist

The following organizes items to verify when operating MoE models from training to serving.

Training Setup

  • Router weights initialized with small standard deviation (0.01 or less)
  • Auxiliary loss coefficient (alpha_balance) started in the 0.005-0.01 range
  • Router Z-Loss activated
  • Learning rate conservative compared to same-size Dense model (2-5x lower)
  • Router logit computation maintained in float32 during bfloat16 training
  • Number of experts is a multiple of the number of devices
  • Capacity Factor set in the 1.0-1.5 range

Monitoring

  • Real-time tracking of per-expert token assignment ratios
  • Alert set for Dead Expert count (assignment ratio under 1%)
  • Tracking average magnitude and variance of router logits
  • Measuring All-to-All communication time vs computation time ratio
  • Automatic detection for sudden training loss spikes

Inference Serving

  • Expert offloading strategy decided (full GPU / partial offloading / disk)
  • Quantization level applied differentially per expert (high precision for frequently used experts)
  • Performance degradation after expert pruning verified via benchmarks
  • Routing overhead measured for different batch sizes
  • Expert cache hit rate monitored
  • Draft model and expert prediction accuracy verified when applying speculative decoding

Fine-tuning

  • Expert dropout rate increased compared to pre-training (ST-MoE recommendation)
  • Batch size reduced compared to pre-training
  • Experimented with freezing the router and fine-tuning only experts
  • Verified that expert specialization distribution changed appropriately for the task after fine-tuning

References