Skip to content
Published on

Deep Dive into Mixture of Experts (MoE) Architecture: From Switch Transformer to Mixtral and DeepSeek

Authors
  • Name
    Twitter
Mixture of Experts Architecture

Introduction

In the era of large language models (LLMs), endlessly scaling model parameters hits fundamental walls in both training cost and inference cost. Dense Transformers activate all parameters for every input token, meaning that as parameter count grows, FLOPs grow proportionally. The Mixture of Experts (MoE) architecture solves this through conditional computation. The core idea is to maintain large model capacity while keeping actual computation constant by activating only a subset of expert networks based on the input.

Since Shazeer et al. proposed the Sparsely-Gated MoE in their 2017 paper "Outrageously Large Neural Networks," MoE has evolved rapidly -- from Google's Switch Transformer in 2021, to Mistral's Mixtral 8x7B in 2023, and DeepSeek-V2/V3 in 2024. By 2025, Meta adopted MoE with Llama 4, and DeepSeek-R1 built on the V3 architecture to maximize reasoning capabilities, gaining worldwide attention.

This article provides a paper-level deep dive from the mathematical foundations of MoE architecture through the design philosophies of major models, comparative analysis of routing strategies, training stability techniques, and inference optimization.

History and Evolution of MoE Architecture

The concept of MoE was first proposed by Jacobs et al. in their 1991 paper "Adaptive Mixtures of Local Experts." Initially, it was a simple approach of computing a weighted sum of multiple expert network outputs through a gating network.

The key milestones in modern MoE evolution are as follows:

  • 2017: Shazeer et al. proposed LSTM-based Sparsely-Gated MoE with 4,096 experts and 100B+ parameters
  • 2021: Google's Switch Transformer simplified routing to Top-1, achieving 1.6 trillion parameters
  • 2022: Google's ST-MoE (Stable and Transferable MoE) systematized training stability techniques
  • 2022: Expert Choice Routing paper proposed reverse routing where experts select tokens
  • 2023: Mixtral 8x7B opened the era of open-source MoE with Top-2 routing and SwiGLU experts
  • 2024: DeepSeek-V2 introduced Fine-Grained Experts and Auxiliary-Loss-Free strategy
  • 2024: DeepSeek-V3 achieved state-of-the-art with 671B parameters (37B active)
  • 2025: Llama 4 Scout (16 experts, 109B/17B active) marked Meta's first MoE adoption

Mathematical Foundations of Sparse MoE

Basic Formulation

The output of an MoE layer is defined as:

y=i=1Ng(x)iEi(x)y = \sum_{i=1}^{N} g(x)_i \cdot E_i(x)

where xx is the hidden representation of the input token, NN is the number of experts, EiE_i is the ii-th expert network, and g(x)ig(x)_i is the weight assigned by the gating function to the ii-th expert.

Sparse Gating Function

The Noisy Top-K gating function proposed by Shazeer (2017) works as follows:

g(x)=Softmax(TopK(H(x),k))g(x) = \text{Softmax}(\text{TopK}(H(x), k)) H(x)i=(xWg)i+ϵSoftplus((xWnoise)i)H(x)_i = (x \cdot W_g)_i + \epsilon \cdot \text{Softplus}((x \cdot W_{noise})_i)

where WgW_g is the gating weight matrix and WnoiseW_{noise} is the noise weight matrix. The TopK operation retains only the top kk values and sets the rest to -\infty, making them zero after Softmax.

PyTorch Implementation: Basic Sparse Gating

import torch
import torch.nn as nn
import torch.nn.functional as F

class TopKGating(nn.Module):
    """Noisy Top-K Gating mechanism for MoE."""

    def __init__(self, input_dim: int, num_experts: int, top_k: int = 2):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.gate = nn.Linear(input_dim, num_experts, bias=False)
        self.noise = nn.Linear(input_dim, num_experts, bias=False)

    def forward(self, x: torch.Tensor):
        # x shape: (batch_size, seq_len, input_dim)
        logits = self.gate(x)  # (batch, seq, num_experts)

        # Training noise for exploration
        if self.training:
            noise = torch.randn_like(logits) * F.softplus(self.noise(x))
            logits = logits + noise

        # Top-K selection
        top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
        # (batch, seq, top_k)

        # Sparse softmax: only over selected experts
        top_k_gates = F.softmax(top_k_logits, dim=-1)

        return top_k_gates, top_k_indices

Switch Transformer Analysis

Core Innovation: Top-1 Routing

The core innovation of the Switch Transformer, published by Fedus et al. in 2021, is Top-1 routing. Previous research assumed that at least two experts needed to be activated for stable training, but Switch Transformer proved that selecting exactly one expert per token is sufficient.

g(x)=Softmax(xWr),i=argmaxig(x)ig(x) = \text{Softmax}(x \cdot W_r), \quad i^* = \arg\max_i g(x)_i

The routed output is simply the product of the gating probability and the expert output:

y=g(x)iEi(x)y = g(x)_{i^*} \cdot E_{i^*}(x)

Architecture Characteristics

Switch Transformer replaces the FFN (Feed-Forward Network) layers of the T5 architecture with MoE layers. Each MoE layer can accommodate up to 2,048 experts, enabling models at the 1.6 trillion parameter scale. Top-1 routing cuts communication costs in half and simplifies the routing computation itself.

Performance

With 64 experts, Switch Transformer achieves 7x faster pretraining speed compared to T5-Base at equivalent compute. This is because model capacity increases while per-token computation remains constant.

PyTorch Implementation: Switch Transformer MoE Layer

class SwitchMoELayer(nn.Module):
    """Switch Transformer style MoE layer with Top-1 routing."""

    def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int,
                 capacity_factor: float = 1.25):
        super().__init__()
        self.num_experts = num_experts
        self.capacity_factor = capacity_factor
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_dim, ffn_dim),
                nn.ReLU(),
                nn.Linear(ffn_dim, hidden_dim)
            ) for _ in range(num_experts)
        ])

    def forward(self, x: torch.Tensor):
        batch_size, seq_len, hidden_dim = x.shape
        x_flat = x.view(-1, hidden_dim)  # (B*S, D)
        num_tokens = x_flat.shape[0]

        # Router: Top-1 selection
        router_logits = self.router(x_flat)  # (B*S, E)
        router_probs = F.softmax(router_logits, dim=-1)
        expert_indices = router_probs.argmax(dim=-1)  # (B*S,)
        expert_gates = router_probs.gather(1, expert_indices.unsqueeze(-1)).squeeze(-1)

        # Capacity: max tokens per expert
        capacity = int(self.capacity_factor * num_tokens / self.num_experts)

        # Dispatch tokens to experts
        output = torch.zeros_like(x_flat)
        for i in range(self.num_experts):
            mask = (expert_indices == i)
            if mask.sum() == 0:
                continue
            selected = x_flat[mask][:capacity]  # enforce capacity
            expert_out = self.experts[i](selected)
            gates = expert_gates[mask][:capacity].unsqueeze(-1)
            output[mask][:capacity] = expert_out * gates

        return output.view(batch_size, seq_len, hidden_dim)

Mixtral 8x7B Architecture Details

Design Philosophy

Mixtral 8x7B, released by Mistral AI in December 2023, is based on the Mistral 7B architecture with each Transformer layer's FFN replaced by an MoE layer consisting of 8 experts. It uses Top-2 routing to activate 2 experts per token.

Key Numbers

  • Total parameters: 46.7B (8 experts x ~5.6B FFN + shared attention parameters)
  • Active parameters: ~13B (2 expert FFNs per token + shared parameters)
  • Expert function: SwiGLU FFN
  • Attention: Grouped Query Attention (GQA)
  • Context length: 32K tokens
  • Sliding Window Attention applied

Top-2 Routing Formula

The MoE layer output in Mixtral is computed as:

y=iTop2(g(x))g(x)iSwiGLUi(x)y = \sum_{i \in \text{Top2}(g(x))} g(x)_i \cdot \text{SwiGLU}_i(x)

The gating function g(x)g(x) computes a Softmax probability distribution over experts for input xx and selects the top 2. The gating weights of the selected 2 experts are renormalized to sum to 1.

SwiGLU Expert Network

Each expert is an FFN using the SwiGLU activation function:

SwiGLU(x)=(Swish(xW1)xV)W2\text{SwiGLU}(x) = (\text{Swish}(xW_1) \odot xV) W_2

PyTorch Implementation: Mixtral MoE Block

class MixtralMoEBlock(nn.Module):
    """Mixtral-style MoE block with Top-2 SwiGLU experts."""

    def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int = 8):
        super().__init__()
        self.num_experts = num_experts
        self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
        self.experts = nn.ModuleList([
            SwiGLUExpert(hidden_dim, ffn_dim) for _ in range(num_experts)
        ])

    def forward(self, x: torch.Tensor):
        # x: (batch, seq_len, hidden_dim)
        gate_logits = self.gate(x)  # (batch, seq, num_experts)
        gate_probs = F.softmax(gate_logits, dim=-1)

        # Top-2 selection
        top2_probs, top2_indices = gate_probs.topk(2, dim=-1)
        # Renormalize gates to sum to 1
        top2_probs = top2_probs / top2_probs.sum(dim=-1, keepdim=True)

        # Compute expert outputs and combine
        batch, seq, dim = x.shape
        output = torch.zeros_like(x)
        for k in range(2):
            expert_idx = top2_indices[:, :, k]  # (batch, seq)
            gate_val = top2_probs[:, :, k].unsqueeze(-1)  # (batch, seq, 1)
            for i in range(self.num_experts):
                mask = (expert_idx == i)
                if mask.any():
                    expert_input = x[mask]
                    expert_output = self.experts[i](expert_input)
                    output[mask] += gate_val[mask].squeeze(-1).unsqueeze(-1) * expert_output

        return output


class SwiGLUExpert(nn.Module):
    """SwiGLU Feed-Forward Network used as expert."""

    def __init__(self, hidden_dim: int, ffn_dim: int):
        super().__init__()
        self.w1 = nn.Linear(hidden_dim, ffn_dim, bias=False)
        self.v = nn.Linear(hidden_dim, ffn_dim, bias=False)
        self.w2 = nn.Linear(ffn_dim, hidden_dim, bias=False)

    def forward(self, x: torch.Tensor):
        return self.w2(F.silu(self.w1(x)) * self.v(x))

DeepSeek-V2/V3 Innovations: DeepSeekMoE

Fine-Grained Expert Segmentation

DeepSeek-V2 (2024) took a fundamentally different approach from existing MoE models. The core idea is Fine-Grained Expert Segmentation -- splitting experts into smaller and more numerous units.

The original NN experts are increased to mNmN, while each expert's hidden dimension is reduced by 1/m1/m. Simultaneously, the number of activated experts increases proportionally from KK to mKmK, maintaining the same per-token computation while enabling finer-grained expert combinations.

DeepSeek-V3 Architecture

DeepSeek-V3 (December 2024) features the following key configuration:

  • Total parameters: 671B
  • Active parameters: 37B (per token)
  • Routed experts: 256 (per layer)
  • Shared experts: 1 (per layer, always active)
  • Active routed experts: 8 (per token)
  • Attention: Multi-head Latent Attention (MLA)

Auxiliary-Loss-Free Load Balancing

One of the most innovative contributions of DeepSeek-V3 is its auxiliary-loss-free load balancing strategy. Traditional MoE models use auxiliary losses for load balancing, but calibrating the coefficient is difficult, and excessive values degrade model performance.

Instead, DeepSeek-V3 adds a bias term bib_i to each expert, used only for routing decisions:

i=TopK(s(x)i+bi)i^* = \text{TopK}(s(x)_i + b_i) g(x)i=s(x)ijTopKs(x)jg(x)_i = \frac{s(x)_i}{\sum_{j \in \text{TopK}} s(x)_j}

The bias term bib_i only affects routing decisions and is not included in the actual gating weight computation. Overloaded experts have their bib_i decreased while underloaded experts have their bib_i increased, achieving load balance without contaminating the loss function.

Device-Limited Routing

To limit communication costs, DeepSeek-V3 restricts each token to be sent to at most MM nodes. It selects the top MM nodes based on affinity scores of experts distributed across each node, then performs Top-K routing only among experts within those selected nodes.

Routing Strategy Comparison

Top-1 Routing (Switch Transformer)

Activates exactly 1 expert per token. Minimizes communication cost and simplifies implementation, but reliance on a single expert may limit representational power.

Top-2 Routing (Mixtral, GShard)

Activates 2 experts per token with weighted combination. Richer representation than Top-1 but doubles communication cost.

Expert Choice Routing (Zhou et al., 2022)

Reverses the traditional approach: experts select tokens. Since each expert selects a fixed number of tokens, load balance is automatically guaranteed. However, a single token may be selected by zero or multiple experts, introducing non-determinism.

Soft MoE (Puigcerver et al., 2023)

Instead of discrete routing, passes weighted combinations of all tokens to each expert. Fully differentiable with no token dropping, but not truly sparse since every expert processes information from all tokens.

PyTorch Implementation: Expert Choice Routing

class ExpertChoiceRouter(nn.Module):
    """Expert Choice Routing: experts select tokens."""

    def __init__(self, hidden_dim: int, num_experts: int, capacity_factor: float = 1.0):
        super().__init__()
        self.num_experts = num_experts
        self.capacity_factor = capacity_factor
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)

    def forward(self, x: torch.Tensor):
        # x: (num_tokens, hidden_dim)
        num_tokens = x.shape[0]
        capacity = int(self.capacity_factor * num_tokens / self.num_experts)

        # Compute affinity scores
        scores = self.router(x)  # (num_tokens, num_experts)
        scores = F.softmax(scores, dim=0)  # softmax over tokens (not experts)

        # Each expert selects top-capacity tokens
        # Transpose: (num_experts, num_tokens)
        expert_scores = scores.t()

        # Top-capacity selection per expert
        top_scores, top_indices = expert_scores.topk(capacity, dim=-1)
        # top_scores: (num_experts, capacity)
        # top_indices: (num_experts, capacity)

        return top_scores, top_indices

Training Stability Techniques

Load Balancing Loss

The load balancing loss proposed in Switch Transformer is defined as:

Lbalance=αNi=1NfiPi\mathcal{L}_{balance} = \alpha \cdot N \sum_{i=1}^{N} f_i \cdot P_i

where NN is the number of experts, fif_i is the fraction of tokens routed to expert ii, and PiP_i is the mean router probability assigned to expert ii. The coefficient α\alpha is typically set between 0.01 and 0.1.

Under ideal uniform distribution, fi=Pi=1/Nf_i = P_i = 1/N, so the loss equals α\alpha. The loss increases as imbalance grows.

Router Z-Loss

The Router Z-Loss proposed in ST-MoE (2022) constrains the magnitude of router logits to improve training stability:

Lz=1BxB(logi=1Nezi(x))2\mathcal{L}_{z} = \frac{1}{B} \sum_{x \in B} \left( \log \sum_{i=1}^{N} e^{z_i(x)} \right)^2

where zi(x)z_i(x) denotes the router logits. This loss prevents logits from growing excessively large, mitigating instability and convergence issues in routing decisions.

PyTorch Implementation: Load Balancing + Z-Loss

def compute_moe_auxiliary_losses(
    router_logits: torch.Tensor,
    expert_indices: torch.Tensor,
    num_experts: int,
    alpha_balance: float = 0.01,
    alpha_z: float = 0.001
):
    """Compute load balancing loss and router z-loss.

    Args:
        router_logits: Raw router logits (batch*seq, num_experts)
        expert_indices: Selected expert indices (batch*seq, top_k)
        num_experts: Total number of experts
        alpha_balance: Weight for load balancing loss
        alpha_z: Weight for router z-loss
    """
    num_tokens = router_logits.shape[0]
    router_probs = F.softmax(router_logits, dim=-1)

    # --- Load Balancing Loss ---
    # f_i: fraction of tokens routed to expert i
    expert_mask = F.one_hot(expert_indices, num_experts).float()
    if expert_mask.dim() == 3:
        expert_mask = expert_mask.sum(dim=1)  # sum over top_k
    expert_mask = (expert_mask > 0).float()
    f = expert_mask.mean(dim=0)  # (num_experts,)

    # P_i: mean router probability for expert i
    P = router_probs.mean(dim=0)  # (num_experts,)

    balance_loss = alpha_balance * num_experts * (f * P).sum()

    # --- Router Z-Loss ---
    log_z = torch.logsumexp(router_logits, dim=-1)  # (num_tokens,)
    z_loss = alpha_z * (log_z ** 2).mean()

    return balance_loss + z_loss

Inference Optimization

Expert Offloading

Due to the large total parameter count of MoE models, it may be difficult to fit all experts in GPU memory. Expert Offloading stores inactive experts in CPU RAM or on disk and loads them to GPU only when needed.

Key techniques include:

  • LRU Cache: Caches recently used experts on GPU
  • Predictive Prefetch: Asynchronously preloads experts needed for the next layer
  • Speculative Decoding + Offloading: Combines with speculative decoding to hide offloading latency

Quantization

Quantization for MoE models is similar to dense models, but requires additional consideration since weight distributions may differ across experts.

  • GPTQ/AWQ: Independent quantization configurations per expert are possible
  • Mixed Precision: Higher precision for frequently used experts, lower for rarely used ones
  • MiLo (2025): Adds Low-Rank compensators to extremely quantized MoE models to recover accuracy

Expert Parallelism

In MoE inference, Expert Parallelism places each expert on a separate GPU for parallel processing. All-to-All communication sends tokens to the GPU hosting their assigned expert, processes them, and gathers results back.

Major MoE Model Comparison

ModelYearTotal ParamsActive ParamsExpertsRoutingExpert TypeKey Features
Sparsely-Gated MoE2017137B-4096Top-KMLPFirst large-scale Sparse MoE
Switch Transformer20211.6T-2048Top-1FFNSimplified routing, T5-based
GLaM20221.2T97B64Top-2FFN1/3 energy vs GPT-3
ST-MoE2022269B-32Top-2FFNZ-Loss, stability focus
Expert Choice2022---Expert ChoiceFFNExperts select tokens
Mixtral 8x7B202346.7B13B8Top-2SwiGLUOpen-source, GQA
DeepSeek-V22024236B21B160Top-6Fine-GrainedAuxiliary-Loss-Free
DeepSeek-V32024671B37B256+1Top-8Fine-GrainedMLA + shared expert
Llama 4 Scout2025109B17B16Top-1-Meta first MoE

Complete Implementation: Custom MoE Transformer Block

Below is a complete implementation of a Transformer block combining an attention layer with an MoE FFN.

class MoETransformerBlock(nn.Module):
    """Complete Transformer block with MoE FFN layer."""

    def __init__(
        self,
        hidden_dim: int = 768,
        num_heads: int = 12,
        ffn_dim: int = 3072,
        num_experts: int = 8,
        top_k: int = 2,
        capacity_factor: float = 1.25,
        dropout: float = 0.1
    ):
        super().__init__()
        # Multi-Head Attention
        self.attn_norm = nn.LayerNorm(hidden_dim)
        self.attention = nn.MultiheadAttention(
            hidden_dim, num_heads, dropout=dropout, batch_first=True
        )

        # MoE FFN
        self.ffn_norm = nn.LayerNorm(hidden_dim)
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)
        self.experts = nn.ModuleList([
            SwiGLUExpert(hidden_dim, ffn_dim)
            for _ in range(num_experts)
        ])
        self.top_k = top_k
        self.num_experts = num_experts
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask=None):
        # Pre-norm Attention
        residual = x
        x_norm = self.attn_norm(x)
        attn_out, _ = self.attention(x_norm, x_norm, x_norm, attn_mask=mask)
        x = residual + self.dropout(attn_out)

        # Pre-norm MoE FFN
        residual = x
        x_norm = self.ffn_norm(x)
        moe_out, aux_loss = self._moe_forward(x_norm)
        x = residual + self.dropout(moe_out)

        return x, aux_loss

    def _moe_forward(self, x: torch.Tensor):
        B, S, D = x.shape
        x_flat = x.view(-1, D)

        # Router
        logits = self.router(x_flat)
        probs = F.softmax(logits, dim=-1)
        top_k_probs, top_k_idx = probs.topk(self.top_k, dim=-1)
        top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # Dispatch and combine
        output = torch.zeros_like(x_flat)
        for k in range(self.top_k):
            for i in range(self.num_experts):
                mask = (top_k_idx[:, k] == i)
                if mask.any():
                    expert_out = self.experts[i](x_flat[mask])
                    output[mask] += top_k_probs[mask, k].unsqueeze(-1) * expert_out

        # Auxiliary loss
        aux_loss = compute_moe_auxiliary_losses(
            logits, top_k_idx, self.num_experts
        )

        return output.view(B, S, D), aux_loss

Conclusion and Future Directions

MoE architecture has established itself as the most practical approach for simultaneously achieving "model capacity scaling" and "computational efficiency." Starting from Switch Transformer's Top-1 simplification, Mixtral 8x7B brought MoE to the open-source ecosystem, and DeepSeek-V3 set new standards with Fine-Grained Experts and Auxiliary-Loss-Free strategies.

Key future research directions include:

  1. Dynamic expert activation: Adaptive routing that adjusts the number of active experts based on input difficulty
  2. Training-inference consistency: Techniques ensuring routing patterns from training are maintained during inference
  3. Expert specialization analysis: Interpretability research on what knowledge or functions each expert specializes in
  4. MoE for edge devices: Lightweight MoE designs for mobile and edge environments

References

  1. Shazeer, N., et al. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017.
  2. Fedus, W., et al. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022.
  3. Jiang, A.Q., et al. "Mixtral of Experts." arXiv:2401.04088, 2024.
  4. DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, 2024.
  5. DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024.
  6. Zoph, B., et al. "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906, 2022.
  7. Zhou, Y., et al. "Mixture-of-Experts with Expert Choice Routing." NeurIPS 2022.
  8. Puigcerver, J., et al. "From Sparse to Soft Mixtures of Experts." ICLR 2024.
  9. Du, N., et al. "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts." ICML 2022.
  10. Jacobs, R.A., et al. "Adaptive Mixtures of Local Experts." Neural Computation, 1991.