Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Paper Analysis

Paper Overview
Motivation: The Limitations of Transformers
Core Idea: Selective State Spaces
Hardware Optimization: The Selective Scan Algorithm
- The Problem: HBM ↔ SRAM Bottleneck
- Solution: Kernel Fusion + Recomputation
Mamba Block Architecture
Experimental Results
Mamba vs Transformer: When to Use Which?
Follow-up Research: Mamba-2 and Hybrids
Conclusion
Quiz

Paper Overview

Title: Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Authors: Albert Gu, Tri Dao (Carnegie Mellon University, Princeton University)
Published: December 2023 (arXiv: 2312.00752), Accepted at ICML 2024
Code: github.com/state-spaces/mamba

Motivation: The Limitations of Transformers

Transformers dominate sequence modeling, but they suffer from fundamental problems:

O(N²) Complexity: Self-attention has quadratic complexity with respect to sequence length
Long Sequence Bottleneck: Memory and compute explode beyond 128K context length
Inference Inefficiency: KV Cache grows proportionally with sequence length

State Space Models (SSMs) theoretically achieve O(N) complexity, but prior SSMs (S4, H3, etc.) fell short of Transformers in language modeling due to their fixed, input-independent parameters.

Core Idea: Selective State Spaces

The Problem with Existing SSMs

Traditional SSMs are Linear Time-Invariant (LTI) systems:

$h'(t) = Ah(t) + Bx(t)$ $y(t) = Ch(t)$

Here, A, B, and C are fixed matrices. Whether the input is "hello" or "world," the system processes it the same way. This means content-aware reasoning is impossible.

Mamba's Solution: Selection Mechanism

Mamba's key innovation is making B, C, and Δ input-dependent:

# Traditional SSM: fixed parameters
B = nn.Parameter(torch.randn(N))  # Learned but input-independent

# Mamba: input-dependent parameters
B = nn.Linear(D, N)(x)      # B varies depending on x
C = nn.Linear(D, N)(x)      # C varies depending on x
delta = nn.Linear(D, D)(x)  # Step size also varies depending on x

This is what Selective State Space means. Depending on the input, the model selects which information to store in its state and which to ignore.

Intuitive Understanding

The Selection Mechanism can be understood intuitively as:

Δ (delta): "How much attention should this token receive?" — A large Δ strongly reflects the current input, while a small Δ retains the previous state
B: "Which parts of the input should be recorded in the state?"
C: "Which parts of the state should be extracted as output?"

This is similar to what Transformer attention does, but at O(N) complexity.

Hardware Optimization: The Selective Scan Algorithm

Since the Selective SSM is input-dependent, it cannot leverage the efficient convolution tricks used by traditional SSMs. Mamba addresses this through a hardware-aware algorithm.

The Problem: HBM ↔ SRAM Bottleneck

GPU memory hierarchy:

┌────────────────────┐
│   SRAM (20MB)      │  ← Very fast, very small
├────────────────────┤
│   HBM (80GB)       │  ← Slow, large
└────────────────────┘

A naive implementation repeatedly stores and loads intermediate states to/from HBM, becoming memory-bound.

Solution: Kernel Fusion + Recomputation

Mamba's Selective Scan algorithm:

Compute entirely in SRAM: Fuse discretization, selective scan, and output computation into a single kernel
No intermediate state storage: Store only (N, B, C, Δ) in HBM during the forward pass; recompute intermediate states during the backward pass
Result: IO-aware optimization similar to FlashAttention

# Simplified Selective Scan (conceptual code)
def selective_scan(x, A, B, C, delta):
    """
    x: (batch, length, d_model)
    Returns: (batch, length, d_model)
    """
    batch, length, d = x.shape
    n = A.shape[1]  # state dimension

    # Discretize A, B using delta
    deltaA = torch.exp(delta.unsqueeze(-1) * A)  # (B, L, D, N)
    deltaB_x = delta.unsqueeze(-1) * B.unsqueeze(2) * x.unsqueeze(-1)

    # Sequential scan (parallelized via CUDA kernel in practice)
    h = torch.zeros(batch, d, n, device=x.device)
    ys = []
    for i in range(length):
        h = deltaA[:, i] * h + deltaB_x[:, i]
        y = (h * C[:, i].unsqueeze(1)).sum(-1)
        ys.append(y)

    return torch.stack(ys, dim=1)

Mamba Block Architecture

Mamba draws inspiration from the H3 architecture and uses a simplified block design:

Input (x)
    │
    ├──── Linear projection (expand) ────┐
    │                                     │
    ▼                                     ▼
  Conv1D                              SiLU (gate)
    │                                     │
    ▼                                     │
  SiLU                                    │
    │                                     │
    ▼                                     │
  SSM (selective scan)                    │
    │                                     │
    ▼                                     │
    ×─────────────────────────────────────┘
    │
    ▼
  Linear projection (reduce)
    │
    ▼
  Output

Implemented in PyTorch:

class MambaBlock(nn.Module):
    def __init__(self, d_model, d_state=16, d_conv=4, expand=2):
        super().__init__()
        d_inner = d_model * expand

        # Input projection
        self.in_proj = nn.Linear(d_model, d_inner * 2, bias=False)

        # Conv1D
        self.conv1d = nn.Conv1d(
            d_inner, d_inner, d_conv,
            padding=d_conv - 1, groups=d_inner
        )

        # SSM parameters
        self.x_proj = nn.Linear(d_inner, d_state * 2 + 1, bias=False)  # B, C, delta
        self.dt_proj = nn.Linear(1, d_inner, bias=True)

        # A parameter (structured)
        A = torch.arange(1, d_state + 1).float().repeat(d_inner, 1)
        self.A_log = nn.Parameter(torch.log(A))

        self.D = nn.Parameter(torch.ones(d_inner))
        self.out_proj = nn.Linear(d_inner, d_model, bias=False)

    def forward(self, x):
        batch, length, _ = x.shape

        # Dual path
        xz = self.in_proj(x)
        x, z = xz.chunk(2, dim=-1)

        # Conv + activation
        x = self.conv1d(x.transpose(1, 2))[:, :, :length].transpose(1, 2)
        x = F.silu(x)

        # SSM
        y = self.ssm(x)

        # Gate and project
        y = y * F.silu(z)
        return self.out_proj(y)

Experimental Results

Language Modeling (The Pile)

Model	Parameters	Perplexity
Transformer++	1.3B	8.94
H3	1.3B	9.19
Hyena	1.3B	9.07
RWKV-4	1.5B	9.02
Mamba	1.4B	8.63

Mamba achieved lower perplexity than Transformers of comparable size. Notably, at the 2.8B scale, Mamba improved perplexity by approximately 0.3 over Transformer++.

Inference Speed

Sequence Length	Transformer	Mamba
2K	1x	1x
8K	1x	3x faster
64K	1x	5x faster
1M	OOM	Works

Mamba demonstrates an overwhelming speed advantage over Transformers as sequence length increases.

Selective Copying & Induction Heads

The paper validates the effectiveness of the Selection Mechanism through two key synthetic tasks:

Selective Copying: Selectively copying specific tokens from a long sequence — Previous SSMs (S4) fail, Mamba succeeds
Induction Heads: Recognizing "A B ... A → B" patterns — Previous SSMs fail, Mamba succeeds

Both tasks require content-aware reasoning and are impossible to solve without the Selection Mechanism.

Mamba vs Transformer: When to Use Which?

Criterion	Transformer	Mamba
Long sequences	△ (O(N²))	◎ (O(N))
Inference speed	△ (KV Cache grows)	◎ (Fixed state)
In-context learning	◎	○
Training parallelism	◎ (Fully parallel)	○ (Scan parallelism)
Ecosystem/tooling	◎	△

Follow-up Research: Mamba-2 and Hybrids

Significant progress has been made since Mamba:

Mamba-2 (Dao & Gu, 2024): Proposed the State Space Duality (SSD) framework, revealing the theoretical connection between SSMs and Attention
Jamba (AI21, 2024): A Transformer + Mamba hybrid with 256K context
Zamba (Zyphra, 2024): Achieved SOTA in small-scale hybrid models

Conclusion

Mamba overcame the limitations of SSMs with the simple yet powerful idea of the Selection Mechanism, achieving performance comparable to or exceeding Transformers at O(N) complexity. The hardware-aware algorithm design was key to translating theoretical efficiency into real-world speed.

While Transformers remain dominant, Mamba and hybrid architectures are gaining increasing traction in domains where long-sequence processing is critical.

Quiz

Q1: What is the key limitation of Transformers that Mamba aims to solve?

The O(N²) time and space complexity of self-attention. As sequence length increases, computation and memory grow quadratically, making long-sequence processing inefficient.

Q2: What does "Selective" mean in Selective State Space?

It means making the SSM parameters (B, C, Δ) input-dependent, so the model can select which information to store in the state and which to ignore.

Q3: What was the fundamental reason previous SSMs (S4, etc.) underperformed in language modeling?

Their Linear Time-Invariant (LTI) nature. Since parameters were fixed and independent of the input, content-aware reasoning was impossible.

Q4: What is the intuitive meaning of the Δ (delta) parameter in Mamba?

It is a "step size" that determines how much the current input is reflected. A large Δ strongly reflects the current input, while a small Δ retains the previous state.

Q5: Why can't Mamba's Selective Scan use the convolution trick from traditional SSMs?

Because the input-dependent parameters mean the system is no longer LTI, so the trick of converting to convolution and computing in parallel via FFT is no longer applicable.

Q6: What is the key strategy that Mamba's hardware optimization shares with FlashAttention?

IO-aware optimization. All computations are fused within SRAM (kernel fusion), and intermediate states are not stored in HBM but recomputed during the backward pass (recomputation).

Q7: What capability does the Selective Copying task validate?

The ability to selectively remember and copy specific tokens from a long sequence — a content-aware reasoning capability. SSMs with fixed parameters cannot solve this task.

Q8: What is State Space Duality (SSD), the key contribution of Mamba-2?

A framework showing that SSMs and Attention are mathematically different representations of the same operation. It provides a theoretical foundation for combining the strengths of both approaches.