Split View: Mamba: Linear-Time Sequence Modeling with Selective State Spaces 논문 분석

Mamba: Linear-Time Sequence Modeling with Selective State Spaces 논문 분석

논문 개요
동기: Transformer의 한계
핵심 아이디어: Selective State Spaces
하드웨어 최적화: Selective Scan 알고리즘
- 문제: HBM ↔ SRAM 병목
- 해결: Kernel Fusion + Recomputation
Mamba 블록 아키텍처
실험 결과
Mamba vs Transformer: 언제 무엇을?
후속 연구: Mamba-2와 하이브리드
마무리
퀴즈

논문 개요

제목: Mamba: Linear-Time Sequence Modeling with Selective State Spaces
저자: Albert Gu, Tri Dao (Carnegie Mellon University, Princeton University)
발표: 2023년 12월 (arXiv: 2312.00752), ICML 2024 채택
코드: github.com/state-spaces/mamba

동기: Transformer의 한계

Transformer는 시퀀스 모델링의 왕좌를 차지하고 있지만, 근본적인 문제가 있습니다:

O(N²) 복잡도: Self-attention은 시퀀스 길이에 대해 이차 복잡도
긴 시퀀스 처리 한계: 128K 이상의 컨텍스트에서 메모리/연산 폭발
추론 비효율: KV Cache가 시퀀스 길이에 비례하여 증가

State Space Model(SSM)은 이론적으로 O(N) 복잡도를 가지지만, 기존 SSM(S4, H3 등)은 입력에 무관한 고정 파라미터 때문에 언어 모델링에서 Transformer에 뒤졌습니다.

핵심 아이디어: Selective State Spaces

기존 SSM의 문제

기존 SSM은 Linear Time-Invariant(LTI) 시스템입니다:

$h'(t) = Ah(t) + Bx(t)$ $y(t) = Ch(t)$

여기서 A, B, C는 고정된 행렬입니다. 즉, 입력이 "hello"든 "world"든 같은 방식으로 처리합니다. 이것은 content-aware reasoning이 불가능하다는 의미입니다.

Mamba의 해결책: Selection Mechanism

Mamba의 핵심 혁신은 B, C, Δ를 입력에 의존적으로 만드는 것입니다:

# 기존 SSM: 고정 파라미터
B = nn.Parameter(torch.randn(N))  # 학습되지만 입력과 무관

# Mamba: 입력 의존적 파라미터
B = nn.Linear(D, N)(x)      # x에 따라 B가 달라짐
C = nn.Linear(D, N)(x)      # x에 따라 C가 달라짐
delta = nn.Linear(D, D)(x)  # x에 따라 step size도 달라짐

이것이 Selective State Space의 의미입니다. 입력에 따라 어떤 정보를 상태에 저장하고 어떤 정보를 무시할지 선택합니다.

직관적 이해

Selection Mechanism을 직관적으로 이해하면:

Δ(delta): "이 토큰을 얼마나 중요하게 볼 것인가" — 큰 Δ는 현재 입력을 강하게 반영, 작은 Δ는 이전 상태를 유지
B: "이 입력의 어떤 부분을 상태에 기록할 것인가"
C: "상태에서 어떤 부분을 출력으로 꺼낼 것인가"

이는 Transformer의 attention이 하는 역할과 유사하지만, O(N) 복잡도로 수행합니다.

하드웨어 최적화: Selective Scan 알고리즘

Selective SSM은 입력 의존적이므로 기존 SSM의 효율적인 합성곱(convolution) 트릭을 사용할 수 없습니다. Mamba는 이 문제를 하드웨어 인지 알고리즘으로 해결합니다.

문제: HBM ↔ SRAM 병목

GPU의 메모리 계층:

┌────────────────────┐
│   SRAM (20MB)      │  ← 매우 빠름, 매우 작음
├────────────────────┤
│   HBM (80GB)       │  ← 느림, 큼
└────────────────────┘

나이브한 구현은 중간 상태를 HBM에 반복 저장/로드하여 메모리 바운드가 됩니다.

해결: Kernel Fusion + Recomputation

Mamba의 Selective Scan 알고리즘:

SRAM에서 한번에 계산: 이산화(discretization), 선택적 스캔, 출력 계산을 하나의 커널로 융합
중간 상태 저장 안 함: 순전파에서 (N, B, C, Δ)만 HBM에 저장하고, 역전파에서 중간 상태를 재계산
결과: FlashAttention과 유사한 IO-aware 최적화

# 간소화된 Selective Scan (개념 코드)
def selective_scan(x, A, B, C, delta):
    """
    x: (batch, length, d_model)
    Returns: (batch, length, d_model)
    """
    batch, length, d = x.shape
    n = A.shape[1]  # state dimension

    # Discretize A, B using delta
    deltaA = torch.exp(delta.unsqueeze(-1) * A)  # (B, L, D, N)
    deltaB_x = delta.unsqueeze(-1) * B.unsqueeze(2) * x.unsqueeze(-1)

    # Sequential scan (실제로는 CUDA kernel으로 병렬화)
    h = torch.zeros(batch, d, n, device=x.device)
    ys = []
    for i in range(length):
        h = deltaA[:, i] * h + deltaB_x[:, i]
        y = (h * C[:, i].unsqueeze(1)).sum(-1)
        ys.append(y)

    return torch.stack(ys, dim=1)

Mamba 블록 아키텍처

Mamba는 H3 아키텍처에서 영감을 받아 간소화된 블록을 사용합니다:

Input (x)
    │
    ├──── Linear projection (expand) ────┐
    │                                     │
    ▼                                     ▼
  Conv1D                              SiLU (gate)
    │                                     │
    ▼                                     │
  SiLU                                    │
    │                                     │
    ▼                                     │
  SSM (selective scan)                    │
    │                                     │
    ▼                                     │
    ×─────────────────────────────────────┘
    │
    ▼
  Linear projection (reduce)
    │
    ▼
  Output

PyTorch로 구현하면:

class MambaBlock(nn.Module):
    def __init__(self, d_model, d_state=16, d_conv=4, expand=2):
        super().__init__()
        d_inner = d_model * expand

        # Input projection
        self.in_proj = nn.Linear(d_model, d_inner * 2, bias=False)

        # Conv1D
        self.conv1d = nn.Conv1d(
            d_inner, d_inner, d_conv,
            padding=d_conv - 1, groups=d_inner
        )

        # SSM parameters
        self.x_proj = nn.Linear(d_inner, d_state * 2 + 1, bias=False)  # B, C, delta
        self.dt_proj = nn.Linear(1, d_inner, bias=True)

        # A parameter (structured)
        A = torch.arange(1, d_state + 1).float().repeat(d_inner, 1)
        self.A_log = nn.Parameter(torch.log(A))

        self.D = nn.Parameter(torch.ones(d_inner))
        self.out_proj = nn.Linear(d_inner, d_model, bias=False)

    def forward(self, x):
        batch, length, _ = x.shape

        # Dual path
        xz = self.in_proj(x)
        x, z = xz.chunk(2, dim=-1)

        # Conv + activation
        x = self.conv1d(x.transpose(1, 2))[:, :, :length].transpose(1, 2)
        x = F.silu(x)

        # SSM
        y = self.ssm(x)

        # Gate and project
        y = y * F.silu(z)
        return self.out_proj(y)

실험 결과

언어 모델링 (The Pile)

모델	파라미터	Perplexity
Transformer++	1.3B	8.94
H3	1.3B	9.19
Hyena	1.3B	9.07
RWKV-4	1.5B	9.02
Mamba	1.4B	8.63

Mamba는 같은 크기의 Transformer보다 더 낮은 perplexity를 달성했습니다. 특히 2.8B 규모에서 Mamba는 Transformer++ 대비 약 0.3 perplexity 개선을 보였습니다.

추론 속도

시퀀스 길이	Transformer	Mamba
2K	1x	1x
8K	1x	3x faster
64K	1x	5x faster
1M	OOM	작동

Mamba는 시퀀스 길이가 길어질수록 Transformer 대비 압도적인 속도 이점을 보여줍니다.

Selective Copying & Induction Heads

논문은 두 가지 핵심 합성 태스크로 Selection Mechanism의 효과를 검증합니다:

Selective Copying: 긴 시퀀스에서 특정 토큰만 선택적으로 복사 → 기존 SSM(S4) 실패, Mamba 성공
Induction Heads: "A B ... A → B" 패턴 인식 → 기존 SSM 실패, Mamba 성공

이 두 태스크는 content-aware reasoning이 필수이며, Selection Mechanism 없이는 해결이 불가능합니다.

Mamba vs Transformer: 언제 무엇을?

기준	Transformer	Mamba
긴 시퀀스	△ (O(N²))	◎ (O(N))
추론 속도	△ (KV Cache 증가)	◎ (고정 상태)
In-context learning	◎	○
학습 병렬화	◎ (완전 병렬)	○ (scan 병렬화)
생태계/도구	◎	△

후속 연구: Mamba-2와 하이브리드

Mamba 이후 많은 발전이 있었습니다:

Mamba-2 (Dao & Gu, 2024): SSM과 Attention의 이론적 연결을 보여주는 State Space Duality(SSD) 프레임워크 제안
Jamba (AI21, 2024): Transformer + Mamba 하이브리드, 256K 컨텍스트
Zamba (Zyphra, 2024): 소형 하이브리드 모델에서 SOTA

마무리

Mamba는 Selection Mechanism이라는 단순하지만 강력한 아이디어로 SSM의 한계를 극복하고, Transformer에 필적하거나 능가하는 성능을 O(N) 복잡도로 달성했습니다. 하드웨어 인지 알고리즘 설계는 이론적 효율성을 실제 속도로 전환하는 핵심이었습니다.

Transformer가 여전히 주류이지만, Mamba와 하이브리드 아키텍처는 긴 시퀀스 처리가 중요한 영역에서 점점 더 존재감을 키우고 있습니다.

퀴즈

Q1: Mamba가 해결하려는 Transformer의 핵심 한계는?

Self-attention의 O(N²) 시간/공간 복잡도. 시퀀스 길이가 길어질수록 연산량과 메모리가 제곱으로 증가하여 긴 시퀀스 처리가 비효율적입니다.

Q2: Selective State Space에서 "Selective"의 의미는?

SSM의 파라미터(B, C, Δ)를 입력에 의존적으로 만들어, 어떤 정보를 상태에 저장하고 어떤 정보를 무시할지 선택(select)할 수 있다는 의미입니다.

Q3: 기존 SSM(S4 등)이 언어 모델링에서 부진했던 근본 원인은?

Linear Time-Invariant(LTI) 특성 때문. 파라미터가 입력과 무관하게 고정되어 content-aware reasoning이 불가능했습니다.

Q4: Mamba에서 Δ(delta) 파라미터의 직관적 의미는?

현재 입력을 얼마나 중요하게 반영할지 결정하는 "step size". 큰 Δ는 현재 입력을 강하게 반영하고, 작은 Δ는 이전 상태를 유지합니다.

Q5: Mamba의 Selective Scan이 기존 SSM의 합성곱 트릭을 사용할 수 없는 이유는?

입력 의존적 파라미터 때문에 더 이상 LTI 시스템이 아니므로, 합성곱으로 변환하여 FFT로 병렬 계산하는 트릭이 적용 불가합니다.

Q6: Mamba의 하드웨어 최적화에서 FlashAttention과 유사한 핵심 전략은?

IO-aware 최적화. SRAM에서 모든 계산을 융합(kernel fusion)하고, 중간 상태를 HBM에 저장하지 않고 역전파에서 재계산(recomputation)합니다.

Q7: Selective Copying 태스크로 검증하려는 능력은?

긴 시퀀스에서 특정 토큰만 선택적으로 기억하고 복사하는 content-aware reasoning 능력. 고정 파라미터 SSM은 이 태스크를 해결할 수 없습니다.

Q8: Mamba-2의 핵심 기여인 State Space Duality(SSD)란?

SSM과 Attention이 수학적으로 동일한 연산의 다른 표현임을 보여주는 프레임워크. 이를 통해 두 접근법의 장점을 결합하는 이론적 기반을 제공합니다.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Paper Analysis

Paper Overview
Motivation: The Limitations of Transformers
Core Idea: Selective State Spaces
Hardware Optimization: The Selective Scan Algorithm
- The Problem: HBM ↔ SRAM Bottleneck
- Solution: Kernel Fusion + Recomputation
Mamba Block Architecture
Experimental Results
Mamba vs Transformer: When to Use Which?
Follow-up Research: Mamba-2 and Hybrids
Conclusion
Quiz

Paper Overview

Title: Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Authors: Albert Gu, Tri Dao (Carnegie Mellon University, Princeton University)
Published: December 2023 (arXiv: 2312.00752), Accepted at ICML 2024
Code: github.com/state-spaces/mamba

Motivation: The Limitations of Transformers

Transformers dominate sequence modeling, but they suffer from fundamental problems:

O(N²) Complexity: Self-attention has quadratic complexity with respect to sequence length
Long Sequence Bottleneck: Memory and compute explode beyond 128K context length
Inference Inefficiency: KV Cache grows proportionally with sequence length

State Space Models (SSMs) theoretically achieve O(N) complexity, but prior SSMs (S4, H3, etc.) fell short of Transformers in language modeling due to their fixed, input-independent parameters.

Core Idea: Selective State Spaces

The Problem with Existing SSMs

Traditional SSMs are Linear Time-Invariant (LTI) systems:

$h'(t) = Ah(t) + Bx(t)$ $y(t) = Ch(t)$

Here, A, B, and C are fixed matrices. Whether the input is "hello" or "world," the system processes it the same way. This means content-aware reasoning is impossible.

Mamba's Solution: Selection Mechanism

Mamba's key innovation is making B, C, and Δ input-dependent:

# Traditional SSM: fixed parameters
B = nn.Parameter(torch.randn(N))  # Learned but input-independent

# Mamba: input-dependent parameters
B = nn.Linear(D, N)(x)      # B varies depending on x
C = nn.Linear(D, N)(x)      # C varies depending on x
delta = nn.Linear(D, D)(x)  # Step size also varies depending on x

This is what Selective State Space means. Depending on the input, the model selects which information to store in its state and which to ignore.

Intuitive Understanding

The Selection Mechanism can be understood intuitively as:

Δ (delta): "How much attention should this token receive?" — A large Δ strongly reflects the current input, while a small Δ retains the previous state
B: "Which parts of the input should be recorded in the state?"
C: "Which parts of the state should be extracted as output?"

This is similar to what Transformer attention does, but at O(N) complexity.

Hardware Optimization: The Selective Scan Algorithm

Since the Selective SSM is input-dependent, it cannot leverage the efficient convolution tricks used by traditional SSMs. Mamba addresses this through a hardware-aware algorithm.

The Problem: HBM ↔ SRAM Bottleneck

GPU memory hierarchy:

┌────────────────────┐
│   SRAM (20MB)      │  ← Very fast, very small
├────────────────────┤
│   HBM (80GB)       │  ← Slow, large
└────────────────────┘

A naive implementation repeatedly stores and loads intermediate states to/from HBM, becoming memory-bound.

Solution: Kernel Fusion + Recomputation

Mamba's Selective Scan algorithm:

Compute entirely in SRAM: Fuse discretization, selective scan, and output computation into a single kernel
No intermediate state storage: Store only (N, B, C, Δ) in HBM during the forward pass; recompute intermediate states during the backward pass
Result: IO-aware optimization similar to FlashAttention

# Simplified Selective Scan (conceptual code)
def selective_scan(x, A, B, C, delta):
    """
    x: (batch, length, d_model)
    Returns: (batch, length, d_model)
    """
    batch, length, d = x.shape
    n = A.shape[1]  # state dimension

    # Discretize A, B using delta
    deltaA = torch.exp(delta.unsqueeze(-1) * A)  # (B, L, D, N)
    deltaB_x = delta.unsqueeze(-1) * B.unsqueeze(2) * x.unsqueeze(-1)

    # Sequential scan (parallelized via CUDA kernel in practice)
    h = torch.zeros(batch, d, n, device=x.device)
    ys = []
    for i in range(length):
        h = deltaA[:, i] * h + deltaB_x[:, i]
        y = (h * C[:, i].unsqueeze(1)).sum(-1)
        ys.append(y)

    return torch.stack(ys, dim=1)

Mamba Block Architecture

Mamba draws inspiration from the H3 architecture and uses a simplified block design:

Input (x)
    │
    ├──── Linear projection (expand) ────┐
    │                                     │
    ▼                                     ▼
  Conv1D                              SiLU (gate)
    │                                     │
    ▼                                     │
  SiLU                                    │
    │                                     │
    ▼                                     │
  SSM (selective scan)                    │
    │                                     │
    ▼                                     │
    ×─────────────────────────────────────┘
    │
    ▼
  Linear projection (reduce)
    │
    ▼
  Output

Implemented in PyTorch:

class MambaBlock(nn.Module):
    def __init__(self, d_model, d_state=16, d_conv=4, expand=2):
        super().__init__()
        d_inner = d_model * expand

        # Input projection
        self.in_proj = nn.Linear(d_model, d_inner * 2, bias=False)

        # Conv1D
        self.conv1d = nn.Conv1d(
            d_inner, d_inner, d_conv,
            padding=d_conv - 1, groups=d_inner
        )

        # SSM parameters
        self.x_proj = nn.Linear(d_inner, d_state * 2 + 1, bias=False)  # B, C, delta
        self.dt_proj = nn.Linear(1, d_inner, bias=True)

        # A parameter (structured)
        A = torch.arange(1, d_state + 1).float().repeat(d_inner, 1)
        self.A_log = nn.Parameter(torch.log(A))

        self.D = nn.Parameter(torch.ones(d_inner))
        self.out_proj = nn.Linear(d_inner, d_model, bias=False)

    def forward(self, x):
        batch, length, _ = x.shape

        # Dual path
        xz = self.in_proj(x)
        x, z = xz.chunk(2, dim=-1)

        # Conv + activation
        x = self.conv1d(x.transpose(1, 2))[:, :, :length].transpose(1, 2)
        x = F.silu(x)

        # SSM
        y = self.ssm(x)

        # Gate and project
        y = y * F.silu(z)
        return self.out_proj(y)

Experimental Results

Language Modeling (The Pile)

Model	Parameters	Perplexity
Transformer++	1.3B	8.94
H3	1.3B	9.19
Hyena	1.3B	9.07
RWKV-4	1.5B	9.02
Mamba	1.4B	8.63

Mamba achieved lower perplexity than Transformers of comparable size. Notably, at the 2.8B scale, Mamba improved perplexity by approximately 0.3 over Transformer++.

Inference Speed

Sequence Length	Transformer	Mamba
2K	1x	1x
8K	1x	3x faster
64K	1x	5x faster
1M	OOM	Works

Mamba demonstrates an overwhelming speed advantage over Transformers as sequence length increases.

Selective Copying & Induction Heads

The paper validates the effectiveness of the Selection Mechanism through two key synthetic tasks:

Selective Copying: Selectively copying specific tokens from a long sequence — Previous SSMs (S4) fail, Mamba succeeds
Induction Heads: Recognizing "A B ... A → B" patterns — Previous SSMs fail, Mamba succeeds

Both tasks require content-aware reasoning and are impossible to solve without the Selection Mechanism.

Mamba vs Transformer: When to Use Which?

Criterion	Transformer	Mamba
Long sequences	△ (O(N²))	◎ (O(N))
Inference speed	△ (KV Cache grows)	◎ (Fixed state)
In-context learning	◎	○
Training parallelism	◎ (Fully parallel)	○ (Scan parallelism)
Ecosystem/tooling	◎	△

Follow-up Research: Mamba-2 and Hybrids

Significant progress has been made since Mamba:

Mamba-2 (Dao & Gu, 2024): Proposed the State Space Duality (SSD) framework, revealing the theoretical connection between SSMs and Attention
Jamba (AI21, 2024): A Transformer + Mamba hybrid with 256K context
Zamba (Zyphra, 2024): Achieved SOTA in small-scale hybrid models

Conclusion

Mamba overcame the limitations of SSMs with the simple yet powerful idea of the Selection Mechanism, achieving performance comparable to or exceeding Transformers at O(N) complexity. The hardware-aware algorithm design was key to translating theoretical efficiency into real-world speed.

While Transformers remain dominant, Mamba and hybrid architectures are gaining increasing traction in domains where long-sequence processing is critical.

Quiz

Q1: What is the key limitation of Transformers that Mamba aims to solve?

The O(N²) time and space complexity of self-attention. As sequence length increases, computation and memory grow quadratically, making long-sequence processing inefficient.

Q2: What does "Selective" mean in Selective State Space?

It means making the SSM parameters (B, C, Δ) input-dependent, so the model can select which information to store in the state and which to ignore.

Q3: What was the fundamental reason previous SSMs (S4, etc.) underperformed in language modeling?

Their Linear Time-Invariant (LTI) nature. Since parameters were fixed and independent of the input, content-aware reasoning was impossible.

Q4: What is the intuitive meaning of the Δ (delta) parameter in Mamba?

It is a "step size" that determines how much the current input is reflected. A large Δ strongly reflects the current input, while a small Δ retains the previous state.

Q5: Why can't Mamba's Selective Scan use the convolution trick from traditional SSMs?

Because the input-dependent parameters mean the system is no longer LTI, so the trick of converting to convolution and computing in parallel via FFT is no longer applicable.

Q6: What is the key strategy that Mamba's hardware optimization shares with FlashAttention?

IO-aware optimization. All computations are fused within SRAM (kernel fusion), and intermediate states are not stored in HBM but recomputed during the backward pass (recomputation).

Q7: What capability does the Selective Copying task validate?

The ability to selectively remember and copy specific tokens from a long sequence — a content-aware reasoning capability. SSMs with fixed parameters cannot solve this task.

Q8: What is State Space Duality (SSD), the key contribution of Mamba-2?

A framework showing that SSMs and Attention are mathematically different representations of the same operation. It provides a theoretical foundation for combining the strengths of both approaches.