💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — why model architecture is interesting again in 2026

When Vaswani and seven co-authors submitted "Attention is All You Need" to NeurIPS in June 2017, recurrent and convolutional networks lost their primacy on sequence tasks almost in a single generation. For the seven years that followed, virtually every large language model, vision transformer, speech model and even protein model ran on one architecture: the Transformer.

That changed in December 2023 when Albert Gu and Tri Dao released Mamba. Over 2024, in quick succession, the field got Mamba 2, Jamba, xLSTM, Falcon Mamba 7B, Test-Time Training, Mixture of A Million Experts and Flash Attention 3. In 2025, DeepSeek-V3's 671B MoE drove home the realization: the Transformer was a starting point, not the end.

This article maps the architecture landscape as of May 2026 — through an engineer's eyes, not a paper survey. Who solves what, who should pick which, and what the Korean and Japanese ecosystems are building.

1 · The 2026 architecture map — four camps

Roughly four camps:

| Camp | Examples | Core idea |

| --- | --- | --- |

| Transformer mainline | GPT-4, Claude 4.7, Gemini 2.5, Llama 4 | Self-attention. Maximum expressivity, maximum cost |

| State space / linear RNN | Mamba, Mamba 2, RWKV, RetNet, Griffin, xLSTM | Linear in sequence length. Cheap inference |

| Hybrid | Jamba, Griffin, Zamba, RecurrentGemma | Mix SSM and attention to keep both strengths |

| Sparse / MoE | Mixtral 8x7B, DeepSeek-V3 671B, Million Experts | Huge parameters, small activation |

Two orthogonal axes sit on top:

- DiT (Diffusion Transformer) for image and video generation — the backbone of OpenAI Sora.

- Long-context algorithmics — Flash Attention 3, Ring Attention, Gemini 2M, Magic LTM-2-mini 100M.

high expressivity

│

Transformer ───────┼──── DiT (image/video)

(GPT, Claude) │

│

hybrids (Jamba, Griffin)

│

Mamba 2 ───────────┼──── RWKV, RetNet

(linear time) │

│

cheap inference

The headline — **one-Transformer-fits-all is over.** Different parts of a system pick different architectures.

2 · Transformer (Vaswani 2017) — still the default

Paper: Vaswani et al., "Attention is All You Need", NeurIPS 2017. arXiv:1706.03762.

The core is **scaled dot-product attention**. From input X, project Query, Key and Value matrices, take Q dot K, normalize, softmax, multiply by V.

Block formula:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V

Strengths:

- Every token sees every other token directly. No locality assumption.

- Fully parallel — no sequential dependency, ideal for TPU/GPU.

- Almost no inductive bias — given enough data, anything is learnable.

Weaknesses:

- O(N**2) time and memory in sequence length N. Painful at 32K, devastating at 128K.

- KV cache dominates inference — every decoded token re-reads all past K/V.

- The lack of inductive bias is a double-edged sword: in low-data regimes, SSMs and CNNs often beat Transformers.

Yet in 2026, GPT-4o, Claude 4.7, Gemini 2.5, Llama 4, Mistral Large 2 and Qwen 3 are all still Transformer-based. The pieces inside have evolved — RoPE, Grouped Query Attention, SwiGLU, RMSNorm, Flash Attention 3 — but the shell is the same.

3 · Flash Attention 3 (Tri Dao, Jul 2024) — Transformer acceleration peaked

The Flash Attention series is driven by Stanford's Tri Dao.

- **FlashAttention 1** (May 2022, NeurIPS 2022): tiling plus recomputation, memory dropped from O(N**2) to O(N).

- **FlashAttention 2** (Jul 2023): reorganized work across heads and sequence, about 2x faster.

- **FlashAttention 3** (Jul 2024): leveraged H100 asynchronous Tensor Cores and FP8 for another 1.5x–2x.

The recipe never changes — **never materialize softmax(QK^T)V as one large matrix; process it in blocks in SRAM and accumulate online.** Accept that memory bandwidth, not flops, is the bottleneck.

Conceptual pseudocode; the real code is CUDA/CUTLASS

def flash_attention(Q, K, V, block_size=128):

out = zeros_like(Q)

row_max = full(Q.shape[:-1], -inf)

row_sum = zeros(Q.shape[:-1])

for j in range(0, K.shape[0], block_size):

Kj = K[j:j+block_size]

Vj = V[j:j+block_size]

partial attention computed in SRAM

Sij = Q @ Kj.T / sqrt(d_k)

new_max = maximum(row_max, Sij.max(-1))

online softmax update

...

return out

FlashAttention 3 hits roughly 740 TFLOPS BF16 and around 1.2 PFLOPS FP8 on H100 — about 75% of peak. The same kernel pattern carries forward to H200 and B200.

For an engineer, the key fact — **PyTorch 2.x SDPA dispatches to FlashAttention 3 automatically.** No integration work. Llama 4 and Claude 4.7 ride on this.

4 · Ring Attention — handling very long contexts

Liu et al., "Ring Attention with Blockwise Transformers for Near-Infinite Context", 2023. arXiv:2310.01889.

Problem: how do you process 1M+ token contexts whose KV cache won't fit on a single GPU?

Answer: split the sequence across GPUs and circulate K/V blocks around a ring. Each GPU keeps its Q stationary and sees every K/V once — but only one block at a time.

GPU0 ──▶ GPU1 ──▶ GPU2 ──▶ GPU3

▲ │

└────────────────────────────┘

Each GPU holds Q,

K/V blocks rotate clockwise.

After four rotations every GPU has seen all K/V.

The gain — context length scales almost linearly with GPU count. When Gemini 1.5 Pro demoed 1M tokens in February 2024, Ring-Attention-style distribution was at the core. Gemini 2.5 extended this to 2M tokens in 2025.

Related techniques:

- **StreamingLLM** (Xiao et al., 2023): attention sink to bound the KV cache.

- **YaRN** (Peng et al., 2023): RoPE interpolation beyond training length.

- **LongRoPE** (Microsoft, 2024): RoPE extension to 2M tokens.

5 · Mamba (Albert Gu + Tri Dao, Dec 2023) — S6 state space

Paper: Gu and Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", Dec 2023. arXiv:2312.00752.

This is the paper that shook 2024. For the first time, an SSM matched Transformer language-modelling quality while running in **linear time in sequence length**.

State-space models start from continuous-time dynamics, then discretize. One-line summary:

h_t = A h_{t-1} + B x_t, \quad y_t = C h_t

That is an RNN by definition. S4/S6 made it GPU-friendly with two tricks:

1. Choose A in a special structure (HiPPO, diagonal-plus-low-rank) for stability and expressivity.

2. Train with parallel scans along the sequence dimension.

Mamba (S6) adds the crucial twist:

- **Selection** — A, B, C and the step size become input-dependent. The dynamics change per token.

- **Selective scan kernel** — input-dependent SSMs can't be reduced to ordinary convolutions, so the authors ship a custom Triton/CUDA kernel.

Conceptual pseudocode; the actual API is the mamba-ssm package

from mamba_ssm import Mamba

model = Mamba(

d_model=2560,

d_state=16, # SSM hidden state dimension

d_conv=4, # 1D convolution kernel

expand=2,

).cuda()

x = torch.randn(2, 8192, 2560).cuda() # batch, seq, dim

y = model(x) # 8K tokens in linear time

What Mamba gives:

- **O(N) training, O(1) per-token inference.**

- No KV cache — all history compressed into the state h.

- At 1.4B, zero-shot performance matched or beat Pythia 1.4B.

What it doesn't give:

- In-context retrieval is weak compared to Transformers — "what is the value at row X column Y of this table?" type queries.

- Above 70B, validation is still thin.

6 · Mamba 2 (May 2024) — unifying SSMs and attention

Paper: Dao and Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality", May 2024. arXiv:2405.21060.

The key insight: **SSMs and self-attention are two faces of the same abstraction.** The authors call this SSD — Structured State Space Duality.

Mathematically:

- SSMs are sequence transforms through 1-semiseparable matrices.

- Linear attention is the same family of matrices under a different parametrization.

- Both fit inside SSD.

Practical consequences:

- 2x to 8x faster training. Larger head dimensions make the algorithm matmul-friendly.

- Compatible with Grouped Query Attention tricks — Transformer-tier acceleration applies.

- 1% to 3% lower perplexity at matched parameter count vs Mamba 1.

from mamba_ssm import Mamba2

model = Mamba2(

d_model=2560,

d_state=128, # much larger state than Mamba 1

headdim=64, # head dimension introduced

expand=2,

).cuda()

Mamba 2 also shows that linear attention, RetNet, RWKV-6, Griffin and GLA are all special cases of SSD. It is the paper that aligned the field.

7 · Hyena (Stanford) — linear-time alternative

Paper: Poli et al., "Hyena Hierarchy: Towards Larger Convolutional Language Models", ICML 2023. arXiv:2302.10866.

Stanford's H3/Hyena/Mamba group attempted to replace attention with **implicit long convolutions plus gating**, computed via FFT in O(N log N).

x: (batch, seq, dim)

v: value (linear projection of x)

h: a learnable long filter per channel

plus gates g1, g2, ...

def hyena_operator(x):

v = linear(x)

h = filter_mlp(positions) # position embedding to long filter

y = fft_conv(v, h) # FFT-based convolution in O(N log N)

g = sigmoid(linear(x)) # gate

return g * y

Strength: faster than attention on very long sequences.

Weakness: no selective mechanism like Mamba's, so information routing is less flexible. After 2024, Mamba absorbed most of the attention, but Stanford's H3, Hyena filter and Striped Hyena keep appearing inside hybrids.

8 · RWKV (Bo Peng) — the RNN, rediscovered

Site: rwkv.com. Paper: Peng et al., "RWKV: Reinventing RNNs for the Transformer Era", EMNLP 2023. arXiv:2305.13048.

Started as a near-solo project by Bo Peng (Discord handle BlinkDL). The name stands for **R**eceptance, **W**eight, **K**ey, **V**alue. The trick — express the same function so that **training runs in parallel like a Transformer, but inference runs sequentially like an RNN.**

The block has two pieces — "time-mixing" and "channel-mixing".

input x_t

│

▼

time-mixing ──▶ decide R, W, K, V. weighted combination = RWKV.

│

▼

channel-mixing ──▶ mix channels (1D conv-like)

│

▼

output y_t

What's nice:

- No KV cache — state is fixed size.

- Very fast per-token decoding.

- Fully open source. Weights and training code public.

Roadmap 2024–2025:

- **RWKV-5 "Eagle"** — extended to matrix-valued state.

- **RWKV-6 "Finch"** — introduced Mamba-like selective dynamics.

- **RWKV-7 "Goose"** — at 7B scale, competing with Llama 3.

The RWKV Foundation under the Linux Foundation governs it. The Korean and Japanese communities are unusually active here.

9 · RetNet (Microsoft) — Retentive Networks

Paper: Sun et al., "Retentive Network: A Successor to Transformer for Large Language Models", Jul 2023. arXiv:2307.08621.

Microsoft Research Asia's answer. RetNet's attraction is the **triple representation** of its retention mechanism.

- Parallel form — used at training time, single GPU pass over all tokens. Replaces softmax with an exponential decay mask.

- Recurrent form — used at inference, one constant-size state. O(1) per token.

- Chunkwise form — for long inputs, processes by blocks.

training: parallel ──▶ fully utilize GPU compute

inference: recurrent ──▶ one state per token

long input: chunkwise ──▶ block-wise efficient

This triple-face property is closely related to Mamba 2's SSD.

Follow-ups: Microsoft's **YOCO** (You Only Cache Once, 2024) and **DiffTransformer** (2024) absorb RetNet ideas and push them further.

10 · Griffin (DeepMind) — gated linear RNN

Paper: De et al., "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models", Feb 2024. arXiv:2402.19427.

DeepMind's contribution. Griffin's block is **RG-LRU** (Real-Gated Linear Recurrent Unit) interleaved with **local sliding-window attention**.

Griffin block = RG-LRU (linear RNN) + Local Attention (sliding window)

Highlights:

- Efficient on TPU and GPU at training, comparable to Mamba.

- More stable than Mamba on very long contexts.

- At 7B and 13B, matches Llama on perplexity and downstream tasks.

The same paper introduces **Hawk**, an attention-less variant that uses only RG-LRU.

In April 2024 DeepMind and Hugging Face released **RecurrentGemma**, a 2B open model based on Griffin. It matches Gemma 2B in quality with much cheaper inference.

11 · S5 (Stanford) — improved state space

Paper: Smith et al., "Simplified State Space Layers for Sequence Modeling", ICLR 2023. arXiv:2208.04933.

A follow-on to S4 (Albert Gu's 2021 thesis work). S4 stacks single-input single-output SSMs across channels; S5 processes all channels together as MIMO.

Wins:

- Smaller hidden state at equal expressivity.

- One parallel scan processes all channels — GPU-friendly.

- 90%+ accuracy on every Long Range Arena task, including Path-X.

S5 (alongside LRU, GSS and MEGA) carried the field through the year or two before Mamba arrived. Even in 2026, time-series models like TimeMixer often sit on S5-style cores.

12 · Linear attention — the Schmidhuber lineage

Schlag, Irie, Schmidhuber, "Linear Transformers Are Secretly Fast Weight Programmers", ICML 2021. arXiv:2102.11174. And Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention", ICML 2020. arXiv:2006.16236.

Unfold softmax(QK^T)V: it's two normalizers. Replace softmax with a non-negative feature map phi:

\text{Attention}(Q, K, V)_i = \frac{\phi(Q_i)^{\top} \sum_j \phi(K_j) V_j^{\top}}{\phi(Q_i)^{\top} \sum_j \phi(K_j)}

Accumulate the sums; you get O(1) updates per token. The thing is now an RNN — exactly the title of the Katharopoulos paper, "Transformers are RNNs".

Descendants: Gated Linear Attention (GLA, 2024), DeltaNet, the matrix-valued RWKV variants. Mamba 2's SSD framework unifies them all under one roof.

13 · xLSTM (Sepp Hochreiter, May 2024) — LSTM returns

Paper: Beck et al., "xLSTM: Extended Long Short-Term Memory", May 2024. arXiv:2405.04517.

Sepp Hochreiter was the original author of LSTM in 1997. xLSTM is his group's attempt to make LSTMs viable at LLM scale.

Two new blocks:

- **sLSTM** — scalar memory with new exponential gates.

- **mLSTM** — matrix memory with parallelizable covariance update.

The key tricks are **exponential gating** and **memory mixing**, attacking LSTM's two old limits — bounded capacity and the difficulty of parallel training.

xLSTM block = mLSTM (matrix memory, parallel) + sLSTM (scalar memory, exponential gate)

The Hochreiter group reports a 7B xLSTM competitive with Llama 2 7B and Mamba 1.4B. In Europe, NXAI/Linz is commercializing xLSTM-based models, and there are spinoffs (and inspirations) at Sakana AI.

14 · Jamba (AI21, Mar 2024) — Mamba + Transformer hybrid

Paper: AI21 Labs, "Jamba: A Hybrid Transformer-Mamba Language Model", Mar 2024. arXiv:2403.19887.

Israel's AI21 Labs released the first large-scale **hybrid open model** — 52B parameters, 12B active MoE. The point: SSM plus attention at production scale.

The block pattern: one attention layer for every eight Mamba layers, plus an MoE every two layers.

[Mamba] [Mamba] [Mamba] [Attn+MoE] [Mamba] [Mamba] [Mamba] [Attn+MoE] ...

Wins:

- 256K-token context on a single 80GB H100 — impossible for a similar-size dense Transformer.

- KV cache is tiny, so inference throughput is roughly 3x.

- At matched perplexity, about 2.5x faster than Llama 2 70B.

Follow-ups: **Jamba 1.5 Mini/Large** (Aug 2024), **Jamba 1.6** (2025). NVIDIA Hymba and IBM Bamba follow the same pattern.

15 · Falcon Mamba 7B (Aug 2024) — UAE's pure SSM

Lab: TII (Technology Innovation Institute), Abu Dhabi.

Released August 2024. **The first 7B-scale general language model trained as a pure Mamba.** Until then SSMs had stalled at 1.5B and 2.7B; Falcon Mamba moved the needle.

Highlights:

- Zero attention layers, only Mamba blocks.

- 5.5T tokens of training — Llama 3-class budget.

- MMLU and friends comparable to Llama 3 7B and Mistral 7B.

- Apache 2.0 weights on Hugging Face.

TII later extended Falcon Mamba into Jamba-style hybrids for comparison studies. The signal: SSMs have left the toy stage.

16 · Test-Time Training (Sun et al, Jul 2024) — learning during inference

Paper: Sun et al., "Learning to (Learn at Test Time): RNNs with Expressive Hidden States", Jul 2024. arXiv:2407.04620.

TTT's idea — make the **hidden state itself a small, learnable model**, and update its parameters by SGD as the sequence streams in.

input token ──▶ inner-loop SGD ──▶ weights of the hidden-state MLP updated

│

▼

output prediction

Properties:

- Information is compressed into a tiny MLP far better than a plain RNN state.

- In-context learning emerges naturally from explicit inner-loop updates.

- Linear-time inference like Mamba.

The authors (Yu Sun, Tatsunori Hashimoto and collaborators at Stanford/CMU) trained up to 7B and reported parity with Mamba 2 7B. In 2025–2026 the TTT-MLP, TTT-Linear and TTT-Hash variants keep appearing.

17 · DiT (Diffusion Transformer) — the spine of Sora

Paper: Peebles and Xie, "Scalable Diffusion Models with Transformers", ICCV 2023. arXiv:2212.09748.

UC Berkeley's William Peebles and Saining Xie showed that **replacing U-Net with a Transformer backbone in diffusion models works better**. Pieces:

- Tokenize images into patches.

- AdaLN-Zero injects diffusion timestep and conditioning via LayerNorm scale and shift.

- Otherwise a standard Transformer.

image ──▶ patch embedder ──▶ [DiT block] x N ──▶ noise prediction

│

▼

condition (timestep, class, text)

The significance — **OpenAI Sora** (Feb 2024), Stable Diffusion 3, Flux, Lumina-T2X and almost every SOTA image/video generator since 2024 are DiT-family. For video, spatio-temporal patches become tokens.

2026 variants:

- **PixArt-Σ** — efficient DiT down to mobile.

- **HunyuanDiT, CogVideoX** — Chinese teams.

- **MovieGen, Veo 2, Sora 2** — US big-tech.

- **Stable Video Diffusion 2** — Stability AI.

18 · MoE — Mixtral / DeepSeek-V3 / Million Experts

Mixture of Experts traces back to Jacobs et al. 1991, "Adaptive Mixtures of Local Experts". Shazeer et al. 2017 revived it as sparsely-gated MoE. By 2023–2025, MoE became the dominant pattern for the biggest LLMs.

Core — **many parameters, few activations.** Split each FFN into N experts and only fire k of them per token.

| --- | --- | --- | --- |

| Switch Transformer | 1.6T | ~7B | 2021 (Google) |

| Mixtral 8x7B | 47B | 13B | Dec 2023 (Mistral) |

| Mixtral 8x22B | 141B | 39B | Apr 2024 |

| DBRX | 132B | 36B | Mar 2024 (Databricks) |

| DeepSeek-V3 | 671B | 37B | Dec 2024 (DeepSeek) |

| DeepSeek-R1 | 671B | 37B | Jan 2025 (reasoning variant) |

| Qwen3-235B | 235B | 22B | 2025 |

**DeepSeek-V3** (Dec 2024) was the shock of the year. 671B total, 37B active, 14.8T tokens of training in roughly 2.8M H800 GPU-hours — about USD 5M — to reach GPT-4 class performance. Two key contributions: **auxiliary-loss-free load balancing** for routing and **Multi-head Latent Attention (MLA)**.

**Mixture of A Million Experts** (DeepMind PEER, Aug 2024). He et al., "Mixture of A Million Experts", arXiv:2407.04153. Product key memory routes among 1M experts as essentially a dictionary lookup. A glimpse of where sparse models are headed.

19 · Long context — Gemini 2M / Magic LTM-2-mini 100M

The other megatrend of 2024–2026 is the **explosion of context length**.

2023: Claude 2: 100K, GPT-4: 32K

2024: Gemini 1.5 Pro: 1M, Claude 3: 200K

2024.8: Magic LTM-2-mini: 100M tokens

2025: Gemini 2.5: 2M, Claude 4: 200K

2026: 1M+ is standard for many frontier models

The stack that made it possible:

- **Algorithms**: Flash Attention 3, Ring Attention, PagedAttention (vLLM), StreamingLLM.

- **Positional embeddings**: RoPE then YaRN then LongRoPE plus NTK-aware scaling.

- **Architecture**: SSM hybrids (Jamba, Hymba) hold up better than dense Transformers in memory.

- **Data**: long-context fine-tuning, needle-in-a-haystack evaluation.

**Magic LTM-2-mini** (Aug 2024) sits slightly off the standard track. To hit 100M tokens, the team built a **new sequence architecture (LTM, long-term memory)** rather than scaling attention. Near-perfect recall on 100M-token needle-in-a-haystack. The architecture is largely undisclosed but widely believed to combine SSM with hash-based retrieval.

20 · Korea — Naver HyperCLOVA X / Kakao Brain / KAIST

The Korean ecosystem is catching up fast.

- **Naver HyperCLOVA X (HCX)**. HCX-Seed went public in 2024; 2025 added HCX-Speech and HCX-Vision for multimodal coverage. Internally it mixes Llama 3-family Transformers fine-tuned on Korean and Japanese with home-grown training. HCX-3.5 in 2025 publicly introduced an MoE.

- **Kakao Brain — KoGPT, mini.kanana**. Korean Stable Diffusion fine-tunes and KakaoTalk integration. KoChat 7B/30B launched in 2024; in 2025 the internal multimodal assistant Kanana shipped.

- **KAIST AI**. Edward Choi's group on medical LLMs, Sung Ju Hwang's group on efficient training, Se-Young Yun's group on distillation. KAIST also leads work like SAIDA (Sparse Attention via Importance Distillation) on efficient attention.

- **Upstage Solar**, **NCSOFT VARCO**, **LG AI Research EXAONE 3.5/4.0** — all Transformer-based.

- **Sionic AI**, **Nota** — on-device compression and quantization.

Korean strengths cluster around (1) Korean and Japanese tokenizer optimization, (2) edge and on-device compression, and (3) domain specialization in medicine and law. Pure SSM research is still mostly in academia.

21 · Japan — Sakana AI / NTT Tsuzumi / ELYZA / PFN

The Japanese scene is its own creature.

- **Sakana AI** (Tokyo, 2023; David Ha and Llion Jones). Famous for **evolutionary model merging**. EvoLLM-JP (2024) evolutionarily combined Japanese math model weights to set SOTA. 2025's The AI Scientist v2 is less about the model than about the autonomous research agent.

- **NTT Tsuzumi**. NTT's Japanese LLM, public from 2023 in 7B and 13B sizes, aimed at on-premise deployment for Japanese enterprises. Tsuzumi 2 in 2025 added multimodal.

- **ELYZA** (a University of Tokyo spinout). The strongest player on Llama-based Japanese tuning. Llama-3-ELYZA-JP-8B; the ELYZA-Tasks-100 evaluation set. Became a KDDI subsidiary in 2024.

- **Preferred Networks (PFN)**. Industrial AI for Toyota's autonomy and for drug discovery. Owns the MN-3 supercomputer; built PLaMo 100B. PLaMo Translate (2025) competes with GPT-4 on Japanese-English-Korean translation.

- **AI Inside, Rinna, Stockmark, Karakuri** — domain-focused mid-size players.

Japan's character clusters around (1) meta-level approaches such as evolution and autonomous research (Sakana), (2) industrial integration in manufacturing, autonomy and pharma (PFN), and (3) rich, high-quality Japanese datasets. University of Tokyo and Kyoto University labs are unusually active.

22 · Who should pick which?

Three personas.

Academic researchers

- **Studying expressivity limits** — Transformer mainline. Anthropic's interpretability tooling and the mechanistic interpretability community.

- **Efficient sequence models** — Mamba 2, RWKV-7, xLSTM, TTT.

- **Theory** — the SSD framework (Dao and Gu 2024) and the linear-attention family papers.

Production teams optimizing inference cost

- **High-concurrency cloud services** — Mixtral 8x22B, DeepSeek-V3, Jamba 1.6. MoE gives many parameters with few active.

- **On-device and edge** — RWKV-7 1.5B/3B, RecurrentGemma, quantized Falcon Mamba 7B. Small or no KV cache.

- **Cutting GPU cost** — SSM hybrids at matched perplexity push throughput by 2x to 3x.

Teams that need long context

- **1M+ context** — Gemini 2.5 or Magic LTM-2-mini through SaaS is the realistic answer.

- **Self-hosted 256K–1M** — Jamba 1.6, Hymba, Bamba — Mamba+Transformer hybrids.

- **Time series and long memory** — TTT, S5, TimeMixer.

Image and video generation

- **Image** — DiT family (Stable Diffusion 3, Flux, PixArt).

- **Video** — Sora 2, Veo 2, MovieGen, CogVideoX, HunyuanVideo, Stable Video 2 — all DiT.

23 · Hands-on — 30 minutes to feel an SSM

The fastest way to feel what an SSM is like, start with Mamba 2.

1) environment

conda create -n ssm python=3.11 -y

conda activate ssm

pip install torch==2.4.0 transformers accelerate

pip install mamba-ssm causal-conv1d

2) smallest possible script

from transformers import AutoTokenizer, AutoModelForCausalLM

Mamba 2 130M — baby checkpoint for tinkering

name = "state-spaces/mamba2-130m"

tok = AutoTokenizer.from_pretrained(name)

model = AutoModelForCausalLM.from_pretrained(name, torch_dtype=torch.float16).cuda()

prompt = "State space models are"

ids = tok(prompt, return_tensors="pt").input_ids.cuda()

out = model.generate(ids, max_new_tokens=128)

print(tok.decode(out[0]))

3) RWKV-7 feels similar

pip install rwkv

or HuggingFace's RWKV/rwkv-7-world-1.5B

4) Jamba 1.6 (needs big VRAM, 80GB H100 recommended)

pip install transformers>=4.42 mamba-ssm causal-conv1d

from transformers import AutoModelForCausalLM

AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini")

Even at small sizes — on the same sequence length you'll measure roughly 2x to 3x faster per-token decode versus a comparable Transformer, with an almost-flat memory pattern as the KV cache disappears. Compare with `torch.cuda.memory_allocated()` to see it.

24 · Conclusion — what 2026 architecture means

For seven years one architecture won every sequence problem. To say that era is over would be too strong — the Transformer is still the SOTA centre. But five shifts are clear in 2026:

1. **Frontier LLMs increasingly go MoE.** DeepSeek-V3, Qwen3, Mixtral and the next big closed models.

2. **On-device and edge belong to SSMs and hybrids.** RecurrentGemma, RWKV-7, Falcon Mamba.

3. **Image and video are DiT country.** Sora 2, Veo 2, MovieGen.

4. **Long-context algorithms** (Flash Attention 3, Ring Attention) reshaped both training and inference.

5. **Korean and Japanese ecosystems** are settling into their own character — domain depth, industrial integration, meta-level research.

The engineer's job — don't be loyal to a single architecture. Pick the right tool for the job. And keep an eye on the next five years, because the parade hasn't stopped.

References

- Vaswani et al., "Attention is All You Need", NeurIPS 2017. https://arxiv.org/abs/1706.03762

- Dao et al., "FlashAttention", NeurIPS 2022. https://arxiv.org/abs/2205.14135

- Dao, "FlashAttention-2", 2023. https://arxiv.org/abs/2307.08691

- Shah et al., "FlashAttention-3", 2024. https://arxiv.org/abs/2407.08608

- Liu et al., "Ring Attention", 2023. https://arxiv.org/abs/2310.01889

- Gu and Dao, "Mamba", 2023. https://arxiv.org/abs/2312.00752

- Dao and Gu, "Transformers are SSMs (Mamba 2 / SSD)", 2024. https://arxiv.org/abs/2405.21060

- Poli et al., "Hyena Hierarchy", 2023. https://arxiv.org/abs/2302.10866

- Peng et al., "RWKV", EMNLP 2023. https://arxiv.org/abs/2305.13048

- RWKV Foundation. https://rwkv.com

- Sun et al., "Retentive Network (RetNet)", 2023. https://arxiv.org/abs/2307.08621

- De et al., "Griffin", 2024. https://arxiv.org/abs/2402.19427

- Google RecurrentGemma. https://huggingface.co/google/recurrentgemma-2b

- Smith et al., "S5", ICLR 2023. https://arxiv.org/abs/2208.04933

- Katharopoulos et al., "Linear Transformers / Transformers are RNNs", 2020. https://arxiv.org/abs/2006.16236

- Schlag, Irie, Schmidhuber, "Linear Transformers as Fast Weight Programmers", 2021. https://arxiv.org/abs/2102.11174

- Beck et al., "xLSTM", 2024. https://arxiv.org/abs/2405.04517

- AI21 Labs, "Jamba", 2024. https://arxiv.org/abs/2403.19887

- TII Falcon Mamba 7B. https://huggingface.co/tiiuae/falcon-mamba-7b

- Sun et al., "Test-Time Training (TTT)", 2024. https://arxiv.org/abs/2407.04620

- Peebles and Xie, "DiT", 2022. https://arxiv.org/abs/2212.09748

- DeepSeek-V3 Tech Report. https://arxiv.org/abs/2412.19437

- He et al., "Mixture of A Million Experts (PEER)", 2024. https://arxiv.org/abs/2407.04153

- Mixtral of Experts. https://arxiv.org/abs/2401.04088

- Magic LTM-2-mini. https://magic.dev/blog/100m-token-context-windows

- Gemini 1.5 Technical Report. https://arxiv.org/abs/2403.05530

- Sakana AI EvoLLM. https://arxiv.org/abs/2403.13187

- NTT Tsuzumi. https://www.rd.ntt/e/research/JN202310_18075.html

- ELYZA Llama-JP. https://huggingface.co/elyza

- Preferred Networks PLaMo. https://www.preferred.jp/en/projects/llm/

- Naver HyperCLOVA X. https://clova.ai/en/ko-llm

- KAIST AI. https://gsai.kaist.ac.kr