- Authors
- Name
- Introduction
- Evolution of the RWKV Series
- Core Mechanism: Dynamic State Evolution
- Full Architecture Overview
- Benchmark Results
- Practical Usage: Working with RWKV-7
- RWKV-7 vs Mamba-2 vs Transformer
- Conclusion

Introduction
Transformers are the foundation of modern LLMs, but they have a fundamental limitation: O(n²) attention cost. RWKV-7 "Goose" is a new sequence modeling architecture that addresses this problem, achieving constant memory usage and consistent inference time per token while delivering performance on par with Transformers.
We present an in-depth analysis of this paper ("RWKV-7 Goose with Expressive Dynamic State Evolution"), published at ICML in July 2025.
Evolution of the RWKV Series
From RWKV-4 to RWKV-7
RWKV (Receptance Weighted Key Value) is an architecture that combines the strengths of RNNs and Transformers:
- RWKV-4 (2023): Introduced the foundational WKV mechanism. A variant of Linear Attention
- RWKV-5 "Eagle" (2024): Multi-headed WKV, improved performance
- RWKV-6 "Finch" (2024): Data-dependent decay, LoRA integration
- RWKV-7 "Goose" (2025): Dynamic State Evolution, breaking the TC0 barrier
Attention vs Linear Attention vs RWKV-7
# Standard Attention: O(n^2) time, O(n) memory
# output = softmax(Q @ K^T / sqrt(d)) @ V
# Linear Attention: O(n) time, O(d^2) memory (constant)
# output = (Q @ (K^T @ V)) -- reordering matrix multiplication
# RWKV-7: O(n) time, O(d^2) memory (constant)
# + Dynamic State Evolution maximizes expressiveness
Core Mechanism: Dynamic State Evolution
Limitations of Existing Linear Attention
Existing Linear Attention (including RWKV-4 through 6) belongs to TC0 (Threshold Circuit Class 0). This means it theoretically cannot solve certain problems:
# TC0 limitation example: state tracking problem
# Input: "A is in room1. A moves to room2. B is in room1. Where is A?"
# TC0 models are theoretically incapable of tracking such state changes
RWKV-7's Dynamic State Evolution
RWKV-7 dynamically modifies the state transition matrix itself based on the input:
import torch
import torch.nn as nn
class RWKV7_TimeMix(nn.Module):
"""RWKV-7 Core: Dynamic State Evolution"""
def __init__(self, d_model, n_heads):
super().__init__()
self.d_model = d_model
self.n_heads = n_heads
self.d_head = d_model // n_heads
# Input-dependent parameter generators
self.W_r = nn.Linear(d_model, d_model) # Receptance
self.W_k = nn.Linear(d_model, d_model) # Key
self.W_v = nn.Linear(d_model, d_model) # Value
self.W_a = nn.Linear(d_model, d_model) # State transition
self.W_g = nn.Linear(d_model, d_model) # Gate
# Dynamic decay parameters
self.w_decay = nn.Parameter(torch.randn(n_heads, self.d_head))
# State evolution matrix learnable parameters
self.A_log = nn.Parameter(torch.randn(n_heads, self.d_head, self.d_head))
def forward(self, x, state=None):
B, T, C = x.shape
H = self.n_heads
D = self.d_head
r = self.W_r(x).view(B, T, H, D)
k = self.W_k(x).view(B, T, H, D)
v = self.W_v(x).view(B, T, H, D)
a = torch.sigmoid(self.W_a(x).view(B, T, H, D))
g = torch.sigmoid(self.W_g(x).view(B, T, H, D))
if state is None:
state = torch.zeros(B, H, D, D, device=x.device)
outputs = []
for t in range(T):
# Dynamic State Evolution: the core innovation
# Previous: state = decay * state + k^T @ v (fixed decay)
# RWKV-7: state = A(x_t) @ state + k^T @ v (dynamic transition matrix)
# Input-dependent transition matrix
A_t = self._compute_transition(a[:, t], state)
# State update
kv = torch.einsum('bhd,bhe->bhde', k[:, t], v[:, t])
state = A_t * state + kv
# Output computation
out = torch.einsum('bhd,bhde->bhe', r[:, t], state)
out = out * g[:, t]
outputs.append(out)
output = torch.stack(outputs, dim=1).view(B, T, C)
return output, state
def _compute_transition(self, a_t, state):
"""Dynamically generate transition matrix based on input"""
# a_t: [B, H, D] - transition parameters derived from current input
# This is the key to how RWKV-7 surpasses TC0
decay = torch.exp(-torch.exp(self.w_decay))
A = decay.unsqueeze(-1) * torch.eye(
self.d_head, device=a_t.device
).unsqueeze(0)
# Input-dependent correction
A = A + torch.einsum('bhd,bhe->bhde', a_t, a_t) * \
torch.exp(self.A_log).unsqueeze(0)
return A
Why It Surpasses TC0
# RWKV-6 (previous): state_{t+1} = diag(w) * state_t + k_t * v_t^T
# -> Diagonal matrix multiplication: each dimension decays independently
# -> Within TC0: cannot track state with finite depth
# RWKV-7 (new): state_{t+1} = A(x_t) * state_t + k_t * v_t^T
# -> A(x_t) is an input-dependent transition matrix (non-diagonal)
# -> Cross-dimension interaction possible: can solve state tracking problems
# -> Expressiveness beyond TC0
Full Architecture Overview
RWKV-7 Block Composition
class RWKV7Block(nn.Module):
"""RWKV-7 basic block"""
def __init__(self, d_model, n_heads):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.ln2 = nn.LayerNorm(d_model)
self.time_mix = RWKV7_TimeMix(d_model, n_heads)
self.channel_mix = RWKV7_ChannelMix(d_model)
def forward(self, x, state=None):
# Time Mixing (inter-token interaction)
h, state = self.time_mix(self.ln1(x), state)
x = x + h
# Channel Mixing (serves as FFN)
x = x + self.channel_mix(self.ln2(x))
return x, state
class RWKV7_ChannelMix(nn.Module):
"""Channel mixing (SwiGLU variant)"""
def __init__(self, d_model, expand=3.5):
super().__init__()
hidden = int(d_model * expand)
self.W_key = nn.Linear(d_model, hidden)
self.W_value = nn.Linear(hidden, d_model)
self.W_gate = nn.Linear(d_model, hidden)
def forward(self, x):
k = self.W_key(x)
v = self.W_value(torch.relu(k) ** 2) # squared ReLU
g = torch.sigmoid(self.W_gate(x))
return v * g
Inference Mode: Operating as an RNN
class RWKV7_Inference:
"""Operates in RNN mode during inference: O(d^2) computation per token"""
def __init__(self, model):
self.model = model
self.states = None # [n_layers, n_heads, d_head, d_head]
def generate_token(self, token_id):
x = self.model.embed(token_id)
for i, block in enumerate(self.model.blocks):
x, self.states[i] = block(x.unsqueeze(0).unsqueeze(0),
self.states[i])
x = x.squeeze(0).squeeze(0)
logits = self.model.head(self.model.ln_out(x))
return logits
# Memory usage: constant regardless of sequence length!
# Transformer: KV Cache uses O(n * d) memory
# RWKV-7: State uses O(d^2) fixed memory
Benchmark Results
3B Model Comparison (Based on Paper Table 1)
| Model | Params | MMLU | HellaSwag | ARC-E | Inference Memory |
|---|---|---|---|---|---|
| LLaMA-3.2 3B | 3.2B | 63.4 | 77.2 | 79.5 | O(n) KV Cache |
| Mamba-2 2.7B | 2.7B | 58.1 | 73.8 | 74.2 | O(1) constant |
| RWKV-6 3B | 3.0B | 58.9 | 74.5 | 75.1 | O(1) constant |
| RWKV-7 3B | 3.0B | 61.2 | 76.8 | 78.3 | O(1) constant |
Multilingual Benchmarks (A Key Strength of RWKV-7)
# RWKV-7 excels particularly in multilingual performance
# Trained on 100+ languages, achieving SOTA-level results for non-English languages
# On Korean/Japanese/Chinese benchmarks,
# outperforms same-size Transformers
# -> Efficient multilingual tokenization + long context handling
Inference Efficiency
# Inference cost comparison by sequence length
#
# Sequence Length | Transformer | RWKV-7
# 1K | 1x | 0.8x
# 4K | 4x | 0.8x
# 16K | 16x | 0.8x
# 64K | 64x | 0.8x
# 1M | OOM | 0.8x
#
# RWKV-7 maintains constant cost regardless of sequence length!
Practical Usage: Working with RWKV-7
Loading Models from Hugging Face
# pip install rwkv torch transformers
from rwkv.model import RWKV
from rwkv.utils import PIPELINE
# Load model
model = RWKV(
model='/path/to/RWKV-7-World-3B',
strategy='cuda fp16' # GPU FP16
)
pipeline = PIPELINE(model, "rwkv_vocab_v20230424")
# Text generation
context = "The advantages of Kubernetes VPA are"
result = pipeline.generate(
context,
token_count=200,
temperature=0.8,
top_p=0.9
)
print(result)
Serving with vLLM
# RWKV-7 vLLM serving (experimental support)
pip install vllm>=0.6.0
python -m vllm.entrypoints.openai.api_server \
--model RWKV/rwkv-7-world-3b \
--tokenizer RWKV/rwkv-7-world-3b \
--dtype float16 \
--max-model-len 32768
Running Locally with Ollama
# After downloading the RWKV-7 GGUF model
ollama create rwkv7 -f Modelfile
# Modelfile example:
# FROM rwkv-7-world-3b-q4_k_m.gguf
# PARAMETER temperature 0.7
# PARAMETER num_ctx 32768
ollama run rwkv7
RWKV-7 vs Mamba-2 vs Transformer
Architecture Comparison
| Property | Transformer | Mamba-2 | RWKV-7 |
|---|---|---|---|
| Time Complexity | O(n²) | O(n) | O(n) |
| Inference Memory | O(n·d) KV Cache | O(d²) constant | O(d²) constant |
| Parallel Training | Fully parallel | Chunk parallel | Fully parallel |
| Expressiveness | Beyond TC0 | TC0 | Beyond TC0 |
| Hardware Optimization | Mature | Evolving | Evolving |
When Should You Choose RWKV-7?
# Scenarios where RWKV-7 excels:
# 1. Very long context processing (100K+ tokens)
# 2. Edge device inference (constant memory)
# 3. Multilingual services (Korean/Japanese/Chinese, etc.)
# 4. Real-time streaming (constant time per token)
# Scenarios where Transformers are still better:
# 1. When maximum performance is needed (especially in English)
# 2. Short contexts (under 1K)
# 3. Leveraging existing ecosystem/tools
Conclusion
RWKV-7 Goose is a groundbreaking architecture that achieves Transformer-level performance while maintaining the efficiency of constant memory + linear time. Through Dynamic State Evolution, it has broken through the theoretical limitations of existing Linear Attention (TC0), and it particularly excels in multilingual and long-context scenarios.
Quiz (7 Questions)
Q1. What is the inference memory complexity of RWKV-7? O(d²) constant — independent of sequence length
Q2. What is the limitation of TC0 (Threshold Circuit Class 0)? Linear Attention with finite depth theoretically cannot solve problems like state tracking
Q3. What is the core mechanism that allows RWKV-7 to surpass TC0? Dynamic State Evolution — input-dependent transition matrix A(x_t) enables cross-dimension interaction
Q4. Why does Transformer inference memory scale as O(n·d)? Because the KV Cache grows proportionally to sequence length
Q5. Why can RWKV-7 be trained in parallel? The WKV operation has a structure that can be parallelized in chunk units
Q6. Why does RWKV-7 excel particularly in multilingual tasks? World tokenizer trained on 100+ languages + efficient long context processing
Q7. What activation function is used in RWKV-7's Channel Mix? Squared ReLU (the square of ReLU)