필사 모드: RWKV-7 "Goose" Architecture Analysis — A Linear-Time Model Surpassing Transformers
EnglishIntroduction
Transformers are the foundation of modern LLMs, but they have a fundamental limitation: **O(n²) attention cost**. RWKV-7 "Goose" is a new sequence modeling architecture that addresses this problem, achieving **constant memory usage** and **consistent inference time per token** while delivering performance on par with Transformers.
We present an in-depth analysis of this paper ("RWKV-7 Goose with Expressive Dynamic State Evolution"), published at ICML in July 2025.
Evolution of the RWKV Series
From RWKV-4 to RWKV-7
RWKV (Receptance Weighted Key Value) is an architecture that combines the strengths of RNNs and Transformers:
- **RWKV-4 (2023)**: Introduced the foundational WKV mechanism. A variant of Linear Attention
- **RWKV-5 "Eagle" (2024)**: Multi-headed WKV, improved performance
- **RWKV-6 "Finch" (2024)**: Data-dependent decay, LoRA integration
- **RWKV-7 "Goose" (2025)**: **Dynamic State Evolution**, breaking the TC0 barrier
Attention vs Linear Attention vs RWKV-7
Standard Attention: O(n^2) time, O(n) memory
output = softmax(Q @ K^T / sqrt(d)) @ V
Linear Attention: O(n) time, O(d^2) memory (constant)
output = (Q @ (K^T @ V)) -- reordering matrix multiplication
RWKV-7: O(n) time, O(d^2) memory (constant)
+ Dynamic State Evolution maximizes expressiveness
Core Mechanism: Dynamic State Evolution
Limitations of Existing Linear Attention
Existing Linear Attention (including RWKV-4 through 6) belongs to **TC0 (Threshold Circuit Class 0)**. This means it theoretically cannot solve certain problems:
TC0 limitation example: state tracking problem
Input: "A is in room1. A moves to room2. B is in room1. Where is A?"
TC0 models are theoretically incapable of tracking such state changes
RWKV-7's Dynamic State Evolution
RWKV-7 **dynamically modifies the state transition matrix itself based on the input**:
class RWKV7_TimeMix(nn.Module):
"""RWKV-7 Core: Dynamic State Evolution"""
def __init__(self, d_model, n_heads):
super().__init__()
self.d_model = d_model
self.n_heads = n_heads
self.d_head = d_model // n_heads
Input-dependent parameter generators
self.W_r = nn.Linear(d_model, d_model) # Receptance
self.W_k = nn.Linear(d_model, d_model) # Key
self.W_v = nn.Linear(d_model, d_model) # Value
self.W_a = nn.Linear(d_model, d_model) # State transition
self.W_g = nn.Linear(d_model, d_model) # Gate
Dynamic decay parameters
self.w_decay = nn.Parameter(torch.randn(n_heads, self.d_head))
State evolution matrix learnable parameters
self.A_log = nn.Parameter(torch.randn(n_heads, self.d_head, self.d_head))
def forward(self, x, state=None):
B, T, C = x.shape
H = self.n_heads
D = self.d_head
r = self.W_r(x).view(B, T, H, D)
k = self.W_k(x).view(B, T, H, D)
v = self.W_v(x).view(B, T, H, D)
a = torch.sigmoid(self.W_a(x).view(B, T, H, D))
g = torch.sigmoid(self.W_g(x).view(B, T, H, D))
if state is None:
state = torch.zeros(B, H, D, D, device=x.device)
outputs = []
for t in range(T):
Dynamic State Evolution: the core innovation
Previous: state = decay * state + k^T @ v (fixed decay)
RWKV-7: state = A(x_t) @ state + k^T @ v (dynamic transition matrix)
Input-dependent transition matrix
A_t = self._compute_transition(a[:, t], state)
State update
kv = torch.einsum('bhd,bhe->bhde', k[:, t], v[:, t])
state = A_t * state + kv
Output computation
out = torch.einsum('bhd,bhde->bhe', r[:, t], state)
out = out * g[:, t]
outputs.append(out)
output = torch.stack(outputs, dim=1).view(B, T, C)
return output, state
def _compute_transition(self, a_t, state):
"""Dynamically generate transition matrix based on input"""
a_t: [B, H, D] - transition parameters derived from current input
This is the key to how RWKV-7 surpasses TC0
decay = torch.exp(-torch.exp(self.w_decay))
A = decay.unsqueeze(-1) * torch.eye(
self.d_head, device=a_t.device
).unsqueeze(0)
Input-dependent correction
A = A + torch.einsum('bhd,bhe->bhde', a_t, a_t) * \
torch.exp(self.A_log).unsqueeze(0)
return A
Why It Surpasses TC0
RWKV-6 (previous): state_{t+1} = diag(w) * state_t + k_t * v_t^T
-> Diagonal matrix multiplication: each dimension decays independently
-> Within TC0: cannot track state with finite depth
RWKV-7 (new): state_{t+1} = A(x_t) * state_t + k_t * v_t^T
-> A(x_t) is an input-dependent transition matrix (non-diagonal)
-> Cross-dimension interaction possible: can solve state tracking problems
-> Expressiveness beyond TC0
Full Architecture Overview
RWKV-7 Block Composition
class RWKV7Block(nn.Module):
"""RWKV-7 basic block"""
def __init__(self, d_model, n_heads):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.ln2 = nn.LayerNorm(d_model)
self.time_mix = RWKV7_TimeMix(d_model, n_heads)
self.channel_mix = RWKV7_ChannelMix(d_model)
def forward(self, x, state=None):
Time Mixing (inter-token interaction)
h, state = self.time_mix(self.ln1(x), state)
x = x + h
Channel Mixing (serves as FFN)
x = x + self.channel_mix(self.ln2(x))
return x, state
class RWKV7_ChannelMix(nn.Module):
"""Channel mixing (SwiGLU variant)"""
def __init__(self, d_model, expand=3.5):
super().__init__()
hidden = int(d_model * expand)
self.W_key = nn.Linear(d_model, hidden)
self.W_value = nn.Linear(hidden, d_model)
self.W_gate = nn.Linear(d_model, hidden)
def forward(self, x):
k = self.W_key(x)
v = self.W_value(torch.relu(k) ** 2) # squared ReLU
g = torch.sigmoid(self.W_gate(x))
return v * g
Inference Mode: Operating as an RNN
class RWKV7_Inference:
"""Operates in RNN mode during inference: O(d^2) computation per token"""
def __init__(self, model):
self.model = model
self.states = None # [n_layers, n_heads, d_head, d_head]
def generate_token(self, token_id):
x = self.model.embed(token_id)
for i, block in enumerate(self.model.blocks):
x, self.states[i] = block(x.unsqueeze(0).unsqueeze(0),
self.states[i])
x = x.squeeze(0).squeeze(0)
logits = self.model.head(self.model.ln_out(x))
return logits
Memory usage: constant regardless of sequence length!
Transformer: KV Cache uses O(n * d) memory
RWKV-7: State uses O(d^2) fixed memory
Benchmark Results
3B Model Comparison (Based on Paper Table 1)
| Model | Params | MMLU | HellaSwag | ARC-E | Inference Memory |
| ------------- | -------- | -------- | --------- | -------- | ----------------- |
| LLaMA-3.2 3B | 3.2B | 63.4 | 77.2 | 79.5 | O(n) KV Cache |
| Mamba-2 2.7B | 2.7B | 58.1 | 73.8 | 74.2 | O(1) constant |
| RWKV-6 3B | 3.0B | 58.9 | 74.5 | 75.1 | O(1) constant |
| **RWKV-7 3B** | **3.0B** | **61.2** | **76.8** | **78.3** | **O(1) constant** |
Multilingual Benchmarks (A Key Strength of RWKV-7)
RWKV-7 excels particularly in multilingual performance
Trained on 100+ languages, achieving SOTA-level results for non-English languages
On Korean/Japanese/Chinese benchmarks,
outperforms same-size Transformers
-> Efficient multilingual tokenization + long context handling
Inference Efficiency
Inference cost comparison by sequence length
#
Sequence Length | Transformer | RWKV-7
1K | 1x | 0.8x
4K | 4x | 0.8x
16K | 16x | 0.8x
64K | 64x | 0.8x
1M | OOM | 0.8x
#
RWKV-7 maintains constant cost regardless of sequence length!
Practical Usage: Working with RWKV-7
Loading Models from Hugging Face
pip install rwkv torch transformers
from rwkv.model import RWKV
from rwkv.utils import PIPELINE
Load model
model = RWKV(
model='/path/to/RWKV-7-World-3B',
strategy='cuda fp16' # GPU FP16
)
pipeline = PIPELINE(model, "rwkv_vocab_v20230424")
Text generation
context = "The advantages of Kubernetes VPA are"
result = pipeline.generate(
context,
token_count=200,
temperature=0.8,
top_p=0.9
)
print(result)
Serving with vLLM
RWKV-7 vLLM serving (experimental support)
pip install vllm>=0.6.0
python -m vllm.entrypoints.openai.api_server \
--model RWKV/rwkv-7-world-3b \
--tokenizer RWKV/rwkv-7-world-3b \
--dtype float16 \
--max-model-len 32768
Running Locally with Ollama
After downloading the RWKV-7 GGUF model
ollama create rwkv7 -f Modelfile
Modelfile example:
FROM rwkv-7-world-3b-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 32768
ollama run rwkv7
RWKV-7 vs Mamba-2 vs Transformer
Architecture Comparison
| Property | Transformer | Mamba-2 | RWKV-7 |
| --------------------- | --------------- | -------------- | -------------- |
| Time Complexity | O(n²) | O(n) | O(n) |
| Inference Memory | O(n·d) KV Cache | O(d²) constant | O(d²) constant |
| Parallel Training | Fully parallel | Chunk parallel | Fully parallel |
| Expressiveness | Beyond TC0 | TC0 | **Beyond TC0** |
| Hardware Optimization | Mature | Evolving | Evolving |
When Should You Choose RWKV-7?
Scenarios where RWKV-7 excels:
1. Very long context processing (100K+ tokens)
2. Edge device inference (constant memory)
3. Multilingual services (Korean/Japanese/Chinese, etc.)
4. Real-time streaming (constant time per token)
Scenarios where Transformers are still better:
1. When maximum performance is needed (especially in English)
2. Short contexts (under 1K)
3. Leveraging existing ecosystem/tools
Conclusion
RWKV-7 Goose is a groundbreaking architecture that achieves **Transformer-level performance** while maintaining the efficiency of **constant memory + linear time**. Through Dynamic State Evolution, it has broken through the theoretical limitations of existing Linear Attention (TC0), and it particularly excels in multilingual and long-context scenarios.
**Q1. What is the inference memory complexity of RWKV-7?**
O(d²) constant — independent of sequence length
**Q2. What is the limitation of TC0 (Threshold Circuit Class 0)?**
Linear Attention with finite depth theoretically cannot solve problems like state tracking
**Q3. What is the core mechanism that allows RWKV-7 to surpass TC0?**
Dynamic State Evolution — input-dependent transition matrix A(x_t) enables cross-dimension interaction
**Q4. Why does Transformer inference memory scale as O(n·d)?**
Because the KV Cache grows proportionally to sequence length
**Q5. Why can RWKV-7 be trained in parallel?**
The WKV operation has a structure that can be parallelized in chunk units
**Q6. Why does RWKV-7 excel particularly in multilingual tasks?**
World tokenizer trained on 100+ languages + efficient long context processing
**Q7. What activation function is used in RWKV-7's Channel Mix?**
Squared ReLU (the square of ReLU)
Quiz
Q1: What is the main topic covered in "RWKV-7 "Goose" Architecture Analysis — A Linear-Time
Model Surpassing Transformers"?
A paper-based analysis of RWKV-7 Goose Dynamic State Evolution mechanism, TC0 barrier
breakthrough, and performance comparison against Transformers. A next-generation architecture
enabling constant memory + linear time inference.
From RWKV-4 to RWKV-7 RWKV (Receptance Weighted Key Value) is an architecture that combines the
strengths of RNNs and Transformers: RWKV-4 (2023): Introduced the foundational WKV mechanism.
Limitations of Existing Linear Attention Existing Linear Attention (including RWKV-4 through 6)
belongs to TC0 (Threshold Circuit Class 0).
3B Model Comparison (Based on Paper Table 1) Multilingual Benchmarks (A Key Strength of RWKV-7)
Inference Efficiency
Loading Models from Hugging Face Serving with vLLM Running Locally with Ollama
현재 단락 (1/155)
Transformers are the foundation of modern LLMs, but they have a fundamental limitation: **O(n²) atte...