Skip to content

필사 모드: RWKV-7 "Goose" Architecture Analysis — A Linear-Time Model Surpassing Transformers

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Transformers are the foundation of modern LLMs, but they have a fundamental limitation: **O(n²) attention cost**. RWKV-7 "Goose" is a new sequence modeling architecture that addresses this problem, achieving **constant memory usage** and **consistent inference time per token** while delivering performance on par with Transformers.

We present an in-depth analysis of this paper ("RWKV-7 Goose with Expressive Dynamic State Evolution"), published at ICML in July 2025.

Evolution of the RWKV Series

From RWKV-4 to RWKV-7

RWKV (Receptance Weighted Key Value) is an architecture that combines the strengths of RNNs and Transformers:

- **RWKV-4 (2023)**: Introduced the foundational WKV mechanism. A variant of Linear Attention

- **RWKV-5 "Eagle" (2024)**: Multi-headed WKV, improved performance

- **RWKV-6 "Finch" (2024)**: Data-dependent decay, LoRA integration

- **RWKV-7 "Goose" (2025)**: **Dynamic State Evolution**, breaking the TC0 barrier

Attention vs Linear Attention vs RWKV-7

Standard Attention: O(n^2) time, O(n) memory

output = softmax(Q @ K^T / sqrt(d)) @ V

Linear Attention: O(n) time, O(d^2) memory (constant)

output = (Q @ (K^T @ V)) -- reordering matrix multiplication

RWKV-7: O(n) time, O(d^2) memory (constant)

+ Dynamic State Evolution maximizes expressiveness

Core Mechanism: Dynamic State Evolution

Limitations of Existing Linear Attention

Existing Linear Attention (including RWKV-4 through 6) belongs to **TC0 (Threshold Circuit Class 0)**. This means it theoretically cannot solve certain problems:

TC0 limitation example: state tracking problem

Input: "A is in room1. A moves to room2. B is in room1. Where is A?"

TC0 models are theoretically incapable of tracking such state changes

RWKV-7's Dynamic State Evolution

RWKV-7 **dynamically modifies the state transition matrix itself based on the input**:

class RWKV7_TimeMix(nn.Module):

"""RWKV-7 Core: Dynamic State Evolution"""

def __init__(self, d_model, n_heads):

super().__init__()

self.d_model = d_model

self.n_heads = n_heads

self.d_head = d_model // n_heads

Input-dependent parameter generators

self.W_r = nn.Linear(d_model, d_model) # Receptance

self.W_k = nn.Linear(d_model, d_model) # Key

self.W_v = nn.Linear(d_model, d_model) # Value

self.W_a = nn.Linear(d_model, d_model) # State transition

self.W_g = nn.Linear(d_model, d_model) # Gate

Dynamic decay parameters

self.w_decay = nn.Parameter(torch.randn(n_heads, self.d_head))

State evolution matrix learnable parameters

self.A_log = nn.Parameter(torch.randn(n_heads, self.d_head, self.d_head))

def forward(self, x, state=None):

B, T, C = x.shape

H = self.n_heads

D = self.d_head

r = self.W_r(x).view(B, T, H, D)

k = self.W_k(x).view(B, T, H, D)

v = self.W_v(x).view(B, T, H, D)

a = torch.sigmoid(self.W_a(x).view(B, T, H, D))

g = torch.sigmoid(self.W_g(x).view(B, T, H, D))

if state is None:

state = torch.zeros(B, H, D, D, device=x.device)

outputs = []

for t in range(T):

Dynamic State Evolution: the core innovation

Previous: state = decay * state + k^T @ v (fixed decay)

RWKV-7: state = A(x_t) @ state + k^T @ v (dynamic transition matrix)

Input-dependent transition matrix

A_t = self._compute_transition(a[:, t], state)

State update

kv = torch.einsum('bhd,bhe->bhde', k[:, t], v[:, t])

state = A_t * state + kv

Output computation

out = torch.einsum('bhd,bhde->bhe', r[:, t], state)

out = out * g[:, t]

outputs.append(out)

output = torch.stack(outputs, dim=1).view(B, T, C)

return output, state

def _compute_transition(self, a_t, state):

"""Dynamically generate transition matrix based on input"""

a_t: [B, H, D] - transition parameters derived from current input

This is the key to how RWKV-7 surpasses TC0

decay = torch.exp(-torch.exp(self.w_decay))

A = decay.unsqueeze(-1) * torch.eye(

self.d_head, device=a_t.device

).unsqueeze(0)

Input-dependent correction

A = A + torch.einsum('bhd,bhe->bhde', a_t, a_t) * \

torch.exp(self.A_log).unsqueeze(0)

return A

Why It Surpasses TC0

RWKV-6 (previous): state_{t+1} = diag(w) * state_t + k_t * v_t^T

-> Diagonal matrix multiplication: each dimension decays independently

-> Within TC0: cannot track state with finite depth

RWKV-7 (new): state_{t+1} = A(x_t) * state_t + k_t * v_t^T

-> A(x_t) is an input-dependent transition matrix (non-diagonal)

-> Cross-dimension interaction possible: can solve state tracking problems

-> Expressiveness beyond TC0

Full Architecture Overview

RWKV-7 Block Composition

class RWKV7Block(nn.Module):

"""RWKV-7 basic block"""

def __init__(self, d_model, n_heads):

super().__init__()

self.ln1 = nn.LayerNorm(d_model)

self.ln2 = nn.LayerNorm(d_model)

self.time_mix = RWKV7_TimeMix(d_model, n_heads)

self.channel_mix = RWKV7_ChannelMix(d_model)

def forward(self, x, state=None):

Time Mixing (inter-token interaction)

h, state = self.time_mix(self.ln1(x), state)

x = x + h

Channel Mixing (serves as FFN)

x = x + self.channel_mix(self.ln2(x))

return x, state

class RWKV7_ChannelMix(nn.Module):

"""Channel mixing (SwiGLU variant)"""

def __init__(self, d_model, expand=3.5):

super().__init__()

hidden = int(d_model * expand)

self.W_key = nn.Linear(d_model, hidden)

self.W_value = nn.Linear(hidden, d_model)

self.W_gate = nn.Linear(d_model, hidden)

def forward(self, x):

k = self.W_key(x)

v = self.W_value(torch.relu(k) ** 2) # squared ReLU

g = torch.sigmoid(self.W_gate(x))

return v * g

Inference Mode: Operating as an RNN

class RWKV7_Inference:

"""Operates in RNN mode during inference: O(d^2) computation per token"""

def __init__(self, model):

self.model = model

self.states = None # [n_layers, n_heads, d_head, d_head]

def generate_token(self, token_id):

x = self.model.embed(token_id)

for i, block in enumerate(self.model.blocks):

x, self.states[i] = block(x.unsqueeze(0).unsqueeze(0),

self.states[i])

x = x.squeeze(0).squeeze(0)

logits = self.model.head(self.model.ln_out(x))

return logits

Memory usage: constant regardless of sequence length!

Transformer: KV Cache uses O(n * d) memory

RWKV-7: State uses O(d^2) fixed memory

Benchmark Results

3B Model Comparison (Based on Paper Table 1)

| Model | Params | MMLU | HellaSwag | ARC-E | Inference Memory |

| ------------- | -------- | -------- | --------- | -------- | ----------------- |

| LLaMA-3.2 3B | 3.2B | 63.4 | 77.2 | 79.5 | O(n) KV Cache |

| Mamba-2 2.7B | 2.7B | 58.1 | 73.8 | 74.2 | O(1) constant |

| RWKV-6 3B | 3.0B | 58.9 | 74.5 | 75.1 | O(1) constant |

| **RWKV-7 3B** | **3.0B** | **61.2** | **76.8** | **78.3** | **O(1) constant** |

Multilingual Benchmarks (A Key Strength of RWKV-7)

RWKV-7 excels particularly in multilingual performance

Trained on 100+ languages, achieving SOTA-level results for non-English languages

On Korean/Japanese/Chinese benchmarks,

outperforms same-size Transformers

-> Efficient multilingual tokenization + long context handling

Inference Efficiency

Inference cost comparison by sequence length

#

Sequence Length | Transformer | RWKV-7

1K | 1x | 0.8x

4K | 4x | 0.8x

16K | 16x | 0.8x

64K | 64x | 0.8x

1M | OOM | 0.8x

#

RWKV-7 maintains constant cost regardless of sequence length!

Practical Usage: Working with RWKV-7

Loading Models from Hugging Face

pip install rwkv torch transformers

from rwkv.model import RWKV

from rwkv.utils import PIPELINE

Load model

model = RWKV(

model='/path/to/RWKV-7-World-3B',

strategy='cuda fp16' # GPU FP16

)

pipeline = PIPELINE(model, "rwkv_vocab_v20230424")

Text generation

context = "The advantages of Kubernetes VPA are"

result = pipeline.generate(

context,

token_count=200,

temperature=0.8,

top_p=0.9

)

print(result)

Serving with vLLM

RWKV-7 vLLM serving (experimental support)

pip install vllm>=0.6.0

python -m vllm.entrypoints.openai.api_server \

--model RWKV/rwkv-7-world-3b \

--tokenizer RWKV/rwkv-7-world-3b \

--dtype float16 \

--max-model-len 32768

Running Locally with Ollama

After downloading the RWKV-7 GGUF model

ollama create rwkv7 -f Modelfile

Modelfile example:

FROM rwkv-7-world-3b-q4_k_m.gguf

PARAMETER temperature 0.7

PARAMETER num_ctx 32768

ollama run rwkv7

RWKV-7 vs Mamba-2 vs Transformer

Architecture Comparison

| Property | Transformer | Mamba-2 | RWKV-7 |

| --------------------- | --------------- | -------------- | -------------- |

| Time Complexity | O(n²) | O(n) | O(n) |

| Inference Memory | O(n·d) KV Cache | O(d²) constant | O(d²) constant |

| Parallel Training | Fully parallel | Chunk parallel | Fully parallel |

| Expressiveness | Beyond TC0 | TC0 | **Beyond TC0** |

| Hardware Optimization | Mature | Evolving | Evolving |

When Should You Choose RWKV-7?

Scenarios where RWKV-7 excels:

1. Very long context processing (100K+ tokens)

2. Edge device inference (constant memory)

3. Multilingual services (Korean/Japanese/Chinese, etc.)

4. Real-time streaming (constant time per token)

Scenarios where Transformers are still better:

1. When maximum performance is needed (especially in English)

2. Short contexts (under 1K)

3. Leveraging existing ecosystem/tools

Conclusion

RWKV-7 Goose is a groundbreaking architecture that achieves **Transformer-level performance** while maintaining the efficiency of **constant memory + linear time**. Through Dynamic State Evolution, it has broken through the theoretical limitations of existing Linear Attention (TC0), and it particularly excels in multilingual and long-context scenarios.

**Q1. What is the inference memory complexity of RWKV-7?**

O(d²) constant — independent of sequence length

**Q2. What is the limitation of TC0 (Threshold Circuit Class 0)?**

Linear Attention with finite depth theoretically cannot solve problems like state tracking

**Q3. What is the core mechanism that allows RWKV-7 to surpass TC0?**

Dynamic State Evolution — input-dependent transition matrix A(x_t) enables cross-dimension interaction

**Q4. Why does Transformer inference memory scale as O(n·d)?**

Because the KV Cache grows proportionally to sequence length

**Q5. Why can RWKV-7 be trained in parallel?**

The WKV operation has a structure that can be parallelized in chunk units

**Q6. Why does RWKV-7 excel particularly in multilingual tasks?**

World tokenizer trained on 100+ languages + efficient long context processing

**Q7. What activation function is used in RWKV-7's Channel Mix?**

Squared ReLU (the square of ReLU)

Quiz

Q1: What is the main topic covered in "RWKV-7 "Goose" Architecture Analysis — A Linear-Time

Model Surpassing Transformers"?

A paper-based analysis of RWKV-7 Goose Dynamic State Evolution mechanism, TC0 barrier

breakthrough, and performance comparison against Transformers. A next-generation architecture

enabling constant memory + linear time inference.

From RWKV-4 to RWKV-7 RWKV (Receptance Weighted Key Value) is an architecture that combines the

strengths of RNNs and Transformers: RWKV-4 (2023): Introduced the foundational WKV mechanism.

Limitations of Existing Linear Attention Existing Linear Attention (including RWKV-4 through 6)

belongs to TC0 (Threshold Circuit Class 0).

3B Model Comparison (Based on Paper Table 1) Multilingual Benchmarks (A Key Strength of RWKV-7)

Inference Efficiency

Loading Models from Hugging Face Serving with vLLM Running Locally with Ollama

현재 단락 (1/155)

Transformers are the foundation of modern LLMs, but they have a fundamental limitation: **O(n²) atte...

작성 글자: 0원문 글자: 9,764작성 단락: 0/155