Skip to content
Published on

RWKV-7 "Goose" Architecture Analysis — A Linear-Time Model Surpassing Transformers

Authors
  • Name
    Twitter
RWKV-7 Goose Architecture

Introduction

Transformers are the foundation of modern LLMs, but they have a fundamental limitation: O(n²) attention cost. RWKV-7 "Goose" is a new sequence modeling architecture that addresses this problem, achieving constant memory usage and consistent inference time per token while delivering performance on par with Transformers.

We present an in-depth analysis of this paper ("RWKV-7 Goose with Expressive Dynamic State Evolution"), published at ICML in July 2025.

Evolution of the RWKV Series

From RWKV-4 to RWKV-7

RWKV (Receptance Weighted Key Value) is an architecture that combines the strengths of RNNs and Transformers:

  • RWKV-4 (2023): Introduced the foundational WKV mechanism. A variant of Linear Attention
  • RWKV-5 "Eagle" (2024): Multi-headed WKV, improved performance
  • RWKV-6 "Finch" (2024): Data-dependent decay, LoRA integration
  • RWKV-7 "Goose" (2025): Dynamic State Evolution, breaking the TC0 barrier

Attention vs Linear Attention vs RWKV-7

# Standard Attention: O(n^2) time, O(n) memory
# output = softmax(Q @ K^T / sqrt(d)) @ V

# Linear Attention: O(n) time, O(d^2) memory (constant)
# output = (Q @ (K^T @ V)) -- reordering matrix multiplication

# RWKV-7: O(n) time, O(d^2) memory (constant)
# + Dynamic State Evolution maximizes expressiveness

Core Mechanism: Dynamic State Evolution

Limitations of Existing Linear Attention

Existing Linear Attention (including RWKV-4 through 6) belongs to TC0 (Threshold Circuit Class 0). This means it theoretically cannot solve certain problems:

# TC0 limitation example: state tracking problem
# Input: "A is in room1. A moves to room2. B is in room1. Where is A?"
# TC0 models are theoretically incapable of tracking such state changes

RWKV-7's Dynamic State Evolution

RWKV-7 dynamically modifies the state transition matrix itself based on the input:

import torch
import torch.nn as nn

class RWKV7_TimeMix(nn.Module):
    """RWKV-7 Core: Dynamic State Evolution"""

    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        # Input-dependent parameter generators
        self.W_r = nn.Linear(d_model, d_model)  # Receptance
        self.W_k = nn.Linear(d_model, d_model)  # Key
        self.W_v = nn.Linear(d_model, d_model)  # Value
        self.W_a = nn.Linear(d_model, d_model)  # State transition
        self.W_g = nn.Linear(d_model, d_model)  # Gate

        # Dynamic decay parameters
        self.w_decay = nn.Parameter(torch.randn(n_heads, self.d_head))

        # State evolution matrix learnable parameters
        self.A_log = nn.Parameter(torch.randn(n_heads, self.d_head, self.d_head))

    def forward(self, x, state=None):
        B, T, C = x.shape
        H = self.n_heads
        D = self.d_head

        r = self.W_r(x).view(B, T, H, D)
        k = self.W_k(x).view(B, T, H, D)
        v = self.W_v(x).view(B, T, H, D)
        a = torch.sigmoid(self.W_a(x).view(B, T, H, D))
        g = torch.sigmoid(self.W_g(x).view(B, T, H, D))

        if state is None:
            state = torch.zeros(B, H, D, D, device=x.device)

        outputs = []
        for t in range(T):
            # Dynamic State Evolution: the core innovation
            # Previous: state = decay * state + k^T @ v (fixed decay)
            # RWKV-7: state = A(x_t) @ state + k^T @ v (dynamic transition matrix)

            # Input-dependent transition matrix
            A_t = self._compute_transition(a[:, t], state)

            # State update
            kv = torch.einsum('bhd,bhe->bhde', k[:, t], v[:, t])
            state = A_t * state + kv

            # Output computation
            out = torch.einsum('bhd,bhde->bhe', r[:, t], state)
            out = out * g[:, t]
            outputs.append(out)

        output = torch.stack(outputs, dim=1).view(B, T, C)
        return output, state

    def _compute_transition(self, a_t, state):
        """Dynamically generate transition matrix based on input"""
        # a_t: [B, H, D] - transition parameters derived from current input
        # This is the key to how RWKV-7 surpasses TC0
        decay = torch.exp(-torch.exp(self.w_decay))
        A = decay.unsqueeze(-1) * torch.eye(
            self.d_head, device=a_t.device
        ).unsqueeze(0)

        # Input-dependent correction
        A = A + torch.einsum('bhd,bhe->bhde', a_t, a_t) * \
            torch.exp(self.A_log).unsqueeze(0)

        return A

Why It Surpasses TC0

# RWKV-6 (previous): state_{t+1} = diag(w) * state_t + k_t * v_t^T
# -> Diagonal matrix multiplication: each dimension decays independently
# -> Within TC0: cannot track state with finite depth

# RWKV-7 (new): state_{t+1} = A(x_t) * state_t + k_t * v_t^T
# -> A(x_t) is an input-dependent transition matrix (non-diagonal)
# -> Cross-dimension interaction possible: can solve state tracking problems
# -> Expressiveness beyond TC0

Full Architecture Overview

RWKV-7 Block Composition

class RWKV7Block(nn.Module):
    """RWKV-7 basic block"""

    def __init__(self, d_model, n_heads):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.time_mix = RWKV7_TimeMix(d_model, n_heads)
        self.channel_mix = RWKV7_ChannelMix(d_model)

    def forward(self, x, state=None):
        # Time Mixing (inter-token interaction)
        h, state = self.time_mix(self.ln1(x), state)
        x = x + h

        # Channel Mixing (serves as FFN)
        x = x + self.channel_mix(self.ln2(x))

        return x, state


class RWKV7_ChannelMix(nn.Module):
    """Channel mixing (SwiGLU variant)"""

    def __init__(self, d_model, expand=3.5):
        super().__init__()
        hidden = int(d_model * expand)
        self.W_key = nn.Linear(d_model, hidden)
        self.W_value = nn.Linear(hidden, d_model)
        self.W_gate = nn.Linear(d_model, hidden)

    def forward(self, x):
        k = self.W_key(x)
        v = self.W_value(torch.relu(k) ** 2)  # squared ReLU
        g = torch.sigmoid(self.W_gate(x))
        return v * g

Inference Mode: Operating as an RNN

class RWKV7_Inference:
    """Operates in RNN mode during inference: O(d^2) computation per token"""

    def __init__(self, model):
        self.model = model
        self.states = None  # [n_layers, n_heads, d_head, d_head]

    def generate_token(self, token_id):
        x = self.model.embed(token_id)

        for i, block in enumerate(self.model.blocks):
            x, self.states[i] = block(x.unsqueeze(0).unsqueeze(0),
                                       self.states[i])
            x = x.squeeze(0).squeeze(0)

        logits = self.model.head(self.model.ln_out(x))
        return logits

    # Memory usage: constant regardless of sequence length!
    # Transformer: KV Cache uses O(n * d) memory
    # RWKV-7: State uses O(d^2) fixed memory

Benchmark Results

3B Model Comparison (Based on Paper Table 1)

ModelParamsMMLUHellaSwagARC-EInference Memory
LLaMA-3.2 3B3.2B63.477.279.5O(n) KV Cache
Mamba-2 2.7B2.7B58.173.874.2O(1) constant
RWKV-6 3B3.0B58.974.575.1O(1) constant
RWKV-7 3B3.0B61.276.878.3O(1) constant

Multilingual Benchmarks (A Key Strength of RWKV-7)

# RWKV-7 excels particularly in multilingual performance
# Trained on 100+ languages, achieving SOTA-level results for non-English languages

# On Korean/Japanese/Chinese benchmarks,
# outperforms same-size Transformers
# -> Efficient multilingual tokenization + long context handling

Inference Efficiency

# Inference cost comparison by sequence length
#
# Sequence Length | Transformer | RWKV-7
# 1K             | 1x          | 0.8x
# 4K             | 4x          | 0.8x
# 16K            | 16x         | 0.8x
# 64K            | 64x         | 0.8x
# 1M             | OOM         | 0.8x
#
# RWKV-7 maintains constant cost regardless of sequence length!

Practical Usage: Working with RWKV-7

Loading Models from Hugging Face

# pip install rwkv torch transformers

from rwkv.model import RWKV
from rwkv.utils import PIPELINE

# Load model
model = RWKV(
    model='/path/to/RWKV-7-World-3B',
    strategy='cuda fp16'  # GPU FP16
)

pipeline = PIPELINE(model, "rwkv_vocab_v20230424")

# Text generation
context = "The advantages of Kubernetes VPA are"
result = pipeline.generate(
    context,
    token_count=200,
    temperature=0.8,
    top_p=0.9
)
print(result)

Serving with vLLM

# RWKV-7 vLLM serving (experimental support)
pip install vllm>=0.6.0

python -m vllm.entrypoints.openai.api_server \
  --model RWKV/rwkv-7-world-3b \
  --tokenizer RWKV/rwkv-7-world-3b \
  --dtype float16 \
  --max-model-len 32768

Running Locally with Ollama

# After downloading the RWKV-7 GGUF model
ollama create rwkv7 -f Modelfile

# Modelfile example:
# FROM rwkv-7-world-3b-q4_k_m.gguf
# PARAMETER temperature 0.7
# PARAMETER num_ctx 32768

ollama run rwkv7

RWKV-7 vs Mamba-2 vs Transformer

Architecture Comparison

PropertyTransformerMamba-2RWKV-7
Time ComplexityO(n²)O(n)O(n)
Inference MemoryO(n·d) KV CacheO(d²) constantO(d²) constant
Parallel TrainingFully parallelChunk parallelFully parallel
ExpressivenessBeyond TC0TC0Beyond TC0
Hardware OptimizationMatureEvolvingEvolving

When Should You Choose RWKV-7?

# Scenarios where RWKV-7 excels:
# 1. Very long context processing (100K+ tokens)
# 2. Edge device inference (constant memory)
# 3. Multilingual services (Korean/Japanese/Chinese, etc.)
# 4. Real-time streaming (constant time per token)

# Scenarios where Transformers are still better:
# 1. When maximum performance is needed (especially in English)
# 2. Short contexts (under 1K)
# 3. Leveraging existing ecosystem/tools

Conclusion

RWKV-7 Goose is a groundbreaking architecture that achieves Transformer-level performance while maintaining the efficiency of constant memory + linear time. Through Dynamic State Evolution, it has broken through the theoretical limitations of existing Linear Attention (TC0), and it particularly excels in multilingual and long-context scenarios.


Quiz (7 Questions)

Q1. What is the inference memory complexity of RWKV-7? O(d²) constant — independent of sequence length

Q2. What is the limitation of TC0 (Threshold Circuit Class 0)? Linear Attention with finite depth theoretically cannot solve problems like state tracking

Q3. What is the core mechanism that allows RWKV-7 to surpass TC0? Dynamic State Evolution — input-dependent transition matrix A(x_t) enables cross-dimension interaction

Q4. Why does Transformer inference memory scale as O(n·d)? Because the KV Cache grows proportionally to sequence length

Q5. Why can RWKV-7 be trained in parallel? The WKV operation has a structure that can be parallelized in chunk units

Q6. Why does RWKV-7 excel particularly in multilingual tasks? World tokenizer trained on 100+ languages + efficient long context processing

Q7. What activation function is used in RWKV-7's Channel Mix? Squared ReLU (the square of ReLU)