Split View: RWKV-7 "Goose" 아키텍처 분석 — Transformer를 넘어서는 선형 시간 모델

RWKV-7 "Goose" 아키텍처 분석 — Transformer를 넘어서는 선형 시간 모델

들어가며
RWKV 시리즈 진화
- RWKV-4에서 RWKV-7까지
- Attention vs Linear Attention vs RWKV-7
핵심 메커니즘: Dynamic State Evolution
아키텍처 전체 구조
- RWKV-7 블록 구성
- 추론 모드: RNN으로 동작
벤치마크 결과
실전: RWKV-7 사용하기
RWKV-7 vs Mamba-2 vs Transformer
- 아키텍처 비교
- 어떤 경우에 RWKV-7을 선택할까?
마무리

들어가며

Transformer는 현대 LLM의 기반이지만, O(n²) 어텐션 비용이라는 근본적 한계가 있습니다. RWKV-7 "Goose"는 이 문제를 해결하는 새로운 시퀀스 모델링 아키텍처로, 상수 메모리 사용량과 토큰당 일정한 추론 시간을 달성하면서도 Transformer에 필적하는 성능을 보여줍니다.

2025년 7월 ICML에서 발표된 이 논문("RWKV-7 Goose with Expressive Dynamic State Evolution")을 심층 분석합니다.

RWKV 시리즈 진화

RWKV-4에서 RWKV-7까지

RWKV(Receptance Weighted Key Value)는 RNN과 Transformer의 장점을 결합한 아키텍처입니다:

RWKV-4 (2023): 기본 WKV 메커니즘 도입. Linear Attention의 변형
RWKV-5 "Eagle" (2024): Multi-headed WKV, 성능 향상
RWKV-6 "Finch" (2024): Data-dependent decay, LoRA 통합
RWKV-7 "Goose" (2025): Dynamic State Evolution, TC0 한계 돌파

Attention vs Linear Attention vs RWKV-7

# Standard Attention: O(n^2) 시간, O(n) 메모리
# output = softmax(Q @ K^T / sqrt(d)) @ V

# Linear Attention: O(n) 시간, O(d^2) 메모리 (상수)
# output = (Q @ (K^T @ V)) -- 행렬 곱 순서 변경

# RWKV-7: O(n) 시간, O(d^2) 메모리 (상수)
# + Dynamic State Evolution으로 표현력 극대화

핵심 메커니즘: Dynamic State Evolution

기존 Linear Attention의 한계

기존 Linear Attention(RWKV-4~6 포함)은 **TC0(Threshold Circuit Class 0)**에 속합니다. 이는 이론적으로 특정 문제를 풀 수 없다는 것을 의미합니다:

# TC0 한계 예시: 상태 추적(state tracking) 문제
# 입력: "A가 방1에 있다. A가 방2로 이동한다. B가 방1에 있다. A는 어디에?"
# TC0 모델은 이런 상태 변화 추적이 이론적으로 불가능

RWKV-7의 Dynamic State Evolution

RWKV-7은 상태 전이 행렬 자체를 입력에 따라 동적으로 변경합니다:

import torch
import torch.nn as nn

class RWKV7_TimeMix(nn.Module):
    """RWKV-7 핵심: Dynamic State Evolution"""

    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        # 입력 의존적 파라미터 생성기
        self.W_r = nn.Linear(d_model, d_model)  # Receptance
        self.W_k = nn.Linear(d_model, d_model)  # Key
        self.W_v = nn.Linear(d_model, d_model)  # Value
        self.W_a = nn.Linear(d_model, d_model)  # State transition
        self.W_g = nn.Linear(d_model, d_model)  # Gate

        # Dynamic decay 파라미터
        self.w_decay = nn.Parameter(torch.randn(n_heads, self.d_head))

        # State evolution matrix 학습 파라미터
        self.A_log = nn.Parameter(torch.randn(n_heads, self.d_head, self.d_head))

    def forward(self, x, state=None):
        B, T, C = x.shape
        H = self.n_heads
        D = self.d_head

        r = self.W_r(x).view(B, T, H, D)
        k = self.W_k(x).view(B, T, H, D)
        v = self.W_v(x).view(B, T, H, D)
        a = torch.sigmoid(self.W_a(x).view(B, T, H, D))
        g = torch.sigmoid(self.W_g(x).view(B, T, H, D))

        if state is None:
            state = torch.zeros(B, H, D, D, device=x.device)

        outputs = []
        for t in range(T):
            # Dynamic State Evolution: 핵심 혁신
            # 기존: state = decay * state + k^T @ v (고정 decay)
            # RWKV-7: state = A(x_t) @ state + k^T @ v (동적 전이 행렬)

            # 입력 의존적 전이 행렬
            A_t = self._compute_transition(a[:, t], state)

            # State update
            kv = torch.einsum('bhd,bhe->bhde', k[:, t], v[:, t])
            state = A_t * state + kv

            # Output 계산
            out = torch.einsum('bhd,bhde->bhe', r[:, t], state)
            out = out * g[:, t]
            outputs.append(out)

        output = torch.stack(outputs, dim=1).view(B, T, C)
        return output, state

    def _compute_transition(self, a_t, state):
        """입력에 따라 전이 행렬을 동적으로 생성"""
        # a_t: [B, H, D] - 현재 입력에서 유도된 전이 파라미터
        # 이것이 RWKV-7이 TC0를 넘어서는 핵심
        decay = torch.exp(-torch.exp(self.w_decay))
        A = decay.unsqueeze(-1) * torch.eye(
            self.d_head, device=a_t.device
        ).unsqueeze(0)

        # 입력 의존적 보정
        A = A + torch.einsum('bhd,bhe->bhde', a_t, a_t) * \
            torch.exp(self.A_log).unsqueeze(0)

        return A

TC0를 넘어서는 이유

# RWKV-6 (기존): state_{t+1} = diag(w) * state_t + k_t * v_t^T
# -> 대각 행렬 곱: 각 차원이 독립적으로 decay
# -> TC0 범위 내: 유한 깊이로는 상태 추적 불가

# RWKV-7 (신규): state_{t+1} = A(x_t) * state_t + k_t * v_t^T
# -> A(x_t)는 입력 의존적 전이 행렬 (비대각)
# -> 차원 간 상호작용 가능: 상태 추적 문제 해결 가능
# -> TC0를 넘어서는 표현력

아키텍처 전체 구조

RWKV-7 블록 구성

class RWKV7Block(nn.Module):
    """RWKV-7 기본 블록"""

    def __init__(self, d_model, n_heads):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.time_mix = RWKV7_TimeMix(d_model, n_heads)
        self.channel_mix = RWKV7_ChannelMix(d_model)

    def forward(self, x, state=None):
        # Time Mixing (token 간 상호작용)
        h, state = self.time_mix(self.ln1(x), state)
        x = x + h

        # Channel Mixing (FFN 역할)
        x = x + self.channel_mix(self.ln2(x))

        return x, state


class RWKV7_ChannelMix(nn.Module):
    """채널 믹싱 (SwiGLU 변형)"""

    def __init__(self, d_model, expand=3.5):
        super().__init__()
        hidden = int(d_model * expand)
        self.W_key = nn.Linear(d_model, hidden)
        self.W_value = nn.Linear(hidden, d_model)
        self.W_gate = nn.Linear(d_model, hidden)

    def forward(self, x):
        k = self.W_key(x)
        v = self.W_value(torch.relu(k) ** 2)  # squared ReLU
        g = torch.sigmoid(self.W_gate(x))
        return v * g

추론 모드: RNN으로 동작

class RWKV7_Inference:
    """추론 시 RNN 모드로 동작: 토큰당 O(d^2) 연산"""

    def __init__(self, model):
        self.model = model
        self.states = None  # [n_layers, n_heads, d_head, d_head]

    def generate_token(self, token_id):
        x = self.model.embed(token_id)

        for i, block in enumerate(self.model.blocks):
            x, self.states[i] = block(x.unsqueeze(0).unsqueeze(0),
                                       self.states[i])
            x = x.squeeze(0).squeeze(0)

        logits = self.model.head(self.model.ln_out(x))
        return logits

    # 메모리 사용량: 시퀀스 길이와 무관하게 상수!
    # Transformer: KV Cache가 O(n * d) 메모리 사용
    # RWKV-7: State가 O(d^2) 고정 메모리

벤치마크 결과

3B 모델 비교 (논문 Table 1 기반)

모델	파라미터	MMLU	HellaSwag	ARC-E	추론 메모리
LLaMA-3.2 3B	3.2B	63.4	77.2	79.5	O(n) KV Cache
Mamba-2 2.7B	2.7B	58.1	73.8	74.2	O(1) 상수
RWKV-6 3B	3.0B	58.9	74.5	75.1	O(1) 상수
RWKV-7 3B	3.0B	61.2	76.8	78.3	O(1) 상수

다국어 벤치마크 (RWKV-7의 강점)

# RWKV-7은 특히 다국어에서 뛰어난 성능
# 100+ 언어로 학습, 영어 외 언어에서 SOTA급

# 한국어/일본어/중국어 벤치마크에서
# 같은 크기 Transformer 대비 우수한 성능
# -> 다국어 토큰화 효율성 + 긴 컨텍스트 처리

추론 효율성

# 시퀀스 길이별 추론 비용 비교
#
# Sequence Length | Transformer | RWKV-7
# 1K             | 1x          | 0.8x
# 4K             | 4x          | 0.8x
# 16K            | 16x         | 0.8x
# 64K            | 64x         | 0.8x
# 1M             | OOM         | 0.8x
#
# RWKV-7은 시퀀스 길이와 무관하게 일정한 비용!

실전: RWKV-7 사용하기

Hugging Face에서 모델 로드

# pip install rwkv torch transformers

from rwkv.model import RWKV
from rwkv.utils import PIPELINE

# 모델 로드
model = RWKV(
    model='/path/to/RWKV-7-World-3B',
    strategy='cuda fp16'  # GPU FP16
)

pipeline = PIPELINE(model, "rwkv_vocab_v20230424")

# 텍스트 생성
context = "Kubernetes VPA의 장점은"
result = pipeline.generate(
    context,
    token_count=200,
    temperature=0.8,
    top_p=0.9
)
print(result)

vLLM으로 서빙

# RWKV-7 vLLM 서빙 (실험적 지원)
pip install vllm>=0.6.0

python -m vllm.entrypoints.openai.api_server \
  --model RWKV/rwkv-7-world-3b \
  --tokenizer RWKV/rwkv-7-world-3b \
  --dtype float16 \
  --max-model-len 32768

Ollama로 로컬 실행

# RWKV-7 GGUF 모델 다운로드 후
ollama create rwkv7 -f Modelfile

# Modelfile 예시:
# FROM rwkv-7-world-3b-q4_k_m.gguf
# PARAMETER temperature 0.7
# PARAMETER num_ctx 32768

ollama run rwkv7

RWKV-7 vs Mamba-2 vs Transformer

아키텍처 비교

특성	Transformer	Mamba-2	RWKV-7
시간 복잡도	O(n²)	O(n)	O(n)
추론 메모리	O(n·d) KV Cache	O(d²) 상수	O(d²) 상수
병렬 학습	완전 병렬	청크 병렬	완전 병렬
표현력	TC0 이상	TC0	TC0 이상
하드웨어 최적화	성숙	발전 중	발전 중

어떤 경우에 RWKV-7을 선택할까?

# RWKV-7이 적합한 시나리오:
# 1. 매우 긴 컨텍스트 처리 (100K+ 토큰)
# 2. 에지 디바이스 추론 (상수 메모리)
# 3. 다국어 서비스 (한국어/일본어/중국어 등)
# 4. 실시간 스트리밍 (토큰당 일정 시간)

# Transformer가 여전히 나은 시나리오:
# 1. 최대 성능이 필요할 때 (특히 영어)
# 2. 짧은 컨텍스트 (1K 이하)
# 3. 기존 생태계/도구 활용

마무리

RWKV-7 Goose는 상수 메모리 + 선형 시간이라는 효율성을 유지하면서 Transformer 수준의 성능을 달성한 혁신적 아키텍처입니다. Dynamic State Evolution을 통해 기존 Linear Attention의 이론적 한계(TC0)를 돌파했고, 특히 다국어와 긴 컨텍스트에서 강점을 보입니다.

📝 퀴즈 (7문제)

Q1. RWKV-7의 추론 시 메모리 복잡도는? O(d²) 상수 — 시퀀스 길이와 무관

Q2. TC0(Threshold Circuit Class 0)의 한계란? 유한 깊이의 Linear Attention으로는 상태 추적(state tracking) 같은 문제를 이론적으로 풀 수 없음

Q3. RWKV-7이 TC0를 넘어서는 핵심 메커니즘은? Dynamic State Evolution — 입력 의존적 전이 행렬 A(x_t)로 차원 간 상호작용 가능

Q4. Transformer의 추론 메모리가 O(n·d)인 이유는? KV Cache가 시퀀스 길이에 비례하여 증가하기 때문

Q5. RWKV-7이 학습 시 병렬 처리가 가능한 이유는? WKV 연산을 청크 단위로 병렬화할 수 있는 구조

Q6. RWKV-7이 다국어에서 특히 강한 이유는? 100+ 언어로 학습된 World 토크나이저 + 효율적인 긴 컨텍스트 처리

Q7. RWKV-7의 Channel Mix에서 사용하는 활성화 함수는? Squared ReLU (ReLU의 제곱)

RWKV-7 "Goose" Architecture Analysis — A Linear-Time Model Surpassing Transformers

Introduction
Evolution of the RWKV Series
- From RWKV-4 to RWKV-7
- Attention vs Linear Attention vs RWKV-7
Core Mechanism: Dynamic State Evolution
Full Architecture Overview
- RWKV-7 Block Composition
- Inference Mode: Operating as an RNN
Benchmark Results
Practical Usage: Working with RWKV-7
RWKV-7 vs Mamba-2 vs Transformer
- Architecture Comparison
- When Should You Choose RWKV-7?
Conclusion
Quiz

Introduction

Transformers are the foundation of modern LLMs, but they have a fundamental limitation: O(n²) attention cost. RWKV-7 "Goose" is a new sequence modeling architecture that addresses this problem, achieving constant memory usage and consistent inference time per token while delivering performance on par with Transformers.

We present an in-depth analysis of this paper ("RWKV-7 Goose with Expressive Dynamic State Evolution"), published at ICML in July 2025.

Evolution of the RWKV Series

From RWKV-4 to RWKV-7

RWKV (Receptance Weighted Key Value) is an architecture that combines the strengths of RNNs and Transformers:

RWKV-4 (2023): Introduced the foundational WKV mechanism. A variant of Linear Attention
RWKV-5 "Eagle" (2024): Multi-headed WKV, improved performance
RWKV-6 "Finch" (2024): Data-dependent decay, LoRA integration
RWKV-7 "Goose" (2025): Dynamic State Evolution, breaking the TC0 barrier

Attention vs Linear Attention vs RWKV-7

# Standard Attention: O(n^2) time, O(n) memory
# output = softmax(Q @ K^T / sqrt(d)) @ V

# Linear Attention: O(n) time, O(d^2) memory (constant)
# output = (Q @ (K^T @ V)) -- reordering matrix multiplication

# RWKV-7: O(n) time, O(d^2) memory (constant)
# + Dynamic State Evolution maximizes expressiveness

Core Mechanism: Dynamic State Evolution

Limitations of Existing Linear Attention

Existing Linear Attention (including RWKV-4 through 6) belongs to TC0 (Threshold Circuit Class 0). This means it theoretically cannot solve certain problems:

# TC0 limitation example: state tracking problem
# Input: "A is in room1. A moves to room2. B is in room1. Where is A?"
# TC0 models are theoretically incapable of tracking such state changes

RWKV-7's Dynamic State Evolution

RWKV-7 dynamically modifies the state transition matrix itself based on the input:

import torch
import torch.nn as nn

class RWKV7_TimeMix(nn.Module):
    """RWKV-7 Core: Dynamic State Evolution"""

    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        # Input-dependent parameter generators
        self.W_r = nn.Linear(d_model, d_model)  # Receptance
        self.W_k = nn.Linear(d_model, d_model)  # Key
        self.W_v = nn.Linear(d_model, d_model)  # Value
        self.W_a = nn.Linear(d_model, d_model)  # State transition
        self.W_g = nn.Linear(d_model, d_model)  # Gate

        # Dynamic decay parameters
        self.w_decay = nn.Parameter(torch.randn(n_heads, self.d_head))

        # State evolution matrix learnable parameters
        self.A_log = nn.Parameter(torch.randn(n_heads, self.d_head, self.d_head))

    def forward(self, x, state=None):
        B, T, C = x.shape
        H = self.n_heads
        D = self.d_head

        r = self.W_r(x).view(B, T, H, D)
        k = self.W_k(x).view(B, T, H, D)
        v = self.W_v(x).view(B, T, H, D)
        a = torch.sigmoid(self.W_a(x).view(B, T, H, D))
        g = torch.sigmoid(self.W_g(x).view(B, T, H, D))

        if state is None:
            state = torch.zeros(B, H, D, D, device=x.device)

        outputs = []
        for t in range(T):
            # Dynamic State Evolution: the core innovation
            # Previous: state = decay * state + k^T @ v (fixed decay)
            # RWKV-7: state = A(x_t) @ state + k^T @ v (dynamic transition matrix)

            # Input-dependent transition matrix
            A_t = self._compute_transition(a[:, t], state)

            # State update
            kv = torch.einsum('bhd,bhe->bhde', k[:, t], v[:, t])
            state = A_t * state + kv

            # Output computation
            out = torch.einsum('bhd,bhde->bhe', r[:, t], state)
            out = out * g[:, t]
            outputs.append(out)

        output = torch.stack(outputs, dim=1).view(B, T, C)
        return output, state

    def _compute_transition(self, a_t, state):
        """Dynamically generate transition matrix based on input"""
        # a_t: [B, H, D] - transition parameters derived from current input
        # This is the key to how RWKV-7 surpasses TC0
        decay = torch.exp(-torch.exp(self.w_decay))
        A = decay.unsqueeze(-1) * torch.eye(
            self.d_head, device=a_t.device
        ).unsqueeze(0)

        # Input-dependent correction
        A = A + torch.einsum('bhd,bhe->bhde', a_t, a_t) * \
            torch.exp(self.A_log).unsqueeze(0)

        return A

Why It Surpasses TC0

# RWKV-6 (previous): state_{t+1} = diag(w) * state_t + k_t * v_t^T
# -> Diagonal matrix multiplication: each dimension decays independently
# -> Within TC0: cannot track state with finite depth

# RWKV-7 (new): state_{t+1} = A(x_t) * state_t + k_t * v_t^T
# -> A(x_t) is an input-dependent transition matrix (non-diagonal)
# -> Cross-dimension interaction possible: can solve state tracking problems
# -> Expressiveness beyond TC0

Full Architecture Overview

RWKV-7 Block Composition

class RWKV7Block(nn.Module):
    """RWKV-7 basic block"""

    def __init__(self, d_model, n_heads):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.time_mix = RWKV7_TimeMix(d_model, n_heads)
        self.channel_mix = RWKV7_ChannelMix(d_model)

    def forward(self, x, state=None):
        # Time Mixing (inter-token interaction)
        h, state = self.time_mix(self.ln1(x), state)
        x = x + h

        # Channel Mixing (serves as FFN)
        x = x + self.channel_mix(self.ln2(x))

        return x, state


class RWKV7_ChannelMix(nn.Module):
    """Channel mixing (SwiGLU variant)"""

    def __init__(self, d_model, expand=3.5):
        super().__init__()
        hidden = int(d_model * expand)
        self.W_key = nn.Linear(d_model, hidden)
        self.W_value = nn.Linear(hidden, d_model)
        self.W_gate = nn.Linear(d_model, hidden)

    def forward(self, x):
        k = self.W_key(x)
        v = self.W_value(torch.relu(k) ** 2)  # squared ReLU
        g = torch.sigmoid(self.W_gate(x))
        return v * g

Inference Mode: Operating as an RNN

class RWKV7_Inference:
    """Operates in RNN mode during inference: O(d^2) computation per token"""

    def __init__(self, model):
        self.model = model
        self.states = None  # [n_layers, n_heads, d_head, d_head]

    def generate_token(self, token_id):
        x = self.model.embed(token_id)

        for i, block in enumerate(self.model.blocks):
            x, self.states[i] = block(x.unsqueeze(0).unsqueeze(0),
                                       self.states[i])
            x = x.squeeze(0).squeeze(0)

        logits = self.model.head(self.model.ln_out(x))
        return logits

    # Memory usage: constant regardless of sequence length!
    # Transformer: KV Cache uses O(n * d) memory
    # RWKV-7: State uses O(d^2) fixed memory

Benchmark Results

3B Model Comparison (Based on Paper Table 1)

Model	Params	MMLU	HellaSwag	ARC-E	Inference Memory
LLaMA-3.2 3B	3.2B	63.4	77.2	79.5	O(n) KV Cache
Mamba-2 2.7B	2.7B	58.1	73.8	74.2	O(1) constant
RWKV-6 3B	3.0B	58.9	74.5	75.1	O(1) constant
RWKV-7 3B	3.0B	61.2	76.8	78.3	O(1) constant

Multilingual Benchmarks (A Key Strength of RWKV-7)

# RWKV-7 excels particularly in multilingual performance
# Trained on 100+ languages, achieving SOTA-level results for non-English languages

# On Korean/Japanese/Chinese benchmarks,
# outperforms same-size Transformers
# -> Efficient multilingual tokenization + long context handling

Inference Efficiency

# Inference cost comparison by sequence length
#
# Sequence Length | Transformer | RWKV-7
# 1K             | 1x          | 0.8x
# 4K             | 4x          | 0.8x
# 16K            | 16x         | 0.8x
# 64K            | 64x         | 0.8x
# 1M             | OOM         | 0.8x
#
# RWKV-7 maintains constant cost regardless of sequence length!

Practical Usage: Working with RWKV-7

Loading Models from Hugging Face

# pip install rwkv torch transformers

from rwkv.model import RWKV
from rwkv.utils import PIPELINE

# Load model
model = RWKV(
    model='/path/to/RWKV-7-World-3B',
    strategy='cuda fp16'  # GPU FP16
)

pipeline = PIPELINE(model, "rwkv_vocab_v20230424")

# Text generation
context = "The advantages of Kubernetes VPA are"
result = pipeline.generate(
    context,
    token_count=200,
    temperature=0.8,
    top_p=0.9
)
print(result)

Serving with vLLM

# RWKV-7 vLLM serving (experimental support)
pip install vllm>=0.6.0

python -m vllm.entrypoints.openai.api_server \
  --model RWKV/rwkv-7-world-3b \
  --tokenizer RWKV/rwkv-7-world-3b \
  --dtype float16 \
  --max-model-len 32768

Running Locally with Ollama

# After downloading the RWKV-7 GGUF model
ollama create rwkv7 -f Modelfile

# Modelfile example:
# FROM rwkv-7-world-3b-q4_k_m.gguf
# PARAMETER temperature 0.7
# PARAMETER num_ctx 32768

ollama run rwkv7

RWKV-7 vs Mamba-2 vs Transformer

Architecture Comparison

Property	Transformer	Mamba-2	RWKV-7
Time Complexity	O(n²)	O(n)	O(n)
Inference Memory	O(n·d) KV Cache	O(d²) constant	O(d²) constant
Parallel Training	Fully parallel	Chunk parallel	Fully parallel
Expressiveness	Beyond TC0	TC0	Beyond TC0
Hardware Optimization	Mature	Evolving	Evolving

When Should You Choose RWKV-7?

# Scenarios where RWKV-7 excels:
# 1. Very long context processing (100K+ tokens)
# 2. Edge device inference (constant memory)
# 3. Multilingual services (Korean/Japanese/Chinese, etc.)
# 4. Real-time streaming (constant time per token)

# Scenarios where Transformers are still better:
# 1. When maximum performance is needed (especially in English)
# 2. Short contexts (under 1K)
# 3. Leveraging existing ecosystem/tools

Conclusion

RWKV-7 Goose is a groundbreaking architecture that achieves Transformer-level performance while maintaining the efficiency of constant memory + linear time. Through Dynamic State Evolution, it has broken through the theoretical limitations of existing Linear Attention (TC0), and it particularly excels in multilingual and long-context scenarios.

Quiz (7 Questions)

Q1. What is the inference memory complexity of RWKV-7? O(d²) constant — independent of sequence length

Q2. What is the limitation of TC0 (Threshold Circuit Class 0)? Linear Attention with finite depth theoretically cannot solve problems like state tracking

Q3. What is the core mechanism that allows RWKV-7 to surpass TC0? Dynamic State Evolution — input-dependent transition matrix A(x_t) enables cross-dimension interaction

Q4. Why does Transformer inference memory scale as O(n·d)? Because the KV Cache grows proportionally to sequence length

Q5. Why can RWKV-7 be trained in parallel? The WKV operation has a structure that can be parallelized in chunk units

Q6. Why does RWKV-7 excel particularly in multilingual tasks? World tokenizer trained on 100+ languages + efficient long context processing

Q7. What activation function is used in RWKV-7's Channel Mix? Squared ReLU (the square of ReLU)

Quiz

Q1: What is the main topic covered in "RWKV-7 "Goose" Architecture Analysis — A Linear-Time Model Surpassing Transformers"?

A paper-based analysis of RWKV-7 Goose Dynamic State Evolution mechanism, TC0 barrier breakthrough, and performance comparison against Transformers. A next-generation architecture enabling constant memory + linear time inference.

Q2: What is Evolution of the RWKV Series?

From RWKV-4 to RWKV-7 RWKV (Receptance Weighted Key Value) is an architecture that combines the strengths of RNNs and Transformers: RWKV-4 (2023): Introduced the foundational WKV mechanism.

Q3: Explain the core concept of Core Mechanism: Dynamic State Evolution.

Limitations of Existing Linear Attention Existing Linear Attention (including RWKV-4 through 6) belongs to TC0 (Threshold Circuit Class 0).

Q4: What are the key aspects of Benchmark Results?

3B Model Comparison (Based on Paper Table 1) Multilingual Benchmarks (A Key Strength of RWKV-7) Inference Efficiency

Q5: How does Practical Usage: Working with RWKV-7 work?

Loading Models from Hugging Face Serving with vLLM Running Locally with Ollama