- Authors
- Name
- Introduction
- RWKV Architecture Overview
- The WKV Mechanism: Mathematical Formulation
- Dual Nature: Transformer Mode vs RNN Mode
- Architecture Comparison
- RWKV Version Evolution
- Inference Performance Benchmarks
- Training and Fine-Tuning Practical Guide
- Failure Cases and Limitations
- Practical Deployment Tips
- Conclusion
- References

Introduction
Since Vaswani et al. introduced the Transformer in 2017, the self-attention mechanism has become the dominant paradigm for sequence modeling. Yet Transformers carry a fundamental burden: their quadratic time and memory complexity with respect to sequence length, O(L^2 * d), makes inference on long sequences expensive and demands an ever-growing KV cache. Recurrent Neural Networks (RNNs) offer constant-memory inference at O(d) per step, but classic RNNs like LSTMs suffer from vanishing gradients, sequential training bottlenecks, and difficulty scaling beyond a few hundred million parameters.
RWKV (pronounced "RwaKuv") -- Receptance Weighted Key Value -- is a novel architecture designed by Bo Peng (BlinkDL) that bridges this gap. By reformulating attention as a linear recurrence, RWKV achieves Transformer-level language modeling quality with RNN-level inference efficiency: O(T*d) total training complexity (linear in sequence length) and O(d) per-step inference with no KV cache. The architecture has been scaled to 14B parameters (RWKV-4 Eagle), making it the largest dense RNN ever trained, and benchmarks show it performs on par with similarly sized Transformers on standard NLP tasks.
The RWKV paper was published at EMNLP 2023 Findings (arXiv: 2305.13048), and the architecture has continued to evolve through versions 5 (Eagle), 6 (Finch), and 7 (Goose), with each iteration introducing more expressive state evolution mechanisms.
This article covers the mathematical foundations of the WKV mechanism, a complete architecture walkthrough, comparisons with Transformers, Mamba, and LSTMs, practical training and deployment guides, and known limitations.
RWKV Architecture Overview
High-Level Structure
RWKV stacks N residual blocks, each containing two sub-layers: Time Mixing (analogous to attention) and Channel Mixing (analogous to FFN). Unlike Transformers, there is no positional encoding -- temporal information is implicitly captured by the recurrent WKV mechanism and the learned time-decay parameters.
+--------------------------------------------------+
| RWKV Block (x N) |
| |
| +--------------------------------------------+ |
| | Layer Norm | |
| +---------------------+----------------------+ |
| | |
| +---------------------v----------------------+ |
| | Time Mixing (WKV Attention) | |
| | | |
| | x_t --+-- R (Receptance) -- sigmoid(r) | |
| | +-- K (Key) | |
| | +-- V (Value) | |
| | +-- W (Time Decay, learned) | |
| | | |
| | wkv_t = weighted_sum(K, V, W, U) | |
| | out_t = sigmoid(r_t) * wkv_t | |
| +---------------------+----------------------+ |
| | + residual |
| +---------------------v----------------------+ |
| | Layer Norm | |
| +---------------------+----------------------+ |
| | |
| +---------------------v----------------------+ |
| | Channel Mixing (FFN) | |
| | | |
| | x_t --+-- R (Receptance) -- sigmoid(r) | |
| | +-- K (Key) -- squared_relu(k) | |
| | | |
| | out_t = sigmoid(r_t) * (W_v * relu2(k)) | |
| +---------------------+----------------------+ |
| | + residual |
+--------------------------------------------------+
Token Shift: The Secret Ingredient
Before computing R, K, V in each sub-layer, RWKV applies a token shift (also called time shift or linear interpolation). Instead of using only the current token embedding x_t, RWKV mixes it with the previous token:
x'_t = mu * x_t + (1 - mu) * x_{t-1}
Here mu is a learnable per-channel interpolation weight. This simple operation gives the model access to bigram-level information before computing keys and values, providing a cheap form of local context that helps compensate for the absence of full attention.
The WKV Mechanism: Mathematical Formulation
Core Computation
The WKV (Weighted Key Value) operator is the heart of RWKV. It replaces softmax attention with an exponentially-decayed weighted sum. For position t, the WKV output is:
wkv_t = (sum_{i=1}^{t-1} exp(-(t-1-i)*w + k_i) * v_i + exp(u + k_t) * v_t)
/ (sum_{i=1}^{t-1} exp(-(t-1-i)*w + k_i) + exp(u + k_t))
Where:
wis the time decay parameter (per-channel, learned), always positive viaw = exp(decay)uis the bonus parameter that gives extra weight to the current tokenk_i, v_iare the key and value at position i
The numerator accumulates a weighted sum of all past values, where the weight of each past token decays exponentially with distance. The current token receives a special bonus u instead of the normal decay.
Recurrent Formulation
The key insight enabling RNN-mode inference is that the WKV computation can be expressed as a recurrence. Define the running numerator alpha_t and denominator beta_t:
alpha_t = exp(-w) * alpha_{t-1} + exp(k_t) * v_t
beta_t = exp(-w) * beta_{t-1} + exp(k_t)
Then the output at each step is:
wkv_t = (exp(-w) * alpha_{t-1} + exp(u + k_t) * v_t)
/ (exp(-w) * beta_{t-1} + exp(u + k_t))
This requires only O(d) computation and O(d) memory per step -- the running states alpha and beta are each d-dimensional vectors.
Numerical Stability via Log-Space Computation
Direct computation of the exponentials can overflow. RWKV uses a log-space trick to maintain numerical stability:
import torch
def rwkv_wkv_single_step(w, u, k, v, alpha_prev, beta_prev, log_max_prev):
"""
Single-step WKV computation with numerical stability.
Args:
w: time decay (d,) -- positive values
u: bonus parameter (d,)
k: key vector (d,)
v: value vector (d,)
alpha_prev: running numerator state (d,)
beta_prev: running denominator state (d,)
log_max_prev: log of max exponent for stability (d,)
Returns:
wkv: output vector (d,)
alpha_new, beta_new, log_max_new: updated states
"""
# Compute log-space exponents
log_ew = -w # log(e^(-w))
log_ek = k # log(e^k)
# For the past terms: e^(-w) * prev = e^(log_ew + log_max_prev)
log_past = log_ew + log_max_prev
# For the current term: e^(u + k)
log_curr = u + k
# Numerically stable log-sum-exp
log_max_new = torch.max(log_past, log_curr)
past_scale = torch.exp(log_past - log_max_new)
curr_scale = torch.exp(log_curr - log_max_new)
wkv = (past_scale * alpha_prev + curr_scale * v) / \
(past_scale * beta_prev + curr_scale)
# Update running states
log_max_alpha = torch.max(log_past, log_ek)
alpha_new = torch.exp(log_past - log_max_alpha) * alpha_prev + \
torch.exp(log_ek - log_max_alpha) * v
beta_new = torch.exp(log_past - log_max_alpha) * beta_prev + \
torch.exp(log_ek - log_max_alpha)
return wkv, alpha_new, beta_new, log_max_alpha
Parallel Training Formulation
During training, the WKV can be computed in parallel across the sequence using a prefix-sum (scan) operation. Since exp(-w) acts as a fixed per-channel decay, the cumulative weights form a geometric series that can be computed efficiently:
def rwkv_wkv_parallel(w, u, k, v):
"""
Parallel WKV computation for training.
Args:
w: time decay (d,)
u: bonus (d,)
k: keys (T, d)
v: values (T, d)
Returns:
wkv: output (T, d)
"""
T, d = k.shape
ew = torch.exp(-w) # per-channel decay factor (d,)
ek = torch.exp(k) # (T, d)
ekv = ek * v # (T, d)
# Conceptual O(T*d) scan -- in practice uses a custom CUDA kernel
alpha = torch.zeros(d, device=k.device)
beta = torch.zeros(d, device=k.device)
wkv = torch.zeros(T, d, device=k.device)
for t in range(T):
# Current token with bonus
euk = torch.exp(u + k[t])
wkv[t] = (alpha + euk * v[t]) / (beta + euk)
# Update running sums (without bonus)
alpha = ew * alpha + ek[t] * v[t]
beta = ew * beta + ek[t]
return wkv
In practice, RWKV uses a custom CUDA kernel that fuses these operations into a highly efficient scan, achieving near-linear speedup on GPUs.
Dual Nature: Transformer Mode vs RNN Mode
One of the most elegant aspects of RWKV is that the same model can operate in two modes:
+-----------------------------------------------------------+
| RWKV Dual Operation Modes |
+---------------------------+-------------------------------+
| Training Mode | Inference Mode |
| (Transformer-like) | (RNN-like) |
| | |
| +---+---+---+---+ | +---+ |
| |t=1|t=2|t=3|t=4| | |t=n| |
| +-+-+-+-+-+-+-+-++ | +-+-+ |
| | | | | | | |
| v v v v | v |
| +---------------+ | +-------+ +---------+ |
| | Parallel Scan | | | WKV |-->| State | |
| | (all at once) | | | step | | (a, b) | |
| +-------+-------+ | +---+---+ +---------+ |
| | | | |
| v | v |
| +---------------+ | +-----------+ |
| | O(Td) total | | | O(d)/step | |
| | parallelized | | | constant | |
| +---------------+ | +-----------+ |
+---------------------------+-------------------------------+
| Same weights, same results, different compute pattern |
+-----------------------------------------------------------+
Training: Process the entire sequence in parallel using the scan formulation. This is mathematically equivalent to the recurrence but allows GPU parallelism. Complexity: O(Td) total.
Inference: Process one token at a time, updating the fixed-size hidden state. No KV cache needed. Complexity: O(d) per token, O(d) memory regardless of context length.
This dual nature is the fundamental advantage of RWKV: you get the best of both worlds without compromise.
Architecture Comparison
RWKV vs Transformer vs Mamba vs LSTM
| Feature | RWKV (v4/v5/v6) | Transformer | Mamba (S6) | LSTM |
|---|---|---|---|---|
| Training Complexity | O(Td) | O(T^2 d) | O(Td) | O(Td) |
| Inference per step | O(d) | O(Td) with KV cache | O(d) | O(d) |
| Memory (inference) | O(d) constant | O(T*d) grows | O(d) constant | O(d) constant |
| Parallelizable Training | Yes (scan) | Yes (matmul) | Yes (scan) | No (sequential) |
| KV Cache Required | No | Yes | No | No |
| Max Trained Scale | 14B (Eagle) | 1.8T+ (GPT-4 class) | 8B (Mamba-2) | approx 1B |
| Long-range Recall | Good (decaying) | Excellent | Good (selective) | Poor |
| Positional Encoding | None (implicit) | Required | None (implicit) | None (implicit) |
| Attention Pattern | Linear decay | Full quadratic | Selective SSM | Gated recurrence |
| Context Length | Unlimited (theory) | Fixed window | Unlimited (theory) | Limited (vanishing) |
| HuggingFace Support | Yes | Yes | Yes | Yes |
| Ecosystem Maturity | Growing | Dominant | Growing | Mature (legacy) |
| Needle-in-Haystack | Weak | Strong | Moderate | Very Weak |
Key Trade-offs
RWKV vs Transformer: RWKV wins decisively on inference efficiency (constant memory, no KV cache, linear generation time). Transformers win on tasks requiring precise retrieval from arbitrary positions in context (needle-in-a-haystack). For most generative tasks, the quality gap is small at comparable scales.
RWKV vs Mamba: Both achieve linear complexity and constant-memory inference. Mamba uses input-dependent SSM parameters (selective mechanism), while RWKV uses fixed time-decay with learned interpolation. Mamba tends to perform slightly better on tasks requiring strong content-based selection, while RWKV has a more mature ecosystem and has been scaled larger. RWKV-7 Goose narrows this gap significantly with dynamic state evolution.
RWKV vs LSTM: RWKV is strictly superior -- it can be trained in parallel (LSTMs cannot), scales to billions of parameters, and achieves much better perplexity. The WKV mechanism is a more expressive form of gated recurrence.
RWKV Version Evolution
From v4 to v7
# Version evolution summary
rwkv_versions = {
"v4 (Pile)": {
"year": 2023,
"max_params": "14B",
"key_feature": "Original WKV with fixed time decay",
"state_size": "d per layer (scalar decay)"
},
"v5 (Eagle)": {
"year": 2024,
"max_params": "7.5B",
"key_feature": "Multi-headed WKV, matrix-valued states",
"state_size": "h * (d/h)^2 per layer"
},
"v6 (Finch)": {
"year": 2024,
"max_params": "7.5B",
"key_feature": "Data-dependent time decay, LoRA-style mixing",
"state_size": "h * (d/h)^2 per layer"
},
"v7 (Goose)": {
"year": 2025,
"max_params": "2.9B (scaling ongoing)",
"key_feature": "Dynamic state evolution via generalized delta rule",
"state_size": "h * (d/h)^2 per layer (dynamic)"
}
}
RWKV-7 Goose is a particularly significant step. It introduces a generalized delta rule with vector-valued gating and in-context learning rates. This allows the hidden state to evolve dynamically based on input content, overcoming the TC0 expressive power limitations of fixed linear attention. On the Pile benchmark at 3B parameters, RWKV-7 achieves a perplexity of 9.6, compared to 9.8 for Transformers and 9.9 for RWKV-6.
Inference Performance Benchmarks
Token Generation Speed
The following benchmarks were collected on a single NVIDIA A100 80GB GPU, comparing models at the 3B parameter scale:
Tokens/sec at various sequence lengths (3B params, A100 GPU):
Seq Length | RWKV-6 | Transformer | Mamba-2 | Note
------------|---------|-------------|----------|------------------
512 | 2,400 | 2,200 | 2,500 | All comparable
2,048 | 2,350 | 1,800 | 2,450 | Transformer slowing
8,192 | 2,300 | 950 | 2,400 | KV cache pressure
32,768 | 2,250 | 280 | 2,350 | Transformer struggling
131,072 | 2,200 | OOM | 2,300 | Transformer OOM
524,288 | 2,100 | OOM | 2,200 | Both linear models OK
Memory Usage (inference state):
Model | 1K ctx | 8K ctx | 128K ctx | 1M ctx
------------|---------|---------|----------|--------
RWKV-6 3B | 6.2 GB | 6.2 GB | 6.2 GB | 6.2 GB (constant!)
Transformer | 6.5 GB | 8.1 GB | 32.4 GB | OOM
Mamba-2 3B | 6.3 GB | 6.3 GB | 6.3 GB | 6.3 GB (constant)
The key takeaway: RWKV maintains constant memory and near-constant speed regardless of context length, while Transformers degrade rapidly beyond 8K tokens.
Perplexity Comparison (The Pile)
Model | Params | Pile Val PPL | LAMBADA | HellaSwag
--------------------|--------|--------------|---------|----------
RWKV-4 | 7B | 8.28 | 67.2% | 52.5%
RWKV-5 Eagle | 7B | 8.15 | 68.1% | 53.2%
RWKV-6 Finch | 7B | 8.05 | 69.0% | 54.1%
RWKV-7 Goose | 2.9B | 9.60 | 65.8% | 51.0%
Pythia (Transformer) | 6.9B | 8.25 | 67.1% | 52.0%
LLaMA-like | 7B | 7.95 | 73.0% | 56.4%
Mamba | 2.8B | 9.80 | 64.9% | 50.3%
At comparable scales, RWKV is competitive with Transformer baselines, though the best Transformer models with extensive data curation (like LLaMA-family) still hold an edge.
Training and Fine-Tuning Practical Guide
Environment Setup
# Clone the RWKV-LM repository
git clone https://github.com/BlinkDL/RWKV-LM.git
cd RWKV-LM/RWKV-v7
# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install lightning deepspeed wandb ninja
# For custom CUDA kernel compilation
pip install triton
# Verify CUDA kernel builds
python -c "from rwkv.model import RWKV; print('RWKV loaded successfully')"
Full Training from Scratch
# train_config.py -- Example training configuration
import lightning as L
from rwkv.model import RWKV
# Model configuration
model_config = {
"n_layer": 24, # Number of RWKV blocks
"n_embd": 2048, # Embedding dimension
"vocab_size": 65536, # Vocabulary size
"ctx_len": 4096, # Context length for training
"head_size": 64, # Head size for multi-headed WKV (v5+)
}
# Training hyperparameters
train_config = {
"learning_rate": 6e-4, # Peak LR
"lr_schedule": "cosine", # Cosine decay
"warmup_steps": 1000,
"batch_size": 16,
"accumulate_grad_batches": 4, # Effective batch = 64
"max_steps": 100000,
"precision": "bf16-mixed", # BF16 mixed precision
"gradient_clip_val": 1.0,
"weight_decay": 0.1,
"beta1": 0.9,
"beta2": 0.99,
}
# DeepSpeed configuration for multi-GPU
deepspeed_config = {
"zero_optimization": {
"stage": 2,
"offload_optimizer": {"device": "cpu"},
},
"bf16": {"enabled": True},
}
LoRA Fine-Tuning with RWKV-PEFT
The community-developed RWKV-PEFT project provides efficient LoRA fine-tuning for RWKV models:
# Clone RWKV-PEFT
git clone https://github.com/JL-er/RWKV-PEFT.git
cd RWKV-PEFT
# Prepare training data in binidx format
python tools/preprocess_data.py \
--input your_data.jsonl \
--output-prefix train_data \
--tokenizer-type RWKVTokenizer \
--vocab-size 65536
# Launch LoRA fine-tuning
python train.py \
--load_model /path/to/rwkv-base-model.pth \
--proj_dir output/ \
--data_file train_data \
--data_type binidx \
--ctx_len 2048 \
--n_layer 24 \
--n_embd 2048 \
--lora_r 64 \
--lora_alpha 128 \
--lora_parts att,ffn \
--micro_bsz 4 \
--epoch_steps 1000 \
--epoch_count 5 \
--lr_init 2e-4 \
--lr_final 2e-5 \
--strategy deepspeed_stage_1 \
--precision bf16
After training, merge the LoRA weights:
python merge_lora.py \
--base_model /path/to/rwkv-base-model.pth \
--lora_checkpoint output/rwkv-lora-final.pth \
--output merged_model.pth \
--lora_r 64 \
--lora_alpha 128
HuggingFace Integration
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load RWKV from HuggingFace Hub
model = AutoModelForCausalLM.from_pretrained(
"RWKV/rwkv-6-world-7b",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
"RWKV/rwkv-6-world-7b",
trust_remote_code=True
)
# Generate text
prompt = "The RWKV architecture is interesting because"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.8,
top_p=0.9,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Resource Requirements
Fine-tuning VRAM requirements (LoRA, ctx_len=2048):
Model Size | LoRA r=8 | LoRA r=64 | Full Fine-tune
------------|-----------|-----------|----------------
1.5B | 4 GB | 6 GB | 16 GB
3B | 6 GB | 10 GB | 28 GB
7B | 12 GB | 18 GB | 56 GB
14B | 22 GB | 34 GB | 112 GB
Failure Cases and Limitations
Understanding where RWKV struggles is as important as knowing its strengths.
1. Needle-in-a-Haystack Retrieval
RWKV's fixed-size state means information must be compressed. When a specific piece of information is embedded deep within a long context, RWKV may fail to retrieve it precisely:
Task: "Find the phone number mentioned on page 37 of a 100-page document"
Transformer (GPT-4 class): Correctly retrieves the number (full attention)
RWKV-6 7B: Often fails or hallucinates a similar number
Mamba 7B: Sometimes succeeds (selective state helps)
Root cause: The exponential decay in WKV means old information is
progressively "forgotten" unless it strongly activates key channels.
The fixed state size (d dimensions) cannot store arbitrary facts.
2. Complex Multi-hop Reasoning
Tasks requiring simultaneous reference to multiple distant context pieces are challenging:
Prompt: "Alice gave Bob a red ball. Charlie gave Diana a blue cube.
... (500 tokens of distraction) ...
Eve traded her green cone for the object Diana has.
What color is the object Eve now has?"
Transformer: "Blue" (correct -- attends to both relevant sentences)
RWKV: "Green" or "Red" (may lose track of multi-hop chain)
3. Sensitivity to Prompt Formatting
RWKV models are notably more sensitive to prompt format than Transformers. The ordered, sequential nature of the RNN means that how information is presented matters more:
# This format works well with RWKV
prompt_good = "User: What is the capital of France?\n\nAssistant:"
# This format may produce worse results
prompt_bad = "capital of france?"
# RWKV is sensitive to newlines, spacing, and role markers.
# Always use consistent chat templates when deploying RWKV models.
Transformers are naturally less sensitive to prompt variations due to their permutational invariance in attention. The ordered nature of RWKV's recurrence makes it inherently more sensitive to how tokens are sequenced.
4. State Size vs Information Capacity
The fixed hidden state creates a fundamental bottleneck. For a model with embedding dimension d = 2048 and h = 32 heads (RWKV-v5+), the total state per layer is h * (d/h)^2 = 32 * 64^2 = 131,072 floating-point numbers. While substantial, this is finite and cannot scale with input length the way a Transformer's KV cache does.
5. Ecosystem and Tooling Gap
Despite growing community support, RWKV's ecosystem is still smaller than the Transformer ecosystem. Fewer pre-trained checkpoints, limited RLHF/DPO-tuned variants, and less tooling for deployment (compared to vLLM, TensorRT-LLM for Transformers) remain practical barriers to adoption.
Practical Deployment Tips
State Management for Long Conversations
class RWKVChatSession:
"""Manage RWKV state across a multi-turn conversation."""
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
self.state = None # Will hold the running RNN state
def chat(self, user_message):
# Format the message
prompt = f"User: {user_message}\n\nAssistant:"
input_ids = self.tokenizer.encode(prompt)
# Feed tokens through model, updating state
output_ids = []
for token_id in input_ids:
logits, self.state = self.model.forward(
token_id, self.state
)
# Generate response tokens
for _ in range(500):
token_id = self.sample(logits, temperature=0.8)
if token_id == self.tokenizer.eos_token_id:
break
output_ids.append(token_id)
logits, self.state = self.model.forward(
token_id, self.state
)
return self.tokenizer.decode(output_ids)
def save_state(self, path):
"""Save conversation state to disk for later resumption."""
torch.save(self.state, path)
def load_state(self, path):
"""Resume conversation from saved state."""
self.state = torch.load(path)
@staticmethod
def sample(logits, temperature=1.0, top_p=0.9):
probs = torch.softmax(logits / temperature, dim=-1)
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumsum = torch.cumsum(sorted_probs, dim=-1)
mask = cumsum - sorted_probs > top_p
sorted_probs[mask] = 0.0
sorted_probs /= sorted_probs.sum()
idx = torch.multinomial(sorted_probs, 1)
return sorted_indices[idx].item()
Quantization for Edge Deployment
RWKV models are well-suited for quantization due to their simple architecture (no complex attention patterns to preserve):
# Using the rwkv.cpp project for CPU inference
git clone https://github.com/saharNooby/rwkv.cpp.git
cd rwkv.cpp
# Quantize model to INT4
python python/convert_pytorch_to_ggml.py \
/path/to/rwkv-model.pth \
/path/to/output.bin \
FP16
python python/quantize.py \
/path/to/output.bin \
/path/to/output-q4_0.bin \
Q4_0
# Run inference on CPU
python python/chat.py \
--model /path/to/output-q4_0.bin
Conclusion
RWKV represents a genuinely novel point in the design space of sequence models. By combining the parallelizable training of Transformers with the constant-memory inference of RNNs, it offers a compelling alternative for scenarios where inference efficiency matters -- edge deployment, long-context applications, real-time generation, and resource-constrained environments.
The architecture is not without trade-offs: needle-in-a-haystack retrieval, multi-hop reasoning over distant context, and prompt sensitivity remain areas where Transformers excel. However, the rapid evolution from v4 through v7 Goose shows that the RWKV community is actively addressing these limitations, with dynamic state evolution and the generalized delta rule significantly improving expressive power.
For practitioners, RWKV is worth serious consideration when:
- Inference cost or latency is a primary concern
- Context lengths exceed 32K tokens regularly
- Deployment targets include edge devices or limited-VRAM GPUs
- Streaming or real-time text generation is required
The field of efficient sequence modeling is evolving rapidly, with RWKV, Mamba, and hybrid architectures (like Jamba, which combines Mamba with Transformer layers) all competing to dethrone the pure Transformer. RWKV's unique position as a "Transformer-trained RNN" gives it distinct advantages that are likely to keep it relevant as the architecture continues to mature.
References
- RWKV: Reinventing RNNs for the Transformer Era (arXiv: 2305.13048)
- RWKV-7 "Goose" with Expressive Dynamic State Evolution (arXiv: 2503.14456)
- BlinkDL/RWKV-LM -- Official GitHub Repository
- RWKV Language Model Wiki (Official Documentation)
- Introducing RWKV -- An RNN with the advantages of a transformer (HuggingFace Blog)
- RWKV Model Collection on HuggingFace
- A Survey of RWKV (arXiv: 2412.14847)
- The Full Stack -- RWKV, Explained
- RWKV-PEFT: Community Fine-Tuning Project
- RWKV Official Website