- Authors
- Name
- Introduction
- What is KV Cache
- KV Cache Memory Consumption Analysis
- KV Cache Optimization Techniques
- Context Window Comparison by Model
- Long-Context Performance Benchmarks
- Cost-Performance Tradeoffs
- Latest Research Trends
- Practical Application Guide
- FAQ
- Does the LLM not work without KV Cache?
- Should I choose GQA or MQA?
- Does using 128K context always yield better results?
- Does PagedAttention affect inference quality?
- Can sliding window attention reference information from the distant past?
- Is KV Cache quantization the same as model weight quantization?
- References
- Conclusion
Introduction
Since 2024, LLM context windows have expanded explosively. From GPT-4 Turbo's 128K to Claude 3's 200K and Gemini 1.5 Pro's 1M+ tokens, the ability to process long contexts has become a core competitive differentiator for models. However, expanding context windows is far more than simply increasing a number. Behind the scenes lies a memory bottleneck called the KV Cache (Key-Value Cache), and managing it efficiently is the central challenge of production deployment.
This article covers KV Cache fundamentals, memory consumption calculation formulas, and the latest optimization techniques including Multi-Query Attention (MQA), Grouped-Query Attention (GQA), PagedAttention, sliding window attention, and Ring Attention. We also examine long-context performance evaluation through Needle-in-a-Haystack tests and cost-performance tradeoffs in production environments.
What is KV Cache
Autoregressive Generation and KV Cache in Transformers
Transformer-based LLMs generate tokens one at a time sequentially (autoregressive generation). Each time a new token is generated, attention computation over all previous tokens is required. If computed from scratch every time, time complexity grows to O(n^2).
KV Cache solves this problem. It stores the Key and Value tensors of previous tokens in cache, and when generating a new token, retrieves previous Key/Value from cache instead of recomputing them, performing attention computation only with the current token's Query.
# KV Cache operation in autoregressive generation
Step 1: "The" → Generate K1, V1 and store in cache
Step 2: "cat" → Generate K2, V2 + attention with cache(K1,V1)
Step 3: "sat" → Generate K3, V3 + attention with cache(K1,V1,K2,V2)
Step 4: "on" → Generate K4, V4 + attention with cache(K1..V3)
...
Step N: "mat" → Generate KN, VN + attention with cache(K1..V(N-1))
What Happens Without KV Cache
Without KV Cache, the Key and Value must be recomputed for the entire sequence at each token generation step. For a 1000-token sequence:
- With KV Cache: Compute only the last token → O(n) attention
- Without KV Cache: Full recomputation at every step → O(n^2) total computation
As a result, KV Cache improves inference speed by tens of times, but at the cost of consuming large amounts of GPU memory.
KV Cache Memory Consumption Analysis
Memory Calculation Formula
KV Cache memory usage can be calculated precisely with the following formula:
KV Cache Memory = 2 x n_layers x d_model x seq_len x batch_size x sizeof(dtype)
Variable descriptions:
| Variable | Description | Example Values |
|---|---|---|
2 | Storing both Key and Value | Constant |
n_layers | Number of Transformer layers | 32 (LLaMA-7B), 80 (LLaMA-70B) |
d_model | Model hidden dimension | 4096 (LLaMA-7B), 8192 (LLaMA-70B) |
seq_len | Sequence length (context window) | 4096, 128K, 1M |
batch_size | Concurrent batch size | 1-64 |
sizeof(dtype) | Data type size (bytes) | 2 (FP16), 1 (INT8) |
KV Cache Memory Examples by Model
Calculating for LLaMA-2 7B (32 layers, d_model=4096, FP16):
seq_len=4096, batch=1:
2 x 32 x 4096 x 4096 x 1 x 2 bytes = 2 GB
seq_len=128K, batch=1:
2 x 32 x 4096 x 131072 x 1 x 2 bytes = 64 GB ← Impossible with a single GPU!
seq_len=128K, batch=8:
2 x 32 x 4096 x 131072 x 8 x 2 bytes = 512 GB ← Requires large-scale cluster
LLaMA-2 70B (80 layers, d_model=8192, FP16):
seq_len=4096, batch=1:
2 x 80 x 8192 x 4096 x 1 x 2 bytes = 40 GB
seq_len=128K, batch=1:
2 x 80 x 8192 x 131072 x 1 x 2 bytes = 1.28 TB ← Separate from model weights!
As shown, KV Cache can consume far more memory than model weights in long-context scenarios, which is why optimization is essential.
KV Cache vs Model Weights Memory Comparison
| Item | LLaMA-7B (FP16) | LLaMA-70B (FP16) |
|---|---|---|
| Model Weights | ~14 GB | ~140 GB |
| KV Cache (4K, batch=1) | ~2 GB | ~40 GB |
| KV Cache (32K, batch=1) | ~16 GB | ~320 GB |
| KV Cache (128K, batch=1) | ~64 GB | ~1.28 TB |
| KV Cache (128K, batch=8) | ~512 GB | ~10.24 TB |
KV Cache Optimization Techniques
Multi-Query Attention (MQA)
Core Idea: All attention heads share the same Key and Value, while only Query remains different per head.
# Standard Multi-Head Attention (MHA)
# Each head has separate Q, K, V
# With n_heads = 32, KV Cache stores 32 sets of K,V
# Multi-Query Attention (MQA)
# 32 Query heads, but only 1 set of K and V
# KV Cache size reduced to 1/32!
Memory Savings:
| Method | KV Heads | KV Cache Size (Relative) |
|---|---|---|
| MHA | n_heads (e.g., 32) | 1x |
| MQA | 1 | 1/32x (~97% reduction) |
Advantages: Dramatic KV Cache memory reduction, faster decoding Disadvantages: Potential quality degradation (especially on complex reasoning tasks)
Applied Models: PaLM, StarCoder, Falcon
Reference: Shazeer (2019), "Fast Transformer Decoding: One Write-Head is All You Need"
Grouped-Query Attention (GQA)
Core Idea: A middle ground between MQA and MHA. Query heads are divided into groups, with each group sharing one Key-Value head.
# Grouped-Query Attention (GQA)
# n_heads = 32, n_kv_heads = 8 (4 Q heads share 1 KV head)
# KV Cache size reduced to 1/4
# GQA-8: 8 KV groups → 1/4 size vs MHA
# GQA-4: 4 KV groups → 1/8 size vs MHA
# GQA-1: 1 KV group → Same as MQA
Memory Savings vs Quality Tradeoff:
| Method | n_kv_heads | Memory Savings | Quality Impact |
|---|---|---|---|
| MHA | 32 | 0% | Baseline |
| GQA-8 | 8 | 75% | Negligible |
| GQA-4 | 4 | 87.5% | Minimal |
| GQA-2 | 2 | 93.75% | Slight |
| MQA (GQA-1) | 1 | 96.875% | Noticeable |
Applied Models: LLaMA 2 70B (GQA-8), LLaMA 3, Mistral, Gemma
Reference: Ainslie et al. (2023), "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints"
PagedAttention
Core Idea: Applies the operating system's virtual memory paging concept to KV Cache management. The KV Cache is managed in fixed-size blocks (pages) rather than contiguous memory.
Traditional approach:
[Request 1 KV Cache (contiguous allocation, max length reserved)]
[ Wasted space ]
[Request 2 KV Cache (contiguous allocation, max length reserved)]
[ Wasted space ]
PagedAttention:
[Page 1: Req1] [Page 2: Req2] [Page 3: Req1] [Page 4: Req2]
[Page 5: Req1] [Page 6: Req3] [Page 7: Req2] [Page 8: Req3]
→ Allocates only as needed in block units, minimizing internal fragmentation
Key Benefits:
- Reduced memory waste: 60-80% memory efficiency improvement over traditional methods
- Higher batch sizes: 2-4x more concurrent requests on the same GPU memory
- Copy-on-Write: KV Cache sharing possible between requests with identical prompts
Applied In: vLLM, SGLang, TensorRT-LLM
Reference: Kwon et al. (2023), "Efficient Memory Management for Large Language Model Serving with PagedAttention"
KV Cache Quantization
Quantizing KV Cache from FP16 to INT8 or INT4 can immediately save 50-75% memory.
# KV Cache quantization example (conceptual code)
# FP16 KV Cache: 2 bytes per value
# INT8 KV Cache: 1 byte per value (50% savings)
# INT4 KV Cache: 0.5 bytes per value (75% savings)
# With quantization on LLaMA-7B 128K context:
# FP16: 64 GB → INT8: 32 GB → INT4: 16 GB
Key Techniques:
| Technique | Description | Quality Impact |
|---|---|---|
| Per-token INT8 | Scale factor maintained per token | Minimal |
| Per-channel INT8 | Scale factor per channel | Negligible |
| KV Cache INT4 | 4-bit quantization | Task-dependent |
| KIVI | Different bit widths for Key and Value | Minimal |
Note: Key tensors tend to be more sensitive to quantization than Values, so asymmetric quantization approaches like KIVI (INT8 for Keys, INT4 for Values) are effective.
Sliding Window Attention
Core Idea: Computes attention only over tokens within a fixed-size window, rather than the entire sequence.
Full Attention:
Token 100 attends to all Tokens 1-99 → Maintains 100 KV Cache entries
Sliding Window (window_size=32):
Token 100 attends only to Tokens 69-99 → Maintains only 32 KV Cache entries
→ Memory reduction from O(n) to O(window_size)!
Advantages: KV Cache size is fixed regardless of sequence length Disadvantages: Cannot directly access information from distant tokens outside the window (indirectly propagated through layer stacking)
Applied Models: Mistral 7B (window_size=4096), Longformer (local + global hybrid)
Ring Attention and Sequence Parallelism
Core Idea: Splits the sequence across multiple devices, circulating Key-Value blocks in a ring pattern while computing attention.
Device 0: [Seq chunk 0] ←→ KV blocks rotate
Device 1: [Seq chunk 1] ←→ KV blocks rotate
Device 2: [Seq chunk 2] ←→ KV blocks rotate
Device 3: [Seq chunk 3] ←→ KV blocks rotate
→ Each device computes attention for its sequence chunk
→ KV blocks circulate in a ring to complete full-sequence attention
→ Context length scales proportionally with the number of devices
Key Benefits:
- Process ultra-long contexts beyond single GPU memory limits
- N devices can handle N times the sequence length
- Communication costs overlapped with computation for maximum efficiency
Applied In: Google's internal model training, academic research stage
Reference: Liu et al. (2023), "Ring Attention with Blockwise Transformers for Near-Infinite Context"
Context Window Comparison by Model
Major LLM Context Window Sizes (as of March 2026)
| Model | Context Window | KV Cache Optimization | Release Date |
|---|---|---|---|
| GPT-4 Turbo | 128K | MQA + internal optimization | 2023.11 |
| GPT-4o | 128K | MQA + internal optimization | 2024.05 |
| Claude 3 Opus/Sonnet | 200K | Undisclosed | 2024.03 |
| Claude 3.5 Sonnet | 200K | Undisclosed | 2024.06 |
| Gemini 1.5 Pro | 1M+ (up to 2M) | Ring Attention family (estimated) | 2024.02 |
| Gemini 2.0 | 1M+ | Internal optimization | 2024.12 |
| LLaMA 3.1 405B | 128K | GQA-8 | 2024.07 |
| Mistral Large | 128K | GQA + SWA | 2024.02 |
| Qwen 2.5 | 128K | GQA | 2024.09 |
| Yi-Lightning | 200K+ | GQA + internal optimization | 2024.05 |
| DeepSeek-V3 | 128K | MLA (Multi-head Latent Attention) | 2024.12 |
| Jamba 1.5 | 256K | SSM-Attention hybrid | 2024.08 |
Context Window Expansion Timeline
2020: GPT-3 2K tokens
2022: GPT-3.5 4K tokens
2023.03: GPT-4 8K / 32K tokens
2023.07: Claude 2 100K tokens
2023.11: GPT-4 Turbo 128K tokens
2024.02: Gemini 1.5 1M tokens
2024.03: Claude 3 200K tokens
2024.07: LLaMA 3.1 128K tokens
2024.12: Gemini 2.0 1M+ tokens
2025+: Various models 128K-2M+ tokens
Long-Context Performance Benchmarks
Needle-in-a-Haystack (NIAH) Test
Needle-in-a-Haystack evaluates how accurately a model can find specific information hidden within a long context.
Test Methodology:
- Prepare a long text (Haystack), e.g., 128K tokens of essays
- Insert a specific fact (Needle) at a particular position in the text
- Ask the model about that fact
- Measure accuracy while varying insertion position (beginning/middle/end) and total length
Key Findings:
| Observation | Description |
|---|---|
| "Lost in the Middle" phenomenon | Retrieval accuracy for information inserted in the middle is lower than beginning/end |
| Inverse relationship between length and accuracy | Overall retrieval accuracy degrades as context grows longer |
| Model-specific differences | Claude 3 and Gemini 1.5 Pro maintain near 100% accuracy on NIAH |
| Multi-needle retrieval | Performance degradation is more pronounced when multiple pieces of information must be found |
Length-Dependent Performance Degradation Patterns
Accuracy(%)
100 |████████████████████░░░░░░░░░░░░░░
90 |████████████████████████████░░░░░░░
80 |████████████████████████████████░░░
70 |████████████████████████████████░░░
60 |████████████████████████████████████
+---+----+-----+------+------+-----+
4K 16K 32K 64K 128K 256K
Context Length
— Top performing models (Claude 3, Gemini 1.5 Pro)
--- Mid-range performing models
··· Models without long-context support
Long-Context Usage Patterns in Production
| Use Case | Recommended Context Length | Considerations |
|---|---|---|
| Code review (single file) | 8-32K | Sufficient with most models |
| Code review (entire repo) | 64-200K | Consider combining with RAG |
| Document summarization | 32-128K | Monitor for quality degradation |
| Conversation history analysis | 16-64K | Combine with summary-based compression |
| Legal document analysis | 128K-1M | Requires longest context models |
| Full codebase analysis | 200K-1M+ | Must evaluate cost-effectiveness |
Cost-Performance Tradeoffs
Cost Impact of Long Context
Cost increases with context length growth are super-linear, not linear:
| Item | 4K Context | 128K Context | Multiplier |
|---|---|---|---|
| KV Cache Memory | 1x | 32x | 32x |
| Prefill Latency | ~100ms | ~3-5s | 30-50x |
| API Cost (input tokens) | $0.01 | $0.32 | 32x |
| GPU Memory Footprint | Low | Very High | - |
| Concurrent Request Capacity | High | Very Low | - |
Memory Savings with Combined Optimization Techniques
Baseline: LLaMA-70B, 128K context, batch=1, FP16
Base KV Cache: 1,280 GB
+ GQA-8 applied: 320 GB (75% reduction)
+ INT8 quantization: 160 GB (87.5% reduction)
+ PagedAttention: ~128 GB (90% reduction, effective usage)
+ INT4 quantization: 80 GB (93.75% reduction)
+ Sliding window: Limited (task-dependent)
Latest Research Trends
Multi-head Latent Attention (MLA)
MLA, introduced in DeepSeek-V2, compresses KV Cache into a lower-dimensional latent space. Keys and Values are compressed into combined latent vectors for storage and reconstructed during inference.
Traditional GQA: Store K, V in groups
MLA: Compress K+V into combined low-dimensional latent vectors
→ Additional 50-70% memory savings over GQA
StreamingLLM
Enables infinite streaming without completely discarding the KV Cache. Leverages the "Attention Sink" phenomenon to maintain only the KV Cache for the first few tokens plus a recent window.
# StreamingLLM KV Cache management (conceptual)
# Full sequence: [t1, t2, t3, ..., t100, ..., t10000]
# Actually retained: [t1, t2, t3, t4] + [t9997, t9998, t9999, t10000]
# ↑ Attention Sink ↑ Sliding Window
Infini-Attention
Proposed by Google Research, this approach combines compressive memory with local attention. Past information accumulates in compressed memory while local attention handles recent context.
Practical Application Guide
Recommended Optimization Strategy by Scenario
Single GPU, Short Context (4-8K):
- Use models with GQA
- FP16 KV Cache is sufficient
- vLLM + PagedAttention default settings
Single GPU, Medium Context (32-64K):
- GQA + KV Cache INT8 quantization
- Memory optimization with PagedAttention
- Set
gpu-memory-utilization=0.9
Multi-GPU, Long Context (128K+):
- GQA + KV Cache INT4/INT8 quantization
- Tensor parallelism + PagedAttention
- Ring Attention (research/experimental stage)
Ultra-Long Context (1M+):
- Use managed services like Gemini API
- RAG + Long-context hybrid approach
- Carefully evaluate cost-effectiveness
KV Cache Monitoring Checklist
# Real-time GPU memory usage monitoring
watch -n 1 nvidia-smi
# Check KV Cache utilization in vLLM
# Check from vLLM server's /metrics endpoint
curl http://localhost:8000/metrics | grep kv_cache
# Key monitoring metrics
# - gpu_cache_usage_perc: KV Cache GPU memory utilization
# - num_running_requests: Number of concurrent requests being processed
# - prompt_tokens_total: Total prompt tokens processed
FAQ
Does the LLM not work without KV Cache?
It does work, but it is not practical. Without KV Cache, the entire sequence must be recomputed for each token generation, making it tens to hundreds of times slower than with KV Cache for generating 1000 tokens. Unless working with small research models, KV Cache is always used in production environments.
Should I choose GQA or MQA?
Most major models released since 2024 have adopted GQA. GQA provides a good balance between MQA's extreme memory savings and MHA's quality. In practice, simply use models that already have GQA applied (LLaMA 3, Mistral, etc.).
Does using 128K context always yield better results?
No. As shown in Needle-in-a-Haystack tests, the "Lost in the Middle" phenomenon can degrade information retrieval performance in the middle sections as context grows longer. Additionally, costs increase linearly, and prefill latency increases significantly. Using only as much context as needed is more efficient.
Does PagedAttention affect inference quality?
No. PagedAttention only changes the memory management approach, not the attention computation itself. It produces mathematically identical results while only improving memory efficiency.
Can sliding window attention reference information from the distant past?
Not directly, but information can propagate indirectly if there are enough layers. For example, with a window size of 4096 and 32 layers, information from a range of 4096 x 32 = 131,072 tokens can theoretically propagate indirectly. However, in practice, information dilutes as it passes through layers, so there are limitations for accurate retrieval of distant past information.
Is KV Cache quantization the same as model weight quantization?
They are different. Model weight quantization (GPTQ, AWQ, GGUF, etc.) compresses trained parameters, while KV Cache quantization compresses cache data dynamically generated during inference. Both can be applied simultaneously, and the memory savings effects are cumulative.
References
- Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150
- Ainslie, J. et al. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv:2305.13245
- Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." arXiv:2309.06180
- Liu, H. et al. (2023). "Ring Attention with Blockwise Transformers for Near-Infinite Context." arXiv:2310.01889
- Xiao, G. et al. (2023). "Efficient Streaming Language Models with Attention Sinks." arXiv:2309.17453
- Liu, Z. et al. (2024). "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." arXiv:2402.02750
- DeepSeek-AI (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434
- Munkhdalai, T. et al. (2024). "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention." arXiv:2404.07143
- Nelson, G. et al. (2024). "Needle In A Haystack - Pressure Testing LLMs." GitHub Repository
- vLLM Project. "vLLM: Easy, Fast, and Cheap LLM Serving." GitHub Repository
Conclusion
KV Cache optimization is the key technology for leveraging LLM long-context capabilities in production. No single technique is sufficient alone, and combining GQA + PagedAttention + KV Cache quantization is currently the most practical approach.
Key takeaways:
- KV Cache is essential for inference speed, but it is the primary memory bottleneck in long-context scenarios.
- GQA is the current industry standard. It provides the optimal balance between MHA quality and MQA efficiency.
- PagedAttention is essential. It can be applied immediately through vLLM, SGLang, and similar tools.
- KV Cache quantization is an effective means for additional memory savings.
- Longer context is not always the answer. Cost, latency, and accuracy must be considered holistically.
- A hybrid approach combining RAG and long context is the optimal choice for most production scenarios.
As new approaches like MLA and Infini-Attention become production-ready, the cost-performance tradeoff of long context will continue to improve. Staying current with technology trends and continuous benchmarking remain important.