Skip to content
Published on

Complete Guide to LLM Long-Context Performance and KV Cache Optimization: From MQA to Ring Attention

Authors
  • Name
    Twitter

Introduction

Since 2024, LLM context windows have expanded explosively. From GPT-4 Turbo's 128K to Claude 3's 200K and Gemini 1.5 Pro's 1M+ tokens, the ability to process long contexts has become a core competitive differentiator for models. However, expanding context windows is far more than simply increasing a number. Behind the scenes lies a memory bottleneck called the KV Cache (Key-Value Cache), and managing it efficiently is the central challenge of production deployment.

This article covers KV Cache fundamentals, memory consumption calculation formulas, and the latest optimization techniques including Multi-Query Attention (MQA), Grouped-Query Attention (GQA), PagedAttention, sliding window attention, and Ring Attention. We also examine long-context performance evaluation through Needle-in-a-Haystack tests and cost-performance tradeoffs in production environments.

What is KV Cache

Autoregressive Generation and KV Cache in Transformers

Transformer-based LLMs generate tokens one at a time sequentially (autoregressive generation). Each time a new token is generated, attention computation over all previous tokens is required. If computed from scratch every time, time complexity grows to O(n^2).

KV Cache solves this problem. It stores the Key and Value tensors of previous tokens in cache, and when generating a new token, retrieves previous Key/Value from cache instead of recomputing them, performing attention computation only with the current token's Query.

# KV Cache operation in autoregressive generation
Step 1: "The"Generate K1, V1 and store in cache
Step 2: "cat"Generate K2, V2 + attention with cache(K1,V1)
Step 3: "sat"Generate K3, V3 + attention with cache(K1,V1,K2,V2)
Step 4: "on"Generate K4, V4 + attention with cache(K1..V3)
...
Step N: "mat"Generate KN, VN + attention with cache(K1..V(N-1))

What Happens Without KV Cache

Without KV Cache, the Key and Value must be recomputed for the entire sequence at each token generation step. For a 1000-token sequence:

  • With KV Cache: Compute only the last token → O(n) attention
  • Without KV Cache: Full recomputation at every step → O(n^2) total computation

As a result, KV Cache improves inference speed by tens of times, but at the cost of consuming large amounts of GPU memory.

KV Cache Memory Consumption Analysis

Memory Calculation Formula

KV Cache memory usage can be calculated precisely with the following formula:

KV Cache Memory = 2 x n_layers x d_model x seq_len x batch_size x sizeof(dtype)

Variable descriptions:

VariableDescriptionExample Values
2Storing both Key and ValueConstant
n_layersNumber of Transformer layers32 (LLaMA-7B), 80 (LLaMA-70B)
d_modelModel hidden dimension4096 (LLaMA-7B), 8192 (LLaMA-70B)
seq_lenSequence length (context window)4096, 128K, 1M
batch_sizeConcurrent batch size1-64
sizeof(dtype)Data type size (bytes)2 (FP16), 1 (INT8)

KV Cache Memory Examples by Model

Calculating for LLaMA-2 7B (32 layers, d_model=4096, FP16):

seq_len=4096, batch=1:
2 x 32 x 4096 x 4096 x 1 x 2 bytes = 2 GB

seq_len=128K, batch=1:
2 x 32 x 4096 x 131072 x 1 x 2 bytes = 64 GBImpossible with a single GPU!

seq_len=128K, batch=8:
2 x 32 x 4096 x 131072 x 8 x 2 bytes = 512 GBRequires large-scale cluster

LLaMA-2 70B (80 layers, d_model=8192, FP16):

seq_len=4096, batch=1:
2 x 80 x 8192 x 4096 x 1 x 2 bytes = 40 GB

seq_len=128K, batch=1:
2 x 80 x 8192 x 131072 x 1 x 2 bytes = 1.28 TBSeparate from model weights!

As shown, KV Cache can consume far more memory than model weights in long-context scenarios, which is why optimization is essential.

KV Cache vs Model Weights Memory Comparison

ItemLLaMA-7B (FP16)LLaMA-70B (FP16)
Model Weights~14 GB~140 GB
KV Cache (4K, batch=1)~2 GB~40 GB
KV Cache (32K, batch=1)~16 GB~320 GB
KV Cache (128K, batch=1)~64 GB~1.28 TB
KV Cache (128K, batch=8)~512 GB~10.24 TB

KV Cache Optimization Techniques

Multi-Query Attention (MQA)

Core Idea: All attention heads share the same Key and Value, while only Query remains different per head.

# Standard Multi-Head Attention (MHA)
# Each head has separate Q, K, V
# With n_heads = 32, KV Cache stores 32 sets of K,V

# Multi-Query Attention (MQA)
# 32 Query heads, but only 1 set of K and V
# KV Cache size reduced to 1/32!

Memory Savings:

MethodKV HeadsKV Cache Size (Relative)
MHAn_heads (e.g., 32)1x
MQA11/32x (~97% reduction)

Advantages: Dramatic KV Cache memory reduction, faster decoding Disadvantages: Potential quality degradation (especially on complex reasoning tasks)

Applied Models: PaLM, StarCoder, Falcon

Reference: Shazeer (2019), "Fast Transformer Decoding: One Write-Head is All You Need"

Grouped-Query Attention (GQA)

Core Idea: A middle ground between MQA and MHA. Query heads are divided into groups, with each group sharing one Key-Value head.

# Grouped-Query Attention (GQA)
# n_heads = 32, n_kv_heads = 8 (4 Q heads share 1 KV head)
# KV Cache size reduced to 1/4

# GQA-8: 8 KV groups → 1/4 size vs MHA
# GQA-4: 4 KV groups → 1/8 size vs MHA
# GQA-1: 1 KV group → Same as MQA

Memory Savings vs Quality Tradeoff:

Methodn_kv_headsMemory SavingsQuality Impact
MHA320%Baseline
GQA-8875%Negligible
GQA-4487.5%Minimal
GQA-2293.75%Slight
MQA (GQA-1)196.875%Noticeable

Applied Models: LLaMA 2 70B (GQA-8), LLaMA 3, Mistral, Gemma

Reference: Ainslie et al. (2023), "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints"

PagedAttention

Core Idea: Applies the operating system's virtual memory paging concept to KV Cache management. The KV Cache is managed in fixed-size blocks (pages) rather than contiguous memory.

Traditional approach:
[Request 1 KV Cache (contiguous allocation, max length reserved)]
[         Wasted space                                          ]
[Request 2 KV Cache (contiguous allocation, max length reserved)]
[         Wasted space                                          ]

PagedAttention:
[Page 1: Req1] [Page 2: Req2] [Page 3: Req1] [Page 4: Req2]
[Page 5: Req1] [Page 6: Req3] [Page 7: Req2] [Page 8: Req3]
Allocates only as needed in block units, minimizing internal fragmentation

Key Benefits:

  • Reduced memory waste: 60-80% memory efficiency improvement over traditional methods
  • Higher batch sizes: 2-4x more concurrent requests on the same GPU memory
  • Copy-on-Write: KV Cache sharing possible between requests with identical prompts

Applied In: vLLM, SGLang, TensorRT-LLM

Reference: Kwon et al. (2023), "Efficient Memory Management for Large Language Model Serving with PagedAttention"

KV Cache Quantization

Quantizing KV Cache from FP16 to INT8 or INT4 can immediately save 50-75% memory.

# KV Cache quantization example (conceptual code)
# FP16 KV Cache: 2 bytes per value
# INT8 KV Cache: 1 byte per value (50% savings)
# INT4 KV Cache: 0.5 bytes per value (75% savings)

# With quantization on LLaMA-7B 128K context:
# FP16: 64 GB → INT8: 32 GB → INT4: 16 GB

Key Techniques:

TechniqueDescriptionQuality Impact
Per-token INT8Scale factor maintained per tokenMinimal
Per-channel INT8Scale factor per channelNegligible
KV Cache INT44-bit quantizationTask-dependent
KIVIDifferent bit widths for Key and ValueMinimal

Note: Key tensors tend to be more sensitive to quantization than Values, so asymmetric quantization approaches like KIVI (INT8 for Keys, INT4 for Values) are effective.

Sliding Window Attention

Core Idea: Computes attention only over tokens within a fixed-size window, rather than the entire sequence.

Full Attention:
Token 100 attends to all Tokens 1-99Maintains 100 KV Cache entries

Sliding Window (window_size=32):
Token 100 attends only to Tokens 69-99Maintains only 32 KV Cache entries
Memory reduction from O(n) to O(window_size)!

Advantages: KV Cache size is fixed regardless of sequence length Disadvantages: Cannot directly access information from distant tokens outside the window (indirectly propagated through layer stacking)

Applied Models: Mistral 7B (window_size=4096), Longformer (local + global hybrid)

Ring Attention and Sequence Parallelism

Core Idea: Splits the sequence across multiple devices, circulating Key-Value blocks in a ring pattern while computing attention.

Device 0: [Seq chunk 0] ←→ KV blocks rotate
Device 1: [Seq chunk 1] ←→ KV blocks rotate
Device 2: [Seq chunk 2] ←→ KV blocks rotate
Device 3: [Seq chunk 3] ←→ KV blocks rotate

Each device computes attention for its sequence chunk
KV blocks circulate in a ring to complete full-sequence attention
Context length scales proportionally with the number of devices

Key Benefits:

  • Process ultra-long contexts beyond single GPU memory limits
  • N devices can handle N times the sequence length
  • Communication costs overlapped with computation for maximum efficiency

Applied In: Google's internal model training, academic research stage

Reference: Liu et al. (2023), "Ring Attention with Blockwise Transformers for Near-Infinite Context"

Context Window Comparison by Model

Major LLM Context Window Sizes (as of March 2026)

ModelContext WindowKV Cache OptimizationRelease Date
GPT-4 Turbo128KMQA + internal optimization2023.11
GPT-4o128KMQA + internal optimization2024.05
Claude 3 Opus/Sonnet200KUndisclosed2024.03
Claude 3.5 Sonnet200KUndisclosed2024.06
Gemini 1.5 Pro1M+ (up to 2M)Ring Attention family (estimated)2024.02
Gemini 2.01M+Internal optimization2024.12
LLaMA 3.1 405B128KGQA-82024.07
Mistral Large128KGQA + SWA2024.02
Qwen 2.5128KGQA2024.09
Yi-Lightning200K+GQA + internal optimization2024.05
DeepSeek-V3128KMLA (Multi-head Latent Attention)2024.12
Jamba 1.5256KSSM-Attention hybrid2024.08

Context Window Expansion Timeline

2020: GPT-3          2K tokens
2022: GPT-3.5        4K tokens
2023.03: GPT-4       8K / 32K tokens
2023.07: Claude 2    100K tokens
2023.11: GPT-4 Turbo 128K tokens
2024.02: Gemini 1.5  1M tokens
2024.03: Claude 3    200K tokens
2024.07: LLaMA 3.1   128K tokens
2024.12: Gemini 2.0  1M+ tokens
2025+: Various models 128K-2M+ tokens

Long-Context Performance Benchmarks

Needle-in-a-Haystack (NIAH) Test

Needle-in-a-Haystack evaluates how accurately a model can find specific information hidden within a long context.

Test Methodology:

  1. Prepare a long text (Haystack), e.g., 128K tokens of essays
  2. Insert a specific fact (Needle) at a particular position in the text
  3. Ask the model about that fact
  4. Measure accuracy while varying insertion position (beginning/middle/end) and total length

Key Findings:

ObservationDescription
"Lost in the Middle" phenomenonRetrieval accuracy for information inserted in the middle is lower than beginning/end
Inverse relationship between length and accuracyOverall retrieval accuracy degrades as context grows longer
Model-specific differencesClaude 3 and Gemini 1.5 Pro maintain near 100% accuracy on NIAH
Multi-needle retrievalPerformance degradation is more pronounced when multiple pieces of information must be found

Length-Dependent Performance Degradation Patterns

Accuracy(%)
100 |████████████████████░░░░░░░░░░░░░░
 90 |████████████████████████████░░░░░░░
 80 |████████████████████████████████░░░
 70 |████████████████████████████████░░░
 60 |████████████████████████████████████
    +---+----+-----+------+------+-----+
    4K  16K   32K   64K   128K   256K
                Context Length

Top performing models (Claude 3, Gemini 1.5 Pro)
    --- Mid-range performing models
    ··· Models without long-context support

Long-Context Usage Patterns in Production

Use CaseRecommended Context LengthConsiderations
Code review (single file)8-32KSufficient with most models
Code review (entire repo)64-200KConsider combining with RAG
Document summarization32-128KMonitor for quality degradation
Conversation history analysis16-64KCombine with summary-based compression
Legal document analysis128K-1MRequires longest context models
Full codebase analysis200K-1M+Must evaluate cost-effectiveness

Cost-Performance Tradeoffs

Cost Impact of Long Context

Cost increases with context length growth are super-linear, not linear:

Item4K Context128K ContextMultiplier
KV Cache Memory1x32x32x
Prefill Latency~100ms~3-5s30-50x
API Cost (input tokens)$0.01$0.3232x
GPU Memory FootprintLowVery High-
Concurrent Request CapacityHighVery Low-

Memory Savings with Combined Optimization Techniques

Baseline: LLaMA-70B, 128K context, batch=1, FP16
Base KV Cache:        1,280 GB

+ GQA-8 applied:        320 GB (75% reduction)
+ INT8 quantization:    160 GB (87.5% reduction)
+ PagedAttention:       ~128 GB (90% reduction, effective usage)
+ INT4 quantization:     80 GB (93.75% reduction)
+ Sliding window:       Limited (task-dependent)

Multi-head Latent Attention (MLA)

MLA, introduced in DeepSeek-V2, compresses KV Cache into a lower-dimensional latent space. Keys and Values are compressed into combined latent vectors for storage and reconstructed during inference.

Traditional GQA: Store K, V in groups
MLA: Compress K+V into combined low-dimensional latent vectors
Additional 50-70% memory savings over GQA

StreamingLLM

Enables infinite streaming without completely discarding the KV Cache. Leverages the "Attention Sink" phenomenon to maintain only the KV Cache for the first few tokens plus a recent window.

# StreamingLLM KV Cache management (conceptual)
# Full sequence: [t1, t2, t3, ..., t100, ..., t10000]
# Actually retained: [t1, t2, t3, t4] + [t9997, t9998, t9999, t10000]
#                     ↑ Attention Sink    ↑ Sliding Window

Infini-Attention

Proposed by Google Research, this approach combines compressive memory with local attention. Past information accumulates in compressed memory while local attention handles recent context.

Practical Application Guide

Single GPU, Short Context (4-8K):

  • Use models with GQA
  • FP16 KV Cache is sufficient
  • vLLM + PagedAttention default settings

Single GPU, Medium Context (32-64K):

  • GQA + KV Cache INT8 quantization
  • Memory optimization with PagedAttention
  • Set gpu-memory-utilization=0.9

Multi-GPU, Long Context (128K+):

  • GQA + KV Cache INT4/INT8 quantization
  • Tensor parallelism + PagedAttention
  • Ring Attention (research/experimental stage)

Ultra-Long Context (1M+):

  • Use managed services like Gemini API
  • RAG + Long-context hybrid approach
  • Carefully evaluate cost-effectiveness

KV Cache Monitoring Checklist

# Real-time GPU memory usage monitoring
watch -n 1 nvidia-smi

# Check KV Cache utilization in vLLM
# Check from vLLM server's /metrics endpoint
curl http://localhost:8000/metrics | grep kv_cache

# Key monitoring metrics
# - gpu_cache_usage_perc: KV Cache GPU memory utilization
# - num_running_requests: Number of concurrent requests being processed
# - prompt_tokens_total: Total prompt tokens processed

FAQ

Does the LLM not work without KV Cache?

It does work, but it is not practical. Without KV Cache, the entire sequence must be recomputed for each token generation, making it tens to hundreds of times slower than with KV Cache for generating 1000 tokens. Unless working with small research models, KV Cache is always used in production environments.

Should I choose GQA or MQA?

Most major models released since 2024 have adopted GQA. GQA provides a good balance between MQA's extreme memory savings and MHA's quality. In practice, simply use models that already have GQA applied (LLaMA 3, Mistral, etc.).

Does using 128K context always yield better results?

No. As shown in Needle-in-a-Haystack tests, the "Lost in the Middle" phenomenon can degrade information retrieval performance in the middle sections as context grows longer. Additionally, costs increase linearly, and prefill latency increases significantly. Using only as much context as needed is more efficient.

Does PagedAttention affect inference quality?

No. PagedAttention only changes the memory management approach, not the attention computation itself. It produces mathematically identical results while only improving memory efficiency.

Can sliding window attention reference information from the distant past?

Not directly, but information can propagate indirectly if there are enough layers. For example, with a window size of 4096 and 32 layers, information from a range of 4096 x 32 = 131,072 tokens can theoretically propagate indirectly. However, in practice, information dilutes as it passes through layers, so there are limitations for accurate retrieval of distant past information.

Is KV Cache quantization the same as model weight quantization?

They are different. Model weight quantization (GPTQ, AWQ, GGUF, etc.) compresses trained parameters, while KV Cache quantization compresses cache data dynamically generated during inference. Both can be applied simultaneously, and the memory savings effects are cumulative.

References

  • Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150
  • Ainslie, J. et al. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv:2305.13245
  • Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." arXiv:2309.06180
  • Liu, H. et al. (2023). "Ring Attention with Blockwise Transformers for Near-Infinite Context." arXiv:2310.01889
  • Xiao, G. et al. (2023). "Efficient Streaming Language Models with Attention Sinks." arXiv:2309.17453
  • Liu, Z. et al. (2024). "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." arXiv:2402.02750
  • DeepSeek-AI (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434
  • Munkhdalai, T. et al. (2024). "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention." arXiv:2404.07143
  • Nelson, G. et al. (2024). "Needle In A Haystack - Pressure Testing LLMs." GitHub Repository
  • vLLM Project. "vLLM: Easy, Fast, and Cheap LLM Serving." GitHub Repository

Conclusion

KV Cache optimization is the key technology for leveraging LLM long-context capabilities in production. No single technique is sufficient alone, and combining GQA + PagedAttention + KV Cache quantization is currently the most practical approach.

Key takeaways:

  1. KV Cache is essential for inference speed, but it is the primary memory bottleneck in long-context scenarios.
  2. GQA is the current industry standard. It provides the optimal balance between MHA quality and MQA efficiency.
  3. PagedAttention is essential. It can be applied immediately through vLLM, SGLang, and similar tools.
  4. KV Cache quantization is an effective means for additional memory savings.
  5. Longer context is not always the answer. Cost, latency, and accuracy must be considered holistically.
  6. A hybrid approach combining RAG and long context is the optimal choice for most production scenarios.

As new approaches like MLA and Infini-Attention become production-ready, the cost-performance tradeoff of long context will continue to improve. Staying current with technology trends and continuous benchmarking remain important.