- Authors
- Name
- Introduction
- Understanding the KV Cache Problem
- PagedAttention
- Continuous Batching
- vLLM Installation and Basic Usage
- Parallelism Strategies
- Performance Optimization Tips
- Benchmarks
- Production Deployment
- Quiz
- Conclusion
- References

Introduction
The key to LLM inference serving lies in balancing cost and performance. Without efficient management of the KV Cache, which occupies 60-80% of GPU memory, expensive GPUs end up being used inefficiently. vLLM is an open-source LLM inference engine developed at UC Berkeley that solves this problem with an innovative memory management technique called PagedAttention.
Understanding the KV Cache Problem
Why KV Cache is a Problem
# KV Cache size calculation during Transformer inference
def kv_cache_size_gb(
num_layers: int,
num_heads: int,
head_dim: int,
seq_len: int,
batch_size: int,
dtype_bytes: int = 2 # float16
) -> float:
"""
KV Cache memory = 2 * L * H * D * S * B * dtype
(2 = one each for K and V)
"""
total_bytes = 2 * num_layers * num_heads * head_dim * seq_len * batch_size * dtype_bytes
return total_bytes / (1024**3)
# Llama 3 70B example
print(kv_cache_size_gb(
num_layers=80,
num_heads=64, # GQA: 8 KV heads
head_dim=128,
seq_len=4096,
batch_size=1
))
# → approximately 5.2GB per request!
# With batch_size=32?
# → approximately 166GB — not enough even with 2x A100 80GB!
Waste in Traditional Approaches
Traditional KV Cache allocation (contiguous memory):
Request 1: [████████████░░░░░░░░] actual 1024 tokens, 4096 reserved
Request 2: [██████░░░░░░░░░░░░░░] actual 512 tokens, 4096 reserved
Request 3: [████████████████░░░░] actual 3072 tokens, 4096 reserved
Total reserved: 4096 * 3 = 12,288 slots
Actually used: 4,608 slots
Waste ratio: 62.5%!
PagedAttention
Core Idea
It applies the OS virtual memory paging concept to KV Cache:
PagedAttention KV Cache management:
Logical Blocks:
Request 1: [B0] → [B1] → [B2] → [B3]
Physical Blocks (GPU memory):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ B0 │ B2 │ B5 │ B1 │ B3 │ B6 │ B4 │ B7 │
│ R1 │ R1 │ R2 │ R1 │ R1 │ R2 │ R2 │FREE│
└────┴────┴────┴────┴────┴────┴────┴────┘
Page Table:
Request 1: [0→0, 1→3, 2→1, 3→4]
Request 2: [0→2, 1→5, 2→6]
# PagedAttention core structure
class PagedAttentionManager:
def __init__(self, num_blocks: int, block_size: int, num_heads: int, head_dim: int):
self.block_size = block_size # e.g., 16 tokens per block
self.num_blocks = num_blocks
# Physical block pool (pre-allocated on GPU memory)
self.k_cache = torch.zeros(
num_blocks, block_size, num_heads, head_dim,
dtype=torch.float16, device='cuda'
)
self.v_cache = torch.zeros_like(self.k_cache)
# Free list
self.free_blocks = list(range(num_blocks))
def allocate_block(self) -> int:
"""Allocate a single physical block"""
if not self.free_blocks:
raise MemoryError("No free blocks")
return self.free_blocks.pop()
def free_block(self, block_id: int):
"""Return a block"""
self.free_blocks.append(block_id)
def append_token(self, request_id: int, key: torch.Tensor, value: torch.Tensor):
"""Add token — allocate a new block when current block is full"""
page_table = self.page_tables[request_id]
last_block = page_table[-1]
slot_in_block = self.token_counts[request_id] % self.block_size
if slot_in_block == 0 and self.token_counts[request_id] > 0:
# New block needed
new_block = self.allocate_block()
page_table.append(new_block)
last_block = new_block
self.k_cache[last_block, slot_in_block] = key
self.v_cache[last_block, slot_in_block] = value
self.token_counts[request_id] += 1
Prefix Sharing with Copy-on-Write
# When multiple requests share the same system prompt
# Save memory with Copy-on-Write
system_prompt = "You are a helpful assistant..."
# Share KV Cache blocks of the system prompt
# Request 1: system_prompt + "What is Python?"
# Request 2: system_prompt + "Explain Docker"
# Request 3: system_prompt + "How to use Git?"
# Shared blocks:
# [System Block 0] ← shared by 3 requests (ref_count=3)
# [System Block 1] ← shared by 3 requests (ref_count=3)
# [System Block 2] ← shared by 3 requests (ref_count=3)
# Individual blocks:
# [R1 Block 3] [R2 Block 3] [R3 Block 3] ← each unique
# Memory savings: no need to copy system prompt blocks 3 times!
Continuous Batching
# Traditional Static Batching
# Wait until all requests finish
def static_batching(requests):
"""
R1: ████████████ (done)
R2: ████████████████████ (done)
R3: ████ (done, but waiting...)
↑ New batch starts only here
"""
max_len = max(r.output_len for r in requests)
for step in range(max_len):
# Already finished requests still occupy GPU
outputs = model.forward(batch)
return outputs
# vLLM's Continuous Batching
def continuous_batching(scheduler):
"""
Step 1: [R1, R2, R3] → process all
Step 2: [R1, R2, R3] → R3 done! → insert R4 in empty slot
Step 3: [R1, R2, R4] → R1 done! → insert R5
Step 4: [R5, R2, R4] → ...
GPU utilization: ~95% (vs Static's ~50-60%)
"""
while requests_exist():
# Remove completed requests, add new ones
batch = scheduler.schedule()
# Separate Prefill and Decode processing
prefill_batch = [r for r in batch if r.is_prefill]
decode_batch = [r for r in batch if r.is_decode]
if prefill_batch:
model.forward(prefill_batch, mode="prefill")
if decode_batch:
model.forward(decode_batch, mode="decode")
vLLM Installation and Basic Usage
Installation
# pip install
pip install vllm
# Requires CUDA 12.1+, PyTorch 2.4+
# GPU: Compute Capability 7.0+ (V100, A100, H100, L40S, etc.)
Offline Inference
from vllm import LLM, SamplingParams
# Load model
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
dtype="auto",
max_model_len=8192,
gpu_memory_utilization=0.9,
tensor_parallel_size=1,
)
# Sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=1024,
repetition_penalty=1.1,
)
# Batch inference
prompts = [
"Explain the lifecycle of a Kubernetes Pod.",
"What are the differences between Docker and Podman?",
"Tell me about Redis caching strategies.",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Generated: {output.outputs[0].text[:100]}...")
print(f"Tokens/s: {len(output.outputs[0].token_ids) / output.metrics.finished_time:.1f}")
print()
OpenAI-Compatible Server
# Run vLLM server (OpenAI API compatible)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 2 \
--enable-prefix-caching \
--max-num-seqs 256
# API call (OpenAI SDK compatible)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is vLLM?"}
],
"temperature": 0.7,
"max_tokens": 512
}'
# Python SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # vLLM does not require authentication
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"}
],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Parallelism Strategies
Tensor Parallelism
# Distribute model across 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--dtype bfloat16
Pipeline Parallelism
# 8 GPUs: 4-way TP × 2-way PP
vllm serve meta-llama/Llama-3.1-405B-Instruct \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2
Performance Optimization Tips
1. GPU Memory Utilization
# Default is 0.9 (90%), can be set more aggressively
--gpu-memory-utilization 0.95
# Check number of KV Cache blocks
# Log: "# GPU blocks: 12345, # CPU blocks: 0"
2. Prefix Caching
# Effective when many requests share the same system prompt
--enable-prefix-caching
3. Quantization
# Use AWQ quantized model
vllm serve TheBloke/Llama-3.1-70B-AWQ \
--quantization awq \
--dtype auto
# GPTQ
vllm serve TheBloke/Llama-3.1-70B-GPTQ \
--quantization gptq
# FP8 (optimal on H100)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--quantization fp8
4. Speculative Decoding
# Speculate with small model → verify with large model
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.1-8B-Instruct \
--num-speculative-tokens 5
Benchmarks
# vLLM benchmark tool
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct &
# Benchmark with ShareGPT dataset
python benchmarks/benchmark_serving.py \
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--num-prompts 1000 \
--request-rate 10
Benchmark results example (A100 80GB, Llama-3.1-8B):
| Metric | vLLM | TGI | Pure HF |
|-----------------|---------|---------|---------|
| Throughput | 2,400 | 1,800 | 400 |
| (tokens/s) | | | |
| TTFT p50 (ms) | 45 | 60 | 200 |
| TTFT p99 (ms) | 120 | 180 | 500 |
| ITL p50 (ms) | 8 | 10 | 25 |
| Max Batch Size | 256 | 128 | 16 |
| Memory Util. | 95% | 85% | 60% |
Production Deployment
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --model=meta-llama/Llama-3.1-8B-Instruct
- --tensor-parallel-size=1
- --gpu-memory-utilization=0.9
- --max-model-len=8192
- --enable-prefix-caching
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
requests:
nvidia.com/gpu: 1
memory: 24Gi
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 80
targetPort: 8000
type: ClusterIP
Quiz
Q1. What is the core idea behind PagedAttention?
It applies the OS virtual memory paging concept to KV Cache. Instead of contiguous memory allocation, it manages KV Cache in small block units, eliminating memory fragmentation and waste.
Q2. Why is Continuous Batching more efficient than Static Batching?
When a request completes, it is immediately removed from the batch and a new request is added. Static Batching occupies all slots until the longest request finishes, wasting GPU resources.
Q3. In what situations is Prefix Caching effective?
It is effective when multiple requests use the same system prompt. It saves redundant computation and memory by sharing the KV Cache of the common prefix.
Q4. What is the difference between Tensor Parallelism and Pipeline Parallelism?
Tensor Parallelism splits a single layer across multiple GPUs, while Pipeline Parallelism places different layers on different GPUs. TP optimizes latency, while PP optimizes throughput.
Q5. What is the principle behind Speculative Decoding?
A small model quickly generates (speculates) multiple tokens, and a large model verifies them all at once. Only verified tokens are accepted, improving speed while maintaining quality.
Q6. What does the gpu-memory-utilization parameter control?
It controls the proportion of GPU memory allocated to KV Cache. At 0.9, up to 90% of total GPU memory is used for KV Cache, enabling more concurrent requests to be processed.
Q7. How many GPUs are needed at minimum to serve Llama 3.1 70B on A100 80GB?
At FP16, approximately 140GB is required, so at least 2 GPUs are needed. With AWQ/GPTQ 4-bit quantization, it is possible with just 1 GPU.
Conclusion
vLLM has established itself as the standard tool for LLM inference through PagedAttention, Continuous Batching, and various parallelism strategies. It provides an OpenAI API-compatible server, allowing adoption without changing existing code, and facilitates production deployment in Kubernetes environments.