The Complete vLLM Guide — From PagedAttention to Production Optimization

Introduction
Understanding the KV Cache Problem
- Why KV Cache is a Problem
- Waste in Traditional Approaches
PagedAttention
- Core Idea
- Prefix Sharing with Copy-on-Write
Continuous Batching
vLLM Installation and Basic Usage
Parallelism Strategies
- Tensor Parallelism
- Pipeline Parallelism
Performance Optimization Tips
Benchmarks
Production Deployment
- Kubernetes Deployment
Quiz
Conclusion
References

Introduction

The key to LLM inference serving lies in balancing cost and performance. Without efficient management of the KV Cache, which occupies 60-80% of GPU memory, expensive GPUs end up being used inefficiently. vLLM is an open-source LLM inference engine developed at UC Berkeley that solves this problem with an innovative memory management technique called PagedAttention.

Understanding the KV Cache Problem

Why KV Cache is a Problem

# KV Cache size calculation during Transformer inference
def kv_cache_size_gb(
    num_layers: int,
    num_heads: int,
    head_dim: int,
    seq_len: int,
    batch_size: int,
    dtype_bytes: int = 2  # float16
) -> float:
    """
    KV Cache memory = 2 * L * H * D * S * B * dtype
    (2 = one each for K and V)
    """
    total_bytes = 2 * num_layers * num_heads * head_dim * seq_len * batch_size * dtype_bytes
    return total_bytes / (1024**3)

# Llama 3 70B example
print(kv_cache_size_gb(
    num_layers=80,
    num_heads=64,  # GQA: 8 KV heads
    head_dim=128,
    seq_len=4096,
    batch_size=1
))
# → approximately 5.2GB per request!

# With batch_size=32?
# → approximately 166GB — not enough even with 2x A100 80GB!

Waste in Traditional Approaches

Traditional KV Cache allocation (contiguous memory):

Request 1: [████████████░░░░░░░░]  actual 1024 tokens, 4096 reserved
Request 2: [██████░░░░░░░░░░░░░░]  actual 512 tokens, 4096 reserved
Request 3: [████████████████░░░░]  actual 3072 tokens, 4096 reserved

Total reserved: 4096 * 3 = 12,288 slots
Actually used: 4,608 slots
Waste ratio: 62.5%!

PagedAttention

Core Idea

It applies the OS virtual memory paging concept to KV Cache:

PagedAttention KV Cache management:

Logical Blocks:
Request 1: [B0] → [B1] → [B2] → [B3]

Physical Blocks (GPU memory):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ B0 │ B2 │ B5 │ B1 │ B3 │ B6 │ B4 │ B7 │
│ R1 │ R1 │ R2 │ R1 │ R1 │ R2 │ R2 │FREE│
└────┴────┴────┴────┴────┴────┴────┴────┘

Page Table:
Request 1: [0→0, 1→3, 2→1, 3→4]
Request 2: [0→2, 1→5, 2→6]

# PagedAttention core structure
class PagedAttentionManager:
    def __init__(self, num_blocks: int, block_size: int, num_heads: int, head_dim: int):
        self.block_size = block_size  # e.g., 16 tokens per block
        self.num_blocks = num_blocks

        # Physical block pool (pre-allocated on GPU memory)
        self.k_cache = torch.zeros(
            num_blocks, block_size, num_heads, head_dim,
            dtype=torch.float16, device='cuda'
        )
        self.v_cache = torch.zeros_like(self.k_cache)

        # Free list
        self.free_blocks = list(range(num_blocks))

    def allocate_block(self) -> int:
        """Allocate a single physical block"""
        if not self.free_blocks:
            raise MemoryError("No free blocks")
        return self.free_blocks.pop()

    def free_block(self, block_id: int):
        """Return a block"""
        self.free_blocks.append(block_id)

    def append_token(self, request_id: int, key: torch.Tensor, value: torch.Tensor):
        """Add token — allocate a new block when current block is full"""
        page_table = self.page_tables[request_id]
        last_block = page_table[-1]
        slot_in_block = self.token_counts[request_id] % self.block_size

        if slot_in_block == 0 and self.token_counts[request_id] > 0:
            # New block needed
            new_block = self.allocate_block()
            page_table.append(new_block)
            last_block = new_block

        self.k_cache[last_block, slot_in_block] = key
        self.v_cache[last_block, slot_in_block] = value
        self.token_counts[request_id] += 1

# When multiple requests share the same system prompt
# Save memory with Copy-on-Write

system_prompt = "You are a helpful assistant..."
# Share KV Cache blocks of the system prompt

# Request 1: system_prompt + "What is Python?"
# Request 2: system_prompt + "Explain Docker"
# Request 3: system_prompt + "How to use Git?"

# Shared blocks:
# [System Block 0] ← shared by 3 requests (ref_count=3)
# [System Block 1] ← shared by 3 requests (ref_count=3)
# [System Block 2] ← shared by 3 requests (ref_count=3)

# Individual blocks:
# [R1 Block 3] [R2 Block 3] [R3 Block 3] ← each unique

# Memory savings: no need to copy system prompt blocks 3 times!

Continuous Batching

# Traditional Static Batching
# Wait until all requests finish
def static_batching(requests):
    """
    R1: ████████████ (done)
    R2: ████████████████████ (done)
    R3: ████ (done, but waiting...)
                              ↑ New batch starts only here
    """
    max_len = max(r.output_len for r in requests)
    for step in range(max_len):
        # Already finished requests still occupy GPU
        outputs = model.forward(batch)
    return outputs


# vLLM's Continuous Batching
def continuous_batching(scheduler):
    """
    Step 1: [R1, R2, R3] → process all
    Step 2: [R1, R2, R3] → R3 done! → insert R4 in empty slot
    Step 3: [R1, R2, R4] → R1 done! → insert R5
    Step 4: [R5, R2, R4] → ...

    GPU utilization: ~95% (vs Static's ~50-60%)
    """
    while requests_exist():
        # Remove completed requests, add new ones
        batch = scheduler.schedule()

        # Separate Prefill and Decode processing
        prefill_batch = [r for r in batch if r.is_prefill]
        decode_batch = [r for r in batch if r.is_decode]

        if prefill_batch:
            model.forward(prefill_batch, mode="prefill")
        if decode_batch:
            model.forward(decode_batch, mode="decode")

vLLM Installation and Basic Usage

Installation

# pip install
pip install vllm

# Requires CUDA 12.1+, PyTorch 2.4+
# GPU: Compute Capability 7.0+ (V100, A100, H100, L40S, etc.)

Offline Inference

from vllm import LLM, SamplingParams

# Load model
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    dtype="auto",
    max_model_len=8192,
    gpu_memory_utilization=0.9,
    tensor_parallel_size=1,
)

# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
    repetition_penalty=1.1,
)

# Batch inference
prompts = [
    "Explain the lifecycle of a Kubernetes Pod.",
    "What are the differences between Docker and Podman?",
    "Tell me about Redis caching strategies.",
]

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Generated: {output.outputs[0].text[:100]}...")
    print(f"Tokens/s: {len(output.outputs[0].token_ids) / output.metrics.finished_time:.1f}")
    print()

OpenAI-Compatible Server

# Run vLLM server (OpenAI API compatible)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9 \
  --tensor-parallel-size 2 \
  --enable-prefix-caching \
  --max-num-seqs 256

# API call (OpenAI SDK compatible)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is vLLM?"}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'

# Python SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # vLLM does not require authentication
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Parallelism Strategies

Tensor Parallelism

# Distribute model across 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --dtype bfloat16

Pipeline Parallelism

# 8 GPUs: 4-way TP × 2-way PP
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2

Performance Optimization Tips

1. GPU Memory Utilization

# Default is 0.9 (90%), can be set more aggressively
--gpu-memory-utilization 0.95

# Check number of KV Cache blocks
# Log: "# GPU blocks: 12345, # CPU blocks: 0"

2. Prefix Caching

# Effective when many requests share the same system prompt
--enable-prefix-caching

3. Quantization

# Use AWQ quantized model
vllm serve TheBloke/Llama-3.1-70B-AWQ \
  --quantization awq \
  --dtype auto

# GPTQ
vllm serve TheBloke/Llama-3.1-70B-GPTQ \
  --quantization gptq

# FP8 (optimal on H100)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8

4. Speculative Decoding

# Speculate with small model → verify with large model
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.1-8B-Instruct \
  --num-speculative-tokens 5

Benchmarks

# vLLM benchmark tool
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct &

# Benchmark with ShareGPT dataset
python benchmarks/benchmark_serving.py \
  --backend vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --num-prompts 1000 \
  --request-rate 10

Benchmark results example (A100 80GB, Llama-3.1-8B):

| Metric          | vLLM    | TGI     | Pure HF |
|-----------------|---------|---------|---------|
| Throughput      | 2,400   | 1,800   | 400     |
| (tokens/s)      |         |         |         |
| TTFT p50 (ms)   | 45      | 60      | 200     |
| TTFT p99 (ms)   | 120     | 180     | 500     |
| ITL p50 (ms)    | 8       | 10      | 25      |
| Max Batch Size  | 256     | 128     | 16      |
| Memory Util.    | 95%     | 85%     | 60%     |

Production Deployment

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --model=meta-llama/Llama-3.1-8B-Instruct
            - --tensor-parallel-size=1
            - --gpu-memory-utilization=0.9
            - --max-model-len=8192
            - --enable-prefix-caching
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
            requests:
              nvidia.com/gpu: 1
              memory: 24Gi
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP

Quiz

Q1. What is the core idea behind PagedAttention?

It applies the OS virtual memory paging concept to KV Cache. Instead of contiguous memory allocation, it manages KV Cache in small block units, eliminating memory fragmentation and waste.

Q2. Why is Continuous Batching more efficient than Static Batching?

When a request completes, it is immediately removed from the batch and a new request is added. Static Batching occupies all slots until the longest request finishes, wasting GPU resources.

Q3. In what situations is Prefix Caching effective?

It is effective when multiple requests use the same system prompt. It saves redundant computation and memory by sharing the KV Cache of the common prefix.

Q4. What is the difference between Tensor Parallelism and Pipeline Parallelism?

Tensor Parallelism splits a single layer across multiple GPUs, while Pipeline Parallelism places different layers on different GPUs. TP optimizes latency, while PP optimizes throughput.

Q5. What is the principle behind Speculative Decoding?

A small model quickly generates (speculates) multiple tokens, and a large model verifies them all at once. Only verified tokens are accepted, improving speed while maintaining quality.

Q6. What does the gpu-memory-utilization parameter control?

It controls the proportion of GPU memory allocated to KV Cache. At 0.9, up to 90% of total GPU memory is used for KV Cache, enabling more concurrent requests to be processed.

Q7. How many GPUs are needed at minimum to serve Llama 3.1 70B on A100 80GB?

At FP16, approximately 140GB is required, so at least 2 GPUs are needed. With AWQ/GPTQ 4-bit quantization, it is possible with just 1 GPU.

Conclusion

vLLM has established itself as the standard tool for LLM inference through PagedAttention, Continuous Batching, and various parallelism strategies. It provides an OpenAI API-compatible server, allowing adoption without changing existing code, and facilitates production deployment in Kubernetes environments.