- Authors
- Name
- Introduction
- How PagedAttention Works
- vLLM Architecture
- vLLM Installation and Basic Usage
- Production Deployment Configuration
- Performance Optimization Techniques
- Kubernetes Deployment Patterns
- Inference Engine Comparison
- Operational Considerations and Monitoring
- Failure Cases and Recovery Procedures
- Checklist
- References

Introduction
The first wall you hit when serving LLMs in production is GPU memory. Loading Llama 3.1 70B in FP16 requires 140GB for model weights alone, and processing multiple requests concurrently demands an additional tens to hundreds of GB for the KV Cache. In real production environments, concurrent requests range from tens to hundreds, making KV Cache memory management the decisive factor for overall system throughput and latency.
In traditional Transformer inference implementations, KV Cache is pre-allocated for the maximum sequence length for each request. If a service allows up to 4,096 tokens but the actual average output is 512 tokens, 87% of the allocated memory is wasted. PagedAttention, published by UC Berkeley, fundamentally solves this problem, and vLLM is the open-source inference engine that implements it.
This article covers the full spectrum of production LLM serving: from how PagedAttention works, vLLM architecture, production deployment configuration, performance optimization techniques, Kubernetes-based autoscaling, comparison with SGLang/TensorRT-LLM, operational monitoring, and failure cases with recovery procedures.
How PagedAttention Works
Applying Virtual Memory Paging
PagedAttention is inspired by virtual memory management in operating systems. The OS provides processes with a contiguous virtual address space, but physical memory is allocated non-contiguously in fixed-size pages. PagedAttention applies this concept directly to the KV Cache.
The KV Cache is divided into fixed-size blocks, where each block stores Key-Value tensors for a fixed number of tokens. The default block size is 16 tokens. As new tokens are generated during request processing, when the current block is full, a new physical block is allocated and a mapping is added to the block table.
Traditional contiguous allocation:
Request A: [████████░░░░░░░░░░░░░░░░░░░░░░░░] Actual 512, max 2048 allocated
Request B: [██████████████░░░░░░░░░░░░░░░░░░] Actual 896, max 2048 allocated
Request C: [██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] Actual 128, max 2048 allocated
→ Total 6,144 slots allocated, only 1,536 used. 75% waste
PagedAttention block approach (block size = 16):
Request A: [B0][B1]...[B31] → Only 32 blocks allocated (512 tokens)
Request B: [B0][B1]...[B55] → Only 56 blocks allocated (896 tokens)
Request C: [B0]...[B7] → Only 8 blocks allocated (128 tokens)
→ Only internal fragmentation in the last block. Average waste under 4%
Copy-on-Write and Prefix Sharing
Another strength of PagedAttention is its Copy-on-Write (CoW) mechanism. When generating multiple sequences from the same prompt, such as in beam search or parallel sampling, KV Cache blocks for the common prefix can be physically shared. New blocks are allocated only at the point of divergence, reducing beam search memory usage by up to 55%.
KV Cache Size Calculation
To accurately estimate GPU memory in production environments, you need to predict the KV Cache size. The following function calculates the KV Cache size for each model.
def estimate_kv_cache_memory(
num_layers: int,
num_kv_heads: int,
head_dim: int,
max_seq_len: int,
max_batch_size: int,
dtype_bytes: int = 2, # FP16
block_size: int = 16,
) -> dict:
"""
KV Cache memory estimation based on vLLM PagedAttention.
For GQA models, num_kv_heads is smaller than num_attention_heads.
"""
# KV Cache bytes per token
per_token_bytes = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
# Bytes per block
per_block_bytes = per_token_bytes * block_size
# Total blocks needed for max concurrent processing
total_tokens = max_seq_len * max_batch_size
total_blocks = (total_tokens + block_size - 1) // block_size
total_bytes = total_blocks * per_block_bytes
total_gb = total_bytes / (1024 ** 3)
return {
"per_token_kv_bytes": per_token_bytes,
"per_block_kv_bytes": per_block_bytes,
"total_blocks": total_blocks,
"total_kv_cache_gb": round(total_gb, 2),
}
# Llama 3.1 70B (GQA: 8 KV heads, 80 layers, head_dim=128)
result = estimate_kv_cache_memory(
num_layers=80,
num_kv_heads=8,
head_dim=128,
max_seq_len=4096,
max_batch_size=64,
dtype_bytes=2,
)
print(result)
# {'per_token_kv_bytes': 327680, 'per_block_kv_bytes': 5242880,
# 'total_blocks': 16384, 'total_kv_cache_gb': 80.0}
# → 64 concurrent requests × 4096 tokens requires 80GB for KV Cache alone
Looking at this calculation, serving a 70B model with 64 concurrent requests requires over 220GB of VRAM: 140GB for model weights + 80GB for KV Cache. That is a scale requiring three or more A100 80GB GPUs. PagedAttention maximizes efficiency by dynamically allocating this memory based on actual usage.
vLLM Architecture
V1 Engine and Core Components
The V1 engine introduced after vLLM 0.7.x significantly improved the architecture compared to previous versions. The core components are as follows.
Scheduler: Determines the priority of pending requests and decides which requests to execute based on GPU memory conditions. Through the preemption mechanism, if memory is insufficient, it swaps or recomputes the KV Cache of lower-priority requests to CPU memory.
Block Manager: The core of PagedAttention, managing allocation, deallocation, and sharing of physical blocks. It maintains the block table and tracks mappings between logical and physical blocks.
Worker: The process that performs actual model inference on GPUs. When Tensor Parallelism is enabled, multiple Workers collaborate to process a single request.
Model Runner: Executes the model's forward pass and calls optimized attention kernels like FlashAttention or FlashInfer.
Continuous Batching
In traditional static batching, new requests cannot be added until all requests in the batch are completed. Requests generating short responses must wait for long responses, reducing GPU utilization.
vLLM's Continuous Batching removes completed requests from the batch at each iteration and immediately adds waiting new requests. This maintains GPU utilization above 90% while significantly reducing average latency.
Static Batching:
Time → ████████████████████████████████████
Req 1: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] Generation complete
Req 2: [■■■■■■■■■■░░░░░░░░░░░░░░░░░░░░░░] Finished early, waiting
Req 3: [■■■■░░░░░░░░░░░░░░░░░░░░░░░░░░░░] Finished early, waiting
→ Cannot add new requests in Req 2, 3 slots
Continuous Batching:
Time → ████████████████████████████████████
Req 1: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■]
Req 2: [■■■■■■■■■■] ← Slot returned immediately upon completion
Req 4: [■■■■■■■■■■■■■■■■■■■■■■] ← Immediately added
Req 3: [■■■■] ← Complete
Req 5: [■■■■■■■■■■■■■■] ← Immediately added
→ GPU always operates at maximum load
vLLM Installation and Basic Usage
Installation
vLLM can be easily installed via pip. It requires CUDA 12.1 or higher, and as of March 2026, the latest stable version is the 0.7.x series.
# Basic installation (for CUDA 12.4)
pip install vllm
# Install specific version
pip install vllm==0.7.2
# When using FlashInfer backend (recommended)
pip install flashinfer-python
pip install vllm
# Build from source (custom CUDA version)
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
# Verify installation
python -c "import vllm; print(vllm.__version__)"
Running an OpenAI-Compatible API Server
One of vLLM's greatest advantages is providing a server fully compatible with the OpenAI API. Existing OpenAI client code can be used without modification.
# Basic server launch
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--dtype auto
# Run 70B model with 4-GPU Tensor Parallel
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 4096 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--max-num-seqs 256 \
--disable-log-requests
Offline Inference with Python SDK
You can also use the vLLM engine directly from Python code without a server. This is useful for batch processing or benchmarking.
from vllm import LLM, SamplingParams
# Load model (PagedAttention applied automatically)
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=1,
max_model_len=4096,
gpu_memory_utilization=0.90,
enable_prefix_caching=True,
)
# Set sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=1024,
repetition_penalty=1.1,
stop=["<|eot_id|>"],
)
# Batch inference
prompts = [
"Explain best practices for managing GPU nodes in Kubernetes.",
"Describe how Python asyncio's event loop works.",
"Compare the pros and cons of PostgreSQL partitioning strategies.",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt[:50]}...")
print(f"Generated: {generated_text[:200]}...")
print(f"Tokens: {len(output.outputs[0].token_ids)}")
print("---")
Production Deployment Configuration
Docker-Based Deployment
In production environments, deploying via Docker containers is standard. You need to properly configure model cache and GPU settings based on the official vLLM image.
# Production deployment using official image
docker run -d \
--name vllm-server \
--gpus '"device=0,1,2,3"' \
--shm-size=16g \
-p 8000:8000 \
-v /data/models:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
--restart unless-stopped \
vllm/vllm-openai:v0.7.2 \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--max-num-seqs 256 \
--served-model-name llama-70b \
--disable-log-requests \
--uvicorn-log-level warning
Note: If
--shm-sizeis not set sufficiently, NCCL communication errors occur in Tensor Parallel mode. A minimum of 8GB is recommended, preferably 16GB.
GPU Memory Configuration Strategy
The --gpu-memory-utilization parameter determines the proportion of GPU memory vLLM will use. Setting it too high causes CUDA OOM, while setting it too low reduces concurrent throughput.
| Environment | gpu-memory-utilization | Reason |
|---|---|---|
| Development/Testing | 0.80 | Possible GPU sharing with other processes |
| Production (Dedicated GPU) | 0.90~0.92 | Maximum throughput with slight headroom |
| Production (with Monitoring) | 0.88~0.90 | Prometheus exporter etc. consume VRAM |
| Unstable Workloads | 0.85 | Buffer for sudden long sequences |
Model Loading Optimization
Loading large models can take several minutes. Using the Safetensors format and local caching can significantly improve loading speed.
# Pre-download model to reduce loading time
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
--local-dir /data/models/llama-3.1-70b \
--local-dir-use-symlinks False
# Load directly from local path (no download needed)
vllm serve /data/models/llama-3.1-70b \
--tensor-parallel-size 4 \
--load-format safetensors \
--max-model-len 8192
Performance Optimization Techniques
Prefix Caching (Automatic Prefix Caching)
In many production workloads, the system prompt is included in every request. Prefix Caching caches and reuses the KV Cache for this common prefix. If the system prompt is 2,000 tokens and 100 requests per second come in, without Prefix Caching you must recompute the KV Cache for 2,000 tokens with every request. With Prefix Caching enabled, it is computed only for the first request and subsequent requests read from cache instantly.
# Enable Prefix Caching (disabled by default in vLLM 0.7.x)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-prefix-caching \
--max-model-len 8192
In benchmarks, Prefix Caching has been observed to improve TTFT (Time-To-First-Token) by up to 8x for workloads with a common system prompt. However, for workloads where prompts are completely different each time, cache hit rates are low and the effect is minimal.
Speculative Decoding
Speculative Decoding is a technique where a small draft model predicts multiple tokens in advance, and the original model (target) verifies them in a single forward pass. vLLM supports both the draft model approach and the ngram-based approach.
# Speculative Decoding with draft model
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.1-8B-Instruct \
--num-speculative-tokens 5 \
--speculative-disable-mqa-scorer \
--tensor-parallel-size 4
# ngram-based Speculative Decoding (no additional model needed)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-model "[ngram]" \
--num-speculative-tokens 5 \
--ngram-prompt-lookup-max 4
Speculative Decoding is most effective with greedy decoding or low temperature. At high temperature, the draft model's prediction acceptance rate drops, potentially adding overhead instead.
Quantization and Serving
GPTQ, AWQ, and FP8 quantized models can be served directly with vLLM. Quantization is used to reduce model size for serving on fewer GPUs or to increase concurrent throughput.
| Quantization Method | Bits | 70B VRAM | Relative Quality | vLLM Support |
|---|---|---|---|---|
| FP16 (baseline) | 16 | ~140GB | 100% | Default |
| FP8 (W8A8) | 8 | ~70GB | 99.5% | Supported |
| AWQ (W4) | 4 | ~35GB | 98.5% | Supported |
| GPTQ (W4) | 4 | ~35GB | 98.0% | Supported |
| GGUF (Q4_K_M) | 4 | ~35GB | 97.5% | Limited |
# Serving AWQ quantized model
vllm serve TheBloke/Llama-3.1-70B-Instruct-AWQ \
--quantization awq \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
# Serving FP8 quantized model (GPUs supporting FP8 like H100/L40S)
vllm serve neuralmagic/Llama-3.1-70B-Instruct-FP8 \
--quantization fp8 \
--tensor-parallel-size 4 \
--max-model-len 8192
Note: When serving AWQ/GPTQ models with Tensor Parallel, you must verify the split compatibility of the quantized model. Some quantized models only work correctly with specific TP sizes.
Kubernetes Deployment Patterns
Basic Deployment Configuration
When deploying vLLM on Kubernetes, you need to properly configure GPU resource requests, health checks, and graceful shutdown.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama-70b
namespace: llm-serving
labels:
app: vllm
model: llama-70b
spec:
replicas: 2
selector:
matchLabels:
app: vllm
model: llama-70b
template:
metadata:
labels:
app: vllm
model: llama-70b
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8000'
prometheus.io/path: '/metrics'
spec:
terminationGracePeriodSeconds: 120
containers:
- name: vllm
image: vllm/vllm-openai:v0.7.2
args:
- '--model'
- 'meta-llama/Llama-3.1-70B-Instruct'
- '--tensor-parallel-size'
- '4'
- '--max-model-len'
- '8192'
- '--gpu-memory-utilization'
- '0.90'
- '--enable-prefix-caching'
- '--max-num-seqs'
- '256'
- '--served-model-name'
- 'llama-70b'
- '--disable-log-requests'
ports:
- containerPort: 8000
name: http
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
- name: VLLM_ATTENTION_BACKEND
value: 'FLASHINFER'
resources:
requests:
cpu: '8'
memory: '32Gi'
nvidia.com/gpu: '4'
limits:
cpu: '16'
memory: '64Gi'
nvidia.com/gpu: '4'
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
startupProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 60 # Allow up to 10 minutes for model loading
readinessProbe:
httpGet:
path: /health
port: http
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: http
periodSeconds: 15
failureThreshold: 5
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi
nodeSelector:
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama-70b
namespace: llm-serving
spec:
selector:
app: vllm
model: llama-70b
ports:
- port: 8000
targetPort: http
name: http
type: ClusterIP
KEDA-Based Autoscaling
LLM serving requires autoscaling due to significant traffic fluctuations. Using KEDA (Kubernetes Event-Driven Autoscaling), pods can be automatically scaled based on Prometheus metrics.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaler
namespace: llm-serving
spec:
scaleTargetRef:
name: vllm-llama-70b
minReplicaCount: 1
maxReplicaCount: 8
pollingInterval: 15
cooldownPeriod: 300 # Wait 5 minutes before scale down
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: vllm_pending_requests
query: |
avg(vllm:num_requests_waiting{model_name="llama-70b"})
threshold: '10'
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: vllm_gpu_cache_usage
query: |
avg(vllm:gpu_cache_usage_perc{model_name="llama-70b"})
threshold: '85'
Note: GPU pod scale-up is much slower than regular pods. Model loading takes 2-10 minutes, so set
cooldownPeriodsufficiently long and maintainminReplicaCountat 1 or above. Be sure to account for the time gap between when scale-up is needed and when traffic can actually be processed.
Inference Engine Comparison
vLLM vs SGLang vs TensorRT-LLM vs LMDeploy
Here is a comparison of the major LLM inference engines as of 2026 from multiple perspectives. Benchmark numbers are based on Llama 3.1 8B, A100 80GB, with 1024 input tokens and 512 output tokens.
| Category | vLLM (0.7.x) | SGLang (0.4.x) | TensorRT-LLM | LMDeploy |
|---|---|---|---|---|
| Throughput (req/s) | ~42 | ~48 | ~55 | ~40 |
| TTFT (ms) | ~85 | ~72 | ~60 | ~90 |
| ITL (ms/token) | ~12 | ~11 | ~9 | ~13 |
| Model Support Range | Very wide | Wide | Medium | Wide |
| Installation Difficulty | Easy | Easy | Difficult | Medium |
| OpenAI API Compatible | Full support | Full support | Partial | Full support |
| Multimodal Support | Supported | Supported | Limited | Supported |
| FP8 Quantization | Supported | Supported | Native | Supported |
| Prefix Caching | Supported | RadixAttention | Supported | Supported |
| LoRA Serving | Supported | Supported | Limited | Supported |
| Speculative Decoding | Supported | Supported | Supported | Limited |
| Community Activity | Very high | High | Medium | Medium |
| Production Maturity | High | High | Very high | Medium |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 | Apache 2.0 |
Engine Selection Guide
Choose vLLM when: You need the widest model compatibility, want to prototype quickly and carry through to production, or when community ecosystem and plugins matter. The OpenAI-compatible API is smooth, and support for new model architectures is the fastest.
Choose SGLang when: You need advanced prompt caching based on RadixAttention, have many structured outputs, or when TTFT optimization is critical for interactive workloads. SGLang's RadixTree-based caching is more efficient than vLLM's Prefix Caching for tree-structured multi-turn conversations.
Choose TensorRT-LLM when: Maximum throughput and minimum latency are absolutely critical, and you use NVIDIA GPUs exclusively and can tolerate the complexity of the engine build process. Since TensorRT engines must be pre-built per model, the deployment pipeline becomes more complex.
Choose LMDeploy when: You need high performance from the TurboMind engine along with deep integration with the PyTorch ecosystem, especially when serving InternLM family models.
Throughput Comparison by Batch Size
| Concurrent Requests | vLLM (tok/s) | SGLang (tok/s) | TensorRT-LLM (tok/s) |
|---|---|---|---|
| 1 | 85 | 92 | 105 |
| 8 | 620 | 680 | 750 |
| 32 | 2,100 | 2,350 | 2,600 |
| 64 | 3,800 | 4,200 | 4,500 |
| 128 | 5,200 | 5,800 | 6,100 |
These numbers are based on Llama 3.1 8B, A100 80GB, FP16 and can vary significantly depending on workload characteristics. Always run your own benchmarks in actual production environments.
Operational Considerations and Monitoring
Prometheus Metrics Collection
vLLM exposes Prometheus-format metrics through the /metrics endpoint. Key monitoring indicators are as follows.
| Metric | Description | Alert Threshold |
|---|---|---|
vllm:num_requests_running | Currently running requests | Over 90% of max_num_seqs |
vllm:num_requests_waiting | Requests in queue | Consistently above 50 |
vllm:gpu_cache_usage_perc | GPU KV Cache utilization | Alert when over 95% |
vllm:cpu_cache_usage_perc | CPU swap cache utilization | Frequent swapping when over 50% |
vllm:avg_prompt_throughput_toks_per_s | Prompt processing tokens/sec | Under 50% of baseline |
vllm:avg_generation_throughput_toks_per_s | Generation tokens/sec | Under 50% of baseline |
vllm:e2e_request_latency_seconds | End-to-end request latency | p99 exceeds SLA |
vllm:time_to_first_token_seconds | Time to first token | p99 exceeds 2 seconds |
GPU Memory Monitoring
GPU-level memory monitoring is performed through the DCGM (Data Center GPU Manager) Exporter. You need to monitor both vLLM's own metrics and GPU hardware metrics to get the full picture.
# Prometheus query examples: for Grafana dashboards
# 1. GPU KV Cache utilization (vLLM internal)
# vllm:gpu_cache_usage_perc{model_name="llama-70b"}
# 2. Actual GPU memory usage (DCGM)
# DCGM_FI_DEV_FB_USED{gpu="0"} / DCGM_FI_DEV_FB_TOTAL{gpu="0"} * 100
# 3. Waiting request count trend
# rate(vllm:num_requests_waiting{model_name="llama-70b"}[5m])
# 4. p99 TTFT monitoring
# histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))
# 5. Tokens generated per second
# rate(vllm:avg_generation_throughput_toks_per_s[1m])
# AlertManager rule example
# ALERT VLLMHighCacheUsage
# IF vllm:gpu_cache_usage_perc > 95
# FOR 5m
# LABELS { severity = "warning" }
# ANNOTATIONS {
# summary = "vLLM KV Cache utilization exceeds 95%",
# description = "KV Cache utilization for model {{ $labels.model_name }}
# is {{ $value }}%. Consider scaling up."
# }
Key Operational Guidelines
Log Level Management: Always use
--disable-log-requestsin production. Logging prompts for every request creates disk I/O bottlenecks and risks personal data exposure.Graceful Shutdown: Allow sufficient time to complete in-progress requests when receiving SIGTERM. Set Kubernetes
terminationGracePeriodSecondsto 120 seconds or more.Health Check Separation: Separate
startupProbeandreadinessProbe. Since model loading takes several minutes, setstartupProbe'sfailureThresholdsufficiently high.readinessProbechecks whether inference is actually possible.Shared Memory Configuration: In Tensor Parallel mode, NCCL communicates between GPUs via
/dev/shm. You must set--shm-size=16gin Docker andemptyDirwithmedium: Memoryin Kubernetes.Model Cache Persistence: Use PVC (PersistentVolumeClaim) to persist the HuggingFace model cache. Re-downloading models of tens of GBs every time a pod restarts is inefficient in both cost and time.
Failure Cases and Recovery Procedures
Case 1: CUDA Out of Memory (OOM)
Symptom: After the server starts, after some time torch.cuda.OutOfMemoryError occurs and all requests fail.
Cause: Occurs when --gpu-memory-utilization is set too high or --max-model-len is excessively large for the actual workload. KV Cache allocation exceeds physical GPU memory.
Recovery Procedure:
- Lower
--gpu-memory-utilizationto 0.85. - Limit
--max-model-lento the actual maximum length needed. - Reduce
--max-num-seqsto limit concurrent requests. - Apply quantization (AWQ/FP8) to reduce model memory.
- Increase the Tensor Parallel count to distribute load per GPU.
Prevention: Always perform stress testing at 120% of expected maximum load before production deployment.
# Memory profiling for OOM debugging
VLLM_LOGGING_LEVEL=DEBUG vllm serve meta-llama/Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.85 \
--max-model-len 4096 \
--max-num-seqs 64 \
--enforce-eager # Disable CUDA Graph to reduce memory usage
Case 2: Model Loading Failure
Symptom: At server startup, ValueError: The model's max seq len is larger than the maximum number of tokens that can be stored in KV cache or Not enough memory error occurs.
Cause: The KV Cache required for the specified --max-model-len exceeds available GPU memory. After loading model weights, the remaining memory is insufficient to allocate even minimal KV Cache.
Recovery Procedure:
- Reduce
--max-model-len(e.g., from 8192 to 4096). - Increase
--tensor-parallel-size. - Use a quantized model.
- Add the
--enforce-eagerflag to reclaim memory occupied by CUDA Graphs.
Case 3: NCCL Timeout (Tensor Parallel)
Symptom: In multi-GPU environments, RuntimeError: NCCL communicator was aborted or Watchdog caught collective operation timeout occurs.
Cause: Insufficient GPU-to-GPU communication (NVLink/PCIe) bandwidth, insufficient /dev/shm size, or mixed use of heterogeneous GPUs.
Recovery Procedure:
- Set
--shm-sizeto 16GB or more. - Configure nodeSelector to use only GPUs with identical specifications.
- Check detailed logs with the
NCCL_DEBUG=INFOenvironment variable. - If there is no NVLink connection between GPUs, reduce the Tensor Parallel count considering PCIe bandwidth limitations.
Case 4: Response Quality Degradation
Symptom: After serving a quantized model, response quality noticeably drops. Repetitive sentences, context-free responses, and decreased code generation accuracy are observed.
Cause: Excessive quantization (INT4) has damaged the model's critical weights. Quality degradation with INT4 quantization is particularly pronounced for coding and math tasks.
Recovery Procedure:
- Switch to FP8 quantization (minimal quality loss).
- Try GPTQ instead of AWQ or vice versa.
- Consider FP16 serving by increasing the Tensor Parallel count without quantization.
- Perform benchmarks of the quantized model by major task type to establish quality baselines.
Case 5: Kubernetes Pod Infinite Restart
Symptom: vLLM pod falls into CrashLoopBackOff state. The livenessProbe fails before model loading completes, causing kubelet to force-terminate the container.
Cause: startupProbe was not configured, or failureThreshold is too low to cover the model loading time.
Recovery Procedure:
- Add
startupProbeand setfailureThresholdto 60 or higher (at 10-second intervals, this allows 10 minutes). - Set
livenessProbe'sinitialDelaySecondssufficiently long. - Pre-download the model to PVC to shorten loading time.
Checklist
Items that must be verified before deploying vLLM to production.
Infrastructure Preparation
- Verify GPU driver and CUDA version compatibility with vLLM requirements
- Calculation complete that GPU memory is sufficient for model weights + KV Cache
- Check NVLink/NVSwitch connection status (when using Tensor Parallel)
- Set
/dev/shmsize to 16GB or more (when using Tensor Parallel) - Verify model files are pre-downloaded to local PVC
Serving Configuration
- Adjust
--gpu-memory-utilizationvalue for your workload - Set
--max-model-lento the actual maximum sequence length needed - Set
--max-num-seqsbased on load test results - Decide on
--enable-prefix-cachingactivation (essential when sharing system prompts) - Enable
--disable-log-requestsin production - Set
--dtype autoor explicit dtype - Choose attention backend (FLASHINFER recommended)
Kubernetes Deployment
- Configure
startupProbeand setfailureThresholdsufficiently high - Separate
readinessProbeandlivenessProbe - Set
terminationGracePeriodSecondsto 120 seconds or more - Configure GPU nodeSelector or nodeAffinity
- Persist model cache with PVC
- Mount
/dev/shmwithemptyDirMedium: Memory - Configure KEDA or HPA autoscaling
- Set
minReplicaCountto 1 or higher (prevent cold starts)
Monitoring and Alerting
- Configure Prometheus metrics collection path (
/metrics) - Set up Grafana dashboards (KV Cache utilization, TTFT, throughput)
- Configure OOM alerting
- Configure waiting request count threshold alerts
- GPU temperature and power monitoring (DCGM)
- Build response quality monitoring pipeline (sampling-based)
Performance Validation
- Stress test completed at 120% of expected maximum load
- TTFT and ITL p50/p95/p99 measured and SLA compliance confirmed
- Acceptance rate confirmed when applying Speculative Decoding (60% or higher recommended)
- Quality benchmarks completed by major task type when applying quantization
- Long-duration (24+ hours) stability test completed
References
- Efficient Memory Management for Large Language Model Serving with PagedAttention - The original PagedAttention paper, providing a detailed explanation of virtual memory paging for KV Cache.
- vLLM Official Documentation - The official reference documentation for vLLM, from installation to advanced configuration.
- vLLM GitHub Repository - Source code, issue tracker, and release notes.
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving - A paper on optimizing serving efficiency by separating the prefill and decoding stages.
- vLLM vs SGLang vs LMDeploy: Fastest LLM Inference Engine in 2026 - Performance benchmark comparison analysis of major inference engines as of 2026.
- SGLang: Efficient Execution of Structured Language Model Programs - A paper covering SGLang's RadixAttention and structured output optimization techniques.
- NVIDIA TensorRT-LLM Documentation - Official TensorRT-LLM documentation, including engine build and optimization guides.