Skip to content
Published on

vLLM PagedAttention Production Serving Optimization and Inference Engine Comparison Guide

Authors
  • Name
    Twitter
vLLM PagedAttention

Introduction

The first wall you hit when serving LLMs in production is GPU memory. Loading Llama 3.1 70B in FP16 requires 140GB for model weights alone, and processing multiple requests concurrently demands an additional tens to hundreds of GB for the KV Cache. In real production environments, concurrent requests range from tens to hundreds, making KV Cache memory management the decisive factor for overall system throughput and latency.

In traditional Transformer inference implementations, KV Cache is pre-allocated for the maximum sequence length for each request. If a service allows up to 4,096 tokens but the actual average output is 512 tokens, 87% of the allocated memory is wasted. PagedAttention, published by UC Berkeley, fundamentally solves this problem, and vLLM is the open-source inference engine that implements it.

This article covers the full spectrum of production LLM serving: from how PagedAttention works, vLLM architecture, production deployment configuration, performance optimization techniques, Kubernetes-based autoscaling, comparison with SGLang/TensorRT-LLM, operational monitoring, and failure cases with recovery procedures.

How PagedAttention Works

Applying Virtual Memory Paging

PagedAttention is inspired by virtual memory management in operating systems. The OS provides processes with a contiguous virtual address space, but physical memory is allocated non-contiguously in fixed-size pages. PagedAttention applies this concept directly to the KV Cache.

The KV Cache is divided into fixed-size blocks, where each block stores Key-Value tensors for a fixed number of tokens. The default block size is 16 tokens. As new tokens are generated during request processing, when the current block is full, a new physical block is allocated and a mapping is added to the block table.

Traditional contiguous allocation:
Request A: [████████░░░░░░░░░░░░░░░░░░░░░░░░]  Actual 512, max 2048 allocated
Request B: [██████████████░░░░░░░░░░░░░░░░░░]  Actual 896, max 2048 allocated
Request C: [██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]  Actual 128, max 2048 allocated
Total 6,144 slots allocated, only 1,536 used. 75% waste

PagedAttention block approach (block size = 16):
Request A: [B0][B1]...[B31]Only 32 blocks allocated (512 tokens)
Request B: [B0][B1]...[B55]Only 56 blocks allocated (896 tokens)
Request C: [B0]...[B7]Only 8 blocks allocated (128 tokens)
Only internal fragmentation in the last block. Average waste under 4%

Copy-on-Write and Prefix Sharing

Another strength of PagedAttention is its Copy-on-Write (CoW) mechanism. When generating multiple sequences from the same prompt, such as in beam search or parallel sampling, KV Cache blocks for the common prefix can be physically shared. New blocks are allocated only at the point of divergence, reducing beam search memory usage by up to 55%.

KV Cache Size Calculation

To accurately estimate GPU memory in production environments, you need to predict the KV Cache size. The following function calculates the KV Cache size for each model.

def estimate_kv_cache_memory(
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    max_seq_len: int,
    max_batch_size: int,
    dtype_bytes: int = 2,  # FP16
    block_size: int = 16,
) -> dict:
    """
    KV Cache memory estimation based on vLLM PagedAttention.
    For GQA models, num_kv_heads is smaller than num_attention_heads.
    """
    # KV Cache bytes per token
    per_token_bytes = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
    # Bytes per block
    per_block_bytes = per_token_bytes * block_size
    # Total blocks needed for max concurrent processing
    total_tokens = max_seq_len * max_batch_size
    total_blocks = (total_tokens + block_size - 1) // block_size
    total_bytes = total_blocks * per_block_bytes
    total_gb = total_bytes / (1024 ** 3)

    return {
        "per_token_kv_bytes": per_token_bytes,
        "per_block_kv_bytes": per_block_bytes,
        "total_blocks": total_blocks,
        "total_kv_cache_gb": round(total_gb, 2),
    }

# Llama 3.1 70B (GQA: 8 KV heads, 80 layers, head_dim=128)
result = estimate_kv_cache_memory(
    num_layers=80,
    num_kv_heads=8,
    head_dim=128,
    max_seq_len=4096,
    max_batch_size=64,
    dtype_bytes=2,
)
print(result)
# {'per_token_kv_bytes': 327680, 'per_block_kv_bytes': 5242880,
#  'total_blocks': 16384, 'total_kv_cache_gb': 80.0}
# → 64 concurrent requests × 4096 tokens requires 80GB for KV Cache alone

Looking at this calculation, serving a 70B model with 64 concurrent requests requires over 220GB of VRAM: 140GB for model weights + 80GB for KV Cache. That is a scale requiring three or more A100 80GB GPUs. PagedAttention maximizes efficiency by dynamically allocating this memory based on actual usage.

vLLM Architecture

V1 Engine and Core Components

The V1 engine introduced after vLLM 0.7.x significantly improved the architecture compared to previous versions. The core components are as follows.

Scheduler: Determines the priority of pending requests and decides which requests to execute based on GPU memory conditions. Through the preemption mechanism, if memory is insufficient, it swaps or recomputes the KV Cache of lower-priority requests to CPU memory.

Block Manager: The core of PagedAttention, managing allocation, deallocation, and sharing of physical blocks. It maintains the block table and tracks mappings between logical and physical blocks.

Worker: The process that performs actual model inference on GPUs. When Tensor Parallelism is enabled, multiple Workers collaborate to process a single request.

Model Runner: Executes the model's forward pass and calls optimized attention kernels like FlashAttention or FlashInfer.

Continuous Batching

In traditional static batching, new requests cannot be added until all requests in the batch are completed. Requests generating short responses must wait for long responses, reducing GPU utilization.

vLLM's Continuous Batching removes completed requests from the batch at each iteration and immediately adds waiting new requests. This maintains GPU utilization above 90% while significantly reducing average latency.

Static Batching:
Time →  ████████████████████████████████████
Req 1:  [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■]  Generation complete
Req 2:  [■■■■■■■■■■░░░░░░░░░░░░░░░░░░░░░░]  Finished early, waiting
Req 3:  [■■■■░░░░░░░░░░░░░░░░░░░░░░░░░░░░]  Finished early, waiting
Cannot add new requests in Req 2, 3 slots

Continuous Batching:
Time →  ████████████████████████████████████
Req 1:  [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■]
Req 2:  [■■■■■■■■■■]Slot returned immediately upon completion
Req 4:             [■■■■■■■■■■■■■■■■■■■■■■]Immediately added
Req 3:  [■■■■]Complete
Req 5:       [■■■■■■■■■■■■■■]Immediately added
GPU always operates at maximum load

vLLM Installation and Basic Usage

Installation

vLLM can be easily installed via pip. It requires CUDA 12.1 or higher, and as of March 2026, the latest stable version is the 0.7.x series.

# Basic installation (for CUDA 12.4)
pip install vllm

# Install specific version
pip install vllm==0.7.2

# When using FlashInfer backend (recommended)
pip install flashinfer-python
pip install vllm

# Build from source (custom CUDA version)
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

# Verify installation
python -c "import vllm; print(vllm.__version__)"

Running an OpenAI-Compatible API Server

One of vLLM's greatest advantages is providing a server fully compatible with the OpenAI API. Existing OpenAI client code can be used without modification.

# Basic server launch
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --dtype auto

# Run 70B model with 4-GPU Tensor Parallel
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --max-num-seqs 256 \
  --disable-log-requests

Offline Inference with Python SDK

You can also use the vLLM engine directly from Python code without a server. This is useful for batch processing or benchmarking.

from vllm import LLM, SamplingParams

# Load model (PagedAttention applied automatically)
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    max_model_len=4096,
    gpu_memory_utilization=0.90,
    enable_prefix_caching=True,
)

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
    repetition_penalty=1.1,
    stop=["<|eot_id|>"],
)

# Batch inference
prompts = [
    "Explain best practices for managing GPU nodes in Kubernetes.",
    "Describe how Python asyncio's event loop works.",
    "Compare the pros and cons of PostgreSQL partitioning strategies.",
]

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt[:50]}...")
    print(f"Generated: {generated_text[:200]}...")
    print(f"Tokens: {len(output.outputs[0].token_ids)}")
    print("---")

Production Deployment Configuration

Docker-Based Deployment

In production environments, deploying via Docker containers is standard. You need to properly configure model cache and GPU settings based on the official vLLM image.

# Production deployment using official image
docker run -d \
  --name vllm-server \
  --gpus '"device=0,1,2,3"' \
  --shm-size=16g \
  -p 8000:8000 \
  -v /data/models:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
  -e VLLM_ATTENTION_BACKEND=FLASHINFER \
  --restart unless-stopped \
  vllm/vllm-openai:v0.7.2 \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --max-num-seqs 256 \
  --served-model-name llama-70b \
  --disable-log-requests \
  --uvicorn-log-level warning

Note: If --shm-size is not set sufficiently, NCCL communication errors occur in Tensor Parallel mode. A minimum of 8GB is recommended, preferably 16GB.

GPU Memory Configuration Strategy

The --gpu-memory-utilization parameter determines the proportion of GPU memory vLLM will use. Setting it too high causes CUDA OOM, while setting it too low reduces concurrent throughput.

Environmentgpu-memory-utilizationReason
Development/Testing0.80Possible GPU sharing with other processes
Production (Dedicated GPU)0.90~0.92Maximum throughput with slight headroom
Production (with Monitoring)0.88~0.90Prometheus exporter etc. consume VRAM
Unstable Workloads0.85Buffer for sudden long sequences

Model Loading Optimization

Loading large models can take several minutes. Using the Safetensors format and local caching can significantly improve loading speed.

# Pre-download model to reduce loading time
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
  --local-dir /data/models/llama-3.1-70b \
  --local-dir-use-symlinks False

# Load directly from local path (no download needed)
vllm serve /data/models/llama-3.1-70b \
  --tensor-parallel-size 4 \
  --load-format safetensors \
  --max-model-len 8192

Performance Optimization Techniques

Prefix Caching (Automatic Prefix Caching)

In many production workloads, the system prompt is included in every request. Prefix Caching caches and reuses the KV Cache for this common prefix. If the system prompt is 2,000 tokens and 100 requests per second come in, without Prefix Caching you must recompute the KV Cache for 2,000 tokens with every request. With Prefix Caching enabled, it is computed only for the first request and subsequent requests read from cache instantly.

# Enable Prefix Caching (disabled by default in vLLM 0.7.x)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching \
  --max-model-len 8192

In benchmarks, Prefix Caching has been observed to improve TTFT (Time-To-First-Token) by up to 8x for workloads with a common system prompt. However, for workloads where prompts are completely different each time, cache hit rates are low and the effect is minimal.

Speculative Decoding

Speculative Decoding is a technique where a small draft model predicts multiple tokens in advance, and the original model (target) verifies them in a single forward pass. vLLM supports both the draft model approach and the ngram-based approach.

# Speculative Decoding with draft model
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.1-8B-Instruct \
  --num-speculative-tokens 5 \
  --speculative-disable-mqa-scorer \
  --tensor-parallel-size 4

# ngram-based Speculative Decoding (no additional model needed)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model "[ngram]" \
  --num-speculative-tokens 5 \
  --ngram-prompt-lookup-max 4

Speculative Decoding is most effective with greedy decoding or low temperature. At high temperature, the draft model's prediction acceptance rate drops, potentially adding overhead instead.

Quantization and Serving

GPTQ, AWQ, and FP8 quantized models can be served directly with vLLM. Quantization is used to reduce model size for serving on fewer GPUs or to increase concurrent throughput.

Quantization MethodBits70B VRAMRelative QualityvLLM Support
FP16 (baseline)16~140GB100%Default
FP8 (W8A8)8~70GB99.5%Supported
AWQ (W4)4~35GB98.5%Supported
GPTQ (W4)4~35GB98.0%Supported
GGUF (Q4_K_M)4~35GB97.5%Limited
# Serving AWQ quantized model
vllm serve TheBloke/Llama-3.1-70B-Instruct-AWQ \
  --quantization awq \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

# Serving FP8 quantized model (GPUs supporting FP8 like H100/L40S)
vllm serve neuralmagic/Llama-3.1-70B-Instruct-FP8 \
  --quantization fp8 \
  --tensor-parallel-size 4 \
  --max-model-len 8192

Note: When serving AWQ/GPTQ models with Tensor Parallel, you must verify the split compatibility of the quantized model. Some quantized models only work correctly with specific TP sizes.

Kubernetes Deployment Patterns

Basic Deployment Configuration

When deploying vLLM on Kubernetes, you need to properly configure GPU resource requests, health checks, and graceful shutdown.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-70b
  namespace: llm-serving
  labels:
    app: vllm
    model: llama-70b
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
      model: llama-70b
  template:
    metadata:
      labels:
        app: vllm
        model: llama-70b
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8000'
        prometheus.io/path: '/metrics'
    spec:
      terminationGracePeriodSeconds: 120
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.7.2
          args:
            - '--model'
            - 'meta-llama/Llama-3.1-70B-Instruct'
            - '--tensor-parallel-size'
            - '4'
            - '--max-model-len'
            - '8192'
            - '--gpu-memory-utilization'
            - '0.90'
            - '--enable-prefix-caching'
            - '--max-num-seqs'
            - '256'
            - '--served-model-name'
            - 'llama-70b'
            - '--disable-log-requests'
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
            - name: VLLM_ATTENTION_BACKEND
              value: 'FLASHINFER'
          resources:
            requests:
              cpu: '8'
              memory: '32Gi'
              nvidia.com/gpu: '4'
            limits:
              cpu: '16'
              memory: '64Gi'
              nvidia.com/gpu: '4'
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
          startupProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 60 # Allow up to 10 minutes for model loading
          readinessProbe:
            httpGet:
              path: /health
              port: http
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: http
            periodSeconds: 15
            failureThreshold: 5
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 16Gi
      nodeSelector:
        nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama-70b
  namespace: llm-serving
spec:
  selector:
    app: vllm
    model: llama-70b
  ports:
    - port: 8000
      targetPort: http
      name: http
  type: ClusterIP

KEDA-Based Autoscaling

LLM serving requires autoscaling due to significant traffic fluctuations. Using KEDA (Kubernetes Event-Driven Autoscaling), pods can be automatically scaled based on Prometheus metrics.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
  namespace: llm-serving
spec:
  scaleTargetRef:
    name: vllm-llama-70b
  minReplicaCount: 1
  maxReplicaCount: 8
  pollingInterval: 15
  cooldownPeriod: 300 # Wait 5 minutes before scale down
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: vllm_pending_requests
        query: |
          avg(vllm:num_requests_waiting{model_name="llama-70b"})
        threshold: '10'
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: vllm_gpu_cache_usage
        query: |
          avg(vllm:gpu_cache_usage_perc{model_name="llama-70b"})
        threshold: '85'

Note: GPU pod scale-up is much slower than regular pods. Model loading takes 2-10 minutes, so set cooldownPeriod sufficiently long and maintain minReplicaCount at 1 or above. Be sure to account for the time gap between when scale-up is needed and when traffic can actually be processed.

Inference Engine Comparison

vLLM vs SGLang vs TensorRT-LLM vs LMDeploy

Here is a comparison of the major LLM inference engines as of 2026 from multiple perspectives. Benchmark numbers are based on Llama 3.1 8B, A100 80GB, with 1024 input tokens and 512 output tokens.

CategoryvLLM (0.7.x)SGLang (0.4.x)TensorRT-LLMLMDeploy
Throughput (req/s)~42~48~55~40
TTFT (ms)~85~72~60~90
ITL (ms/token)~12~11~9~13
Model Support RangeVery wideWideMediumWide
Installation DifficultyEasyEasyDifficultMedium
OpenAI API CompatibleFull supportFull supportPartialFull support
Multimodal SupportSupportedSupportedLimitedSupported
FP8 QuantizationSupportedSupportedNativeSupported
Prefix CachingSupportedRadixAttentionSupportedSupported
LoRA ServingSupportedSupportedLimitedSupported
Speculative DecodingSupportedSupportedSupportedLimited
Community ActivityVery highHighMediumMedium
Production MaturityHighHighVery highMedium
LicenseApache 2.0Apache 2.0Apache 2.0Apache 2.0

Engine Selection Guide

Choose vLLM when: You need the widest model compatibility, want to prototype quickly and carry through to production, or when community ecosystem and plugins matter. The OpenAI-compatible API is smooth, and support for new model architectures is the fastest.

Choose SGLang when: You need advanced prompt caching based on RadixAttention, have many structured outputs, or when TTFT optimization is critical for interactive workloads. SGLang's RadixTree-based caching is more efficient than vLLM's Prefix Caching for tree-structured multi-turn conversations.

Choose TensorRT-LLM when: Maximum throughput and minimum latency are absolutely critical, and you use NVIDIA GPUs exclusively and can tolerate the complexity of the engine build process. Since TensorRT engines must be pre-built per model, the deployment pipeline becomes more complex.

Choose LMDeploy when: You need high performance from the TurboMind engine along with deep integration with the PyTorch ecosystem, especially when serving InternLM family models.

Throughput Comparison by Batch Size

Concurrent RequestsvLLM (tok/s)SGLang (tok/s)TensorRT-LLM (tok/s)
18592105
8620680750
322,1002,3502,600
643,8004,2004,500
1285,2005,8006,100

These numbers are based on Llama 3.1 8B, A100 80GB, FP16 and can vary significantly depending on workload characteristics. Always run your own benchmarks in actual production environments.

Operational Considerations and Monitoring

Prometheus Metrics Collection

vLLM exposes Prometheus-format metrics through the /metrics endpoint. Key monitoring indicators are as follows.

MetricDescriptionAlert Threshold
vllm:num_requests_runningCurrently running requestsOver 90% of max_num_seqs
vllm:num_requests_waitingRequests in queueConsistently above 50
vllm:gpu_cache_usage_percGPU KV Cache utilizationAlert when over 95%
vllm:cpu_cache_usage_percCPU swap cache utilizationFrequent swapping when over 50%
vllm:avg_prompt_throughput_toks_per_sPrompt processing tokens/secUnder 50% of baseline
vllm:avg_generation_throughput_toks_per_sGeneration tokens/secUnder 50% of baseline
vllm:e2e_request_latency_secondsEnd-to-end request latencyp99 exceeds SLA
vllm:time_to_first_token_secondsTime to first tokenp99 exceeds 2 seconds

GPU Memory Monitoring

GPU-level memory monitoring is performed through the DCGM (Data Center GPU Manager) Exporter. You need to monitor both vLLM's own metrics and GPU hardware metrics to get the full picture.

# Prometheus query examples: for Grafana dashboards

# 1. GPU KV Cache utilization (vLLM internal)
# vllm:gpu_cache_usage_perc{model_name="llama-70b"}

# 2. Actual GPU memory usage (DCGM)
# DCGM_FI_DEV_FB_USED{gpu="0"} / DCGM_FI_DEV_FB_TOTAL{gpu="0"} * 100

# 3. Waiting request count trend
# rate(vllm:num_requests_waiting{model_name="llama-70b"}[5m])

# 4. p99 TTFT monitoring
# histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))

# 5. Tokens generated per second
# rate(vllm:avg_generation_throughput_toks_per_s[1m])

# AlertManager rule example
# ALERT VLLMHighCacheUsage
#   IF vllm:gpu_cache_usage_perc > 95
#   FOR 5m
#   LABELS { severity = "warning" }
#   ANNOTATIONS {
#     summary = "vLLM KV Cache utilization exceeds 95%",
#     description = "KV Cache utilization for model {{ $labels.model_name }}
#       is {{ $value }}%. Consider scaling up."
#   }

Key Operational Guidelines

  1. Log Level Management: Always use --disable-log-requests in production. Logging prompts for every request creates disk I/O bottlenecks and risks personal data exposure.

  2. Graceful Shutdown: Allow sufficient time to complete in-progress requests when receiving SIGTERM. Set Kubernetes terminationGracePeriodSeconds to 120 seconds or more.

  3. Health Check Separation: Separate startupProbe and readinessProbe. Since model loading takes several minutes, set startupProbe's failureThreshold sufficiently high. readinessProbe checks whether inference is actually possible.

  4. Shared Memory Configuration: In Tensor Parallel mode, NCCL communicates between GPUs via /dev/shm. You must set --shm-size=16g in Docker and emptyDir with medium: Memory in Kubernetes.

  5. Model Cache Persistence: Use PVC (PersistentVolumeClaim) to persist the HuggingFace model cache. Re-downloading models of tens of GBs every time a pod restarts is inefficient in both cost and time.

Failure Cases and Recovery Procedures

Case 1: CUDA Out of Memory (OOM)

Symptom: After the server starts, after some time torch.cuda.OutOfMemoryError occurs and all requests fail.

Cause: Occurs when --gpu-memory-utilization is set too high or --max-model-len is excessively large for the actual workload. KV Cache allocation exceeds physical GPU memory.

Recovery Procedure:

  1. Lower --gpu-memory-utilization to 0.85.
  2. Limit --max-model-len to the actual maximum length needed.
  3. Reduce --max-num-seqs to limit concurrent requests.
  4. Apply quantization (AWQ/FP8) to reduce model memory.
  5. Increase the Tensor Parallel count to distribute load per GPU.

Prevention: Always perform stress testing at 120% of expected maximum load before production deployment.

# Memory profiling for OOM debugging
VLLM_LOGGING_LEVEL=DEBUG vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.85 \
  --max-model-len 4096 \
  --max-num-seqs 64 \
  --enforce-eager  # Disable CUDA Graph to reduce memory usage

Case 2: Model Loading Failure

Symptom: At server startup, ValueError: The model's max seq len is larger than the maximum number of tokens that can be stored in KV cache or Not enough memory error occurs.

Cause: The KV Cache required for the specified --max-model-len exceeds available GPU memory. After loading model weights, the remaining memory is insufficient to allocate even minimal KV Cache.

Recovery Procedure:

  1. Reduce --max-model-len (e.g., from 8192 to 4096).
  2. Increase --tensor-parallel-size.
  3. Use a quantized model.
  4. Add the --enforce-eager flag to reclaim memory occupied by CUDA Graphs.

Case 3: NCCL Timeout (Tensor Parallel)

Symptom: In multi-GPU environments, RuntimeError: NCCL communicator was aborted or Watchdog caught collective operation timeout occurs.

Cause: Insufficient GPU-to-GPU communication (NVLink/PCIe) bandwidth, insufficient /dev/shm size, or mixed use of heterogeneous GPUs.

Recovery Procedure:

  1. Set --shm-size to 16GB or more.
  2. Configure nodeSelector to use only GPUs with identical specifications.
  3. Check detailed logs with the NCCL_DEBUG=INFO environment variable.
  4. If there is no NVLink connection between GPUs, reduce the Tensor Parallel count considering PCIe bandwidth limitations.

Case 4: Response Quality Degradation

Symptom: After serving a quantized model, response quality noticeably drops. Repetitive sentences, context-free responses, and decreased code generation accuracy are observed.

Cause: Excessive quantization (INT4) has damaged the model's critical weights. Quality degradation with INT4 quantization is particularly pronounced for coding and math tasks.

Recovery Procedure:

  1. Switch to FP8 quantization (minimal quality loss).
  2. Try GPTQ instead of AWQ or vice versa.
  3. Consider FP16 serving by increasing the Tensor Parallel count without quantization.
  4. Perform benchmarks of the quantized model by major task type to establish quality baselines.

Case 5: Kubernetes Pod Infinite Restart

Symptom: vLLM pod falls into CrashLoopBackOff state. The livenessProbe fails before model loading completes, causing kubelet to force-terminate the container.

Cause: startupProbe was not configured, or failureThreshold is too low to cover the model loading time.

Recovery Procedure:

  1. Add startupProbe and set failureThreshold to 60 or higher (at 10-second intervals, this allows 10 minutes).
  2. Set livenessProbe's initialDelaySeconds sufficiently long.
  3. Pre-download the model to PVC to shorten loading time.

Checklist

Items that must be verified before deploying vLLM to production.

Infrastructure Preparation

  • Verify GPU driver and CUDA version compatibility with vLLM requirements
  • Calculation complete that GPU memory is sufficient for model weights + KV Cache
  • Check NVLink/NVSwitch connection status (when using Tensor Parallel)
  • Set /dev/shm size to 16GB or more (when using Tensor Parallel)
  • Verify model files are pre-downloaded to local PVC

Serving Configuration

  • Adjust --gpu-memory-utilization value for your workload
  • Set --max-model-len to the actual maximum sequence length needed
  • Set --max-num-seqs based on load test results
  • Decide on --enable-prefix-caching activation (essential when sharing system prompts)
  • Enable --disable-log-requests in production
  • Set --dtype auto or explicit dtype
  • Choose attention backend (FLASHINFER recommended)

Kubernetes Deployment

  • Configure startupProbe and set failureThreshold sufficiently high
  • Separate readinessProbe and livenessProbe
  • Set terminationGracePeriodSeconds to 120 seconds or more
  • Configure GPU nodeSelector or nodeAffinity
  • Persist model cache with PVC
  • Mount /dev/shm with emptyDir Medium: Memory
  • Configure KEDA or HPA autoscaling
  • Set minReplicaCount to 1 or higher (prevent cold starts)

Monitoring and Alerting

  • Configure Prometheus metrics collection path (/metrics)
  • Set up Grafana dashboards (KV Cache utilization, TTFT, throughput)
  • Configure OOM alerting
  • Configure waiting request count threshold alerts
  • GPU temperature and power monitoring (DCGM)
  • Build response quality monitoring pipeline (sampling-based)

Performance Validation

  • Stress test completed at 120% of expected maximum load
  • TTFT and ITL p50/p95/p99 measured and SLA compliance confirmed
  • Acceptance rate confirmed when applying Speculative Decoding (60% or higher recommended)
  • Quality benchmarks completed by major task type when applying quantization
  • Long-duration (24+ hours) stability test completed

References

  1. Efficient Memory Management for Large Language Model Serving with PagedAttention - The original PagedAttention paper, providing a detailed explanation of virtual memory paging for KV Cache.
  2. vLLM Official Documentation - The official reference documentation for vLLM, from installation to advanced configuration.
  3. vLLM GitHub Repository - Source code, issue tracker, and release notes.
  4. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving - A paper on optimizing serving efficiency by separating the prefill and decoding stages.
  5. vLLM vs SGLang vs LMDeploy: Fastest LLM Inference Engine in 2026 - Performance benchmark comparison analysis of major inference engines as of 2026.
  6. SGLang: Efficient Execution of Structured Language Model Programs - A paper covering SGLang's RadixAttention and structured output optimization techniques.
  7. NVIDIA TensorRT-LLM Documentation - Official TensorRT-LLM documentation, including engine build and optimization guides.