Skip to content
Published on

Complete Guide to vLLM Production Serving Optimization: From PagedAttention to Kubernetes Deployment

Authors
  • Name
    Twitter

Introduction

Serving LLMs in production environments is a complex engineering challenge that goes far beyond simply loading a model and exposing an API. It is a multidimensional problem that requires consideration of GPU memory management, concurrent request handling, latency optimization, and cost efficiency.

vLLM is a high-performance LLM serving engine developed by UC Berkeley's Sky Computing Lab, built around the innovative PagedAttention algorithm. Since its initial release in 2023, it has rapidly established itself as an industry-standard serving framework, and is now used in production by numerous companies and research institutions.

This article comprehensively covers vLLM's architecture and core optimization techniques, detailed configuration guide, comparison with competing frameworks, Kubernetes deployment patterns, monitoring strategies, and common issues with their solutions.

vLLM Core Architecture

What is PagedAttention

vLLM's most innovative contribution is PagedAttention. Inspired by the operating system's virtual memory management, it divides the KV Cache into fixed-size blocks (pages) and stores them in non-contiguous memory spaces.

Problems with Traditional Approaches:

Traditional LLM Serving (HuggingFace Transformers, etc.):
├─ Pre-allocates contiguous memory for max sequence length per request
├─ Max length memory is occupied even when actual sequence is short → internal fragmentation
├─ No memory sharing between requests → external fragmentation
└─ Result: 60-80% of GPU memory is wasted

PagedAttention's Solution:

vLLM PagedAttention:
├─ Divides KV Cache into fixed-size blocks (e.g., 16 tokens)
├─ Blocks can be stored in non-contiguous memory
├─ Dynamically allocates/frees blocks as needed
├─ Manages logical→physical mapping via block tables
└─ Result: Reduces GPU memory waste to under 5%
# PagedAttention Block Table Concept
# Logical Block → Physical Block Mapping

# Request 1: "The cat sat on the mat"
# Logical:  [Block 0] [Block 1]
# Physical: [GPU Block 3] [GPU Block 7]

# Request 2: "Hello world"
# Logical:  [Block 0]
# Physical: [GPU Block 1]

# Copy-on-Write when system prompts are identical:
# Request 3 and Request 4 use the same system prompt
# → Share physical blocks for system prompt (no additional memory consumption)

Continuous Batching

In traditional static batching, all requests wait until the longest sequence in the batch completes. vLLM's Continuous Batching eliminates this inefficiency.

Static Batching:
TimeReq A: [████████████████████]Generation complete
Req B: [████████]Finished early but waits for A
Req C: [████████████]Longer than B but waits for A
Batch start    ↑ Batch end (when A completes)

Continuous Batching:
TimeReq A: [████████████████████]
Req B: [████████]Req D starts immediately: [████████████]
Req C: [████████████]Req E starts immediately: [██████]
Slot is replaced with new request as soon as one finishes

Effect: Throughput improves 2-5x on the same GPU.

Overall Architecture Overview

┌─────────────────────────────────────────────┐
│                 vLLM Engine├─────────────────────────────────────────────┤
API Server (OpenAI Compatible)│  ├─ /v1/completions                          │
│  ├─ /v1/chat/completions                     │
│  └─ /v1/embeddings                           │
├─────────────────────────────────────────────┤
Scheduler│  ├─ Continuous Batching│  ├─ Priority Queue│  └─ Preemption (Swap/Recompute)├─────────────────────────────────────────────┤
KV Cache Manager (PagedAttention)│  ├─ Block Allocator│  ├─ Block Table│  └─ Copy-on-Write├─────────────────────────────────────────────┤
Model Executor│  ├─ Tensor Parallelism (Ray/NCCL)│  ├─ Quantization (GPTQ/AWQ/FP8)│  └─ Speculative Decoding└─────────────────────────────────────────────┘

Core Optimization Techniques in Detail

Tensor Parallelism

Distributes a single model across multiple GPUs for execution. vLLM supports Megatron-LM style tensor parallelism.

# Single GPU (default)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1

# 4 GPU tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

# 8 GPU (A100 80GB x 8 node)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1

Tensor Parallelism Sizing Guide:

Model SizeNo Quantization (FP16)INT8 QuantizationINT4 Quantization
7B1x A100 80GB1x A100 40GB1x RTX 4090
13B1x A100 80GB1x A100 80GB1x A100 40GB
34B2x A100 80GB1x A100 80GB1x A100 80GB
70B4x A100 80GB2x A100 80GB1-2x A100 80GB
405B8x A100 80GB4-8x A100 80GB4x A100 80GB

Speculative Decoding

A small "draft model" quickly generates speculative tokens, and the large "target model" verifies them in a single forward pass.

# Enable Speculative Decoding
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --speculative-draft-tensor-parallel-size 1

How It Works:

Traditional autoregressive generation (1 token/step):
Step 1"The"
Step 2"weather"
Step 3"is"
Step 4"sunny"
Step 5"today"
= 5 forward passes (large model)

Speculative Decoding:
Draft model (fast, small): "The weather is sunny today" (5 tokens speculated)
Target model (1 forward pass): "The weather is sunny today"All accepted!
= 1 forward pass (small model) + 1 forward pass (large model)
Potentially 2.5-3x speed improvement

Suitable Scenarios:

  • GPU compute-bound environments (when batch size is small)
  • When draft and target model tokenizers are compatible
  • When acceptance rate is high (general text generation)

Prefix Caching

Reuses KV Cache across requests that share the same system prompt or prefix.

# Enable Prefix Caching
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching

Effect:

Scenario: All requests use the same 2000-token system prompt

Prefix Caching disabled:
Req 1: [Process 2000-token system prompt] + [Process user input]TTFT: 500ms
Req 2: [Reprocess 2000-token system prompt] + [Process user input]TTFT: 500ms
Req 3: [Reprocess 2000-token system prompt] + [Process user input]TTFT: 500ms

Prefix Caching enabled:
Req 1: [Process 2000-token system prompt] + [Process user input]TTFT: 500ms
Req 2: [Cache hit!] + [Process user input]TTFT: 50ms (10x improvement!)
Req 3: [Cache hit!] + [Process user input]TTFT: 50ms

Chunked Prefill

Divides the prefill stage of long prompts into chunks and interleaves them with decoding. This prevents the Time Between Tokens (TBT) for existing decoding requests from spiking when long prompts arrive.

# Enable Chunked Prefill
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill \
  --max-num-batched-tokens 2048

Detailed Configuration Guide

Core Configuration Parameters

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  \
  # GPU memory settings
  --gpu-memory-utilization 0.90 \          # GPU memory usage ratio (0.0-1.0)
  --max-model-len 32768 \                  # Maximum context length
  \
  # Batch and concurrency settings
  --max-num-seqs 256 \                     # Maximum concurrent sequences
  --max-num-batched-tokens 32768 \         # Maximum tokens per batch
  \
  # Parallelism settings
  --tensor-parallel-size 1 \               # Number of tensor parallel GPUs
  --pipeline-parallel-size 1 \             # Number of pipeline parallel stages
  \
  # Quantization settings
  --quantization awq \                     # Quantization method (awq, gptq, fp8, etc.)
  --dtype auto \                           # Data type (auto, float16, bfloat16)
  \
  # Server settings
  --host 0.0.0.0 \
  --port 8000 \
  --api-key "your-secret-key"

Configuration Parameter Details

ParameterDefaultDescriptionRecommended Range
--gpu-memory-utilization0.9GPU memory ratio allocated to KV Cache0.85-0.95
--max-model-lenModel configMaximum processable sequence lengthAdjust per task
--max-num-seqs256Maximum concurrent sequences64-512
--max-num-batched-tokensNoneMaximum tokens per iteration2048-65536
--tensor-parallel-size1Number of GPUs for tensor parallelism1, 2, 4, 8
--block-size16PagedAttention block size (tokens)8, 16, 32
--swap-space4CPU swap space (GB)4-16
--enforce-eagerFalseUse eager mode instead of CUDA graphsTrue for debugging

Configuration Examples by Scenario

High Throughput Optimization:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 512 \
  --max-model-len 4096 \
  --enable-prefix-caching \
  --enable-chunked-prefill

Low Latency Optimization:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 32 \
  --max-model-len 8192 \
  --num-scheduler-steps 1

Long Context Optimization:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 16 \
  --max-model-len 131072 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 4096

Performance Comparison: vLLM vs Competing Frameworks

Major LLM Serving Framework Comparison

FeaturevLLMTGI (HuggingFace)TensorRT-LLM (NVIDIA)Triton + TensorRT-LLM
DeveloperUC BerkeleyHuggingFaceNVIDIANVIDIA
Core TechnologyPagedAttentionContinuous BatchingFasterTransformer-basedModel serving framework
Installation DifficultyVery EasyEasyDifficultVery Difficult
Model CompatibilityVery BroadBroadLimited (conversion required)Limited
API CompatibilityOpenAI CompatibleCustom + OpenAI CompatibleCustom APIgRPC + HTTP
Quantization SupportGPTQ, AWQ, FP8, GGUFGPTQ, AWQ, EETQFP8, INT8, INT4FP8, INT8, INT4
Multi-GPUTensor/PipelineTensorTensor/PipelineTensor/Pipeline
Speculative DecodingSupportedSupportedSupportedSupported
Production StabilityHighHighVery HighVery High
CommunityVery ActiveActiveNVIDIA-ledNVIDIA-led

Throughput Benchmarks (LLaMA-3.1-8B, A100 80GB)

MetricvLLMTGITensorRT-LLM
Throughput (tokens/s) - batch=1~120~100~150
Throughput (tokens/s) - batch=32~2,800~2,200~3,500
Throughput (tokens/s) - batch=128~5,500~4,000~7,000
TTFT (ms) - 512 token input~35~40~25
TBT (ms) - batch=1~8~10~6
Memory Efficiency95%+~80%~90%

Note: Benchmark results can vary significantly depending on hardware, model, and configuration. The figures above are approximate comparisons for reference.

Framework Selection Guide

Want to get started quickly?
vLLM (pip install vllm → serve immediately)

Need maximum performance?
TensorRT-LLM (requires significant effort for model conversion and configuration)

Already using the HuggingFace ecosystem?
TGI (natural integration with HuggingFace Hub)

Need enterprise deployment?
Triton + TensorRT-LLM (official NVIDIA support, multi-model serving)

Deployment Patterns

Single GPU Deployment

The simplest form, suitable for serving small models.

# Single GPU deployment with Docker
docker run --runtime nvidia --gpus '"device=0"' \
  -v /path/to/model:/model \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /model \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192

Multi-GPU Deployment

# 4 GPU tensor parallel deployment
docker run --runtime nvidia --gpus '"device=0,1,2,3"' \
  -v /path/to/model:/model \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /model \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9

Kubernetes + Ray Deployment

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-serving
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-serving
  template:
    metadata:
      labels:
        app: vllm-serving
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - '--model'
            - 'meta-llama/Llama-3.1-8B-Instruct'
            - '--gpu-memory-utilization'
            - '0.9'
            - '--max-model-len'
            - '8192'
            - '--max-num-seqs'
            - '256'
            - '--enable-prefix-caching'
            - '--port'
            - '8000'
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: '32Gi'
              cpu: '8'
            requests:
              nvidia.com/gpu: 1
              memory: '16Gi'
              cpu: '4'
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
          env:
            - name: VLLM_USAGE_SOURCE
              value: 'production'
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: ml-serving
spec:
  selector:
    app: vllm-serving
  ports:
    - port: 80
      targetPort: 8000
      name: http
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-serving
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: '80'

Multi-Model Serving Pattern

# When serving multiple models on the same cluster
# Deploy separate vLLM instances per model + router in front

# router.py (simple example)
from fastapi import FastAPI, Request
import httpx

app = FastAPI()

MODEL_ENDPOINTS = {
    "llama-8b": "http://vllm-8b:8000",
    "llama-70b": "http://vllm-70b:8000",
    "codellama-34b": "http://vllm-code:8000",
}

@app.post("/v1/chat/completions")
async def route_request(request: Request):
    body = await request.json()
    model = body.get("model", "llama-8b")
    endpoint = MODEL_ENDPOINTS.get(model)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{endpoint}/v1/chat/completions",
            json=body,
            timeout=120.0
        )
        return response.json()

Monitoring Strategy

Core Metric Definitions

Key metrics for measuring LLM serving performance:

MetricDescriptionTarget Range
TTFT (Time to First Token)Time until first token generation< 200ms (interactive)
TBT (Time Between Tokens)Time between token generations (= Inter-Token Latency)< 30ms
E2E LatencyTotal request processing timeTask-dependent
ThroughputTokens generated per second (tokens/s)Model/GPU dependent
GPU UtilizationGPU compute unit usage70-95%
KV Cache UsageKV Cache memory utilization< 95%
Queue DepthNumber of waiting requests< max_num_seqs
Request Success RateRequest success rate> 99.5%

Prometheus + Grafana Monitoring

vLLM natively exposes Prometheus metrics via the /metrics endpoint.

# Key Prometheus metrics
# vllm:num_requests_running - Currently processing requests
# vllm:num_requests_waiting - Waiting requests
# vllm:gpu_cache_usage_perc - KV Cache GPU utilization
# vllm:cpu_cache_usage_perc - KV Cache CPU swap utilization
# vllm:avg_prompt_throughput_toks_per_s - Prompt processing throughput
# vllm:avg_generation_throughput_toks_per_s - Generation throughput
# vllm:e2e_request_latency_seconds - E2E request latency histogram
# vllm:time_to_first_token_seconds - TTFT histogram
# vllm:time_per_output_token_seconds - TBT histogram
# prometheus-scrape-config.yaml
scrape_configs:
  - job_name: 'vllm'
    scrape_interval: 15s
    static_configs:
      - targets: ['vllm-service:8000']
    metrics_path: '/metrics'

Alert Rules Example

# prometheus-alert-rules.yaml
groups:
  - name: vllm_alerts
    rules:
      - alert: HighKVCacheUsage
        expr: vllm_gpu_cache_usage_perc > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'KV Cache usage exceeds 95%'

      - alert: HighRequestLatency
        expr: histogram_quantile(0.99, vllm_e2e_request_latency_seconds_bucket) > 30
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: 'P99 request latency exceeds 30 seconds'

      - alert: HighQueueDepth
        expr: vllm_num_requests_waiting > 100
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: 'Waiting requests exceed 100'

Common Issues and Solutions

OOM (Out of Memory) Errors

Symptoms: CUDA out of memory error occurs

Causes and Solutions:

# 1. Lower gpu-memory-utilization
--gpu-memory-utilization 0.80  # Reduce from default 0.9

# 2. Reduce max-model-len
--max-model-len 4096  # Limit unnecessarily long contexts

# 3. Reduce max-num-seqs
--max-num-seqs 64  # Decrease concurrent processing

# 4. Apply quantization
--quantization awq  # Or gptq, fp8

# 5. Tensor parallelism (add GPUs)
--tensor-parallel-size 2

Slow First Token (Slow TTFT)

Symptoms: TTFT is abnormally high (several seconds or more)

Causes and Solutions:

# 1. Long prompts → Enable Chunked Prefill
--enable-chunked-prefill
--max-num-batched-tokens 2048

# 2. Enable Prefix Caching (when repeated prompts exist)
--enable-prefix-caching

# 3. Check CUDA graph optimization
# Remove --enforce-eager (enable CUDA graphs)

# 4. Optimize model loading
--load-format auto  # Use safetensors when possible

Throughput Degradation

Symptoms: tokens/s is lower than expected

Checklist:

# 1. Check batch size
--max-num-seqs 256  # Too small leads to low GPU utilization

# 2. Check memory utilization
--gpu-memory-utilization 0.92  # Too conservative limits batch size

# 3. Try Speculative Decoding
--speculative-model <small-model> --num-speculative-tokens 5

# 4. Apply quantization
--quantization awq  # Or fp8 (A100/H100)

Request Timeouts

Symptoms: Some requests fail due to timeouts

# 1. Limit maximum tokens
# Set max_tokens appropriately in API requests

# 2. Allocate swap space
--swap-space 16  # Allow swapping to CPU memory

# 3. Set preemption strategy
--preemption-mode recompute  # Or swap

Advanced Optimization Tips

FP8 Quantization (H100/A100)

# Leverage FP8 quantization on NVIDIA H100
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92

# FP8 compared to INT8:
# - Higher accuracy (maintains FP range)
# - Maximum performance with H100 FP8 Tensor Cores
# - No separate calibration required

Multi-LoRA Serving

# Serve base model + multiple LoRA adapters simultaneously
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules \
    korean-chat=/path/to/korean-lora \
    code-assist=/path/to/code-lora \
    medical-qa=/path/to/medical-lora \
  --max-loras 3 \
  --max-lora-rank 64

Benchmarking Tools

# vLLM built-in benchmark tools
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct &

# Throughput benchmark
python -m vllm.benchmark_throughput \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 512 \
  --output-len 128 \
  --num-prompts 1000

# Latency benchmark
python -m vllm.benchmark_latency \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 512 \
  --output-len 128 \
  --batch-size 1

FAQ

What is the difference between vLLM and Ollama?

Ollama is a convenient tool for local development and experimentation, while vLLM is a high-performance engine for production-level serving. Ollama is extremely simple to install and use but does not provide advanced optimizations like PagedAttention or Continuous Batching. If you need to handle production traffic, vLLM is recommended.

What GPUs does vLLM support?

vLLM runs on CUDA-supported NVIDIA GPUs. A100 and H100 are optimal, and RTX 3090/4090 are also usable. AMD GPUs (ROCm) are experimentally supported. Minimum requirements vary depending on model size and quantization level.

Can I set gpu-memory-utilization to 1.0?

Not recommended. GPUs need memory for CUDA kernels, cuBLAS workspaces, temporary tensors, and other allocations beyond the KV Cache. 0.9-0.95 is a safe upper bound for most cases, and setting it to 1.0 frequently causes OOM errors.

What is the difference between Tensor Parallelism and Pipeline Parallelism?

Tensor Parallelism (TP) splits a single layer across multiple GPUs for parallel processing. Since inter-layer communication is required, fast GPU interconnect (NVLink) is important. Pipeline Parallelism (PP) assigns groups of layers to each GPU. It has fewer communication requirements but suffers from inefficiency due to "pipeline bubbles." Generally, TP is used within a single node, and PP is used across nodes.

Is Speculative Decoding always faster?

No. Speculative Decoding is most effective when batch size is small and the GPU is compute-bound. With large batch sizes, the overhead of the draft model can offset the benefits. Additionally, if the draft model's acceptance rate is low (specialized domains, code generation, etc.), performance gains are minimal.

Should I use streaming responses in vLLM?

Recommended in most cases. Streaming significantly reduces perceived latency for users. Setting stream=true in the OpenAI-compatible API delivers tokens in real-time via SSE (Server-Sent Events). However, additional logic for streaming response completeness validation and error handling is required.

References

Conclusion

vLLM has established itself as the de facto standard for LLM production serving, with continuously expanding features thanks to its rapid development pace and active community.

Key takeaways:

  1. PagedAttention is a game-changer for memory efficiency. It reduces KV Cache waste to under 5%, handling far more concurrent requests on the same hardware.
  2. Continuous Batching dramatically improves throughput. It delivers 2-5x improvement over static batching.
  3. Configuration optimization is key to performance. The right combination of gpu-memory-utilization, max-model-len, and max-num-seqs is critical.
  4. Prefix Caching and Speculative Decoding provide additional performance gains depending on the scenario.
  5. Monitoring is essential. TTFT, TBT, throughput, and KV Cache utilization must be continuously tracked.
  6. GPU resource management and autoscaling strategies must be carefully designed for Kubernetes deployments.

When choosing a framework, vLLM offers the best balance of "quick start + high performance + broad compatibility," and is recommended as the first choice unless there are specific reasons otherwise.