Split View: vLLM 프로덕션 서빙 최적화 완전 가이드: PagedAttention부터 Kubernetes 배포까지

vLLM 프로덕션 서빙 최적화 완전 가이드: PagedAttention부터 Kubernetes 배포까지

서론
vLLM 핵심 아키텍처
핵심 최적화 기법 상세
상세 설정 가이드
성능 비교: vLLM vs 경쟁 프레임워크
배포 패턴
모니터링 전략
자주 발생하는 문제와 해결 방법
고급 최적화 팁
FAQ
vLLM과 Ollama의 차이는 무엇인가요?
vLLM은 어떤 GPU에서 동작하나요?
gpu-memory-utilization을 1.0으로 설정하면 안 되나요?
Tensor Parallelism과 Pipeline Parallelism의 차이는 무엇인가요?
Speculative Decoding은 항상 빠른가요?
vLLM에서 스트리밍 응답을 사용하는 것이 좋은가요?
참고 자료
마무리

서론

LLM을 프로덕션 환경에서 서빙하는 것은 단순히 모델을 로드하고 API를 노출하는 것을 넘어선 복잡한 엔지니어링 과제입니다. GPU 메모리 관리, 동시 요청 처리, 지연 시간 최적화, 비용 효율성까지 고려해야 하는 다차원적 문제입니다.

vLLM은 UC Berkeley의 Sky Computing Lab에서 개발한 고성능 LLM 서빙 엔진으로, 혁신적인 PagedAttention 알고리즘을 핵심으로 합니다. 2023년 첫 공개 이후 빠르게 업계 표준 수준의 서빙 프레임워크로 자리잡았으며, 현재 수많은 기업과 연구 기관에서 프로덕션에 활용하고 있습니다.

이 글에서는 vLLM의 아키텍처와 핵심 최적화 기법, 상세 설정 가이드, 경쟁 프레임워크와의 비교, Kubernetes 배포 패턴, 모니터링 전략, 그리고 실무에서 자주 만나는 문제와 해결 방법까지 포괄적으로 다룹니다.

vLLM 핵심 아키텍처

PagedAttention이란

vLLM의 가장 혁신적인 기여는 PagedAttention입니다. 운영체제의 가상 메모리 관리에서 영감을 받아, KV Cache를 고정 크기의 블록(페이지)으로 나누어 비연속적인 메모리 공간에 저장합니다.

기존 방식의 문제점:

기존 LLM 서빙 (HuggingFace Transformers 등):
├─ 각 요청에 최대 시퀀스 길이만큼 연속 메모리를 사전 할당
├─ 실제 시퀀스가 짧아도 최대 길이 메모리 점유 → 내부 단편화
├─ 요청 간 메모리 공유 불가 → 외부 단편화
└─ 결과: GPU 메모리의 60~80%가 낭비됨

PagedAttention의 해결 방식:

vLLM PagedAttention:
├─ KV Cache를 고정 크기 블록(예: 16토큰)으로 분할
├─ 블록은 비연속적 메모리에 저장 가능
├─ 필요할 때 블록을 동적 할당/해제
├─ 블록 테이블로 논리→물리 매핑 관리
└─ 결과: GPU 메모리 낭비를 5% 미만으로 줄임

# PagedAttention의 블록 테이블 개념
# Logical Block → Physical Block 매핑

# Request 1: "The cat sat on the mat"
# Logical:  [Block 0] [Block 1]
# Physical: [GPU Block 3] [GPU Block 7]

# Request 2: "Hello world"
# Logical:  [Block 0]
# Physical: [GPU Block 1]

# 시스템 프롬프트가 동일한 경우 Copy-on-Write:
# Request 3과 Request 4가 같은 시스템 프롬프트 사용
# → 시스템 프롬프트의 물리 블록을 공유 (메모리 추가 소비 없음)

Continuous Batching (지속적 배칭)

기존의 정적 배칭(static batching)에서는 배치 내 가장 긴 시퀀스가 완료될 때까지 모든 요청이 대기합니다. vLLM의 Continuous Batching은 이 비효율을 제거합니다.

정적 배칭 (Static Batching):
Time →
Req A: [████████████████████]     ← 생성 완료
Req B: [████████]                 ← 일찍 끝났지만 A를 기다림
Req C: [████████████]             ← B보다 길지만 A를 기다림
       ↑ 배치 시작     ↑ 배치 끝 (A 완료 시점)

Continuous Batching:
Time →
Req A: [████████████████████]
Req B: [████████] → Req D 즉시 시작: [████████████]
Req C: [████████████] → Req E 즉시 시작: [██████]
       ↑ 요청이 끝나면 즉시 새 요청으로 슬롯 교체

효과: 동일 GPU에서 처리량(throughput)이 2~5배 향상됩니다.

전체 아키텍처 개요

┌─────────────────────────────────────────────┐
│                 vLLM Engine                  │
├─────────────────────────────────────────────┤
│  API Server (OpenAI Compatible)              │
│  ├─ /v1/completions                          │
│  ├─ /v1/chat/completions                     │
│  └─ /v1/embeddings                           │
├─────────────────────────────────────────────┤
│  Scheduler                                   │
│  ├─ Continuous Batching                      │
│  ├─ Priority Queue                           │
│  └─ Preemption (Swap/Recompute)             │
├─────────────────────────────────────────────┤
│  KV Cache Manager (PagedAttention)           │
│  ├─ Block Allocator                          │
│  ├─ Block Table                              │
│  └─ Copy-on-Write                            │
├─────────────────────────────────────────────┤
│  Model Executor                              │
│  ├─ Tensor Parallelism (Ray/NCCL)           │
│  ├─ Quantization (GPTQ/AWQ/FP8)            │
│  └─ Speculative Decoding                    │
└─────────────────────────────────────────────┘

핵심 최적화 기법 상세

Tensor Parallelism (텐서 병렬화)

하나의 모델을 여러 GPU에 분산하여 실행합니다. vLLM은 Megatron-LM 스타일의 텐서 병렬화를 지원합니다.

# 단일 GPU (기본)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1

# 4 GPU 텐서 병렬화
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

# 8 GPU (A100 80GB x 8 노드)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1

텐서 병렬화 크기 결정 가이드:

모델 크기	양자화 없음 (FP16)	INT8 양자화	INT4 양자화
7B	1x A100 80GB	1x A100 40GB	1x RTX 4090
13B	1x A100 80GB	1x A100 80GB	1x A100 40GB
34B	2x A100 80GB	1x A100 80GB	1x A100 80GB
70B	4x A100 80GB	2x A100 80GB	1~2x A100 80GB
405B	8x A100 80GB	4~8x A100 80GB	4x A100 80GB

Speculative Decoding (추측적 디코딩)

작은 "드래프트 모델"이 여러 토큰을 빠르게 추측 생성하고, 큰 "타겟 모델"이 한 번의 포워드 패스로 이를 검증합니다.

# Speculative Decoding 활성화
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --speculative-draft-tensor-parallel-size 1

동작 원리:

기존 자기회귀 생성 (1 토큰/스텝):
Step 1 → "The"
Step 2 → "weather"
Step 3 → "is"
Step 4 → "sunny"
Step 5 → "today"
= 5 forward passes (대형 모델)

Speculative Decoding:
Draft model (빠른 소형 모델): "The weather is sunny today" (5 토큰 추측)
Target model (1 forward pass): "The weather is sunny today" ✓ 전부 수락!
= 1 forward pass (소형 모델) + 1 forward pass (대형 모델)
→ 잠재적으로 2.5~3x 속도 향상

적합한 시나리오:

GPU compute-bound 환경 (배치 크기가 작을 때)
드래프트 모델과 타겟 모델의 토크나이저가 호환될 때
수락률이 높은 경우 (일반적인 텍스트 생성)

Prefix Caching (프리픽스 캐싱)

동일한 시스템 프롬프트나 접두사를 가진 요청들 사이에서 KV Cache를 재사용합니다.

# Prefix Caching 활성화
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching

효과:

시나리오: 모든 요청에 동일한 2000 토큰 시스템 프롬프트 사용

Prefix Caching 비활성화:
Req 1: [시스템 프롬프트 2000토큰 처리] + [사용자 입력 처리] → TTFT: 500ms
Req 2: [시스템 프롬프트 2000토큰 재처리] + [사용자 입력 처리] → TTFT: 500ms
Req 3: [시스템 프롬프트 2000토큰 재처리] + [사용자 입력 처리] → TTFT: 500ms

Prefix Caching 활성화:
Req 1: [시스템 프롬프트 2000토큰 처리] + [사용자 입력 처리] → TTFT: 500ms
Req 2: [캐시 히트!] + [사용자 입력 처리] → TTFT: 50ms (10x 개선!)
Req 3: [캐시 히트!] + [사용자 입력 처리] → TTFT: 50ms

Chunked Prefill

긴 프롬프트의 Prefill 단계를 청크로 나누어 디코딩과 인터리빙합니다. 이를 통해 긴 프롬프트가 들어와도 기존 디코딩 중인 요청의 지연시간(TBT)이 급증하지 않습니다.

# Chunked Prefill 활성화
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill \
  --max-num-batched-tokens 2048

상세 설정 가이드

핵심 설정 파라미터

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  \
  # GPU 메모리 설정
  --gpu-memory-utilization 0.90 \          # GPU 메모리 사용 비율 (0.0~1.0)
  --max-model-len 32768 \                  # 최대 컨텍스트 길이
  \
  # 배치 및 동시성 설정
  --max-num-seqs 256 \                     # 동시 처리 최대 시퀀스 수
  --max-num-batched-tokens 32768 \         # 배치당 최대 토큰 수
  \
  # 병렬화 설정
  --tensor-parallel-size 1 \               # 텐서 병렬 GPU 수
  --pipeline-parallel-size 1 \             # 파이프라인 병렬 단계 수
  \
  # 양자화 설정
  --quantization awq \                     # 양자화 방식 (awq, gptq, fp8 등)
  --dtype auto \                           # 데이터 타입 (auto, float16, bfloat16)
  \
  # 서버 설정
  --host 0.0.0.0 \
  --port 8000 \
  --api-key "your-secret-key"

설정 파라미터 상세 설명

파라미터	기본값	설명	권장 범위
`--gpu-memory-utilization`	0.9	KV Cache에 할당할 GPU 메모리 비율	0.85~0.95
`--max-model-len`	모델 설정값	처리할 수 있는 최대 시퀀스 길이	태스크에 따라 조정
`--max-num-seqs`	256	동시에 처리할 수 있는 최대 시퀀스 수	64~512
`--max-num-batched-tokens`	없음	한 번의 iteration에서 처리할 최대 토큰 수	2048~65536
`--tensor-parallel-size`	1	텐서 병렬화에 사용할 GPU 수	1, 2, 4, 8
`--block-size`	16	PagedAttention 블록 크기 (토큰 수)	8, 16, 32
`--swap-space`	4	CPU 스왑 공간 (GB)	4~16
`--enforce-eager`	False	CUDA 그래프 대신 Eager 모드 사용	디버깅 시 True

시나리오별 설정 예시

높은 처리량 (Throughput) 최적화:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 512 \
  --max-model-len 4096 \
  --enable-prefix-caching \
  --enable-chunked-prefill

낮은 지연 시간 (Latency) 최적화:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 32 \
  --max-model-len 8192 \
  --num-scheduler-steps 1

긴 컨텍스트 (Long Context) 최적화:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 16 \
  --max-model-len 131072 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 4096

성능 비교: vLLM vs 경쟁 프레임워크

주요 LLM 서빙 프레임워크 비교

특성	vLLM	TGI (HuggingFace)	TensorRT-LLM (NVIDIA)	Triton + TensorRT-LLM
개발사	UC Berkeley	HuggingFace	NVIDIA	NVIDIA
핵심 기술	PagedAttention	Continuous Batching	FasterTransformer 기반	모델 서빙 프레임워크
설치 난이도	매우 쉬움	쉬움	어려움	매우 어려움
모델 호환성	매우 넓음	넓음	제한적 (변환 필요)	제한적
API 호환성	OpenAI 호환	자체 + OpenAI 호환	자체 API	gRPC + HTTP
양자화 지원	GPTQ, AWQ, FP8, GGUF	GPTQ, AWQ, EETQ	FP8, INT8, INT4	FP8, INT8, INT4
멀티 GPU	Tensor/Pipeline	Tensor	Tensor/Pipeline	Tensor/Pipeline
Speculative Decoding	지원	지원	지원	지원
프로덕션 안정성	높음	높음	매우 높음	매우 높음
커뮤니티	매우 활발	활발	NVIDIA 주도	NVIDIA 주도

처리량 벤치마크 (LLaMA-3.1-8B, A100 80GB 기준)

메트릭	vLLM	TGI	TensorRT-LLM
Throughput (tokens/s) - batch=1	~120	~100	~150
Throughput (tokens/s) - batch=32	~2,800	~2,200	~3,500
Throughput (tokens/s) - batch=128	~5,500	~4,000	~7,000
TTFT (ms) - 512 토큰 입력	~35	~40	~25
TBT (ms) - batch=1	~8	~10	~6
메모리 효율성	95%+	~80%	~90%

참고: 벤치마크 결과는 하드웨어, 모델, 설정에 따라 크게 달라질 수 있습니다. 위 수치는 참고용 대략적 비교입니다.

프레임워크 선택 가이드

빠르게 시작하고 싶다면?
→ vLLM (pip install vllm → 바로 서빙)

최대 성능이 필요하다면?
→ TensorRT-LLM (단, 모델 변환과 설정에 상당한 노력 필요)

HuggingFace 생태계를 이미 사용 중이라면?
→ TGI (HuggingFace Hub과 자연스러운 통합)

엔터프라이즈 배포가 필요하다면?
→ Triton + TensorRT-LLM (NVIDIA 공식 지원, 멀티모델 서빙)

배포 패턴

단일 GPU 배포

가장 단순한 형태로, 소규모 모델 서빙에 적합합니다.

# Docker로 단일 GPU 배포
docker run --runtime nvidia --gpus '"device=0"' \
  -v /path/to/model:/model \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /model \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192

멀티 GPU 배포

# 4 GPU 텐서 병렬 배포
docker run --runtime nvidia --gpus '"device=0,1,2,3"' \
  -v /path/to/model:/model \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /model \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9

Kubernetes + Ray 배포

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-serving
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-serving
  template:
    metadata:
      labels:
        app: vllm-serving
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - '--model'
            - 'meta-llama/Llama-3.1-8B-Instruct'
            - '--gpu-memory-utilization'
            - '0.9'
            - '--max-model-len'
            - '8192'
            - '--max-num-seqs'
            - '256'
            - '--enable-prefix-caching'
            - '--port'
            - '8000'
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: '32Gi'
              cpu: '8'
            requests:
              nvidia.com/gpu: 1
              memory: '16Gi'
              cpu: '4'
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
          env:
            - name: VLLM_USAGE_SOURCE
              value: 'production'
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: ml-serving
spec:
  selector:
    app: vllm-serving
  ports:
    - port: 80
      targetPort: 8000
      name: http
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-serving
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: '80'

멀티 모델 서빙 패턴

# 여러 모델을 동일 클러스터에서 서빙하는 경우
# 각 모델에 대해 별도의 vLLM 인스턴스 + 앞단에 라우터 배치

# router.py (간단한 예시)
from fastapi import FastAPI, Request
import httpx

app = FastAPI()

MODEL_ENDPOINTS = {
    "llama-8b": "http://vllm-8b:8000",
    "llama-70b": "http://vllm-70b:8000",
    "codellama-34b": "http://vllm-code:8000",
}

@app.post("/v1/chat/completions")
async def route_request(request: Request):
    body = await request.json()
    model = body.get("model", "llama-8b")
    endpoint = MODEL_ENDPOINTS.get(model)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{endpoint}/v1/chat/completions",
            json=body,
            timeout=120.0
        )
        return response.json()

모니터링 전략

핵심 메트릭 정의

LLM 서빙의 성능을 측정하는 핵심 지표는 다음과 같습니다:

메트릭	설명	목표 범위
TTFT (Time to First Token)	첫 번째 토큰 생성까지의 시간	< 200ms (대화형)
TBT (Time Between Tokens)	토큰 간 생성 시간 (= Inter-Token Latency)	< 30ms
E2E Latency	전체 요청 처리 시간	태스크 의존
Throughput	초당 생성 토큰 수 (tokens/s)	모델/GPU 의존
GPU Utilization	GPU 연산 유닛 사용률	70~95%
KV Cache Usage	KV Cache 메모리 사용률	< 95%
Queue Depth	대기 중인 요청 수	< max_num_seqs
Request Success Rate	요청 성공률	> 99.5%

Prometheus + Grafana 모니터링

vLLM은 기본적으로 Prometheus 메트릭을 /metrics 엔드포인트로 노출합니다.

# 주요 Prometheus 메트릭
# vllm:num_requests_running - 현재 처리 중인 요청 수
# vllm:num_requests_waiting - 대기 중인 요청 수
# vllm:gpu_cache_usage_perc - KV Cache GPU 사용률
# vllm:cpu_cache_usage_perc - KV Cache CPU 스왑 사용률
# vllm:avg_prompt_throughput_toks_per_s - 프롬프트 처리 처리량
# vllm:avg_generation_throughput_toks_per_s - 생성 처리량
# vllm:e2e_request_latency_seconds - E2E 요청 지연 히스토그램
# vllm:time_to_first_token_seconds - TTFT 히스토그램
# vllm:time_per_output_token_seconds - TBT 히스토그램

# prometheus-scrape-config.yaml
scrape_configs:
  - job_name: 'vllm'
    scrape_interval: 15s
    static_configs:
      - targets: ['vllm-service:8000']
    metrics_path: '/metrics'

알림 규칙 예시

# prometheus-alert-rules.yaml
groups:
  - name: vllm_alerts
    rules:
      - alert: HighKVCacheUsage
        expr: vllm_gpu_cache_usage_perc > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'KV Cache 사용률이 95%를 초과했습니다'

      - alert: HighRequestLatency
        expr: histogram_quantile(0.99, vllm_e2e_request_latency_seconds_bucket) > 30
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: 'P99 요청 지연 시간이 30초를 초과했습니다'

      - alert: HighQueueDepth
        expr: vllm_num_requests_waiting > 100
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: '대기 중인 요청이 100개를 초과했습니다'

자주 발생하는 문제와 해결 방법

OOM (Out of Memory) 에러

증상: CUDA out of memory 에러 발생

원인 및 해결:

# 1. gpu-memory-utilization 낮추기
--gpu-memory-utilization 0.80  # 기본 0.9에서 낮춤

# 2. max-model-len 줄이기
--max-model-len 4096  # 불필요하게 긴 컨텍스트 제한

# 3. max-num-seqs 줄이기
--max-num-seqs 64  # 동시 처리 수 감소

# 4. 양자화 적용
--quantization awq  # 또는 gptq, fp8

# 5. 텐서 병렬화 (GPU 추가)
--tensor-parallel-size 2

느린 첫 번째 토큰 (Slow TTFT)

증상: TTFT가 비정상적으로 높음 (수 초 이상)

원인 및 해결:

# 1. 긴 프롬프트 → Chunked Prefill 활성화
--enable-chunked-prefill
--max-num-batched-tokens 2048

# 2. Prefix Caching 활성화 (반복 프롬프트가 있는 경우)
--enable-prefix-caching

# 3. CUDA 그래프 최적화 확인
# --enforce-eager를 제거 (CUDA 그래프 활성화)

# 4. 모델 로드 최적화
--load-format auto  # 가능한 경우 safetensors 사용

처리량 저하

증상: tokens/s가 기대치보다 낮음

체크리스트:

# 1. 배치 크기 확인
--max-num-seqs 256  # 너무 작으면 GPU 활용률 저하

# 2. 메모리 활용률 확인
--gpu-memory-utilization 0.92  # 너무 보수적이면 배치 크기 제한

# 3. Speculative Decoding 시도
--speculative-model <small-model> --num-speculative-tokens 5

# 4. 양자화 적용
--quantization awq  # 또는 fp8 (A100/H100)

요청 타임아웃

증상: 일부 요청이 타임아웃으로 실패

# 1. 최대 토큰 수 제한
# API 요청 시 max_tokens를 적절히 설정

# 2. 스왑 공간 확보
--swap-space 16  # CPU 메모리로 스왑 허용

# 3. 프리엠션 전략 설정
--preemption-mode recompute  # 또는 swap

고급 최적화 팁

FP8 양자화 (H100/A100)

# NVIDIA H100에서 FP8 양자화 활용
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92

# FP8은 INT8 대비:
# - 더 높은 정확도 (FP 범위 유지)
# - H100 FP8 Tensor Core 활용으로 최대 성능
# - 별도 캘리브레이션 불필요

다중 LoRA 서빙

# 베이스 모델 + 여러 LoRA 어댑터 동시 서빙
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules \
    korean-chat=/path/to/korean-lora \
    code-assist=/path/to/code-lora \
    medical-qa=/path/to/medical-lora \
  --max-loras 3 \
  --max-lora-rank 64

벤치마킹 도구

# vLLM 내장 벤치마크 도구
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct &

# 처리량 벤치마크
python -m vllm.benchmark_throughput \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 512 \
  --output-len 128 \
  --num-prompts 1000

# 지연시간 벤치마크
python -m vllm.benchmark_latency \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 512 \
  --output-len 128 \
  --batch-size 1

FAQ

vLLM과 Ollama의 차이는 무엇인가요?

Ollama는 로컬 개발과 실험을 위한 간편한 도구이고, vLLM은 프로덕션 수준의 서빙을 위한 고성능 엔진입니다. Ollama는 설치와 사용이 매우 간단하지만, PagedAttention이나 Continuous Batching 같은 고급 최적화를 제공하지 않습니다. 프로덕션 트래픽을 처리해야 한다면 vLLM을 권장합니다.

vLLM은 어떤 GPU에서 동작하나요?

vLLM은 CUDA 지원 NVIDIA GPU에서 동작합니다. A100, H100이 최적이며, RTX 3090/4090에서도 사용 가능합니다. AMD GPU (ROCm)도 실험적으로 지원합니다. 최소 요구 사항은 모델 크기와 양자화 수준에 따라 다릅니다.

gpu-memory-utilization을 1.0으로 설정하면 안 되나요?

권장하지 않습니다. GPU에는 KV Cache 외에도 CUDA 커널, cuBLAS 워크스페이스, 임시 텐서 등을 위한 메모리가 필요합니다. 0.9~0.95가 대부분의 경우 안전한 상한선이며, 1.0으로 설정하면 OOM이 자주 발생합니다.

Tensor Parallelism과 Pipeline Parallelism의 차이는 무엇인가요?

Tensor Parallelism(TP)은 하나의 레이어를 여러 GPU에 분할하여 병렬 처리합니다. 레이어 간 통신이 필요하므로 GPU 간 빠른 인터커넥트(NVLink)가 중요합니다. Pipeline Parallelism(PP)은 레이어 그룹을 각 GPU에 할당합니다. 통신 요구 사항이 적지만 "파이프라인 버블"로 인한 비효율이 있습니다. 일반적으로 단일 노드 내에서는 TP를, 노드 간에서는 PP를 사용합니다.

Speculative Decoding은 항상 빠른가요?

아닙니다. Speculative Decoding은 배치 크기가 작고 GPU가 compute-bound일 때 가장 효과적입니다. 배치 크기가 클 때는 드래프트 모델의 오버헤드가 이점을 상쇄할 수 있습니다. 또한 드래프트 모델의 수락률이 낮으면 (특수한 도메인, 코드 생성 등) 성능 향상이 미미합니다.

vLLM에서 스트리밍 응답을 사용하는 것이 좋은가요?

대부분의 경우 권장합니다. 스트리밍은 사용자 체감 지연 시간을 크게 줄여줍니다. OpenAI 호환 API에서 stream=true로 설정하면 SSE(Server-Sent Events)로 토큰을 실시간 전달합니다. 단, 스트리밍 응답의 완전성 검증과 에러 처리 로직이 추가로 필요합니다.

참고 자료

Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." arXiv:2309.06180
vLLM Project. "vLLM: Easy, Fast, and Cheap LLM Serving." GitHub Repository
vLLM Documentation. Official Docs
Leviathan, Y. et al. (2023). "Fast Inference from Transformers via Speculative Decoding." arXiv:2211.17192
NVIDIA. "TensorRT-LLM." GitHub Repository
HuggingFace. "Text Generation Inference (TGI)." GitHub Repository
Zheng, L. et al. (2023). "Efficiently Programming Large Language Models using SGLang." arXiv:2312.07104
Agrawal, A. et al. (2024). "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve." arXiv:2403.02310
Anyscale. "Ray Serve for LLM Serving." Documentation
Kubernetes. "GPU Scheduling." Documentation

마무리

vLLM은 LLM 프로덕션 서빙의 사실상 표준으로 자리잡았으며, 빠른 개발 속도와 활발한 커뮤니티 덕분에 기능이 지속적으로 확장되고 있습니다.

핵심 요약:

PagedAttention은 메모리 효율의 게임체인저입니다. KV Cache 낭비를 5% 미만으로 줄여 동일 하드웨어에서 훨씬 많은 동시 요청을 처리합니다.
Continuous Batching으로 처리량을 극적으로 개선합니다. 정적 배칭 대비 2~5배 향상됩니다.
설정 최적화가 성능의 핵심입니다. gpu-memory-utilization, max-model-len, max-num-seqs의 적절한 조합이 중요합니다.
Prefix Caching과 Speculative Decoding은 시나리오에 따라 추가적인 성능 향상을 제공합니다.
모니터링은 필수입니다. TTFT, TBT, 처리량, KV Cache 사용률을 지속적으로 추적해야 합니다.
Kubernetes 배포 시 GPU 리소스 관리와 오토스케일링 전략을 신중하게 설계해야 합니다.

프레임워크 선택 시 vLLM은 "빠른 시작 + 높은 성능 + 넓은 호환성"의 균형이 가장 뛰어나며, 특별한 이유가 없다면 첫 번째 선택지로 추천합니다.

Complete Guide to vLLM Production Serving Optimization: From PagedAttention to Kubernetes Deployment

Introduction
vLLM Core Architecture
Core Optimization Techniques in Detail
Detailed Configuration Guide
Performance Comparison: vLLM vs Competing Frameworks
Deployment Patterns
Monitoring Strategy
Common Issues and Solutions
Advanced Optimization Tips
FAQ
What is the difference between vLLM and Ollama?
What GPUs does vLLM support?
Can I set gpu-memory-utilization to 1.0?
What is the difference between Tensor Parallelism and Pipeline Parallelism?
Is Speculative Decoding always faster?
Should I use streaming responses in vLLM?
References
Conclusion
Quiz

Introduction

Serving LLMs in production environments is a complex engineering challenge that goes far beyond simply loading a model and exposing an API. It is a multidimensional problem that requires consideration of GPU memory management, concurrent request handling, latency optimization, and cost efficiency.

vLLM is a high-performance LLM serving engine developed by UC Berkeley's Sky Computing Lab, built around the innovative PagedAttention algorithm. Since its initial release in 2023, it has rapidly established itself as an industry-standard serving framework, and is now used in production by numerous companies and research institutions.

This article comprehensively covers vLLM's architecture and core optimization techniques, detailed configuration guide, comparison with competing frameworks, Kubernetes deployment patterns, monitoring strategies, and common issues with their solutions.

vLLM Core Architecture

What is PagedAttention

vLLM's most innovative contribution is PagedAttention. Inspired by the operating system's virtual memory management, it divides the KV Cache into fixed-size blocks (pages) and stores them in non-contiguous memory spaces.

Problems with Traditional Approaches:

Traditional LLM Serving (HuggingFace Transformers, etc.):
├─ Pre-allocates contiguous memory for max sequence length per request
├─ Max length memory is occupied even when actual sequence is short → internal fragmentation
├─ No memory sharing between requests → external fragmentation
└─ Result: 60-80% of GPU memory is wasted

PagedAttention's Solution:

vLLM PagedAttention:
├─ Divides KV Cache into fixed-size blocks (e.g., 16 tokens)
├─ Blocks can be stored in non-contiguous memory
├─ Dynamically allocates/frees blocks as needed
├─ Manages logical→physical mapping via block tables
└─ Result: Reduces GPU memory waste to under 5%

# PagedAttention Block Table Concept
# Logical Block → Physical Block Mapping

# Request 1: "The cat sat on the mat"
# Logical:  [Block 0] [Block 1]
# Physical: [GPU Block 3] [GPU Block 7]

# Request 2: "Hello world"
# Logical:  [Block 0]
# Physical: [GPU Block 1]

# Copy-on-Write when system prompts are identical:
# Request 3 and Request 4 use the same system prompt
# → Share physical blocks for system prompt (no additional memory consumption)

Continuous Batching

In traditional static batching, all requests wait until the longest sequence in the batch completes. vLLM's Continuous Batching eliminates this inefficiency.

Static Batching:
Time →
Req A: [████████████████████]     ← Generation complete
Req B: [████████]                 ← Finished early but waits for A
Req C: [████████████]             ← Longer than B but waits for A
       ↑ Batch start    ↑ Batch end (when A completes)

Continuous Batching:
Time →
Req A: [████████████████████]
Req B: [████████] → Req D starts immediately: [████████████]
Req C: [████████████] → Req E starts immediately: [██████]
       ↑ Slot is replaced with new request as soon as one finishes

Effect: Throughput improves 2-5x on the same GPU.

Overall Architecture Overview

┌─────────────────────────────────────────────┐
│                 vLLM Engine                  │
├─────────────────────────────────────────────┤
│  API Server (OpenAI Compatible)              │
│  ├─ /v1/completions                          │
│  ├─ /v1/chat/completions                     │
│  └─ /v1/embeddings                           │
├─────────────────────────────────────────────┤
│  Scheduler                                   │
│  ├─ Continuous Batching                      │
│  ├─ Priority Queue                           │
│  └─ Preemption (Swap/Recompute)             │
├─────────────────────────────────────────────┤
│  KV Cache Manager (PagedAttention)           │
│  ├─ Block Allocator                          │
│  ├─ Block Table                              │
│  └─ Copy-on-Write                            │
├─────────────────────────────────────────────┤
│  Model Executor                              │
│  ├─ Tensor Parallelism (Ray/NCCL)           │
│  ├─ Quantization (GPTQ/AWQ/FP8)            │
│  └─ Speculative Decoding                    │
└─────────────────────────────────────────────┘

Core Optimization Techniques in Detail

Tensor Parallelism

Distributes a single model across multiple GPUs for execution. vLLM supports Megatron-LM style tensor parallelism.

# Single GPU (default)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1

# 4 GPU tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

# 8 GPU (A100 80GB x 8 node)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1

Tensor Parallelism Sizing Guide:

Model Size	No Quantization (FP16)	INT8 Quantization	INT4 Quantization
7B	1x A100 80GB	1x A100 40GB	1x RTX 4090
13B	1x A100 80GB	1x A100 80GB	1x A100 40GB
34B	2x A100 80GB	1x A100 80GB	1x A100 80GB
70B	4x A100 80GB	2x A100 80GB	1-2x A100 80GB
405B	8x A100 80GB	4-8x A100 80GB	4x A100 80GB

Speculative Decoding

A small "draft model" quickly generates speculative tokens, and the large "target model" verifies them in a single forward pass.

# Enable Speculative Decoding
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --speculative-draft-tensor-parallel-size 1

How It Works:

Traditional autoregressive generation (1 token/step):
Step 1 → "The"
Step 2 → "weather"
Step 3 → "is"
Step 4 → "sunny"
Step 5 → "today"
= 5 forward passes (large model)

Speculative Decoding:
Draft model (fast, small): "The weather is sunny today" (5 tokens speculated)
Target model (1 forward pass): "The weather is sunny today" ✓ All accepted!
= 1 forward pass (small model) + 1 forward pass (large model)
→ Potentially 2.5-3x speed improvement

Suitable Scenarios:

GPU compute-bound environments (when batch size is small)
When draft and target model tokenizers are compatible
When acceptance rate is high (general text generation)

Prefix Caching

Reuses KV Cache across requests that share the same system prompt or prefix.

# Enable Prefix Caching
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching

Effect:

Scenario: All requests use the same 2000-token system prompt

Prefix Caching disabled:
Req 1: [Process 2000-token system prompt] + [Process user input] → TTFT: 500ms
Req 2: [Reprocess 2000-token system prompt] + [Process user input] → TTFT: 500ms
Req 3: [Reprocess 2000-token system prompt] + [Process user input] → TTFT: 500ms

Prefix Caching enabled:
Req 1: [Process 2000-token system prompt] + [Process user input] → TTFT: 500ms
Req 2: [Cache hit!] + [Process user input] → TTFT: 50ms (10x improvement!)
Req 3: [Cache hit!] + [Process user input] → TTFT: 50ms

Chunked Prefill

Divides the prefill stage of long prompts into chunks and interleaves them with decoding. This prevents the Time Between Tokens (TBT) for existing decoding requests from spiking when long prompts arrive.

# Enable Chunked Prefill
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill \
  --max-num-batched-tokens 2048

Detailed Configuration Guide

Core Configuration Parameters

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  \
  # GPU memory settings
  --gpu-memory-utilization 0.90 \          # GPU memory usage ratio (0.0-1.0)
  --max-model-len 32768 \                  # Maximum context length
  \
  # Batch and concurrency settings
  --max-num-seqs 256 \                     # Maximum concurrent sequences
  --max-num-batched-tokens 32768 \         # Maximum tokens per batch
  \
  # Parallelism settings
  --tensor-parallel-size 1 \               # Number of tensor parallel GPUs
  --pipeline-parallel-size 1 \             # Number of pipeline parallel stages
  \
  # Quantization settings
  --quantization awq \                     # Quantization method (awq, gptq, fp8, etc.)
  --dtype auto \                           # Data type (auto, float16, bfloat16)
  \
  # Server settings
  --host 0.0.0.0 \
  --port 8000 \
  --api-key "your-secret-key"

Configuration Parameter Details

Parameter	Default	Description	Recommended Range
`--gpu-memory-utilization`	0.9	GPU memory ratio allocated to KV Cache	0.85-0.95
`--max-model-len`	Model config	Maximum processable sequence length	Adjust per task
`--max-num-seqs`	256	Maximum concurrent sequences	64-512
`--max-num-batched-tokens`	None	Maximum tokens per iteration	2048-65536
`--tensor-parallel-size`	1	Number of GPUs for tensor parallelism	1, 2, 4, 8
`--block-size`	16	PagedAttention block size (tokens)	8, 16, 32
`--swap-space`	4	CPU swap space (GB)	4-16
`--enforce-eager`	False	Use eager mode instead of CUDA graphs	True for debugging

Configuration Examples by Scenario

High Throughput Optimization:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 512 \
  --max-model-len 4096 \
  --enable-prefix-caching \
  --enable-chunked-prefill

Low Latency Optimization:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 32 \
  --max-model-len 8192 \
  --num-scheduler-steps 1

Long Context Optimization:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 16 \
  --max-model-len 131072 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 4096

Performance Comparison: vLLM vs Competing Frameworks

Major LLM Serving Framework Comparison

Feature	vLLM	TGI (HuggingFace)	TensorRT-LLM (NVIDIA)	Triton + TensorRT-LLM
Developer	UC Berkeley	HuggingFace	NVIDIA	NVIDIA
Core Technology	PagedAttention	Continuous Batching	FasterTransformer-based	Model serving framework
Installation Difficulty	Very Easy	Easy	Difficult	Very Difficult
Model Compatibility	Very Broad	Broad	Limited (conversion required)	Limited
API Compatibility	OpenAI Compatible	Custom + OpenAI Compatible	Custom API	gRPC + HTTP
Quantization Support	GPTQ, AWQ, FP8, GGUF	GPTQ, AWQ, EETQ	FP8, INT8, INT4	FP8, INT8, INT4
Multi-GPU	Tensor/Pipeline	Tensor	Tensor/Pipeline	Tensor/Pipeline
Speculative Decoding	Supported	Supported	Supported	Supported
Production Stability	High	High	Very High	Very High
Community	Very Active	Active	NVIDIA-led	NVIDIA-led

Throughput Benchmarks (LLaMA-3.1-8B, A100 80GB)

Metric	vLLM	TGI	TensorRT-LLM
Throughput (tokens/s) - batch=1	~120	~100	~150
Throughput (tokens/s) - batch=32	~2,800	~2,200	~3,500
Throughput (tokens/s) - batch=128	~5,500	~4,000	~7,000
TTFT (ms) - 512 token input	~35	~40	~25
TBT (ms) - batch=1	~8	~10	~6
Memory Efficiency	95%+	~80%	~90%

Note: Benchmark results can vary significantly depending on hardware, model, and configuration. The figures above are approximate comparisons for reference.

Framework Selection Guide

Want to get started quickly?
→ vLLM (pip install vllm → serve immediately)

Need maximum performance?
→ TensorRT-LLM (requires significant effort for model conversion and configuration)

Already using the HuggingFace ecosystem?
→ TGI (natural integration with HuggingFace Hub)

Need enterprise deployment?
→ Triton + TensorRT-LLM (official NVIDIA support, multi-model serving)

Deployment Patterns

Single GPU Deployment

The simplest form, suitable for serving small models.

# Single GPU deployment with Docker
docker run --runtime nvidia --gpus '"device=0"' \
  -v /path/to/model:/model \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /model \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192

Multi-GPU Deployment

# 4 GPU tensor parallel deployment
docker run --runtime nvidia --gpus '"device=0,1,2,3"' \
  -v /path/to/model:/model \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /model \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9

Kubernetes + Ray Deployment

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-serving
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-serving
  template:
    metadata:
      labels:
        app: vllm-serving
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - '--model'
            - 'meta-llama/Llama-3.1-8B-Instruct'
            - '--gpu-memory-utilization'
            - '0.9'
            - '--max-model-len'
            - '8192'
            - '--max-num-seqs'
            - '256'
            - '--enable-prefix-caching'
            - '--port'
            - '8000'
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: '32Gi'
              cpu: '8'
            requests:
              nvidia.com/gpu: 1
              memory: '16Gi'
              cpu: '4'
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
          env:
            - name: VLLM_USAGE_SOURCE
              value: 'production'
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: ml-serving
spec:
  selector:
    app: vllm-serving
  ports:
    - port: 80
      targetPort: 8000
      name: http
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-serving
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: '80'

Multi-Model Serving Pattern

# When serving multiple models on the same cluster
# Deploy separate vLLM instances per model + router in front

# router.py (simple example)
from fastapi import FastAPI, Request
import httpx

app = FastAPI()

MODEL_ENDPOINTS = {
    "llama-8b": "http://vllm-8b:8000",
    "llama-70b": "http://vllm-70b:8000",
    "codellama-34b": "http://vllm-code:8000",
}

@app.post("/v1/chat/completions")
async def route_request(request: Request):
    body = await request.json()
    model = body.get("model", "llama-8b")
    endpoint = MODEL_ENDPOINTS.get(model)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{endpoint}/v1/chat/completions",
            json=body,
            timeout=120.0
        )
        return response.json()

Monitoring Strategy

Core Metric Definitions

Key metrics for measuring LLM serving performance:

Metric	Description	Target Range
TTFT (Time to First Token)	Time until first token generation	< 200ms (interactive)
TBT (Time Between Tokens)	Time between token generations (= Inter-Token Latency)	< 30ms
E2E Latency	Total request processing time	Task-dependent
Throughput	Tokens generated per second (tokens/s)	Model/GPU dependent
GPU Utilization	GPU compute unit usage	70-95%
KV Cache Usage	KV Cache memory utilization	< 95%
Queue Depth	Number of waiting requests	< max_num_seqs
Request Success Rate	Request success rate	> 99.5%

Prometheus + Grafana Monitoring

vLLM natively exposes Prometheus metrics via the /metrics endpoint.

# Key Prometheus metrics
# vllm:num_requests_running - Currently processing requests
# vllm:num_requests_waiting - Waiting requests
# vllm:gpu_cache_usage_perc - KV Cache GPU utilization
# vllm:cpu_cache_usage_perc - KV Cache CPU swap utilization
# vllm:avg_prompt_throughput_toks_per_s - Prompt processing throughput
# vllm:avg_generation_throughput_toks_per_s - Generation throughput
# vllm:e2e_request_latency_seconds - E2E request latency histogram
# vllm:time_to_first_token_seconds - TTFT histogram
# vllm:time_per_output_token_seconds - TBT histogram

# prometheus-scrape-config.yaml
scrape_configs:
  - job_name: 'vllm'
    scrape_interval: 15s
    static_configs:
      - targets: ['vllm-service:8000']
    metrics_path: '/metrics'

Alert Rules Example

# prometheus-alert-rules.yaml
groups:
  - name: vllm_alerts
    rules:
      - alert: HighKVCacheUsage
        expr: vllm_gpu_cache_usage_perc > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'KV Cache usage exceeds 95%'

      - alert: HighRequestLatency
        expr: histogram_quantile(0.99, vllm_e2e_request_latency_seconds_bucket) > 30
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: 'P99 request latency exceeds 30 seconds'

      - alert: HighQueueDepth
        expr: vllm_num_requests_waiting > 100
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: 'Waiting requests exceed 100'

Common Issues and Solutions

OOM (Out of Memory) Errors

Symptoms: CUDA out of memory error occurs

Causes and Solutions:

# 1. Lower gpu-memory-utilization
--gpu-memory-utilization 0.80  # Reduce from default 0.9

# 2. Reduce max-model-len
--max-model-len 4096  # Limit unnecessarily long contexts

# 3. Reduce max-num-seqs
--max-num-seqs 64  # Decrease concurrent processing

# 4. Apply quantization
--quantization awq  # Or gptq, fp8

# 5. Tensor parallelism (add GPUs)
--tensor-parallel-size 2

Slow First Token (Slow TTFT)

Symptoms: TTFT is abnormally high (several seconds or more)

Causes and Solutions:

# 1. Long prompts → Enable Chunked Prefill
--enable-chunked-prefill
--max-num-batched-tokens 2048

# 2. Enable Prefix Caching (when repeated prompts exist)
--enable-prefix-caching

# 3. Check CUDA graph optimization
# Remove --enforce-eager (enable CUDA graphs)

# 4. Optimize model loading
--load-format auto  # Use safetensors when possible

Throughput Degradation

Symptoms: tokens/s is lower than expected

Checklist:

# 1. Check batch size
--max-num-seqs 256  # Too small leads to low GPU utilization

# 2. Check memory utilization
--gpu-memory-utilization 0.92  # Too conservative limits batch size

# 3. Try Speculative Decoding
--speculative-model <small-model> --num-speculative-tokens 5

# 4. Apply quantization
--quantization awq  # Or fp8 (A100/H100)

Request Timeouts

Symptoms: Some requests fail due to timeouts

# 1. Limit maximum tokens
# Set max_tokens appropriately in API requests

# 2. Allocate swap space
--swap-space 16  # Allow swapping to CPU memory

# 3. Set preemption strategy
--preemption-mode recompute  # Or swap

Advanced Optimization Tips

FP8 Quantization (H100/A100)

# Leverage FP8 quantization on NVIDIA H100
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92

# FP8 compared to INT8:
# - Higher accuracy (maintains FP range)
# - Maximum performance with H100 FP8 Tensor Cores
# - No separate calibration required

Multi-LoRA Serving

# Serve base model + multiple LoRA adapters simultaneously
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules \
    korean-chat=/path/to/korean-lora \
    code-assist=/path/to/code-lora \
    medical-qa=/path/to/medical-lora \
  --max-loras 3 \
  --max-lora-rank 64

Benchmarking Tools

# vLLM built-in benchmark tools
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct &

# Throughput benchmark
python -m vllm.benchmark_throughput \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 512 \
  --output-len 128 \
  --num-prompts 1000

# Latency benchmark
python -m vllm.benchmark_latency \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 512 \
  --output-len 128 \
  --batch-size 1

FAQ

What is the difference between vLLM and Ollama?

Ollama is a convenient tool for local development and experimentation, while vLLM is a high-performance engine for production-level serving. Ollama is extremely simple to install and use but does not provide advanced optimizations like PagedAttention or Continuous Batching. If you need to handle production traffic, vLLM is recommended.

What GPUs does vLLM support?

vLLM runs on CUDA-supported NVIDIA GPUs. A100 and H100 are optimal, and RTX 3090/4090 are also usable. AMD GPUs (ROCm) are experimentally supported. Minimum requirements vary depending on model size and quantization level.

Can I set gpu-memory-utilization to 1.0?

Not recommended. GPUs need memory for CUDA kernels, cuBLAS workspaces, temporary tensors, and other allocations beyond the KV Cache. 0.9-0.95 is a safe upper bound for most cases, and setting it to 1.0 frequently causes OOM errors.

What is the difference between Tensor Parallelism and Pipeline Parallelism?

Tensor Parallelism (TP) splits a single layer across multiple GPUs for parallel processing. Since inter-layer communication is required, fast GPU interconnect (NVLink) is important. Pipeline Parallelism (PP) assigns groups of layers to each GPU. It has fewer communication requirements but suffers from inefficiency due to "pipeline bubbles." Generally, TP is used within a single node, and PP is used across nodes.

Is Speculative Decoding always faster?

No. Speculative Decoding is most effective when batch size is small and the GPU is compute-bound. With large batch sizes, the overhead of the draft model can offset the benefits. Additionally, if the draft model's acceptance rate is low (specialized domains, code generation, etc.), performance gains are minimal.

Should I use streaming responses in vLLM?

Recommended in most cases. Streaming significantly reduces perceived latency for users. Setting stream=true in the OpenAI-compatible API delivers tokens in real-time via SSE (Server-Sent Events). However, additional logic for streaming response completeness validation and error handling is required.

References

Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." arXiv:2309.06180
vLLM Project. "vLLM: Easy, Fast, and Cheap LLM Serving." GitHub Repository
vLLM Documentation. Official Docs
Leviathan, Y. et al. (2023). "Fast Inference from Transformers via Speculative Decoding." arXiv:2211.17192
NVIDIA. "TensorRT-LLM." GitHub Repository
HuggingFace. "Text Generation Inference (TGI)." GitHub Repository
Zheng, L. et al. (2023). "Efficiently Programming Large Language Models using SGLang." arXiv:2312.07104
Agrawal, A. et al. (2024). "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve." arXiv:2403.02310
Anyscale. "Ray Serve for LLM Serving." Documentation
Kubernetes. "GPU Scheduling." Documentation

Conclusion

vLLM has established itself as the de facto standard for LLM production serving, with continuously expanding features thanks to its rapid development pace and active community.

Key takeaways:

PagedAttention is a game-changer for memory efficiency. It reduces KV Cache waste to under 5%, handling far more concurrent requests on the same hardware.
Continuous Batching dramatically improves throughput. It delivers 2-5x improvement over static batching.
Configuration optimization is key to performance. The right combination of gpu-memory-utilization, max-model-len, and max-num-seqs is critical.
Prefix Caching and Speculative Decoding provide additional performance gains depending on the scenario.
Monitoring is essential. TTFT, TBT, throughput, and KV Cache utilization must be continuously tracked.
GPU resource management and autoscaling strategies must be carefully designed for Kubernetes deployments.

When choosing a framework, vLLM offers the best balance of "quick start + high performance + broad compatibility," and is recommended as the first choice unless there are specific reasons otherwise.

Quiz

Q1: What is the main topic covered in "Complete Guide to vLLM Production Serving Optimization: From PagedAttention to Kubernetes Deployment"?

A comprehensive production-focused guide covering vLLM core architecture including PagedAttention, optimization techniques such as Continuous Batching, Tensor Parallelism, Speculative Decoding, and Prefix Caching, detailed configuration guide, performance comparison with TGI and...

Q2: Describe the vLLM Core Architecture.

What is PagedAttention vLLM's most innovative contribution is PagedAttention. Inspired by the operating system's virtual memory management, it divides the KV Cache into fixed-size blocks (pages) and stores them in non-contiguous memory spaces.

Q3: How can Core Optimization Techniques in Detail be achieved effectively?

Tensor Parallelism Distributes a single model across multiple GPUs for execution. vLLM supports Megatron-LM style tensor parallelism.

Q4: What are the key steps for Detailed Configuration Guide?

Core Configuration Parameters Configuration Parameter Details Configuration Examples by Scenario High Throughput Optimization: Low Latency Optimization: Long Context Optimization:

Q5: What are the key differences in Performance Comparison: vLLM vs Competing Frameworks?

Major LLM Serving Framework Comparison Throughput Benchmarks (LLaMA-3.1-8B, A100 80GB) Framework Selection Guide