vLLM 프로덕션 서빙 최적화 완전 가이드: PagedAttention부터 Kubernetes 배포까지

서론
vLLM 핵심 아키텍처
핵심 최적화 기법 상세
상세 설정 가이드
성능 비교: vLLM vs 경쟁 프레임워크
배포 패턴
모니터링 전략
자주 발생하는 문제와 해결 방법
고급 최적화 팁
FAQ
vLLM과 Ollama의 차이는 무엇인가요?
vLLM은 어떤 GPU에서 동작하나요?
gpu-memory-utilization을 1.0으로 설정하면 안 되나요?
Tensor Parallelism과 Pipeline Parallelism의 차이는 무엇인가요?
Speculative Decoding은 항상 빠른가요?
vLLM에서 스트리밍 응답을 사용하는 것이 좋은가요?
참고 자료
마무리

서론

LLM을 프로덕션 환경에서 서빙하는 것은 단순히 모델을 로드하고 API를 노출하는 것을 넘어선 복잡한 엔지니어링 과제입니다. GPU 메모리 관리, 동시 요청 처리, 지연 시간 최적화, 비용 효율성까지 고려해야 하는 다차원적 문제입니다.

vLLM은 UC Berkeley의 Sky Computing Lab에서 개발한 고성능 LLM 서빙 엔진으로, 혁신적인 PagedAttention 알고리즘을 핵심으로 합니다. 2023년 첫 공개 이후 빠르게 업계 표준 수준의 서빙 프레임워크로 자리잡았으며, 현재 수많은 기업과 연구 기관에서 프로덕션에 활용하고 있습니다.

이 글에서는 vLLM의 아키텍처와 핵심 최적화 기법, 상세 설정 가이드, 경쟁 프레임워크와의 비교, Kubernetes 배포 패턴, 모니터링 전략, 그리고 실무에서 자주 만나는 문제와 해결 방법까지 포괄적으로 다룹니다.

vLLM 핵심 아키텍처

PagedAttention이란

vLLM의 가장 혁신적인 기여는 PagedAttention입니다. 운영체제의 가상 메모리 관리에서 영감을 받아, KV Cache를 고정 크기의 블록(페이지)으로 나누어 비연속적인 메모리 공간에 저장합니다.

기존 방식의 문제점:

기존 LLM 서빙 (HuggingFace Transformers 등):
├─ 각 요청에 최대 시퀀스 길이만큼 연속 메모리를 사전 할당
├─ 실제 시퀀스가 짧아도 최대 길이 메모리 점유 → 내부 단편화
├─ 요청 간 메모리 공유 불가 → 외부 단편화
└─ 결과: GPU 메모리의 60~80%가 낭비됨

PagedAttention의 해결 방식:

vLLM PagedAttention:
├─ KV Cache를 고정 크기 블록(예: 16토큰)으로 분할
├─ 블록은 비연속적 메모리에 저장 가능
├─ 필요할 때 블록을 동적 할당/해제
├─ 블록 테이블로 논리→물리 매핑 관리
└─ 결과: GPU 메모리 낭비를 5% 미만으로 줄임

# PagedAttention의 블록 테이블 개념
# Logical Block → Physical Block 매핑

# Request 1: "The cat sat on the mat"
# Logical:  [Block 0] [Block 1]
# Physical: [GPU Block 3] [GPU Block 7]

# Request 2: "Hello world"
# Logical:  [Block 0]
# Physical: [GPU Block 1]

# 시스템 프롬프트가 동일한 경우 Copy-on-Write:
# Request 3과 Request 4가 같은 시스템 프롬프트 사용
# → 시스템 프롬프트의 물리 블록을 공유 (메모리 추가 소비 없음)

Continuous Batching (지속적 배칭)

기존의 정적 배칭(static batching)에서는 배치 내 가장 긴 시퀀스가 완료될 때까지 모든 요청이 대기합니다. vLLM의 Continuous Batching은 이 비효율을 제거합니다.

정적 배칭 (Static Batching):
Time →
Req A: [████████████████████]     ← 생성 완료
Req B: [████████]                 ← 일찍 끝났지만 A를 기다림
Req C: [████████████]             ← B보다 길지만 A를 기다림
       ↑ 배치 시작     ↑ 배치 끝 (A 완료 시점)

Continuous Batching:
Time →
Req A: [████████████████████]
Req B: [████████] → Req D 즉시 시작: [████████████]
Req C: [████████████] → Req E 즉시 시작: [██████]
       ↑ 요청이 끝나면 즉시 새 요청으로 슬롯 교체

효과: 동일 GPU에서 처리량(throughput)이 2~5배 향상됩니다.

전체 아키텍처 개요

┌─────────────────────────────────────────────┐
│                 vLLM Engine                  │
├─────────────────────────────────────────────┤
│  API Server (OpenAI Compatible)              │
│  ├─ /v1/completions                          │
│  ├─ /v1/chat/completions                     │
│  └─ /v1/embeddings                           │
├─────────────────────────────────────────────┤
│  Scheduler                                   │
│  ├─ Continuous Batching                      │
│  ├─ Priority Queue                           │
│  └─ Preemption (Swap/Recompute)             │
├─────────────────────────────────────────────┤
│  KV Cache Manager (PagedAttention)           │
│  ├─ Block Allocator                          │
│  ├─ Block Table                              │
│  └─ Copy-on-Write                            │
├─────────────────────────────────────────────┤
│  Model Executor                              │
│  ├─ Tensor Parallelism (Ray/NCCL)           │
│  ├─ Quantization (GPTQ/AWQ/FP8)            │
│  └─ Speculative Decoding                    │
└─────────────────────────────────────────────┘

핵심 최적화 기법 상세

Tensor Parallelism (텐서 병렬화)

하나의 모델을 여러 GPU에 분산하여 실행합니다. vLLM은 Megatron-LM 스타일의 텐서 병렬화를 지원합니다.

# 단일 GPU (기본)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1

# 4 GPU 텐서 병렬화
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

# 8 GPU (A100 80GB x 8 노드)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1

텐서 병렬화 크기 결정 가이드:

모델 크기	양자화 없음 (FP16)	INT8 양자화	INT4 양자화
7B	1x A100 80GB	1x A100 40GB	1x RTX 4090
13B	1x A100 80GB	1x A100 80GB	1x A100 40GB
34B	2x A100 80GB	1x A100 80GB	1x A100 80GB
70B	4x A100 80GB	2x A100 80GB	1~2x A100 80GB
405B	8x A100 80GB	4~8x A100 80GB	4x A100 80GB

Speculative Decoding (추측적 디코딩)

작은 "드래프트 모델"이 여러 토큰을 빠르게 추측 생성하고, 큰 "타겟 모델"이 한 번의 포워드 패스로 이를 검증합니다.

# Speculative Decoding 활성화
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --speculative-draft-tensor-parallel-size 1

동작 원리:

기존 자기회귀 생성 (1 토큰/스텝):
Step 1 → "The"
Step 2 → "weather"
Step 3 → "is"
Step 4 → "sunny"
Step 5 → "today"
= 5 forward passes (대형 모델)

Speculative Decoding:
Draft model (빠른 소형 모델): "The weather is sunny today" (5 토큰 추측)
Target model (1 forward pass): "The weather is sunny today" ✓ 전부 수락!
= 1 forward pass (소형 모델) + 1 forward pass (대형 모델)
→ 잠재적으로 2.5~3x 속도 향상

적합한 시나리오:

GPU compute-bound 환경 (배치 크기가 작을 때)
드래프트 모델과 타겟 모델의 토크나이저가 호환될 때
수락률이 높은 경우 (일반적인 텍스트 생성)

Prefix Caching (프리픽스 캐싱)

동일한 시스템 프롬프트나 접두사를 가진 요청들 사이에서 KV Cache를 재사용합니다.

# Prefix Caching 활성화
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching

효과:

시나리오: 모든 요청에 동일한 2000 토큰 시스템 프롬프트 사용

Prefix Caching 비활성화:
Req 1: [시스템 프롬프트 2000토큰 처리] + [사용자 입력 처리] → TTFT: 500ms
Req 2: [시스템 프롬프트 2000토큰 재처리] + [사용자 입력 처리] → TTFT: 500ms
Req 3: [시스템 프롬프트 2000토큰 재처리] + [사용자 입력 처리] → TTFT: 500ms

Prefix Caching 활성화:
Req 1: [시스템 프롬프트 2000토큰 처리] + [사용자 입력 처리] → TTFT: 500ms
Req 2: [캐시 히트!] + [사용자 입력 처리] → TTFT: 50ms (10x 개선!)
Req 3: [캐시 히트!] + [사용자 입력 처리] → TTFT: 50ms

Chunked Prefill

긴 프롬프트의 Prefill 단계를 청크로 나누어 디코딩과 인터리빙합니다. 이를 통해 긴 프롬프트가 들어와도 기존 디코딩 중인 요청의 지연시간(TBT)이 급증하지 않습니다.

# Chunked Prefill 활성화
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill \
  --max-num-batched-tokens 2048

상세 설정 가이드

핵심 설정 파라미터

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  \
  # GPU 메모리 설정
  --gpu-memory-utilization 0.90 \          # GPU 메모리 사용 비율 (0.0~1.0)
  --max-model-len 32768 \                  # 최대 컨텍스트 길이
  \
  # 배치 및 동시성 설정
  --max-num-seqs 256 \                     # 동시 처리 최대 시퀀스 수
  --max-num-batched-tokens 32768 \         # 배치당 최대 토큰 수
  \
  # 병렬화 설정
  --tensor-parallel-size 1 \               # 텐서 병렬 GPU 수
  --pipeline-parallel-size 1 \             # 파이프라인 병렬 단계 수
  \
  # 양자화 설정
  --quantization awq \                     # 양자화 방식 (awq, gptq, fp8 등)
  --dtype auto \                           # 데이터 타입 (auto, float16, bfloat16)
  \
  # 서버 설정
  --host 0.0.0.0 \
  --port 8000 \
  --api-key "your-secret-key"

설정 파라미터 상세 설명

파라미터	기본값	설명	권장 범위
`--gpu-memory-utilization`	0.9	KV Cache에 할당할 GPU 메모리 비율	0.85~0.95
`--max-model-len`	모델 설정값	처리할 수 있는 최대 시퀀스 길이	태스크에 따라 조정
`--max-num-seqs`	256	동시에 처리할 수 있는 최대 시퀀스 수	64~512
`--max-num-batched-tokens`	없음	한 번의 iteration에서 처리할 최대 토큰 수	2048~65536
`--tensor-parallel-size`	1	텐서 병렬화에 사용할 GPU 수	1, 2, 4, 8
`--block-size`	16	PagedAttention 블록 크기 (토큰 수)	8, 16, 32
`--swap-space`	4	CPU 스왑 공간 (GB)	4~16
`--enforce-eager`	False	CUDA 그래프 대신 Eager 모드 사용	디버깅 시 True

시나리오별 설정 예시

높은 처리량 (Throughput) 최적화:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 512 \
  --max-model-len 4096 \
  --enable-prefix-caching \
  --enable-chunked-prefill

낮은 지연 시간 (Latency) 최적화:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 32 \
  --max-model-len 8192 \
  --num-scheduler-steps 1

긴 컨텍스트 (Long Context) 최적화:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 16 \
  --max-model-len 131072 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 4096

성능 비교: vLLM vs 경쟁 프레임워크

주요 LLM 서빙 프레임워크 비교

특성	vLLM	TGI (HuggingFace)	TensorRT-LLM (NVIDIA)	Triton + TensorRT-LLM
개발사	UC Berkeley	HuggingFace	NVIDIA	NVIDIA
핵심 기술	PagedAttention	Continuous Batching	FasterTransformer 기반	모델 서빙 프레임워크
설치 난이도	매우 쉬움	쉬움	어려움	매우 어려움
모델 호환성	매우 넓음	넓음	제한적 (변환 필요)	제한적
API 호환성	OpenAI 호환	자체 + OpenAI 호환	자체 API	gRPC + HTTP
양자화 지원	GPTQ, AWQ, FP8, GGUF	GPTQ, AWQ, EETQ	FP8, INT8, INT4	FP8, INT8, INT4
멀티 GPU	Tensor/Pipeline	Tensor	Tensor/Pipeline	Tensor/Pipeline
Speculative Decoding	지원	지원	지원	지원
프로덕션 안정성	높음	높음	매우 높음	매우 높음
커뮤니티	매우 활발	활발	NVIDIA 주도	NVIDIA 주도

처리량 벤치마크 (LLaMA-3.1-8B, A100 80GB 기준)

메트릭	vLLM	TGI	TensorRT-LLM
Throughput (tokens/s) - batch=1	~120	~100	~150
Throughput (tokens/s) - batch=32	~2,800	~2,200	~3,500
Throughput (tokens/s) - batch=128	~5,500	~4,000	~7,000
TTFT (ms) - 512 토큰 입력	~35	~40	~25
TBT (ms) - batch=1	~8	~10	~6
메모리 효율성	95%+	~80%	~90%

참고: 벤치마크 결과는 하드웨어, 모델, 설정에 따라 크게 달라질 수 있습니다. 위 수치는 참고용 대략적 비교입니다.

프레임워크 선택 가이드

빠르게 시작하고 싶다면?
→ vLLM (pip install vllm → 바로 서빙)

최대 성능이 필요하다면?
→ TensorRT-LLM (단, 모델 변환과 설정에 상당한 노력 필요)

HuggingFace 생태계를 이미 사용 중이라면?
→ TGI (HuggingFace Hub과 자연스러운 통합)

엔터프라이즈 배포가 필요하다면?
→ Triton + TensorRT-LLM (NVIDIA 공식 지원, 멀티모델 서빙)

배포 패턴

단일 GPU 배포

가장 단순한 형태로, 소규모 모델 서빙에 적합합니다.

# Docker로 단일 GPU 배포
docker run --runtime nvidia --gpus '"device=0"' \
  -v /path/to/model:/model \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /model \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192

멀티 GPU 배포

# 4 GPU 텐서 병렬 배포
docker run --runtime nvidia --gpus '"device=0,1,2,3"' \
  -v /path/to/model:/model \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /model \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9

Kubernetes + Ray 배포

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-serving
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-serving
  template:
    metadata:
      labels:
        app: vllm-serving
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - '--model'
            - 'meta-llama/Llama-3.1-8B-Instruct'
            - '--gpu-memory-utilization'
            - '0.9'
            - '--max-model-len'
            - '8192'
            - '--max-num-seqs'
            - '256'
            - '--enable-prefix-caching'
            - '--port'
            - '8000'
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: '32Gi'
              cpu: '8'
            requests:
              nvidia.com/gpu: 1
              memory: '16Gi'
              cpu: '4'
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
          env:
            - name: VLLM_USAGE_SOURCE
              value: 'production'
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: ml-serving
spec:
  selector:
    app: vllm-serving
  ports:
    - port: 80
      targetPort: 8000
      name: http
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-serving
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: '80'

멀티 모델 서빙 패턴

# 여러 모델을 동일 클러스터에서 서빙하는 경우
# 각 모델에 대해 별도의 vLLM 인스턴스 + 앞단에 라우터 배치

# router.py (간단한 예시)
from fastapi import FastAPI, Request
import httpx

app = FastAPI()

MODEL_ENDPOINTS = {
    "llama-8b": "http://vllm-8b:8000",
    "llama-70b": "http://vllm-70b:8000",
    "codellama-34b": "http://vllm-code:8000",
}

@app.post("/v1/chat/completions")
async def route_request(request: Request):
    body = await request.json()
    model = body.get("model", "llama-8b")
    endpoint = MODEL_ENDPOINTS.get(model)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{endpoint}/v1/chat/completions",
            json=body,
            timeout=120.0
        )
        return response.json()

모니터링 전략

핵심 메트릭 정의

LLM 서빙의 성능을 측정하는 핵심 지표는 다음과 같습니다:

메트릭	설명	목표 범위
TTFT (Time to First Token)	첫 번째 토큰 생성까지의 시간	< 200ms (대화형)
TBT (Time Between Tokens)	토큰 간 생성 시간 (= Inter-Token Latency)	< 30ms
E2E Latency	전체 요청 처리 시간	태스크 의존
Throughput	초당 생성 토큰 수 (tokens/s)	모델/GPU 의존
GPU Utilization	GPU 연산 유닛 사용률	70~95%
KV Cache Usage	KV Cache 메모리 사용률	< 95%
Queue Depth	대기 중인 요청 수	< max_num_seqs
Request Success Rate	요청 성공률	> 99.5%

Prometheus + Grafana 모니터링

vLLM은 기본적으로 Prometheus 메트릭을 /metrics 엔드포인트로 노출합니다.

# 주요 Prometheus 메트릭
# vllm:num_requests_running - 현재 처리 중인 요청 수
# vllm:num_requests_waiting - 대기 중인 요청 수
# vllm:gpu_cache_usage_perc - KV Cache GPU 사용률
# vllm:cpu_cache_usage_perc - KV Cache CPU 스왑 사용률
# vllm:avg_prompt_throughput_toks_per_s - 프롬프트 처리 처리량
# vllm:avg_generation_throughput_toks_per_s - 생성 처리량
# vllm:e2e_request_latency_seconds - E2E 요청 지연 히스토그램
# vllm:time_to_first_token_seconds - TTFT 히스토그램
# vllm:time_per_output_token_seconds - TBT 히스토그램

# prometheus-scrape-config.yaml
scrape_configs:
  - job_name: 'vllm'
    scrape_interval: 15s
    static_configs:
      - targets: ['vllm-service:8000']
    metrics_path: '/metrics'

알림 규칙 예시

# prometheus-alert-rules.yaml
groups:
  - name: vllm_alerts
    rules:
      - alert: HighKVCacheUsage
        expr: vllm_gpu_cache_usage_perc > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'KV Cache 사용률이 95%를 초과했습니다'

      - alert: HighRequestLatency
        expr: histogram_quantile(0.99, vllm_e2e_request_latency_seconds_bucket) > 30
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: 'P99 요청 지연 시간이 30초를 초과했습니다'

      - alert: HighQueueDepth
        expr: vllm_num_requests_waiting > 100
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: '대기 중인 요청이 100개를 초과했습니다'

자주 발생하는 문제와 해결 방법

OOM (Out of Memory) 에러

증상: CUDA out of memory 에러 발생

원인 및 해결:

# 1. gpu-memory-utilization 낮추기
--gpu-memory-utilization 0.80  # 기본 0.9에서 낮춤

# 2. max-model-len 줄이기
--max-model-len 4096  # 불필요하게 긴 컨텍스트 제한

# 3. max-num-seqs 줄이기
--max-num-seqs 64  # 동시 처리 수 감소

# 4. 양자화 적용
--quantization awq  # 또는 gptq, fp8

# 5. 텐서 병렬화 (GPU 추가)
--tensor-parallel-size 2

느린 첫 번째 토큰 (Slow TTFT)

증상: TTFT가 비정상적으로 높음 (수 초 이상)

원인 및 해결:

# 1. 긴 프롬프트 → Chunked Prefill 활성화
--enable-chunked-prefill
--max-num-batched-tokens 2048

# 2. Prefix Caching 활성화 (반복 프롬프트가 있는 경우)
--enable-prefix-caching

# 3. CUDA 그래프 최적화 확인
# --enforce-eager를 제거 (CUDA 그래프 활성화)

# 4. 모델 로드 최적화
--load-format auto  # 가능한 경우 safetensors 사용

처리량 저하

증상: tokens/s가 기대치보다 낮음

체크리스트:

# 1. 배치 크기 확인
--max-num-seqs 256  # 너무 작으면 GPU 활용률 저하

# 2. 메모리 활용률 확인
--gpu-memory-utilization 0.92  # 너무 보수적이면 배치 크기 제한

# 3. Speculative Decoding 시도
--speculative-model <small-model> --num-speculative-tokens 5

# 4. 양자화 적용
--quantization awq  # 또는 fp8 (A100/H100)

요청 타임아웃

증상: 일부 요청이 타임아웃으로 실패

# 1. 최대 토큰 수 제한
# API 요청 시 max_tokens를 적절히 설정

# 2. 스왑 공간 확보
--swap-space 16  # CPU 메모리로 스왑 허용

# 3. 프리엠션 전략 설정
--preemption-mode recompute  # 또는 swap

고급 최적화 팁

FP8 양자화 (H100/A100)

# NVIDIA H100에서 FP8 양자화 활용
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92

# FP8은 INT8 대비:
# - 더 높은 정확도 (FP 범위 유지)
# - H100 FP8 Tensor Core 활용으로 최대 성능
# - 별도 캘리브레이션 불필요

다중 LoRA 서빙

# 베이스 모델 + 여러 LoRA 어댑터 동시 서빙
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules \
    korean-chat=/path/to/korean-lora \
    code-assist=/path/to/code-lora \
    medical-qa=/path/to/medical-lora \
  --max-loras 3 \
  --max-lora-rank 64

벤치마킹 도구

# vLLM 내장 벤치마크 도구
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct &

# 처리량 벤치마크
python -m vllm.benchmark_throughput \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 512 \
  --output-len 128 \
  --num-prompts 1000

# 지연시간 벤치마크
python -m vllm.benchmark_latency \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 512 \
  --output-len 128 \
  --batch-size 1

FAQ

vLLM과 Ollama의 차이는 무엇인가요?

Ollama는 로컬 개발과 실험을 위한 간편한 도구이고, vLLM은 프로덕션 수준의 서빙을 위한 고성능 엔진입니다. Ollama는 설치와 사용이 매우 간단하지만, PagedAttention이나 Continuous Batching 같은 고급 최적화를 제공하지 않습니다. 프로덕션 트래픽을 처리해야 한다면 vLLM을 권장합니다.

vLLM은 어떤 GPU에서 동작하나요?

vLLM은 CUDA 지원 NVIDIA GPU에서 동작합니다. A100, H100이 최적이며, RTX 3090/4090에서도 사용 가능합니다. AMD GPU (ROCm)도 실험적으로 지원합니다. 최소 요구 사항은 모델 크기와 양자화 수준에 따라 다릅니다.

gpu-memory-utilization을 1.0으로 설정하면 안 되나요?

권장하지 않습니다. GPU에는 KV Cache 외에도 CUDA 커널, cuBLAS 워크스페이스, 임시 텐서 등을 위한 메모리가 필요합니다. 0.9~0.95가 대부분의 경우 안전한 상한선이며, 1.0으로 설정하면 OOM이 자주 발생합니다.

Tensor Parallelism과 Pipeline Parallelism의 차이는 무엇인가요?

Tensor Parallelism(TP)은 하나의 레이어를 여러 GPU에 분할하여 병렬 처리합니다. 레이어 간 통신이 필요하므로 GPU 간 빠른 인터커넥트(NVLink)가 중요합니다. Pipeline Parallelism(PP)은 레이어 그룹을 각 GPU에 할당합니다. 통신 요구 사항이 적지만 "파이프라인 버블"로 인한 비효율이 있습니다. 일반적으로 단일 노드 내에서는 TP를, 노드 간에서는 PP를 사용합니다.

Speculative Decoding은 항상 빠른가요?

아닙니다. Speculative Decoding은 배치 크기가 작고 GPU가 compute-bound일 때 가장 효과적입니다. 배치 크기가 클 때는 드래프트 모델의 오버헤드가 이점을 상쇄할 수 있습니다. 또한 드래프트 모델의 수락률이 낮으면 (특수한 도메인, 코드 생성 등) 성능 향상이 미미합니다.

vLLM에서 스트리밍 응답을 사용하는 것이 좋은가요?

대부분의 경우 권장합니다. 스트리밍은 사용자 체감 지연 시간을 크게 줄여줍니다. OpenAI 호환 API에서 stream=true로 설정하면 SSE(Server-Sent Events)로 토큰을 실시간 전달합니다. 단, 스트리밍 응답의 완전성 검증과 에러 처리 로직이 추가로 필요합니다.

참고 자료

Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." arXiv:2309.06180
vLLM Project. "vLLM: Easy, Fast, and Cheap LLM Serving." GitHub Repository
vLLM Documentation. Official Docs
Leviathan, Y. et al. (2023). "Fast Inference from Transformers via Speculative Decoding." arXiv:2211.17192
NVIDIA. "TensorRT-LLM." GitHub Repository
HuggingFace. "Text Generation Inference (TGI)." GitHub Repository
Zheng, L. et al. (2023). "Efficiently Programming Large Language Models using SGLang." arXiv:2312.07104
Agrawal, A. et al. (2024). "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve." arXiv:2403.02310
Anyscale. "Ray Serve for LLM Serving." Documentation
Kubernetes. "GPU Scheduling." Documentation

마무리

vLLM은 LLM 프로덕션 서빙의 사실상 표준으로 자리잡았으며, 빠른 개발 속도와 활발한 커뮤니티 덕분에 기능이 지속적으로 확장되고 있습니다.

핵심 요약:

PagedAttention은 메모리 효율의 게임체인저입니다. KV Cache 낭비를 5% 미만으로 줄여 동일 하드웨어에서 훨씬 많은 동시 요청을 처리합니다.
Continuous Batching으로 처리량을 극적으로 개선합니다. 정적 배칭 대비 2~5배 향상됩니다.
설정 최적화가 성능의 핵심입니다. gpu-memory-utilization, max-model-len, max-num-seqs의 적절한 조합이 중요합니다.
Prefix Caching과 Speculative Decoding은 시나리오에 따라 추가적인 성능 향상을 제공합니다.
모니터링은 필수입니다. TTFT, TBT, 처리량, KV Cache 사용률을 지속적으로 추적해야 합니다.
Kubernetes 배포 시 GPU 리소스 관리와 오토스케일링 전략을 신중하게 설계해야 합니다.

프레임워크 선택 시 vLLM은 "빠른 시작 + 높은 성능 + 넓은 호환성"의 균형이 가장 뛰어나며, 특별한 이유가 없다면 첫 번째 선택지로 추천합니다.