Split View: vLLM 완벽 가이드 — PagedAttention부터 프로덕션 최적화까지

vLLM 완벽 가이드 — PagedAttention부터 프로덕션 최적화까지

들어가며
KV Cache 문제 이해
- 왜 KV Cache가 문제인가
- 기존 방식의 낭비
PagedAttention
- 핵심 아이디어
- Copy-on-Write로 Prefix 공유
Continuous Batching
vLLM 설치 및 기본 사용
병렬화 전략
- Tensor Parallelism
- Pipeline Parallelism
성능 최적화 팁
벤치마크
프로덕션 배포
- Kubernetes 배포
퀴즈
마무리
참고 자료

들어가며

LLM 추론 서빙은 비용과 성능의 균형이 핵심입니다. GPU 메모리의 60~80%를 차지하는 KV Cache를 효율적으로 관리하지 않으면, 비싼 GPU를 비효율적으로 사용하게 됩니다. vLLM은 UC Berkeley에서 개발한 오픈소스 LLM 추론 엔진으로, PagedAttention이라는 혁신적 메모리 관리 기법으로 이 문제를 해결합니다.

KV Cache 문제 이해

왜 KV Cache가 문제인가

# Transformer 추론 시 KV Cache 크기 계산
def kv_cache_size_gb(
    num_layers: int,
    num_heads: int,
    head_dim: int,
    seq_len: int,
    batch_size: int,
    dtype_bytes: int = 2  # float16
) -> float:
    """
    KV Cache 메모리 = 2 * L * H * D * S * B * dtype
    (2 = K와 V 각각)
    """
    total_bytes = 2 * num_layers * num_heads * head_dim * seq_len * batch_size * dtype_bytes
    return total_bytes / (1024**3)

# Llama 3 70B 예시
print(kv_cache_size_gb(
    num_layers=80,
    num_heads=64,  # GQA: 8 KV heads
    head_dim=128,
    seq_len=4096,
    batch_size=1
))
# → 약 5.2GB per request!

# batch_size=32이면?
# → 약 166GB — A100 80GB 2장으로도 부족!

기존 방식의 낭비

기존 KV Cache 할당 (연속 메모리):

Request 1: [████████████░░░░░░░░]  실제 1024 토큰, 4096 예약
Request 2: [██████░░░░░░░░░░░░░░]  실제 512 토큰, 4096 예약
Request 3: [████████████████░░░░]  실제 3072 토큰, 4096 예약

총 예약: 4096 * 3 = 12,288 슬롯
실제 사용: 4,608 슬롯
낭비율: 62.5%!

PagedAttention

핵심 아이디어

OS의 가상 메모리(Virtual Memory) 페이징 개념을 KV Cache에 적용합니다:

PagedAttention KV Cache 관리:

논리 블록 (Logical Blocks):
Request 1: [B0] → [B1] → [B2] → [B3]

물리 블록 (Physical Blocks - GPU 메모리):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ B0 │ B2 │ B5 │ B1 │ B3 │ B6 │ B4 │ B7 │
│ R1 │ R1 │ R2 │ R1 │ R1 │ R2 │ R2 │FREE│
└────┴────┴────┴────┴────┴────┴────┴────┘

페이지 테이블:
Request 1: [0→0, 1→3, 2→1, 3→4]
Request 2: [0→2, 1→5, 2→6]

# PagedAttention 핵심 구조
class PagedAttentionManager:
    def __init__(self, num_blocks: int, block_size: int, num_heads: int, head_dim: int):
        self.block_size = block_size  # 예: 16 tokens per block
        self.num_blocks = num_blocks

        # 물리 블록 풀 (GPU 메모리에 미리 할당)
        self.k_cache = torch.zeros(
            num_blocks, block_size, num_heads, head_dim,
            dtype=torch.float16, device='cuda'
        )
        self.v_cache = torch.zeros_like(self.k_cache)

        # 프리 리스트
        self.free_blocks = list(range(num_blocks))

    def allocate_block(self) -> int:
        """물리 블록 하나 할당"""
        if not self.free_blocks:
            raise MemoryError("No free blocks")
        return self.free_blocks.pop()

    def free_block(self, block_id: int):
        """블록 반환"""
        self.free_blocks.append(block_id)

    def append_token(self, request_id: int, key: torch.Tensor, value: torch.Tensor):
        """토큰 추가 — 블록이 꽉 차면 새 블록 할당"""
        page_table = self.page_tables[request_id]
        last_block = page_table[-1]
        slot_in_block = self.token_counts[request_id] % self.block_size

        if slot_in_block == 0 and self.token_counts[request_id] > 0:
            # 새 블록 필요
            new_block = self.allocate_block()
            page_table.append(new_block)
            last_block = new_block

        self.k_cache[last_block, slot_in_block] = key
        self.v_cache[last_block, slot_in_block] = value
        self.token_counts[request_id] += 1

Copy-on-Write로 Prefix 공유

# 여러 요청이 같은 시스템 프롬프트를 공유하는 경우
# Copy-on-Write로 메모리 절약

system_prompt = "You are a helpful assistant..."
# 시스템 프롬프트의 KV Cache 블록을 공유

# Request 1: system_prompt + "What is Python?"
# Request 2: system_prompt + "Explain Docker"
# Request 3: system_prompt + "How to use Git?"

# 공유 블록:
# [System Block 0] ← 3개 요청이 공유 (ref_count=3)
# [System Block 1] ← 3개 요청이 공유 (ref_count=3)
# [System Block 2] ← 3개 요청이 공유 (ref_count=3)

# 개별 블록:
# [R1 Block 3] [R2 Block 3] [R3 Block 3] ← 각자 고유

# 메모리 절약: 시스템 프롬프트 블록을 3번 복사하지 않음!

Continuous Batching

# 기존 Static Batching
# 모든 요청이 끝날 때까지 기다림
def static_batching(requests):
    """
    R1: ████████████ (완료)
    R2: ████████████████████ (완료)
    R3: ████ (완료, but 대기...)
                              ↑ 여기서야 새 배치 시작
    """
    max_len = max(r.output_len for r in requests)
    for step in range(max_len):
        # 이미 끝난 요청도 GPU를 점유
        outputs = model.forward(batch)
    return outputs


# vLLM의 Continuous Batching
def continuous_batching(scheduler):
    """
    Step 1: [R1, R2, R3] → 모두 처리
    Step 2: [R1, R2, R3] → R3 완료! → 빈 자리에 R4 투입
    Step 3: [R1, R2, R4] → R1 완료! → R5 투입
    Step 4: [R5, R2, R4] → ...

    GPU 활용률: ~95% (vs Static의 ~50-60%)
    """
    while requests_exist():
        # 완료된 요청 제거, 새 요청 추가
        batch = scheduler.schedule()

        # Prefill과 Decode를 분리하여 처리
        prefill_batch = [r for r in batch if r.is_prefill]
        decode_batch = [r for r in batch if r.is_decode]

        if prefill_batch:
            model.forward(prefill_batch, mode="prefill")
        if decode_batch:
            model.forward(decode_batch, mode="decode")

vLLM 설치 및 기본 사용

설치

# pip 설치
pip install vllm

# CUDA 12.1+ 필요, PyTorch 2.4+
# GPU: Compute Capability 7.0+ (V100, A100, H100, L40S 등)

Offline Inference

from vllm import LLM, SamplingParams

# 모델 로드
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    dtype="auto",
    max_model_len=8192,
    gpu_memory_utilization=0.9,
    tensor_parallel_size=1,
)

# 샘플링 파라미터
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
    repetition_penalty=1.1,
)

# 배치 추론
prompts = [
    "Kubernetes Pod의 생명주기를 설명해주세요.",
    "Docker와 Podman의 차이점은?",
    "Redis 캐싱 전략에 대해 알려주세요.",
]

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Generated: {output.outputs[0].text[:100]}...")
    print(f"Tokens/s: {len(output.outputs[0].token_ids) / output.metrics.finished_time:.1f}")
    print()

OpenAI 호환 서버

# vLLM 서버 실행 (OpenAI API 호환)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9 \
  --tensor-parallel-size 2 \
  --enable-prefix-caching \
  --max-num-seqs 256

# API 호출 (OpenAI SDK 호환)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "vLLM이란 무엇인가요?"}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'

# Python SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # vLLM은 인증 불필요
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

병렬화 전략

Tensor Parallelism

# 4 GPU에 모델 분산
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --dtype bfloat16

Pipeline Parallelism

# 8 GPU: 4-way TP × 2-way PP
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2

성능 최적화 팁

1. GPU 메모리 활용률

# 기본 0.9 (90%), 더 공격적으로 설정 가능
--gpu-memory-utilization 0.95

# KV Cache 블록 수 확인
# 로그: "# GPU blocks: 12345, # CPU blocks: 0"

2. Prefix Caching

# 시스템 프롬프트가 동일한 요청이 많을 때 효과적
--enable-prefix-caching

3. Quantization

# AWQ 양자화 모델 사용
vllm serve TheBloke/Llama-3.1-70B-AWQ \
  --quantization awq \
  --dtype auto

# GPTQ
vllm serve TheBloke/Llama-3.1-70B-GPTQ \
  --quantization gptq

# FP8 (H100에서 최적)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8

4. Speculative Decoding

# 작은 모델로 추측 → 큰 모델로 검증
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.1-8B-Instruct \
  --num-speculative-tokens 5

벤치마크

# vLLM 벤치마크 도구
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct &

# ShareGPT 데이터셋으로 벤치마크
python benchmarks/benchmark_serving.py \
  --backend vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --num-prompts 1000 \
  --request-rate 10

벤치마크 결과 예시 (A100 80GB, Llama-3.1-8B):

| 메트릭           | vLLM    | TGI     | 순수 HF |
|-----------------|---------|---------|---------|
| Throughput      | 2,400   | 1,800   | 400     |
| (tokens/s)      |         |         |         |
| TTFT p50 (ms)   | 45      | 60      | 200     |
| TTFT p99 (ms)   | 120     | 180     | 500     |
| ITL p50 (ms)    | 8       | 10      | 25      |
| Max Batch Size  | 256     | 128     | 16      |
| Memory Util.    | 95%     | 85%     | 60%     |

프로덕션 배포

Kubernetes 배포

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --model=meta-llama/Llama-3.1-8B-Instruct
            - --tensor-parallel-size=1
            - --gpu-memory-utilization=0.9
            - --max-model-len=8192
            - --enable-prefix-caching
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
            requests:
              nvidia.com/gpu: 1
              memory: 24Gi
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP

퀴즈

Q1. PagedAttention의 핵심 아이디어는 무엇인가요?

OS의 가상 메모리 페이징 개념을 KV Cache에 적용합니다. 연속 메모리 할당 대신 작은 블록 단위로 KV Cache를 관리하여 메모리 단편화와 낭비를 해소합니다.

Q2. Continuous Batching이 Static Batching보다 효율적인 이유는?

요청이 완료되면 즉시 배치에서 제거하고 새 요청을 추가합니다. Static Batching은 가장 긴 요청이 끝날 때까지 모든 슬롯을 점유하여 GPU가 낭비됩니다.

Q3. Prefix Caching은 어떤 상황에서 효과적인가요?

동일한 시스템 프롬프트를 사용하는 여러 요청이 있을 때 효과적입니다. 공통 접두사의 KV Cache를 공유하여 중복 계산과 메모리를 절약합니다.

Q4. Tensor Parallelism과 Pipeline Parallelism의 차이는?

Tensor Parallelism은 하나의 레이어를 여러 GPU에 분할하고, Pipeline Parallelism은 서로 다른 레이어를 다른 GPU에 배치합니다. TP는 레이턴시를, PP는 처리량을 최적화합니다.

Q5. Speculative Decoding의 원리는?

작은 모델이 여러 토큰을 빠르게 생성(추측)하고, 큰 모델이 이를 한번에 검증합니다. 검증된 토큰만 채택하여, 품질을 유지하면서 속도를 높입니다.

Q6. gpu-memory-utilization 파라미터는 무엇을 제어하나요?

KV Cache에 할당할 GPU 메모리의 비율입니다. 0.9이면 전체 GPU 메모리의 90%까지 KV Cache로 사용하여 더 많은 동시 요청을 처리할 수 있습니다.

Q7. Llama 3.1 70B를 A100 80GB에서 서빙하려면 최소 몇 장의 GPU가 필요한가요?

FP16 기준 약 140GB이므로 최소 2장이 필요합니다. AWQ/GPTQ 4bit 양자화를 사용하면 1장으로도 가능합니다.

마무리

vLLM은 PagedAttention, Continuous Batching, 다양한 병렬화 전략을 통해 LLM 추론의 표준 도구로 자리잡았습니다. OpenAI API 호환 서버를 제공하여 기존 코드 변경 없이 도입할 수 있고, Kubernetes 환경에서의 프로덕션 배포도 용이합니다.

참고 자료

The Complete vLLM Guide — From PagedAttention to Production Optimization

Introduction
Understanding the KV Cache Problem
- Why KV Cache is a Problem
- Waste in Traditional Approaches
PagedAttention
- Core Idea
- Prefix Sharing with Copy-on-Write
Continuous Batching
vLLM Installation and Basic Usage
Parallelism Strategies
- Tensor Parallelism
- Pipeline Parallelism
Performance Optimization Tips
Benchmarks
Production Deployment
- Kubernetes Deployment
Quiz
Conclusion
References

Introduction

The key to LLM inference serving lies in balancing cost and performance. Without efficient management of the KV Cache, which occupies 60-80% of GPU memory, expensive GPUs end up being used inefficiently. vLLM is an open-source LLM inference engine developed at UC Berkeley that solves this problem with an innovative memory management technique called PagedAttention.

Understanding the KV Cache Problem

Why KV Cache is a Problem

# KV Cache size calculation during Transformer inference
def kv_cache_size_gb(
    num_layers: int,
    num_heads: int,
    head_dim: int,
    seq_len: int,
    batch_size: int,
    dtype_bytes: int = 2  # float16
) -> float:
    """
    KV Cache memory = 2 * L * H * D * S * B * dtype
    (2 = one each for K and V)
    """
    total_bytes = 2 * num_layers * num_heads * head_dim * seq_len * batch_size * dtype_bytes
    return total_bytes / (1024**3)

# Llama 3 70B example
print(kv_cache_size_gb(
    num_layers=80,
    num_heads=64,  # GQA: 8 KV heads
    head_dim=128,
    seq_len=4096,
    batch_size=1
))
# → approximately 5.2GB per request!

# With batch_size=32?
# → approximately 166GB — not enough even with 2x A100 80GB!

Waste in Traditional Approaches

Traditional KV Cache allocation (contiguous memory):

Request 1: [████████████░░░░░░░░]  actual 1024 tokens, 4096 reserved
Request 2: [██████░░░░░░░░░░░░░░]  actual 512 tokens, 4096 reserved
Request 3: [████████████████░░░░]  actual 3072 tokens, 4096 reserved

Total reserved: 4096 * 3 = 12,288 slots
Actually used: 4,608 slots
Waste ratio: 62.5%!

PagedAttention

Core Idea

It applies the OS virtual memory paging concept to KV Cache:

PagedAttention KV Cache management:

Logical Blocks:
Request 1: [B0] → [B1] → [B2] → [B3]

Physical Blocks (GPU memory):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ B0 │ B2 │ B5 │ B1 │ B3 │ B6 │ B4 │ B7 │
│ R1 │ R1 │ R2 │ R1 │ R1 │ R2 │ R2 │FREE│
└────┴────┴────┴────┴────┴────┴────┴────┘

Page Table:
Request 1: [0→0, 1→3, 2→1, 3→4]
Request 2: [0→2, 1→5, 2→6]

# PagedAttention core structure
class PagedAttentionManager:
    def __init__(self, num_blocks: int, block_size: int, num_heads: int, head_dim: int):
        self.block_size = block_size  # e.g., 16 tokens per block
        self.num_blocks = num_blocks

        # Physical block pool (pre-allocated on GPU memory)
        self.k_cache = torch.zeros(
            num_blocks, block_size, num_heads, head_dim,
            dtype=torch.float16, device='cuda'
        )
        self.v_cache = torch.zeros_like(self.k_cache)

        # Free list
        self.free_blocks = list(range(num_blocks))

    def allocate_block(self) -> int:
        """Allocate a single physical block"""
        if not self.free_blocks:
            raise MemoryError("No free blocks")
        return self.free_blocks.pop()

    def free_block(self, block_id: int):
        """Return a block"""
        self.free_blocks.append(block_id)

    def append_token(self, request_id: int, key: torch.Tensor, value: torch.Tensor):
        """Add token — allocate a new block when current block is full"""
        page_table = self.page_tables[request_id]
        last_block = page_table[-1]
        slot_in_block = self.token_counts[request_id] % self.block_size

        if slot_in_block == 0 and self.token_counts[request_id] > 0:
            # New block needed
            new_block = self.allocate_block()
            page_table.append(new_block)
            last_block = new_block

        self.k_cache[last_block, slot_in_block] = key
        self.v_cache[last_block, slot_in_block] = value
        self.token_counts[request_id] += 1

# When multiple requests share the same system prompt
# Save memory with Copy-on-Write

system_prompt = "You are a helpful assistant..."
# Share KV Cache blocks of the system prompt

# Request 1: system_prompt + "What is Python?"
# Request 2: system_prompt + "Explain Docker"
# Request 3: system_prompt + "How to use Git?"

# Shared blocks:
# [System Block 0] ← shared by 3 requests (ref_count=3)
# [System Block 1] ← shared by 3 requests (ref_count=3)
# [System Block 2] ← shared by 3 requests (ref_count=3)

# Individual blocks:
# [R1 Block 3] [R2 Block 3] [R3 Block 3] ← each unique

# Memory savings: no need to copy system prompt blocks 3 times!

Continuous Batching

# Traditional Static Batching
# Wait until all requests finish
def static_batching(requests):
    """
    R1: ████████████ (done)
    R2: ████████████████████ (done)
    R3: ████ (done, but waiting...)
                              ↑ New batch starts only here
    """
    max_len = max(r.output_len for r in requests)
    for step in range(max_len):
        # Already finished requests still occupy GPU
        outputs = model.forward(batch)
    return outputs


# vLLM's Continuous Batching
def continuous_batching(scheduler):
    """
    Step 1: [R1, R2, R3] → process all
    Step 2: [R1, R2, R3] → R3 done! → insert R4 in empty slot
    Step 3: [R1, R2, R4] → R1 done! → insert R5
    Step 4: [R5, R2, R4] → ...

    GPU utilization: ~95% (vs Static's ~50-60%)
    """
    while requests_exist():
        # Remove completed requests, add new ones
        batch = scheduler.schedule()

        # Separate Prefill and Decode processing
        prefill_batch = [r for r in batch if r.is_prefill]
        decode_batch = [r for r in batch if r.is_decode]

        if prefill_batch:
            model.forward(prefill_batch, mode="prefill")
        if decode_batch:
            model.forward(decode_batch, mode="decode")

vLLM Installation and Basic Usage

Installation

# pip install
pip install vllm

# Requires CUDA 12.1+, PyTorch 2.4+
# GPU: Compute Capability 7.0+ (V100, A100, H100, L40S, etc.)

Offline Inference

from vllm import LLM, SamplingParams

# Load model
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    dtype="auto",
    max_model_len=8192,
    gpu_memory_utilization=0.9,
    tensor_parallel_size=1,
)

# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
    repetition_penalty=1.1,
)

# Batch inference
prompts = [
    "Explain the lifecycle of a Kubernetes Pod.",
    "What are the differences between Docker and Podman?",
    "Tell me about Redis caching strategies.",
]

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Generated: {output.outputs[0].text[:100]}...")
    print(f"Tokens/s: {len(output.outputs[0].token_ids) / output.metrics.finished_time:.1f}")
    print()

OpenAI-Compatible Server

# Run vLLM server (OpenAI API compatible)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9 \
  --tensor-parallel-size 2 \
  --enable-prefix-caching \
  --max-num-seqs 256

# API call (OpenAI SDK compatible)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is vLLM?"}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'

# Python SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # vLLM does not require authentication
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Parallelism Strategies

Tensor Parallelism

# Distribute model across 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --dtype bfloat16

Pipeline Parallelism

# 8 GPUs: 4-way TP × 2-way PP
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2

Performance Optimization Tips

1. GPU Memory Utilization

# Default is 0.9 (90%), can be set more aggressively
--gpu-memory-utilization 0.95

# Check number of KV Cache blocks
# Log: "# GPU blocks: 12345, # CPU blocks: 0"

2. Prefix Caching

# Effective when many requests share the same system prompt
--enable-prefix-caching

3. Quantization

# Use AWQ quantized model
vllm serve TheBloke/Llama-3.1-70B-AWQ \
  --quantization awq \
  --dtype auto

# GPTQ
vllm serve TheBloke/Llama-3.1-70B-GPTQ \
  --quantization gptq

# FP8 (optimal on H100)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8

4. Speculative Decoding

# Speculate with small model → verify with large model
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.1-8B-Instruct \
  --num-speculative-tokens 5

Benchmarks

# vLLM benchmark tool
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct &

# Benchmark with ShareGPT dataset
python benchmarks/benchmark_serving.py \
  --backend vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --num-prompts 1000 \
  --request-rate 10

Benchmark results example (A100 80GB, Llama-3.1-8B):

| Metric          | vLLM    | TGI     | Pure HF |
|-----------------|---------|---------|---------|
| Throughput      | 2,400   | 1,800   | 400     |
| (tokens/s)      |         |         |         |
| TTFT p50 (ms)   | 45      | 60      | 200     |
| TTFT p99 (ms)   | 120     | 180     | 500     |
| ITL p50 (ms)    | 8       | 10      | 25      |
| Max Batch Size  | 256     | 128     | 16      |
| Memory Util.    | 95%     | 85%     | 60%     |

Production Deployment

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --model=meta-llama/Llama-3.1-8B-Instruct
            - --tensor-parallel-size=1
            - --gpu-memory-utilization=0.9
            - --max-model-len=8192
            - --enable-prefix-caching
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
            requests:
              nvidia.com/gpu: 1
              memory: 24Gi
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP

Quiz

Q1. What is the core idea behind PagedAttention?

It applies the OS virtual memory paging concept to KV Cache. Instead of contiguous memory allocation, it manages KV Cache in small block units, eliminating memory fragmentation and waste.

Q2. Why is Continuous Batching more efficient than Static Batching?

When a request completes, it is immediately removed from the batch and a new request is added. Static Batching occupies all slots until the longest request finishes, wasting GPU resources.

Q3. In what situations is Prefix Caching effective?

It is effective when multiple requests use the same system prompt. It saves redundant computation and memory by sharing the KV Cache of the common prefix.

Q4. What is the difference between Tensor Parallelism and Pipeline Parallelism?

Tensor Parallelism splits a single layer across multiple GPUs, while Pipeline Parallelism places different layers on different GPUs. TP optimizes latency, while PP optimizes throughput.

Q5. What is the principle behind Speculative Decoding?

A small model quickly generates (speculates) multiple tokens, and a large model verifies them all at once. Only verified tokens are accepted, improving speed while maintaining quality.

Q6. What does the gpu-memory-utilization parameter control?

It controls the proportion of GPU memory allocated to KV Cache. At 0.9, up to 90% of total GPU memory is used for KV Cache, enabling more concurrent requests to be processed.

Q7. How many GPUs are needed at minimum to serve Llama 3.1 70B on A100 80GB?

At FP16, approximately 140GB is required, so at least 2 GPUs are needed. With AWQ/GPTQ 4-bit quantization, it is possible with just 1 GPU.

Conclusion

vLLM has established itself as the standard tool for LLM inference through PagedAttention, Continuous Batching, and various parallelism strategies. It provides an OpenAI API-compatible server, allowing adoption without changing existing code, and facilitates production deployment in Kubernetes environments.

vLLM 완벽 가이드 — PagedAttention부터 프로덕션 최적화까지

들어가며

KV Cache 문제 이해

왜 KV Cache가 문제인가

기존 방식의 낭비

PagedAttention

핵심 아이디어

Copy-on-Write로 Prefix 공유

Continuous Batching

vLLM 설치 및 기본 사용

설치

Offline Inference

OpenAI 호환 서버

병렬화 전략

Tensor Parallelism

Pipeline Parallelism

성능 최적화 팁

1. GPU 메모리 활용률

2. Prefix Caching

3. Quantization

4. Speculative Decoding

벤치마크

프로덕션 배포

Kubernetes 배포

퀴즈

마무리

참고 자료

The Complete vLLM Guide — From PagedAttention to Production Optimization

Introduction

Understanding the KV Cache Problem

Why KV Cache is a Problem

Waste in Traditional Approaches

PagedAttention

Core Idea

Prefix Sharing with Copy-on-Write

Continuous Batching

vLLM Installation and Basic Usage

Installation

Offline Inference

OpenAI-Compatible Server

Parallelism Strategies

Tensor Parallelism

Pipeline Parallelism

Performance Optimization Tips

1. GPU Memory Utilization

2. Prefix Caching

3. Quantization

4. Speculative Decoding

Benchmarks

Production Deployment

Kubernetes Deployment

Quiz

Conclusion

References