Split View: LLM 추론 최적화 완전 가이드 2025: vLLM, TensorRT-LLM, KV Cache, Speculative Decoding

LLM 추론 최적화 완전 가이드 2025: vLLM, TensorRT-LLM, KV Cache, Speculative Decoding

1. LLM 추론의 병목 이해: Compute-Bound vs Memory-Bound

LLM 추론 최적화를 논하기 전에, 먼저 병목이 어디서 발생하는지 정확히 이해해야 합니다.

1.1 Arithmetic Intensity와 Roofline Model

GPU 연산의 성능은 두 가지 리소스에 의해 결정됩니다.

리소스	단위	A100 80GB	H100 80GB	H200 141GB
연산 능력 (FP16)	TFLOPS	312	989	989
메모리 대역폭	TB/s	2.0	3.35	4.8
Arithmetic Intensity 경계	FLOP/byte	156	295	206

Arithmetic Intensity = 총 연산량(FLOPs) / 총 메모리 전송량(Bytes)

Compute-Bound: Arithmetic Intensity가 경계값보다 높을 때. 행렬 곱셈이 대표적
Memory-Bound: Arithmetic Intensity가 경계값보다 낮을 때. Attention, Decoding이 대표적

1.2 Prefill vs Decode 단계

LLM 추론은 크게 두 단계로 나뉩니다.

┌──────────────────────────────────────────────────────┐
│                  LLM 추론 파이프라인                    │
├──────────────────┬───────────────────────────────────┤
│   Prefill 단계    │         Decode 단계               │
│   (프롬프트 처리)  │      (토큰 생성)                   │
├──────────────────┼───────────────────────────────────┤
│ - 입력 토큰 병렬   │ - 토큰 1개씩 순차 생성              │
│ - Compute-Bound  │ - Memory-Bound                    │
│ - 높은 GPU 활용률  │ - 낮은 GPU 활용률 (보통 5-15%)      │
│ - 한 번 실행      │ - 출력 길이만큼 반복                 │
│ - KV Cache 생성   │ - KV Cache 읽기 + 추가              │
└──────────────────┴───────────────────────────────────┘

Prefill 단계: 전체 프롬프트를 한 번에 처리합니다. 행렬-행렬 곱셈(GEMM)이 주를 이루어 compute-bound입니다.

Decode 단계: 토큰을 하나씩 생성합니다. 행렬-벡터 곱셈(GEMV)이 주를 이루어 memory-bound입니다. 매 스텝마다 전체 모델 가중치를 읽어야 하지만 실제 연산량은 적습니다.

1.3 왜 Decode가 느린가

Llama-2 70B 모델 기준:

모델 가중치: 약 140GB (FP16)
Decode 한 스텝당: 140GB를 메모리에서 읽어야 함
A100 대역폭 2TB/s 기준: 140GB / 2TB/s = 70ms per token
실제 연산에 필요한 시간: 약 1ms

메모리 읽기가 70배 더 오래 걸립니다. 이것이 LLM 추론 최적화의 핵심 동기입니다.

2. KV Cache: LLM 추론의 핵심 자료구조

2.1 KV Cache란 무엇인가

Transformer의 Self-Attention은 모든 이전 토큰의 Key(K)와 Value(V)를 필요로 합니다. KV Cache는 이미 계산된 K, V 텐서를 저장하여 재계산을 방지합니다.

# KV Cache 없는 경우 (매 스텝 전체 재계산)
# 토큰 n개 생성 시 총 연산: O(n^2 * d)

# KV Cache 있는 경우 (이전 결과 재사용)
# 토큰 n개 생성 시 총 연산: O(n * d)
# 단, KV Cache 메모리: O(n * d) 추가 필요

2.2 KV Cache 메모리 계산

KV Cache 크기 = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * dtype_size

예시: Llama-2 70B, seq_len=4096, batch_size=1, FP16
= 2 * 80 * 8 * 128 * 4096 * 1 * 2 bytes
= 1.34 GB (시퀀스 하나에!)

batch_size=32면: 1.34 * 32 = 42.9 GB

모델	파라미터	KV Cache/token (FP16)	4K 시퀀스 1개	4K 시퀀스 32개
Llama-2 7B	7B	800 KB	3.2 GB	102 GB
Llama-2 70B	70B	320 KB	1.34 GB	42.9 GB
Mixtral 8x7B	46.7B	640 KB	2.56 GB	81.9 GB
Llama-3 405B	405B	1.6 MB	6.4 GB	204 GB

2.3 PagedAttention (vLLM의 핵심)

기존 방식의 문제: 시퀀스마다 최대 길이만큼 연속 메모리를 미리 할당. 실제로는 60-80%가 낭비됩니다.

┌─────────────────────────────────────────────┐
│        기존 KV Cache 할당 방식               │
│                                             │
│  Request 1: [████████░░░░░░░░░░░░]  40% 사용 │
│  Request 2: [████████████░░░░░░░░]  60% 사용 │
│  Request 3: [██░░░░░░░░░░░░░░░░░░]  10% 사용 │
│              ^^^^^^^^ 낭비되는 메모리          │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│       PagedAttention KV Cache 할당           │
│                                             │
│  물리 블록: [B0][B1][B2][B3][B4][B5][B6][B7] │
│                                             │
│  Request 1 → 페이지 테이블: [B0, B3, B5]     │
│  Request 2 → 페이지 테이블: [B1, B4, B6, B7] │
│  Request 3 → 페이지 테이블: [B2]             │
│                                             │
│  ✅ 내부 단편화 거의 제로                     │
│  ✅ 비연속 메모리 블록 활용                    │
│  ✅ Copy-on-Write로 프롬프트 공유             │
└─────────────────────────────────────────────┘

PagedAttention의 핵심 아이디어:

KV Cache를 고정 크기 **블록(페이지)**으로 분할
OS의 가상 메모리처럼 페이지 테이블로 비연속 블록을 논리적으로 연결
필요할 때만 블록을 할당하여 내부 단편화 제거
Copy-on-Write: 같은 프롬프트를 공유하는 요청들이 KV Cache를 공유

2.4 Prefix Caching

반복되는 시스템 프롬프트나 공통 프리픽스의 KV Cache를 재사용합니다.

# vLLM에서 Prefix Caching 활성화
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    enable_prefix_caching=True,  # Prefix Caching 활성화
    max_model_len=8192,
)

# 같은 시스템 프롬프트를 사용하는 요청들은
# 시스템 프롬프트 부분의 KV Cache를 공유합니다

3. Attention 최적화: FlashAttention과 MQA/GQA

3.1 FlashAttention: IO-Aware Attention

표준 Attention의 문제점:

Q, K, V 행렬을 HBM(High Bandwidth Memory)에서 읽기
S = Q @ K^T 계산 후 HBM에 쓰기
P = softmax(S) 계산 후 HBM에 쓰기
O = P @ V 계산 후 HBM에 쓰기

총 4번의 HBM 읽기/쓰기 - 이것이 병목입니다.

┌──────────────────────────────────────────────┐
│          FlashAttention 핵심 아이디어          │
│                                              │
│  GPU 메모리 계층:                              │
│  ┌─────────┐  19 TB/s   ┌─────────────────┐ │
│  │  SRAM   │◄──────────►│  Compute Units   │ │
│  │ (20 MB) │            └─────────────────┘ │
│  └────┬────┘                                 │
│       │ 2-4.8 TB/s                           │
│  ┌────▼────────────────┐                     │
│  │    HBM (80-141 GB)  │                     │
│  └─────────────────────┘                     │
│                                              │
│  전략: Q,K,V를 타일(블록)로 나누어             │
│  SRAM에서 모든 계산을 수행하고                  │
│  최종 결과만 HBM에 기록                        │
└──────────────────────────────────────────────┘

3.2 FlashAttention 버전별 비교

특성	FlashAttention-1	FlashAttention-2	FlashAttention-3
출시	2022	2023	2024
속도 향상	2-4x	추가 2x	추가 1.5-2x
GPU 지원	A100	A100, H100	H100 (Hopper 최적화)
주요 최적화	타일링, 재계산	병렬화 개선, warp 분할	FP8, 비동기 복사, 파이프라이닝
MHA 대비 FLOPS	50-70%	70-80%	최대 740 TFLOPS (75%)

3.3 Multi-Query Attention (MQA) vs Grouped-Query Attention (GQA)

KV Cache 크기를 줄이는 아키텍처 수준 최적화입니다.

┌─────────────────────────────────────────────────────┐
│    Multi-Head Attention (MHA)                       │
│    Q heads: [H1][H2][H3][H4][H5][H6][H7][H8]      │
│    K heads: [H1][H2][H3][H4][H5][H6][H7][H8]      │
│    V heads: [H1][H2][H3][H4][H5][H6][H7][H8]      │
│    KV Cache: 8x                                     │
├─────────────────────────────────────────────────────┤
│    Multi-Query Attention (MQA)                      │
│    Q heads: [H1][H2][H3][H4][H5][H6][H7][H8]      │
│    K heads: [        H_shared         ]             │
│    V heads: [        H_shared         ]             │
│    KV Cache: 1x (8배 절감)                          │
├─────────────────────────────────────────────────────┤
│    Grouped-Query Attention (GQA, 2 groups)          │
│    Q heads: [H1][H2][H3][H4] | [H5][H6][H7][H8]   │
│    K heads: [   K_group1   ] | [   K_group2   ]    │
│    V heads: [   V_group1   ] | [   V_group2   ]    │
│    KV Cache: 2x (4배 절감)                          │
└─────────────────────────────────────────────────────┘

모델	Attention 유형	KV Heads	Q Heads	KV Cache 절감
GPT-J 6B	MHA	16	16	1x
Falcon-40B	MQA	1	64	64x
Llama-2 70B	GQA	8	64	8x
Llama-3 70B	GQA	8	64	8x
Mistral 7B	GQA	8	32	4x

4. Batching 전략: Static vs Continuous

4.1 Static Batching의 한계

Static Batching (기존 방식):
시간 ──────────────────────────────────►

Req 1: [████████████████████████████████]  (긴 응답)
Req 2: [████████░░░░░░░░░░░░░░░░░░░░░░]  (짧은 응답)
Req 3: [██████████████░░░░░░░░░░░░░░░░]  (중간 응답)
Req 4: [WAIT WAIT WAIT WAIT WAIT WAIT ]  (대기 중)

░ = GPU 유휴 (패딩), WAIT = 배치 완료까지 대기
전체 배치가 끝나야 다음 배치 시작 → 처리량 매우 낮음

4.2 Continuous Batching (In-Flight Batching)

Continuous Batching:
시간 ──────────────────────────────────►

Req 1: [████████████████████████████████]
Req 2: [████████]
Req 3:          [██████████████]
Req 4:                  [████████████████]
Req 5:                          [████████]

완료된 요청 즉시 제거 → 새 요청 즉시 투입
GPU 유휴 시간 최소화 → 처리량 10-20배 향상

Continuous Batching의 핵심 원리:

매 iteration마다 완료된 요청을 배치에서 제거
대기 중인 요청을 즉시 배치에 추가
GPU가 항상 최대 부하로 동작
개별 요청의 레이턴시도 개선 (대기 시간 감소)

4.3 Chunked Prefill

긴 프롬프트의 Prefill 단계가 Decode 요청을 블로킹하는 문제를 해결합니다.

# vLLM chunked prefill 설정
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_chunked_prefill=True,
    max_num_batched_tokens=2048,  # 한 번에 처리할 최대 토큰 수
)

# 긴 프롬프트(예: 32K 토큰)를 2048 토큰 청크로 나누어 처리
# 청크 사이사이에 Decode 요청도 처리 가능
# TTFT는 약간 증가하지만, 전체 시스템 처리량과 ITL 개선

5. Speculative Decoding: 추론 속도의 게임 체인저

5.1 핵심 아이디어

작은 **드래프트 모델(Draft Model)**이 여러 토큰을 빠르게 예측하고, 큰 **타겟 모델(Target Model)**이 한 번의 forward pass로 모두 검증합니다.

┌────────────────────────────────────────────────────┐
│              Speculative Decoding 흐름              │
│                                                    │
│  Step 1: Draft Model (작고 빠른 모델)               │
│  "The capital of France is" → [Paris][,][a][city]  │
│  4개 토큰을 매우 빠르게 예측 (4ms)                   │
│                                                    │
│  Step 2: Target Model (크고 정확한 모델)             │
│  한 번의 forward pass로 4개 토큰 동시 검증           │
│  [Paris ✅] [, ✅] [a ❌→ "known"] [city ❌]         │
│                                                    │
│  결과: "Paris, known" (2개 수락 + 1개 수정)          │
│  기존: 3 forward pass 필요 → 이제 1 forward pass    │
│  속도 향상: 약 2-3x                                 │
└────────────────────────────────────────────────────┘

5.2 수학적 보장: 출력 품질 유지

Speculative Decoding의 핵심 장점은 타겟 모델의 출력 분포를 정확히 유지한다는 것입니다.

수락/거절 확률:

드래프트 토큰 x에 대해, 수락 확률 = min(1, p_target(x) / p_draft(x))
거절 시: (p_target(x) - p_draft(x)) 분포에서 재샘플링

이 과정을 통해 최종 출력은 타겟 모델만 사용한 것과 수학적으로 동일한 분포를 가집니다.

5.3 다양한 Speculative Decoding 변형

# 1. 별도 Draft Model 사용
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.1-8B-Instruct",
    num_speculative_tokens=5,
    use_v2_block_manager=True,
)

# 2. Medusa Heads (추가 MLP 헤드로 여러 위치 동시 예측)
# Draft 모델 없이 타겟 모델 자체에 경량 헤드를 추가
# 학습 필요하지만 메모리 오버헤드 최소

# 3. EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)
# 드래프트 모델이 타겟 모델의 hidden state를 재사용
# 별도 드래프트 모델보다 높은 수락률

5.4 Tree Attention

여러 후보 시퀀스를 트리 구조로 동시에 검증합니다.

토큰 위치:    1        2        3
            ┌── Paris ─┬── is ── ...
The ────────┤          └── was ── ...
            ├── Lyon ── is ── ...
            └── capital ── of ── ...

트리의 모든 경로를 한 번의 forward pass로 검증
→ 수락 확률 극대화, 처리량 향상

6. 양자화(Quantization)로 추론 가속

6.1 데이터 타입별 비교

데이터 타입	비트 수	범위	메모리 절감	품질 영향
FP32	32	매우 넓음	기준	기준
FP16	16	넓음	2x	무시 가능
BF16	16	FP32와 동일	2x	무시 가능
FP8 (E4M3)	8	중간	4x	매우 적음
INT8	8	-128~127	4x	적음
INT4	4	-8~7	8x	중간
NF4	4	정규분포 최적화	8x	INT4보다 적음

6.2 양자화 기법 비교

┌───────────────────────────────────────────────────────┐
│              양자화 기법 분류                            │
├─────────────────────┬─────────────────────────────────┤
│  Post-Training      │  Training-Aware                 │
│  Quantization(PTQ)  │  Quantization                   │
├─────────────────────┼─────────────────────────────────┤
│  - GPTQ (INT4)      │  - QLoRA + Merge                │
│  - AWQ (INT4)       │  - QAT (Quantization-Aware      │
│  - GGUF (다양한)     │    Training)                    │
│  - bitsandbytes     │                                 │
│  - SmoothQuant      │                                 │
│  - FP8 Dynamic      │                                 │
└─────────────────────┴─────────────────────────────────┘

6.3 주요 양자화 포맷 상세

# GPTQ: 레이어별 최적 양자화 (OBQ 기반)
# 장점: INT4에서도 좋은 품질, GPU 추론 최적화
# 단점: 캘리브레이션 데이터 필요, 양자화 시간 오래 걸림

from transformers import AutoModelForCausalLM, GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    group_size=128,
    dataset="c4",
    desc_act=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    quantization_config=gptq_config,
    device_map="auto",
)

# AWQ: Activation-aware Weight Quantization
# 핵심: 중요한 가중치 채널을 찾아 보호 (활성화 크기 기준)
# GPTQ보다 빠른 양자화, 비슷한 품질

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct"
)
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)

# bitsandbytes: 간편한 INT8/NF4 양자화
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
)

6.4 GGUF: CPU/Metal 추론용 포맷

llama.cpp에서 사용하는 양자화 포맷입니다. 다양한 양자화 레벨을 지원합니다.

GGUF 양자화	비트	방법	품질	속도
Q2_K	2-3	K-quant 혼합	낮음	매우 빠름
Q4_K_M	4-5	K-quant 중간	좋음	빠름
Q5_K_M	5-6	K-quant 중간	매우 좋음	보통
Q6_K	6	K-quant	거의 원본	느림
Q8_0	8	균일 양자화	원본과 동일	느림
F16	16	양자화 없음	원본	가장 느림

7. 서빙 프레임워크 비교: vLLM vs TensorRT-LLM vs TGI

7.1 종합 비교 표

기능	vLLM	TensorRT-LLM	TGI	Ollama	llama.cpp
개발사	UC Berkeley	NVIDIA	Hugging Face	Ollama	ggerganov
언어	Python/C++	C++/Python	Rust/Python	Go	C/C++
PagedAttention	O	O	O	X	X
Continuous Batching	O	O	O	X	X
Tensor Parallelism	O	O	O	X	X
FP8 지원	O	O (최적)	O	X	X
Speculative Decoding	O	O	제한적	X	O
LoRA 서빙	O (다중)	O	O	O	O
Vision 모델	O	O	O	O	O (일부)
CPU 추론	제한적	X	X	O	O (최적)
Metal (Apple)	X	X	X	O	O
설치 난이도	쉬움	어려움	쉬움	매우 쉬움	보통
프로덕션 적합도	높음	높음	높음	낮음	중간

7.2 처리량 벤치마크 (Llama-3.1 8B, A100 80GB)

프레임워크	처리량 (tok/s)	TTFT (ms)	ITL (ms)	메모리 사용
vLLM (FP16)	4,200	45	12	18 GB
vLLM (AWQ-4bit)	6,800	32	8	7 GB
TensorRT-LLM (FP16)	4,800	38	10	17 GB
TensorRT-LLM (FP8)	7,500	28	7	10 GB
TGI (FP16)	3,600	52	14	18 GB
llama.cpp (Q4_K_M)	120	200	35	5 GB

8. vLLM 심화: 아키텍처부터 LoRA 서빙까지

8.1 vLLM 아키텍처

┌──────────────────────────────────────────┐
│              vLLM Architecture            │
│                                          │
│  ┌─────────┐     ┌──────────────────┐   │
│  │ FastAPI  │────►│   LLM Engine     │   │
│  │ Server   │     │                  │   │
│  └─────────┘     │  ┌────────────┐  │   │
│                  │  │ Scheduler   │  │   │
│  ┌─────────┐    │  │ (요청 배치)  │  │   │
│  │ OpenAI  │────►│  └─────┬──────┘  │   │
│  │ compat  │     │        │         │   │
│  └─────────┘     │  ┌─────▼──────┐  │   │
│                  │  │ Block Mgr   │  │   │
│                  │  │ (PagedAttn) │  │   │
│                  │  └─────┬──────┘  │   │
│                  │        │         │   │
│                  │  ┌─────▼──────┐  │   │
│                  │  │  Worker(s)  │  │   │
│                  │  │  (GPU 실행) │  │   │
│                  │  └────────────┘  │   │
│                  └──────────────────┘   │
└──────────────────────────────────────────┘

8.2 vLLM 실전 배포

# vLLM 서버 시작 (OpenAI API 호환)
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
#     --tensor-parallel-size 4 \
#     --max-model-len 32768 \
#     --gpu-memory-utilization 0.90 \
#     --enable-prefix-caching \
#     --enable-chunked-prefill \
#     --max-num-batched-tokens 4096 \
#     --port 8000

# Python으로 API 호출
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing"},
    ],
    max_tokens=512,
    temperature=0.7,
)

8.3 vLLM LoRA 다중 서빙

하나의 베이스 모델로 여러 LoRA 어댑터를 동시에 서빙할 수 있습니다.

# vllm serve meta-llama/Llama-3.1-8B-Instruct \
#     --enable-lora \
#     --lora-modules \
#         sql-lora=./adapters/sql-lora \
#         code-lora=./adapters/code-lora \
#         chat-lora=./adapters/chat-lora \
#     --max-loras 3 \
#     --max-lora-rank 64

# API 호출 시 모델 이름으로 LoRA 어댑터 선택
response = client.chat.completions.create(
    model="sql-lora",  # LoRA 어댑터 이름
    messages=[{"role": "user", "content": "SELECT ..."}],
)

8.4 vLLM Vision 모델 서빙

# 멀티모달 모델 서빙
# vllm serve Qwen/Qwen2-VL-7B-Instruct \
#     --max-model-len 8192 \
#     --limit-mm-per-prompt image=4

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="key")

response = client.chat.completions.create(
    model="Qwen/Qwen2-VL-7B-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
        ]
    }],
)

9. TensorRT-LLM 심화: 최적 성능을 위한 선택

9.1 TensorRT-LLM 빌드 파이프라인

┌────────┐     ┌───────────┐     ┌──────────┐     ┌──────────┐
│HF Model│────►│ Convert   │────►│ TRT-LLM  │────►│ Triton   │
│(원본)   │     │ Checkpoint│     │ Engine   │     │ 서빙     │
└────────┘     └───────────┘     └──────────┘     └──────────┘
                 양자화 적용       컴파일 최적화      API 서버

# Step 1: 체크포인트 변환 + FP8 양자화
python convert_checkpoint.py \
    --model_dir meta-llama/Llama-3.1-70B-Instruct \
    --output_dir ./checkpoint_fp8 \
    --dtype bfloat16 \
    --tp_size 4 \
    --pp_size 1 \
    --use_fp8

# Step 2: TensorRT 엔진 빌드
trtllm-build \
    --checkpoint_dir ./checkpoint_fp8 \
    --output_dir ./engine_fp8 \
    --gemm_plugin auto \
    --max_batch_size 64 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --paged_kv_cache enable \
    --use_paged_context_fmha enable \
    --workers 4

9.2 TensorRT-LLM FP8 최적화

H100 GPU의 FP8 Tensor Core를 최대한 활용합니다.

설정	처리량 (Llama-3.1 70B, 4xH100)	레이턴시
FP16, TP=4	2,400 tok/s	16ms ITL
FP8, TP=4	4,200 tok/s	9ms ITL
FP8 + Speculative	5,800 tok/s	6ms ITL
INT4 AWQ, TP=2	3,800 tok/s	11ms ITL

9.3 Inflight Batching (TensorRT-LLM)

TensorRT-LLM의 Continuous Batching 구현입니다.

# Triton Inference Server + TensorRT-LLM 백엔드
# model_config.pbtxt 설정
"""
backend: "tensorrtllm"
max_batch_size: 64

model_transaction_policy {
  decoupled: True    # 스트리밍 응답 지원
}

parameters: {
  key: "batching_type"
  value: {string_value: "inflight"}  # Inflight Batching 활성화
}

parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {string_value: "131072"}   # KV Cache 토큰 수 제한
}
"""

10. 모델 병렬화: Multi-GPU 전략

10.1 Tensor Parallelism (TP)

하나의 레이어를 여러 GPU에 분할합니다.

Tensor Parallelism (TP=4):

         레이어 N의 가중치 행렬 W
    ┌──────┬──────┬──────┬──────┐
    │ W_1  │ W_2  │ W_3  │ W_4  │
    │GPU 0 │GPU 1 │GPU 2 │GPU 3 │
    └──┬───┴──┬───┴──┬───┴──┬───┘
       │      │      │      │
       ▼      ▼      ▼      ▼
    [부분1] [부분2] [부분3] [부분4]
       │      │      │      │
       └──────┴──────┴──────┘
              All-Reduce
              (결과 합산)

장점: 레이턴시 감소 (모든 GPU 동시 계산)
단점: GPU 간 통신 필요 (NVLink 권장)
적합: 같은 노드 내 GPU (낮은 레이턴시 필요)

10.2 Pipeline Parallelism (PP)

레이어를 순차적으로 여러 GPU에 분배합니다.

Pipeline Parallelism (PP=4, 80 layers):

GPU 0: [Layer 0-19]  → GPU 1: [Layer 20-39]
                              → GPU 2: [Layer 40-59]
                                       → GPU 3: [Layer 60-79]

장점: GPU 간 통신 최소 (한 방향)
단점: 파이프라인 버블 (GPU 유휴 시간)
적합: 노드 간 분산 (높은 레이턴시 허용)

10.3 Expert Parallelism (EP) - MoE 모델용

Mixture of Experts 모델에서 Expert를 분산합니다.

Expert Parallelism (Mixtral 8x7B, EP=4):

GPU 0: Expert 0, 1 + Shared Layers
GPU 1: Expert 2, 3 + Shared Layers
GPU 2: Expert 4, 5 + Shared Layers
GPU 3: Expert 6, 7 + Shared Layers

토큰 라우팅: 각 토큰은 Top-2 Expert로 전송
→ GPU 간 All-to-All 통신 필요

10.4 실전 병렬화 조합

# Llama-3.1 405B 서빙 (8x H100 80GB)
# 모델 크기: ~810 GB (FP16) → FP8로 ~405 GB

# 옵션 1: TP=8 (모든 GPU에 모든 레이어 분할)
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --tensor-parallel-size 8 \
    --max-model-len 16384

# 옵션 2: TP=4, PP=2 (4 GPU씩 2 파이프라인 스테이지)
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 2 \
    --max-model-len 16384

11. GPU 메모리 최적화 심화

11.1 KV Cache 양자화

# vLLM에서 KV Cache FP8 양자화
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8 \
    --quantization fp8

# KV Cache 메모리 절감 효과
# FP16 KV Cache: 1.34 GB / sequence (Llama-2 70B, 4K)
# FP8 KV Cache:  0.67 GB / sequence (50% 절감)
# 같은 GPU에서 2배 많은 동시 요청 처리 가능

11.2 메모리 할당 전략

GPU 메모리 분배 (A100 80GB, Llama-3.1 70B FP16):

┌─────────────────────────────────┐
│  모델 가중치: ~35 GB (TP=2)      │  43.75%
├─────────────────────────────────┤
│  KV Cache: ~35 GB               │  43.75%
│  (gpu_memory_utilization=0.90)  │
├─────────────────────────────────┤
│  활성화 메모리: ~2 GB            │  2.5%
├─────────────────────────────────┤
│  시스템 예약: ~8 GB              │  10%
└─────────────────────────────────┘

KV Cache가 처리 가능한 최대 동시 요청 수를 결정합니다.

11.3 메모리 부족 시 대응 전략

전략	구현	효과	부작용
양자화	FP16→INT4	가중치 4배 감소	미세한 품질 저하
KV Cache 양자화	FP16→FP8	KV Cache 2배 감소	무시할 수준
max_model_len 축소	32K→8K	해당 비율만큼 KV Cache 감소	긴 컨텍스트 불가
TP 증가	TP=2→TP=4	GPU당 메모리 절반	GPU 추가 비용
Prefix Caching	시스템 프롬프트 공유	반복 요청 시 큰 절감	유니크 요청에는 효과 없음

12. 비용 분석: 플랫폼별 tokens/dollar

12.1 셀프 호스팅 비용 비교

GPU	클라우드 시간당 비용	Llama-3.1 70B 처리량	tokens/dollar
A100 80GB x1	약 3.0 USD	800 tok/s (FP16)	960K
A100 80GB x4 (TP=4)	약 12.0 USD	2,800 tok/s	840K
H100 80GB x1	약 4.5 USD	1,500 tok/s (FP8)	1,200K
H100 80GB x4 (TP=4)	약 18.0 USD	5,000 tok/s (FP8)	1,000K
L40S x1	약 1.5 USD	600 tok/s (INT4)	1,440K
4090 x1 (자체 서버)	약 0.3 USD (전기)	400 tok/s (INT4)	4,800K

12.2 API vs 셀프 호스팅 손익분기점

월간 토큰 사용량별 비용 비교 (Llama-3.1 70B급):

┌─────────────────────────────────────────────────┐
│ 비용                                             │
│ ($)                                             │
│ 5000│                                    /API   │
│     │                                  /        │
│ 3000│                    ────────── Self-Hosted  │
│     │              /  (H100x4 월 고정비)          │
│ 1000│      / API                                │
│     │   /                                       │
│    0├──┬────┬────┬────┬────┬────┬────►          │
│     0  2B   5B  10B  20B  50B 100B  토큰/월      │
│                                                 │
│  손익분기점: 약 10B tokens/month                   │
└─────────────────────────────────────────────────┘

13. 벤치마킹: 올바른 측정 방법

13.1 핵심 메트릭

메트릭	정의	중요한 이유
TTFT (Time To First Token)	첫 토큰 생성까지 시간	사용자 체감 응답 시작 시간
ITL (Inter-Token Latency)	토큰 간 생성 시간	스트리밍 시 체감 속도
E2E Latency	전체 요청 완료 시간	총 대기 시간
Throughput	초당 생성 토큰 수	시스템 전체 처리 능력
TPS/User	사용자당 초당 토큰	개인 체감 속도

13.2 벤치마킹 도구와 방법

# vLLM 내장 벤치마크 (추천)
# python -m vllm.entrypoints.openai.api_server 실행 후:

# python benchmarks/benchmark_serving.py \
#     --backend vllm \
#     --model meta-llama/Llama-3.1-8B-Instruct \
#     --dataset-name sharegpt \
#     --dataset-path ShareGPT_V3_unfiltered.json \
#     --num-prompts 1000 \
#     --request-rate 10 \
#     --endpoint /v1/completions

# 결과 예시:
# Successful requests:                     1000
# Benchmark duration (s):                  105.23
# Total input tokens:                      215000
# Total generated tokens:                  180000
# Request throughput (req/s):              9.50
# Output token throughput (tok/s):         1710.5
# Mean TTFT (ms):                          48.2
# Median TTFT (ms):                        42.1
# P99 TTFT (ms):                           125.3
# Mean ITL (ms):                           11.8
# Median ITL (ms):                         10.2
# P99 ITL (ms):                            35.7

13.3 부하별 성능 특성

처리량과 레이턴시의 관계 (concurrency 증가 시):

처리량                          레이턴시
(tok/s)                        (ms)
  │        ┌──────────           │              /
  │       /                      │            /
  │      /                       │          /
  │     /                        │        /
  │    /                         │      /
  │   /                          │    /
  │  /                           │  /
  │ /                            │/
  ├──────────────► concurrency   ├──────────────► concurrency

  최적 운영점: 처리량이 포화되기 직전 (Knee point)
  보통 GPU 활용률 70-80% 지점

14. 실전 배포 아키텍처

14.1 프로덕션 서빙 아키텍처

┌──────────────────────────────────────────────────┐
│                Production Architecture            │
│                                                  │
│  Client → Load Balancer → API Gateway            │
│                              │                   │
│                    ┌─────────┼─────────┐         │
│                    ▼         ▼         ▼         │
│              ┌─────────┐┌────────┐┌────────┐    │
│              │ vLLM    ││ vLLM   ││ vLLM   │    │
│              │ Pod 1   ││ Pod 2  ││ Pod 3  │    │
│              │ (4xH100)││(4xH100)││(4xH100)│    │
│              └────┬────┘└───┬────┘└───┬────┘    │
│                   │         │         │          │
│              ┌────▼─────────▼─────────▼────┐    │
│              │     Prometheus + Grafana     │    │
│              │     (메트릭 수집/시각화)       │    │
│              └─────────────────────────────┘    │
│                                                  │
│  Autoscaling: 큐 길이/GPU 활용률 기반             │
│  Health Check: /health 엔드포인트                 │
│  Graceful Shutdown: 진행 중인 요청 완료 후 종료    │
└──────────────────────────────────────────────────┘

14.2 Kubernetes 배포 예시

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-70b
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-llama3
  template:
    metadata:
      labels:
        app: vllm-llama3
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
        args:
          - "--model=meta-llama/Llama-3.1-70B-Instruct"
          - "--tensor-parallel-size=4"
          - "--max-model-len=16384"
          - "--gpu-memory-utilization=0.90"
          - "--enable-prefix-caching"
          - "--enable-chunked-prefill"
        resources:
          limits:
            nvidia.com/gpu: "4"
            memory: "64Gi"
          requests:
            nvidia.com/gpu: "4"
            memory: "32Gi"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
      nodeSelector:
        gpu-type: h100
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

15. 퀴즈

Q1. vLLM의 PagedAttention이 해결하는 핵심 문제는 무엇인가요?

정답: KV Cache의 메모리 단편화 문제를 해결합니다.

기존 방식은 시퀀스마다 최대 길이만큼 연속 메모리를 미리 할당하여 60-80%가 낭비되었습니다. PagedAttention은 OS의 가상 메모리처럼 KV Cache를 고정 크기 블록(페이지)으로 나누고, 페이지 테이블로 비연속 블록을 논리적으로 연결합니다. 이를 통해:

내부 단편화 거의 제거
비연속 메모리 블록 활용 가능
Copy-on-Write로 공통 프리픽스 KV Cache 공유

결과적으로 같은 GPU 메모리에서 2-4배 많은 동시 요청을 처리할 수 있습니다.

Q2. Continuous Batching이 Static Batching보다 처리량이 높은 이유는?

정답: Static Batching은 배치 내 모든 요청이 끝날 때까지 기다려야 다음 배치를 시작합니다. 짧은 응답이 먼저 끝나도 GPU는 유휴 상태로 대기합니다.

Continuous Batching은:

매 iteration마다 완료된 요청을 즉시 제거
대기 큐의 새 요청을 즉시 투입
GPU가 항상 최대 부하로 동작

이를 통해 Static Batching 대비 10-20배 높은 처리량을 달성합니다. 개별 요청의 레이턴시도 대기 시간 감소로 개선됩니다.

Q3. Speculative Decoding이 출력 품질을 저하시키지 않는 이유는?

정답: 수학적으로 타겟 모델의 출력 분포를 정확히 보존하기 때문입니다.

드래프트 모델이 예측한 토큰 x에 대해:

수락 확률 = min(1, p_target(x) / p_draft(x))
거절 시: (p_target - p_draft) 분포에서 재샘플링

이 과정을 통해 최종 출력은 타겟 모델만 사용한 것과 수학적으로 동일한 분포를 가집니다. 속도만 향상되고 품질 손실은 제로입니다.

Q4. LLM Decode 단계가 Memory-Bound인 이유는?

정답: Decode 단계에서는 한 번에 한 토큰만 생성합니다. 이때 전체 모델 가중치를 메모리에서 읽어야 하지만(행렬-벡터 곱셈), 실제 연산량은 매우 적습니다.

Llama-2 70B 예시:

모델 가중치 140GB를 매 스텝 읽어야 함
A100 대역폭 2TB/s 기준: 70ms (메모리 읽기)
실제 연산 시간: 약 1ms

메모리 대역폭이 병목이므로 Memory-Bound입니다. 이것이 양자화(가중치 크기 축소)와 배칭(가중치 읽기 1회로 여러 요청 처리)이 효과적인 이유입니다.

Q5. FP8 양자화가 INT8보다 LLM 추론에 더 적합한 이유는?

정답: FP8은 부동소수점 형식이라 넓은 동적 범위를 가집니다. LLM 가중치와 활성화의 분포는 매우 다양한 크기를 가지므로, 고정소수점인 INT8보다 FP8이 더 적합합니다.

구체적으로:

FP8 E4M3: 지수부 4비트, 가수부 3비트 → 넓은 범위, 적당한 정밀도
INT8: -128~127 고정 범위 → 이상치(outlier)에 취약
H100 GPU는 FP8 전용 Tensor Core를 탑재하여 FP16 대비 2배 연산 성능
FP8은 별도 캘리브레이션 없이 동적 양자화 가능

결과적으로 FP8은 INT4에 가까운 성능 향상을 제공하면서 FP16에 가까운 품질을 유지합니다.

16. 참고 자료

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - Kwon et al., 2023
- https://arxiv.org/abs/2309.06180
FlashAttention: Fast and Memory-Efficient Exact Attention - Dao et al., 2022
- https://arxiv.org/abs/2205.14135
FlashAttention-2: Faster Attention with Better Parallelism - Dao, 2023
- https://arxiv.org/abs/2307.08691
Efficient Memory Management for Large Language Model Serving with PagedAttention - Kwon et al., 2023
- https://arxiv.org/abs/2309.06180
Fast Inference from Transformers via Speculative Decoding - Leviathan et al., 2023
- https://arxiv.org/abs/2211.17192
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - Frantar et al., 2023
- https://arxiv.org/abs/2210.17323
AWQ: Activation-aware Weight Quantization - Lin et al., 2024
- https://arxiv.org/abs/2306.00978
TensorRT-LLM - NVIDIA Official Documentation
- https://nvidia.github.io/TensorRT-LLM/
Orca: A Distributed Serving System for Transformer-Based Generative Models - Yu et al., 2022
- https://www.usenix.org/conference/osdi22/presentation/yu
GQA: Training Generalized Multi-Query Transformer Models - Ainslie et al., 2023
- https://arxiv.org/abs/2305.13245
Medusa: Simple LLM Inference Acceleration Framework - Cai et al., 2024
- https://arxiv.org/abs/2401.10774
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty - Li et al., 2024
- https://arxiv.org/abs/2401.15077
SmoothQuant: Accurate and Efficient Post-Training Quantization - Xiao et al., 2023
- https://arxiv.org/abs/2211.10438

Complete LLM Inference Optimization Guide 2025: vLLM, TensorRT-LLM, KV Cache, Speculative Decoding

1. Understanding LLM Inference Bottlenecks: Compute-Bound vs Memory-Bound

Before diving into LLM inference optimization, we must first understand exactly where bottlenecks occur.

1.1 Arithmetic Intensity and the Roofline Model

GPU performance is governed by two resources:

Resource	Unit	A100 80GB	H100 80GB	H200 141GB
Compute (FP16)	TFLOPS	312	989	989
Memory Bandwidth	TB/s	2.0	3.35	4.8
Arithmetic Intensity Boundary	FLOP/byte	156	295	206

Arithmetic Intensity = Total FLOPs / Total Memory Transfers (Bytes)

Compute-Bound: When arithmetic intensity exceeds the boundary. Matrix-matrix multiplication (GEMM) is the canonical example
Memory-Bound: When arithmetic intensity is below the boundary. Attention and decoding are typical examples

1.2 Prefill vs Decode Phases

LLM inference consists of two main phases:

┌──────────────────────────────────────────────────────┐
│                 LLM Inference Pipeline                │
├──────────────────┬───────────────────────────────────┤
│   Prefill Phase  │         Decode Phase              │
│  (Prompt Proc.)  │      (Token Generation)           │
├──────────────────┼───────────────────────────────────┤
│ - Input tokens   │ - Generates 1 token at a time     │
│   in parallel    │ - Memory-Bound                    │
│ - Compute-Bound  │ - Low GPU utilization (5-15%)     │
│ - High GPU util  │ - Repeats for output length       │
│ - Runs once      │ - KV Cache read + append          │
│ - Creates KV $   │                                   │
└──────────────────┴───────────────────────────────────┘

Prefill Phase: Processes the entire prompt at once. Dominated by matrix-matrix multiplication (GEMM), making it compute-bound.

Decode Phase: Generates tokens one at a time. Dominated by matrix-vector multiplication (GEMV), making it memory-bound. The entire model weights must be read from memory each step, but actual computation is minimal.

1.3 Why Decode Is Slow

For Llama-2 70B:

Model weights: ~140 GB (FP16)
Per decode step: must read 140 GB from memory
At A100 bandwidth of 2 TB/s: 140 GB / 2 TB/s = 70ms per token
Actual computation time: ~1ms

Memory reading takes 70x longer than computation. This is the core motivation for LLM inference optimization.

2. KV Cache: The Core Data Structure of LLM Inference

2.1 What Is KV Cache

Transformer Self-Attention requires the Key (K) and Value (V) of all previous tokens. KV Cache stores previously computed K and V tensors to avoid recomputation.

# Without KV Cache (full recomputation each step)
# Total compute for n tokens: O(n^2 * d)

# With KV Cache (reuse previous results)
# Total compute for n tokens: O(n * d)
# But KV Cache memory: O(n * d) additional

2.2 KV Cache Memory Calculation

KV Cache Size = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * dtype_size

Example: Llama-2 70B, seq_len=4096, batch_size=1, FP16
= 2 * 80 * 8 * 128 * 4096 * 1 * 2 bytes
= 1.34 GB (for a single sequence!)

With batch_size=32: 1.34 * 32 = 42.9 GB

Model	Parameters	KV Cache/token (FP16)	4K seq x1	4K seq x32
Llama-2 7B	7B	800 KB	3.2 GB	102 GB
Llama-2 70B	70B	320 KB	1.34 GB	42.9 GB
Mixtral 8x7B	46.7B	640 KB	2.56 GB	81.9 GB
Llama-3 405B	405B	1.6 MB	6.4 GB	204 GB

2.3 PagedAttention (Core of vLLM)

The problem with traditional approaches: each sequence pre-allocates contiguous memory for the maximum length. In practice, 60-80% is wasted.

┌─────────────────────────────────────────────┐
│    Traditional KV Cache Allocation           │
│                                             │
│  Request 1: [████████░░░░░░░░░░░░]  40% used│
│  Request 2: [████████████░░░░░░░░]  60% used│
│  Request 3: [██░░░░░░░░░░░░░░░░░░]  10% used│
│              ^^^^^^^^ wasted memory          │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│       PagedAttention KV Cache Allocation     │
│                                             │
│  Physical Blocks: [B0][B1][B2][B3][B4][B5]  │
│                                             │
│  Request 1 -> Page Table: [B0, B3, B5]      │
│  Request 2 -> Page Table: [B1, B4, B6, B7]  │
│  Request 3 -> Page Table: [B2]              │
│                                             │
│  Near-zero internal fragmentation           │
│  Non-contiguous memory blocks utilized      │
│  Copy-on-Write for shared prompts           │
└─────────────────────────────────────────────┘

Key ideas of PagedAttention:

Split KV Cache into fixed-size blocks (pages)
Use page tables to logically link non-contiguous blocks, like OS virtual memory
Allocate blocks only on demand, eliminating internal fragmentation
Copy-on-Write: Requests sharing the same prompt share KV Cache

2.4 Prefix Caching

Reuses KV Cache for repeated system prompts or common prefixes.

# Enable Prefix Caching in vLLM
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    enable_prefix_caching=True,
    max_model_len=8192,
)

# Requests using the same system prompt
# share the KV Cache for the system prompt portion

3. Attention Optimization: FlashAttention and MQA/GQA

3.1 FlashAttention: IO-Aware Attention

Problems with standard attention:

Read Q, K, V matrices from HBM (High Bandwidth Memory)
Compute S = Q @ K^T and write to HBM
Compute P = softmax(S) and write to HBM
Compute O = P @ V and write to HBM

4 HBM read/write round-trips -- this is the bottleneck.

┌──────────────────────────────────────────────┐
│          FlashAttention Core Idea             │
│                                              │
│  GPU Memory Hierarchy:                       │
│  ┌─────────┐  19 TB/s   ┌─────────────────┐ │
│  │  SRAM   │<---------->│  Compute Units   │ │
│  │ (20 MB) │            └─────────────────┘ │
│  └────┬────┘                                 │
│       | 2-4.8 TB/s                           │
│  ┌────v────────────────┐                     │
│  │    HBM (80-141 GB)  │                     │
│  └─────────────────────┘                     │
│                                              │
│  Strategy: Split Q,K,V into tiles (blocks),  │
│  perform all computation in SRAM,            │
│  write only final results to HBM             │
└──────────────────────────────────────────────┘

3.2 FlashAttention Version Comparison

Feature	FlashAttention-1	FlashAttention-2	FlashAttention-3
Release	2022	2023	2024
Speedup	2-4x	Additional 2x	Additional 1.5-2x
GPU Support	A100	A100, H100	H100 (Hopper optimized)
Key Optimization	Tiling, recomputation	Improved parallelism, warp splitting	FP8, async copy, pipelining
FLOPS vs MHA	50-70%	70-80%	Up to 740 TFLOPS (75%)

3.3 Multi-Query Attention (MQA) vs Grouped-Query Attention (GQA)

Architecture-level optimization to reduce KV Cache size:

┌─────────────────────────────────────────────────────┐
│    Multi-Head Attention (MHA)                       │
│    Q heads: [H1][H2][H3][H4][H5][H6][H7][H8]      │
│    K heads: [H1][H2][H3][H4][H5][H6][H7][H8]      │
│    V heads: [H1][H2][H3][H4][H5][H6][H7][H8]      │
│    KV Cache: 8x                                     │
├─────────────────────────────────────────────────────┤
│    Multi-Query Attention (MQA)                      │
│    Q heads: [H1][H2][H3][H4][H5][H6][H7][H8]      │
│    K heads: [        H_shared         ]             │
│    V heads: [        H_shared         ]             │
│    KV Cache: 1x (8x reduction)                      │
├─────────────────────────────────────────────────────┤
│    Grouped-Query Attention (GQA, 2 groups)          │
│    Q heads: [H1][H2][H3][H4] | [H5][H6][H7][H8]   │
│    K heads: [   K_group1   ] | [   K_group2   ]    │
│    V heads: [   V_group1   ] | [   V_group2   ]    │
│    KV Cache: 2x (4x reduction)                      │
└─────────────────────────────────────────────────────┘

Model	Attention Type	KV Heads	Q Heads	KV Cache Reduction
GPT-J 6B	MHA	16	16	1x
Falcon-40B	MQA	1	64	64x
Llama-2 70B	GQA	8	64	8x
Llama-3 70B	GQA	8	64	8x
Mistral 7B	GQA	8	32	4x

4. Batching Strategies: Static vs Continuous

4.1 Limitations of Static Batching

Static Batching (traditional):
Time ──────────────────────────────────>

Req 1: [████████████████████████████████]  (long response)
Req 2: [████████░░░░░░░░░░░░░░░░░░░░░░]  (short response)
Req 3: [██████████████░░░░░░░░░░░░░░░░]  (medium response)
Req 4: [WAIT WAIT WAIT WAIT WAIT WAIT ]  (waiting)

░ = GPU idle (padding), WAIT = waiting for batch to complete
Entire batch must finish before next batch starts -> very low throughput

4.2 Continuous Batching (In-Flight Batching)

Continuous Batching:
Time ──────────────────────────────────>

Req 1: [████████████████████████████████]
Req 2: [████████]
Req 3:          [██████████████]
Req 4:                  [████████████████]
Req 5:                          [████████]

Completed requests immediately removed -> new requests immediately added
GPU idle time minimized -> 10-20x throughput improvement

Core principles of Continuous Batching:

Every iteration, remove completed requests from the batch
Immediately add waiting requests to the batch
GPU always operates at maximum load
Individual request latency also improves (reduced wait time)

4.3 Chunked Prefill

Solves the problem of long prompt prefill blocking decode requests.

# vLLM chunked prefill configuration
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_chunked_prefill=True,
    max_num_batched_tokens=2048,  # max tokens per iteration
)

# A long prompt (e.g., 32K tokens) is split into 2048-token chunks
# Decode requests can also be processed between chunks
# Slightly increases TTFT but improves overall system throughput and ITL

5. Speculative Decoding: A Game Changer for Inference Speed

5.1 Core Idea

A small Draft Model quickly predicts multiple tokens, and a large Target Model verifies them all in a single forward pass.

┌────────────────────────────────────────────────────┐
│           Speculative Decoding Flow                │
│                                                    │
│  Step 1: Draft Model (small, fast)                 │
│  "The capital of France is" -> [Paris][,][a][city] │
│  4 tokens predicted very quickly (4ms)             │
│                                                    │
│  Step 2: Target Model (large, accurate)            │
│  Single forward pass verifies all 4 tokens         │
│  [Paris OK] [, OK] [a FAIL->"known"] [city FAIL]  │
│                                                    │
│  Result: "Paris, known" (2 accepted + 1 corrected) │
│  Before: 3 forward passes needed -> now 1          │
│  Speedup: ~2-3x                                    │
└────────────────────────────────────────────────────┘

5.2 Mathematical Guarantee: Preserving Output Quality

The key advantage of Speculative Decoding is that it exactly preserves the target model's output distribution.

Acceptance/rejection probability:

For draft token x: acceptance probability = min(1, p_target(x) / p_draft(x))
On rejection: resample from (p_target(x) - p_draft(x)) distribution

Through this process, the final output has a mathematically identical distribution to using only the target model.

5.3 Speculative Decoding Variants

# 1. Separate Draft Model
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.1-8B-Instruct",
    num_speculative_tokens=5,
    use_v2_block_manager=True,
)

# 2. Medusa Heads (additional MLP heads predict multiple positions)
# No draft model needed - adds lightweight heads to target model itself
# Requires training but minimal memory overhead

# 3. EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)
# Draft model reuses target model's hidden states
# Higher acceptance rate than separate draft models

5.4 Tree Attention

Verifies multiple candidate sequences simultaneously in a tree structure.

Token position:    1        2        3
                 +-- Paris --+-- is -- ...
The -------------+           +-- was -- ...
                 +-- Lyon --- is -- ...
                 +-- capital - of -- ...

All tree paths verified in a single forward pass
-> Maximizes acceptance rate, improves throughput

6. Quantization for Inference Acceleration

6.1 Data Type Comparison

Data Type	Bits	Range	Memory Savings	Quality Impact
FP32	32	Very wide	Baseline	Baseline
FP16	16	Wide	2x	Negligible
BF16	16	Same as FP32	2x	Negligible
FP8 (E4M3)	8	Medium	4x	Very small
INT8	8	-128 to 127	4x	Small
INT4	4	-8 to 7	8x	Moderate
NF4	4	Normal dist. optimized	8x	Less than INT4

6.2 Quantization Technique Comparison

┌───────────────────────────────────────────────────────┐
│            Quantization Technique Classification      │
├─────────────────────┬─────────────────────────────────┤
│  Post-Training      │  Training-Aware                 │
│  Quantization(PTQ)  │  Quantization                   │
├─────────────────────┼─────────────────────────────────┤
│  - GPTQ (INT4)      │  - QLoRA + Merge                │
│  - AWQ (INT4)       │  - QAT (Quantization-Aware      │
│  - GGUF (various)   │    Training)                    │
│  - bitsandbytes     │                                 │
│  - SmoothQuant      │                                 │
│  - FP8 Dynamic      │                                 │
└─────────────────────┴─────────────────────────────────┘

6.3 Major Quantization Formats in Detail

# GPTQ: Layer-wise optimal quantization (OBQ-based)
# Pros: Good quality even at INT4, optimized for GPU inference
# Cons: Requires calibration data, slow quantization

from transformers import AutoModelForCausalLM, GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    group_size=128,
    dataset="c4",
    desc_act=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    quantization_config=gptq_config,
    device_map="auto",
)

# AWQ: Activation-aware Weight Quantization
# Key: Finds and protects important weight channels (based on activation magnitude)
# Faster quantization than GPTQ, similar quality

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct"
)
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)

# bitsandbytes: Simple INT8/NF4 quantization
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
)

6.4 GGUF: Format for CPU/Metal Inference

The quantization format used by llama.cpp, supporting various quantization levels.

GGUF Quantization	Bits	Method	Quality	Speed
Q2_K	2-3	K-quant mixed	Low	Very fast
Q4_K_M	4-5	K-quant medium	Good	Fast
Q5_K_M	5-6	K-quant medium	Very good	Medium
Q6_K	6	K-quant	Near original	Slow
Q8_0	8	Uniform quant	Same as original	Slow
F16	16	No quantization	Original	Slowest

7. Serving Framework Comparison: vLLM vs TensorRT-LLM vs TGI

7.1 Comprehensive Comparison

Feature	vLLM	TensorRT-LLM	TGI	Ollama	llama.cpp
Developer	UC Berkeley	NVIDIA	Hugging Face	Ollama	ggerganov
Language	Python/C++	C++/Python	Rust/Python	Go	C/C++
PagedAttention	Yes	Yes	Yes	No	No
Continuous Batching	Yes	Yes	Yes	No	No
Tensor Parallelism	Yes	Yes	Yes	No	No
FP8 Support	Yes	Yes (optimal)	Yes	No	No
Speculative Decoding	Yes	Yes	Limited	No	Yes
LoRA Serving	Yes (multi)	Yes	Yes	Yes	Yes
Vision Models	Yes	Yes	Yes	Yes	Yes (some)
CPU Inference	Limited	No	No	Yes	Yes (optimal)
Metal (Apple)	No	No	No	Yes	Yes
Install Difficulty	Easy	Hard	Easy	Very easy	Medium
Production Ready	High	High	High	Low	Medium

7.2 Throughput Benchmarks (Llama-3.1 8B, A100 80GB)

Framework	Throughput (tok/s)	TTFT (ms)	ITL (ms)	Memory Usage
vLLM (FP16)	4,200	45	12	18 GB
vLLM (AWQ-4bit)	6,800	32	8	7 GB
TensorRT-LLM (FP16)	4,800	38	10	17 GB
TensorRT-LLM (FP8)	7,500	28	7	10 GB
TGI (FP16)	3,600	52	14	18 GB
llama.cpp (Q4_K_M)	120	200	35	5 GB

8. vLLM Deep Dive: Architecture to LoRA Serving

8.1 vLLM Architecture

┌──────────────────────────────────────────┐
│              vLLM Architecture            │
│                                          │
│  ┌─────────┐     ┌──────────────────┐   │
│  │ FastAPI  │---->│   LLM Engine     │   │
│  │ Server   │     │                  │   │
│  └─────────┘     │  ┌────────────┐  │   │
│                  │  │ Scheduler   │  │   │
│  ┌─────────┐    │  │ (Batching)  │  │   │
│  │ OpenAI  │---->│  └─────┬──────┘  │   │
│  │ compat  │     │        |         │   │
│  └─────────┘     │  ┌─────v──────┐  │   │
│                  │  │ Block Mgr   │  │   │
│                  │  │ (PagedAttn) │  │   │
│                  │  └─────┬──────┘  │   │
│                  │        |         │   │
│                  │  ┌─────v──────┐  │   │
│                  │  │  Worker(s)  │  │   │
│                  │  │ (GPU exec)  │  │   │
│                  │  └────────────┘  │   │
│                  └──────────────────┘   │
└──────────────────────────────────────────┘

8.2 vLLM Production Deployment

# Start vLLM server (OpenAI API compatible)
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
#     --tensor-parallel-size 4 \
#     --max-model-len 32768 \
#     --gpu-memory-utilization 0.90 \
#     --enable-prefix-caching \
#     --enable-chunked-prefill \
#     --max-num-batched-tokens 4096 \
#     --port 8000

# API call via Python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing"},
    ],
    max_tokens=512,
    temperature=0.7,
)

8.3 vLLM Multi-LoRA Serving

Serve multiple LoRA adapters simultaneously from a single base model.

# vllm serve meta-llama/Llama-3.1-8B-Instruct \
#     --enable-lora \
#     --lora-modules \
#         sql-lora=./adapters/sql-lora \
#         code-lora=./adapters/code-lora \
#         chat-lora=./adapters/chat-lora \
#     --max-loras 3 \
#     --max-lora-rank 64

# Select LoRA adapter by model name in API call
response = client.chat.completions.create(
    model="sql-lora",  # LoRA adapter name
    messages=[{"role": "user", "content": "SELECT ..."}],
)

8.4 vLLM Vision Model Serving

# Multimodal model serving
# vllm serve Qwen/Qwen2-VL-7B-Instruct \
#     --max-model-len 8192 \
#     --limit-mm-per-prompt image=4

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="key")

response = client.chat.completions.create(
    model="Qwen/Qwen2-VL-7B-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
        ]
    }],
)

9. TensorRT-LLM Deep Dive: The Choice for Maximum Performance

9.1 TensorRT-LLM Build Pipeline

┌────────┐     ┌───────────┐     ┌──────────┐     ┌──────────┐
│HF Model│---->│ Convert   │---->│ TRT-LLM  │---->│ Triton   │
│(source) │     │ Checkpoint│     │ Engine   │     │ Serving  │
└────────┘     └───────────┘     └──────────┘     └──────────┘
                 Apply quant      Compile optim    API server

# Step 1: Checkpoint conversion + FP8 quantization
python convert_checkpoint.py \
    --model_dir meta-llama/Llama-3.1-70B-Instruct \
    --output_dir ./checkpoint_fp8 \
    --dtype bfloat16 \
    --tp_size 4 \
    --pp_size 1 \
    --use_fp8

# Step 2: Build TensorRT engine
trtllm-build \
    --checkpoint_dir ./checkpoint_fp8 \
    --output_dir ./engine_fp8 \
    --gemm_plugin auto \
    --max_batch_size 64 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --paged_kv_cache enable \
    --use_paged_context_fmha enable \
    --workers 4

9.2 TensorRT-LLM FP8 Optimization

Maximizes utilization of H100 GPU's FP8 Tensor Cores.

Configuration	Throughput (Llama-3.1 70B, 4xH100)	Latency
FP16, TP=4	2,400 tok/s	16ms ITL
FP8, TP=4	4,200 tok/s	9ms ITL
FP8 + Speculative	5,800 tok/s	6ms ITL
INT4 AWQ, TP=2	3,800 tok/s	11ms ITL

9.3 Inflight Batching (TensorRT-LLM)

TensorRT-LLM's implementation of Continuous Batching.

# Triton Inference Server + TensorRT-LLM backend
# model_config.pbtxt configuration
"""
backend: "tensorrtllm"
max_batch_size: 64

model_transaction_policy {
  decoupled: True    # Streaming response support
}

parameters: {
  key: "batching_type"
  value: {string_value: "inflight"}  # Enable Inflight Batching
}

parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {string_value: "131072"}   # Limit KV Cache token count
}
"""

10. Model Parallelism: Multi-GPU Strategies

10.1 Tensor Parallelism (TP)

Splits a single layer across multiple GPUs.

Tensor Parallelism (TP=4):

         Layer N weight matrix W
    ┌──────┬──────┬──────┬──────┐
    │ W_1  │ W_2  │ W_3  │ W_4  │
    │GPU 0 │GPU 1 │GPU 2 │GPU 3 │
    └──┬───┴──┬───┴──┬───┴──┬───┘
       |      |      |      |
       v      v      v      v
    [part1] [part2] [part3] [part4]
       |      |      |      |
       └──────┴──────┴──────┘
              All-Reduce
              (aggregate results)

Pros: Reduces latency (all GPUs compute simultaneously)
Cons: Requires inter-GPU communication (NVLink recommended)
Best for: GPUs within same node (low latency needed)

10.2 Pipeline Parallelism (PP)

Distributes layers sequentially across GPUs.

Pipeline Parallelism (PP=4, 80 layers):

GPU 0: [Layer 0-19]  -> GPU 1: [Layer 20-39]
                              -> GPU 2: [Layer 40-59]
                                       -> GPU 3: [Layer 60-79]

Pros: Minimal inter-GPU communication (one direction)
Cons: Pipeline bubbles (GPU idle time)
Best for: Cross-node distribution (higher latency tolerable)

10.3 Expert Parallelism (EP) - For MoE Models

Distributes experts in Mixture of Experts models.

Expert Parallelism (Mixtral 8x7B, EP=4):

GPU 0: Expert 0, 1 + Shared Layers
GPU 1: Expert 2, 3 + Shared Layers
GPU 2: Expert 4, 5 + Shared Layers
GPU 3: Expert 6, 7 + Shared Layers

Token routing: Each token sent to Top-2 Experts
-> Requires All-to-All communication between GPUs

10.4 Practical Parallelism Combinations

# Serving Llama-3.1 405B (8x H100 80GB)
# Model size: ~810 GB (FP16) -> FP8 ~405 GB

# Option 1: TP=8 (all layers split across all GPUs)
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --tensor-parallel-size 8 \
    --max-model-len 16384

# Option 2: TP=4, PP=2 (4 GPUs per pipeline stage, 2 stages)
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 2 \
    --max-model-len 16384

11. Advanced GPU Memory Optimization

11.1 KV Cache Quantization

# KV Cache FP8 quantization in vLLM
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
#     --tensor-parallel-size 4 \
#     --kv-cache-dtype fp8 \
#     --quantization fp8

# KV Cache memory savings:
# FP16 KV Cache: 1.34 GB / sequence (Llama-2 70B, 4K)
# FP8 KV Cache:  0.67 GB / sequence (50% savings)
# 2x more concurrent requests on same GPU

11.2 Memory Allocation Strategy

GPU Memory Distribution (A100 80GB, Llama-3.1 70B FP16):

┌─────────────────────────────────┐
│  Model Weights: ~35 GB (TP=2)   │  43.75%
├─────────────────────────────────┤
│  KV Cache: ~35 GB               │  43.75%
│  (gpu_memory_utilization=0.90)  │
├─────────────────────────────────┤
│  Activation Memory: ~2 GB       │  2.5%
├─────────────────────────────────┤
│  System Reserved: ~8 GB         │  10%
└─────────────────────────────────┘

KV Cache determines the maximum number of concurrent requests.

11.3 Strategies When Running Low on Memory

Strategy	Implementation	Effect	Side Effect
Quantization	FP16 to INT4	4x weight reduction	Slight quality loss
KV Cache Quant	FP16 to FP8	2x KV Cache reduction	Negligible
Reduce max_model_len	32K to 8K	Proportional KV Cache reduction	No long contexts
Increase TP	TP=2 to TP=4	Half memory per GPU	Extra GPU cost
Prefix Caching	Shared system prompts	Large savings for repeated requests	No effect on unique requests

12. Cost Analysis: tokens/dollar Across Platforms

12.1 Self-Hosting Cost Comparison

GPU	Cloud Hourly Cost	Llama-3.1 70B Throughput	tokens/dollar
A100 80GB x1	~3.0 USD	800 tok/s (FP16)	960K
A100 80GB x4 (TP=4)	~12.0 USD	2,800 tok/s	840K
H100 80GB x1	~4.5 USD	1,500 tok/s (FP8)	1,200K
H100 80GB x4 (TP=4)	~18.0 USD	5,000 tok/s (FP8)	1,000K
L40S x1	~1.5 USD	600 tok/s (INT4)	1,440K
4090 x1 (own server)	~0.3 USD (power)	400 tok/s (INT4)	4,800K

12.2 API vs Self-Hosting Break-Even Point

Monthly token usage cost comparison (Llama-3.1 70B class):

┌─────────────────────────────────────────────────┐
│ Cost                                            │
│ ($)                                             │
│ 5000|                                    /API   │
│     |                                  /        │
│ 3000|                    ---------- Self-Hosted  │
│     |              /  (H100x4 monthly fixed)    │
│ 1000|      / API                                │
│     |   /                                       │
│    0+--+----+----+----+----+----+------>        │
│     0  2B   5B  10B  20B  50B 100B  tokens/mo  │
│                                                 │
│  Break-even: ~10B tokens/month                  │
└─────────────────────────────────────────────────┘

13. Benchmarking: How to Measure Correctly

13.1 Core Metrics

Metric	Definition	Why It Matters
TTFT (Time To First Token)	Time until first token generated	User-perceived response start
ITL (Inter-Token Latency)	Time between tokens	Perceived streaming speed
E2E Latency	Total request completion time	Total wait time
Throughput	Tokens generated per second	Overall system capacity
TPS/User	Tokens per second per user	Individual perceived speed

13.2 Benchmarking Tools and Methods

# vLLM built-in benchmark (recommended)
# After running: python -m vllm.entrypoints.openai.api_server

# python benchmarks/benchmark_serving.py \
#     --backend vllm \
#     --model meta-llama/Llama-3.1-8B-Instruct \
#     --dataset-name sharegpt \
#     --dataset-path ShareGPT_V3_unfiltered.json \
#     --num-prompts 1000 \
#     --request-rate 10 \
#     --endpoint /v1/completions

# Example results:
# Successful requests:                     1000
# Benchmark duration (s):                  105.23
# Total input tokens:                      215000
# Total generated tokens:                  180000
# Request throughput (req/s):              9.50
# Output token throughput (tok/s):         1710.5
# Mean TTFT (ms):                          48.2
# Median TTFT (ms):                        42.1
# P99 TTFT (ms):                           125.3
# Mean ITL (ms):                           11.8
# Median ITL (ms):                         10.2
# P99 ITL (ms):                            35.7

13.3 Performance Characteristics Under Load

Throughput vs Latency relationship (as concurrency increases):

Throughput                          Latency
(tok/s)                            (ms)
  |        ┌──────────              |              /
  |       /                         |            /
  |      /                          |          /
  |     /                           |        /
  |    /                            |      /
  |   /                             |    /
  |  /                              |  /
  | /                               |/
  +──────────────> concurrency      +──────────────> concurrency

  Optimal operating point: Just before throughput saturates (Knee point)
  Usually GPU utilization of 70-80%

14. Production Deployment Architecture

14.1 Production Serving Architecture

┌──────────────────────────────────────────────────┐
│                Production Architecture            │
│                                                  │
│  Client -> Load Balancer -> API Gateway          │
│                              |                   │
│                    ┌─────────┼─────────┐         │
│                    v         v         v         │
│              ┌─────────┐┌────────┐┌────────┐    │
│              │ vLLM    ││ vLLM   ││ vLLM   │    │
│              │ Pod 1   ││ Pod 2  ││ Pod 3  │    │
│              │ (4xH100)││(4xH100)││(4xH100)│    │
│              └────┬────┘└───┬────┘└───┬────┘    │
│                   |         |         |          │
│              ┌────v─────────v─────────v────┐    │
│              │     Prometheus + Grafana     │    │
│              │    (Metrics collection)      │    │
│              └─────────────────────────────┘    │
│                                                  │
│  Autoscaling: Based on queue length / GPU util   │
│  Health Check: /health endpoint                  │
│  Graceful Shutdown: Complete in-flight requests  │
└──────────────────────────────────────────────────┘

14.2 Kubernetes Deployment Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-70b
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-llama3
  template:
    metadata:
      labels:
        app: vllm-llama3
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
        args:
          - "--model=meta-llama/Llama-3.1-70B-Instruct"
          - "--tensor-parallel-size=4"
          - "--max-model-len=16384"
          - "--gpu-memory-utilization=0.90"
          - "--enable-prefix-caching"
          - "--enable-chunked-prefill"
        resources:
          limits:
            nvidia.com/gpu: "4"
            memory: "64Gi"
          requests:
            nvidia.com/gpu: "4"
            memory: "32Gi"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
      nodeSelector:
        gpu-type: h100
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

15. Quiz

Q1. What core problem does vLLM's PagedAttention solve?

Answer: It solves the memory fragmentation problem of KV Cache.

Traditional approaches pre-allocate contiguous memory for the maximum sequence length per request, wasting 60-80%. PagedAttention splits KV Cache into fixed-size blocks (pages) and uses page tables to logically link non-contiguous blocks, like OS virtual memory:

Near-zero internal fragmentation
Non-contiguous memory blocks utilized
Copy-on-Write for shared prompt KV Cache

This enables 2-4x more concurrent requests on the same GPU memory.

Q2. Why does Continuous Batching achieve higher throughput than Static Batching?

Answer: Static Batching waits for all requests in a batch to complete before starting the next batch. Even when short responses finish early, the GPU sits idle.

Continuous Batching:

Removes completed requests every iteration
Immediately adds new requests from the queue
Keeps GPU at maximum utilization

This achieves 10-20x higher throughput compared to Static Batching. Individual request latency also improves due to reduced waiting time.

Q3. Why does Speculative Decoding not degrade output quality?

Answer: Because it exactly preserves the target model's output distribution mathematically.

For a draft token x:

Acceptance probability = min(1, p_target(x) / p_draft(x))
On rejection: resample from (p_target - p_draft) distribution

This process ensures the final output has a mathematically identical distribution to using only the target model. Speed improves while quality loss is zero.

Q4. Why is the LLM Decode phase Memory-Bound?

Answer: In the decode phase, only one token is generated at a time. The entire model weights must be read from memory (matrix-vector multiplication), but actual computation is minimal.

Llama-2 70B example:

Must read 140 GB model weights each step
At A100 bandwidth of 2 TB/s: 70ms (memory reading)
Actual computation time: ~1ms

Memory bandwidth is the bottleneck, making it Memory-Bound. This is why quantization (reducing weight size) and batching (reading weights once for multiple requests) are effective.

Q5. Why is FP8 quantization more suitable for LLM inference than INT8?

Answer: FP8 is a floating-point format with a wide dynamic range. LLM weights and activations have highly varied magnitudes, making FP8 more suitable than fixed-point INT8.

Specifically:

FP8 E4M3: 4-bit exponent, 3-bit mantissa -- wide range, decent precision
INT8: Fixed range of -128 to 127 -- vulnerable to outliers
H100 GPUs have dedicated FP8 Tensor Cores with 2x FP16 compute
FP8 supports dynamic quantization without calibration

As a result, FP8 provides performance gains close to INT4 while maintaining quality close to FP16.

16. References

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - Kwon et al., 2023
- https://arxiv.org/abs/2309.06180
FlashAttention: Fast and Memory-Efficient Exact Attention - Dao et al., 2022
- https://arxiv.org/abs/2205.14135
FlashAttention-2: Faster Attention with Better Parallelism - Dao, 2023
- https://arxiv.org/abs/2307.08691
Efficient Memory Management for Large Language Model Serving with PagedAttention - Kwon et al., 2023
- https://arxiv.org/abs/2309.06180
Fast Inference from Transformers via Speculative Decoding - Leviathan et al., 2023
- https://arxiv.org/abs/2211.17192
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - Frantar et al., 2023
- https://arxiv.org/abs/2210.17323
AWQ: Activation-aware Weight Quantization - Lin et al., 2024
- https://arxiv.org/abs/2306.00978
TensorRT-LLM - NVIDIA Official Documentation
- https://nvidia.github.io/TensorRT-LLM/
Orca: A Distributed Serving System for Transformer-Based Generative Models - Yu et al., 2022
- https://www.usenix.org/conference/osdi22/presentation/yu
GQA: Training Generalized Multi-Query Transformer Models - Ainslie et al., 2023
- https://arxiv.org/abs/2305.13245
Medusa: Simple LLM Inference Acceleration Framework - Cai et al., 2024
- https://arxiv.org/abs/2401.10774
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty - Li et al., 2024
- https://arxiv.org/abs/2401.15077
SmoothQuant: Accurate and Efficient Post-Training Quantization - Xiao et al., 2023
- https://arxiv.org/abs/2211.10438

LLM 추론 최적화 완전 가이드 2025: vLLM, TensorRT-LLM, KV Cache, Speculative Decoding

목차

1. LLM 추론의 병목 이해: Compute-Bound vs Memory-Bound

1.1 Arithmetic Intensity와 Roofline Model

1.2 Prefill vs Decode 단계

1.3 왜 Decode가 느린가

2. KV Cache: LLM 추론의 핵심 자료구조

2.1 KV Cache란 무엇인가

2.2 KV Cache 메모리 계산

2.3 PagedAttention (vLLM의 핵심)

2.4 Prefix Caching

3. Attention 최적화: FlashAttention과 MQA/GQA

3.1 FlashAttention: IO-Aware Attention

3.2 FlashAttention 버전별 비교

3.3 Multi-Query Attention (MQA) vs Grouped-Query Attention (GQA)

4. Batching 전략: Static vs Continuous

4.1 Static Batching의 한계

4.2 Continuous Batching (In-Flight Batching)

4.3 Chunked Prefill

5. Speculative Decoding: 추론 속도의 게임 체인저

5.1 핵심 아이디어

5.2 수학적 보장: 출력 품질 유지

5.3 다양한 Speculative Decoding 변형

5.4 Tree Attention

6. 양자화(Quantization)로 추론 가속

6.1 데이터 타입별 비교

6.2 양자화 기법 비교

6.3 주요 양자화 포맷 상세

6.4 GGUF: CPU/Metal 추론용 포맷

7. 서빙 프레임워크 비교: vLLM vs TensorRT-LLM vs TGI

7.1 종합 비교 표

7.2 처리량 벤치마크 (Llama-3.1 8B, A100 80GB)

8. vLLM 심화: 아키텍처부터 LoRA 서빙까지

8.1 vLLM 아키텍처

8.2 vLLM 실전 배포

8.3 vLLM LoRA 다중 서빙

8.4 vLLM Vision 모델 서빙

9. TensorRT-LLM 심화: 최적 성능을 위한 선택

9.1 TensorRT-LLM 빌드 파이프라인

9.2 TensorRT-LLM FP8 최적화

9.3 Inflight Batching (TensorRT-LLM)

10. 모델 병렬화: Multi-GPU 전략

10.1 Tensor Parallelism (TP)

10.2 Pipeline Parallelism (PP)

10.3 Expert Parallelism (EP) - MoE 모델용

10.4 실전 병렬화 조합

11. GPU 메모리 최적화 심화

11.1 KV Cache 양자화

11.2 메모리 할당 전략

11.3 메모리 부족 시 대응 전략

12. 비용 분석: 플랫폼별 tokens/dollar

12.1 셀프 호스팅 비용 비교

12.2 API vs 셀프 호스팅 손익분기점

13. 벤치마킹: 올바른 측정 방법

13.1 핵심 메트릭

13.2 벤치마킹 도구와 방법

13.3 부하별 성능 특성

14. 실전 배포 아키텍처

14.1 프로덕션 서빙 아키텍처

14.2 Kubernetes 배포 예시

15. 퀴즈

16. 참고 자료

Complete LLM Inference Optimization Guide 2025: vLLM, TensorRT-LLM, KV Cache, Speculative Decoding

Table of Contents

1. Understanding LLM Inference Bottlenecks: Compute-Bound vs Memory-Bound

1.1 Arithmetic Intensity and the Roofline Model

1.2 Prefill vs Decode Phases

1.3 Why Decode Is Slow

2. KV Cache: The Core Data Structure of LLM Inference

2.1 What Is KV Cache

2.2 KV Cache Memory Calculation

2.3 PagedAttention (Core of vLLM)

2.4 Prefix Caching

3. Attention Optimization: FlashAttention and MQA/GQA

3.1 FlashAttention: IO-Aware Attention

3.2 FlashAttention Version Comparison

3.3 Multi-Query Attention (MQA) vs Grouped-Query Attention (GQA)

4. Batching Strategies: Static vs Continuous

4.1 Limitations of Static Batching

4.2 Continuous Batching (In-Flight Batching)