- Published on
Complete LLM Serving Optimization Guide: KV Cache, PagedAttention, and Quantization
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- The Two Phases of LLM Inference: Prefill and Decode
- KV Cache: The Memory Dilemma of Attention
- PagedAttention (vLLM): Virtual Memory Saves LLMs
- Continuous Batching: Maximizing Throughput
- Quantization: Trade Precision for Speed and Memory
- Speculative Decoding: Free Lunch Exists
- Tensor Parallelism and Pipeline Parallelism
- vLLM vs TGI vs TensorRT-LLM: Framework Comparison
- Production LLM Serving Stack
The Two Phases of LLM Inference: Prefill and Decode
LLM text generation splits into two fundamentally different phases. Without understanding this split, optimization is impossible.
Phase 1: PREFILL (process the input)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input: "What is the capital of France?"
└─ all 9 tokens processed at once
What happens:
- All input tokens are processed in parallel (big matrix multiply!)
- Q, K, V are computed for each input token
- KV cache is created (saves K, V for later reuse)
- First output token is generated
Characteristics:
- GPU operation: COMPUTE-BOUND (matrix × matrix)
- GPU utilization: HIGH ✅
- Latency metric: TTFT (Time To First Token)
Phase 2: DECODE (generate tokens one-by-one)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Generates: "Paris" → "is" → "the" → "capital" → ...
What happens:
- Generate exactly one token per forward pass
- Compute Q for the new token, attend over cached K, V
- Must read ALL model weights for every single token
Characteristics:
- GPU operation: MEMORY-BOUND (matrix × vector)
- GPU utilization: LOW (often 5–20%!)
- Throughput metric: TBT (Time Between Tokens)
This is why LLM serving is hard to optimize:
the two phases have completely different bottlenecks!
Measuring it in practice:
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
def measure_llm_phases(model_name="meta-llama/Llama-3.2-1B"):
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float16, device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Explain the transformer architecture in detail:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
input_len = inputs["input_ids"].shape[1]
# Measure prefill time
torch.cuda.synchronize()
t0 = time.perf_counter()
with torch.no_grad():
_ = model(**inputs) # forward pass on input only
torch.cuda.synchronize()
t_prefill = time.perf_counter() - t0
# Measure decode time
t0 = time.perf_counter()
with torch.no_grad():
generated = model.generate(
inputs["input_ids"],
max_new_tokens=50,
do_sample=False
)
torch.cuda.synchronize()
t_total = time.perf_counter() - t0
t_decode = t_total - t_prefill
n_new = generated.shape[1] - input_len
print(f"Input tokens: {input_len}")
print(f"Prefill time (TTFT): {t_prefill*1000:.1f}ms")
print(f"Generated tokens: {n_new}")
print(f"Decode time: {t_decode*1000:.1f}ms")
print(f"Per-token decode time: {t_decode/n_new*1000:.1f}ms/token")
# Llama-1B on H100 (~):
# Prefill: ~5ms (linear in input length)
# Decode: ~3ms/token (proportional to model size, batch-dependent)
KV Cache: The Memory Dilemma of Attention
What Happens Without a KV Cache?
Autoregressive generation WITHOUT KV cache:
Step 1: [token_1] → generate token_2
- Compute Q1,K1,V1 for token_1
- Compute Q2,K2,V2 for token_2 (partial)
- Attention: Q2 × [K1,K2]^T
- Ops: 2^2 = 4 dot products
Step 2: [token_1, token_2] → generate token_3
- Re-compute K1,V1 (wasted work!)
- Re-compute K2,V2 (wasted work!)
- Compute Q3,K3,V3
- Attention: Q3 × [K1,K2,K3]^T
- Ops: 3^2 = 9 dot products
Step N: O(N^2) operations per token
Total for L tokens: O(L^3) total compute
100 tokens: 1,000,000 dot products
1000 tokens: 1,000,000,000 dot products
KV Cache: Reuse Previous Computation
import torch
import torch.nn as nn
import math
class AttentionWithKVCache(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.W_q = nn.Linear(d_model, d_model, bias=False)
self.W_k = nn.Linear(d_model, d_model, bias=False)
self.W_v = nn.Linear(d_model, d_model, bias=False)
self.W_o = nn.Linear(d_model, d_model, bias=False)
# KV cache storage
self.k_cache = None # (batch, heads, past_len, d_k)
self.v_cache = None
def forward(self, x, use_cache=True):
batch, seq, d = x.shape
Q = self.W_q(x).view(batch, seq, self.n_heads, self.d_k).transpose(1,2)
K = self.W_k(x).view(batch, seq, self.n_heads, self.d_k).transpose(1,2)
V = self.W_v(x).view(batch, seq, self.n_heads, self.d_k).transpose(1,2)
if use_cache and self.k_cache is not None:
# Append new K, V to the cache
K = torch.cat([self.k_cache, K], dim=2)
V = torch.cat([self.v_cache, V], dim=2)
if use_cache:
self.k_cache = K.detach()
self.v_cache = V.detach()
# Q is only the current token(s); K, V are the full sequence
scale = math.sqrt(self.d_k)
scores = torch.matmul(Q, K.transpose(-2,-1)) / scale
weights = torch.softmax(scores, dim=-1)
output = torch.matmul(weights, V)
return output.transpose(1,2).contiguous().view(batch, seq, d)
def compute_kv_cache_bytes(seq_len, n_layers, n_kv_heads, head_dim,
batch_size, dtype_bytes=2):
"""
KV cache memory in bytes.
Factor of 2: one tensor for K, one for V.
"""
return 2 * n_layers * n_kv_heads * head_dim * seq_len * batch_size * dtype_bytes
# Llama 3.1 70B (uses GQA: 8 KV heads, 64 Q heads):
size = compute_kv_cache_bytes(
seq_len=4096, n_layers=80, n_kv_heads=8,
head_dim=128, batch_size=1, dtype_bytes=2
)
print(f"KV cache (Llama-70B, seq=4096, batch=1): {size/1e9:.1f} GB")
# Result: ~6.7 GB per request
# batch=32: ~214 GB → won't fit in one H100 (80 GB)!
The Memory Fragmentation Problem
Traditional KV cache allocation (pre-allocate max_seq_len per request):
┌───────────────────────────────────────────────────────┐
│ Request 1: current_len=100, reserved=512 │
│ ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
│ [used: 100] [wasted: 412 slots = 80%!] │
├───────────────────────────────────────────────────────┤
│ Request 2: current_len=50, reserved=512 │
│ ██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
│ [used: 50] [wasted: 462 slots = 90%!] │
├───────────────────────────────────────────────────────┤
│ Request 3: current_len=300, reserved=512 │
│ █████████████████████████████████████░░░░░░░░ │
│ [used: 300] [wasted: 212 slots = 41%!] │
└───────────────────────────────────────────────────────┘
Total allocated: 3 × 512 = 1536 slots
Total used: 450 slots
Wasted: 1086 slots = 71%
Internal fragmentation (allocated but unused) +
External fragmentation (gaps between allocations)
→ Typical real-world GPU memory utilization: 20–40%
PagedAttention (vLLM): Virtual Memory Saves LLMs
The Core Insight: OS Virtual Memory Applied to KV Cache
Kwon et al. (UC Berkeley, 2023): "Operating systems solved memory fragmentation decades ago. Apply the same idea to KV cache."
The OS Virtual Memory Lesson:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
virtual address → page table → physical address
Process sees contiguous virtual space
Physical memory can be non-contiguous
→ No fragmentation, efficient RAM usage
PagedAttention Analogy:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
virtual KV slot → block table → physical block
Sequence sees contiguous virtual slots
Physical GPU blocks can be non-contiguous
→ Near-zero fragmentation, efficient GPU VRAM usage
PagedAttention Memory Layout:
GPU memory split into fixed-size blocks (default: 16 tokens each):
┌─────────────────────────────────────────────────────────┐
│ Physical KV Cache Blocks │
│ Block 0: [tok0–15] Block 1: [tok16–31] │
│ Block 2: [tok32–47] Block 3: [tok48–63] │
│ Block 4: [tok64–79] Block 5: FREE │
│ Block 6: FREE Block 7: FREE │
└─────────────────────────────────────────────────────────┘
Block Table (same role as OS page table):
┌──────────┬──────────────────────────────────────────────┐
│ Request │ Virtual block → Physical block mapping │
├──────────┼──────────────────────────────────────────────┤
│ Req 1 │ virt[0]→phys[0], virt[1]→phys[2] │
│ │ (tokens 0–15: block 0; tokens 32–47: block 2) │
├──────────┼──────────────────────────────────────────────┤
│ Req 2 │ virt[0]→phys[1], virt[1]→phys[3] │
│ │ (tokens 0–15: block 1; tokens 16–31: block 3) │
└──────────┴──────────────────────────────────────────────┘
Key properties:
- Blocks are allocated ON-DEMAND as the sequence grows
- Internal fragmentation < 1 block = at most 15 wasted slots
- Blocks can be SHARED across requests (same prefix)!
# vLLM with PagedAttention:
from vllm import LLM, SamplingParams
import time
def benchmark_vllm():
llm = LLM(
model="meta-llama/Llama-3.2-8B-Instruct",
gpu_memory_utilization=0.9,
max_model_len=8192,
block_size=16,
max_num_seqs=256,
)
prompts = [
"Short question: What is Python?",
"Medium question: " + "Explain the history of machine learning. " * 5,
"Long question: " + "How do you build a transformer from scratch? " * 10,
] * 20 # 60 requests of varying lengths
params = SamplingParams(temperature=0.0, max_tokens=100)
t0 = time.perf_counter()
outputs = llm.generate(prompts, params)
elapsed = time.perf_counter() - t0
total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
print(f"Requests: {len(prompts)}")
print(f"Tokens generated: {total_tokens}")
print(f"Elapsed: {elapsed:.1f}s")
print(f"Throughput: {total_tokens/elapsed:.0f} tokens/s")
# Memory efficiency improvement:
# Traditional: 20–40% of GPU memory used for actual KV cache
# PagedAttention: >95% of GPU memory used for actual KV cache
# Result: 2–3× more concurrent requests on the same GPU
Prefix Caching: Share Common Prompts
# Enable prefix caching in vLLM:
llm = LLM(
model="meta-llama/Llama-3.2-8B-Instruct",
enable_prefix_caching=True,
)
# Many requests sharing a long system prompt:
system = "You are an expert software engineer. " * 100 # long system prompt
requests = [
system + "User: What is a binary search tree?",
system + "User: How does garbage collection work?",
system + "User: Explain ACID properties in databases.",
]
# The KV cache for `system` is computed ONCE and shared across all 3 requests.
# Prefill cost: computed 1 time instead of 3 times (3× savings on prefill!)
# This matters hugely for RAG pipelines where context is repeated.
Continuous Batching: Maximizing Throughput
The Problem with Static Batching
Static (request-level) batching:
GPU batch at each step:
Step 1: [Req1: running] [Req2: running] [Req3: running]
Step 2: [Req1: running] [Req2: DONE ] [Req3: running]
Step 3: [Req1: running] [ idle/wait ] [Req3: running] ← GPU waste!
Step 4: [Req1: DONE ] [ idle/wait ] [Req3: running] ← GPU waste!
Step 5: [ idle/wait ] [ idle/wait ] [Req3: DONE ] ← GPU waste!
New requests must wait until the ENTIRE batch finishes.
GPU waste rate: often 50%+
Continuous Batching: Dynamic Scheduling Every Token Step
Continuous (iteration-level) batching:
Step 1: [Req1] [Req2] [Req3]
Step 2: [Req1] [Req2] [Req3]
Step 3: [Req1] [Req4] [Req3] ← Req2 done → Req4 inserted immediately!
Step 4: [Req5] [Req4] [Req3] ← Req1 done → Req5 inserted!
Step 5: [Req5] [Req4] [Req6] ← Req3 done → Req6 inserted!
GPU is at maximum utilization at every step.
Throughput improvement over static batching: 2–4×
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
import asyncio
async def run_continuous_batching_server():
engine_args = AsyncEngineArgs(
model="meta-llama/Llama-3.2-8B-Instruct",
max_num_seqs=256, # max concurrent sequences
max_num_batched_tokens=8192, # max tokens per batch step
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
async def generate_one(prompt, req_id):
from vllm import SamplingParams
params = SamplingParams(temperature=0.7, max_tokens=200)
async for output in engine.generate(prompt, params, request_id=req_id):
if output.finished:
return output.outputs[0].text
# Requests submitted concurrently — engine handles continuous batching
results = await asyncio.gather(
generate_one("Explain quantum entanglement.", "r1"),
generate_one("Write a Python quicksort.", "r2"),
generate_one("Summarize the French Revolution.", "r3"),
)
for r in results:
print(r[:80])
Quantization: Trade Precision for Speed and Memory
Why Quantization?
LLM memory footprint (FP16):
Llama 3.1 8B: 16 GB
Llama 3.1 70B: 140 GB
Llama 3.1 405B: 810 GB
Common GPU memory:
RTX 4090: 24 GB → tight even for 8B
A100 80GB: → 70B impossible on one card
H100 80GB: → 70B impossible on one card
H100 ×8 (640 GB): → 70B fine, 405B barely
Memory savings with quantization:
FP16 (16-bit): baseline
INT8 (8-bit): 50% saved, ~1% accuracy loss
INT4 (4-bit): 75% saved, ~2–3% accuracy loss
INT3 (3-bit): 81% saved, use cautiously
INT2 (2-bit): 88% saved, usually unacceptable
Post-Training Quantization: INT8 (LLM.int8())
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# INT8 quantization — Dettmers et al., 2022 (bitsandbytes):
config_int8 = BitsAndBytesConfig(
load_in_8bit=True,
# Optionally keep certain layers in FP16 (e.g., output head)
llm_int8_skip_modules=["lm_head"],
)
model_int8 = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B-Instruct",
quantization_config=config_int8,
device_map="auto",
)
# 70B model: 140 GB (FP16) → 70 GB (INT8), ~1% accuracy loss
# The key insight behind LLM.int8():
# Problem: activation outliers in certain channels ruin naive INT8 quality
# Solution: "Mixed-precision decomposition"
# - Detect outlier channels (top ~1% by magnitude)
# - Keep those channels in FP16
# - Quantize all other channels to INT8
# → Near-lossless quality with ~50% memory savings
4-bit Quantization: NF4 and GPTQ
# NF4 quantization (QLoRA paper, Dettmers et al. 2023):
config_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16, # compute in FP16
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_use_double_quant=True, # quantize the scale too!
)
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B-Instruct",
quantization_config=config_4bit,
device_map="auto",
)
# 70B: 140 GB → 35 GB, ~2–3% accuracy loss
# Why NF4?
# Neural network weights are approximately normally distributed.
# NF4 uses 16 codepoints placed at equal-probability quantiles
# of a standard normal distribution.
# Each codepoint covers an equal probability mass → minimal quantization error
# vs. uniform INT4 which distributes points evenly on the number line.
# Double quantization:
# Quantization scale factors are FP32: 1 per group of 64 weights
# Double-quant quantizes those scale factors to INT8 too
# Net savings: ~0.5 additional bits per weight
# GPTQ (Frantar et al., 2022) — layer-wise optimal quantization:
from auto_gptq import AutoGPTQForCausalLM
model_gptq = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-GPTQ",
device="cuda:0",
use_triton=True, # Triton kernels for faster inference
)
# GPTQ uses the Hessian of each layer's loss to minimize quantization error.
# Generally highest accuracy among INT4 methods.
AWQ: Activation-Aware Weight Quantization
# AWQ (Lin et al., 2023) key insight:
# Not all weights are equally important!
# ~1% of channels produce large activations — these are "salient"
# Naively quantizing them to INT4 crushes quality
# AWQ solution:
# 1. Run calibration data, record activation magnitudes per channel
# 2. Scale up salient channels in the weight matrix (per-channel scaling)
# 3. Quantize everything to INT4 — the scaling absorbs the error for salient channels
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
# AWQ vs GPTQ:
# AWQ: faster inference (hand-tuned CUDA/Triton kernels)
# ~25% of FP16 memory
# accuracy: slightly below GPTQ
# GPTQ: higher accuracy (Hessian-based error minimization)
# similar inference speed
# same memory as AWQ
Quantization Comparison Table
Llama 3.1 70B quantization comparison (single H100):
┌──────────┬──────────┬────────────┬──────────┬───────────────────┐
│ Format │ Memory │ Throughput │ MMLU │ Hardware needed │
├──────────┼──────────┼────────────┼──────────┼───────────────────┤
│ FP16 │ 140 GB │ baseline │ 80.9% │ 8× H100 │
│ BF16 │ 140 GB │ +5% │ 80.9% │ 8× H100 │
│ INT8 │ 70 GB │ +10% │ 80.2% │ 2× H100 │
│ GPTQ-4b │ 36 GB │ +30% │ 79.8% │ 1× H100 │
│ AWQ-4b │ 36 GB │ +35% │ 79.5% │ 1× H100 │
│ GGUF-Q4 │ 38 GB │ CPU ok │ 79.1% │ CPU or 1× H100 │
└──────────┴──────────┴────────────┴──────────┴───────────────────┘
Speculative Decoding: Free Lunch Exists
The Idea: Draft Fast, Verify in Parallel
Standard decode:
70B model generates 1 token = 1 forward pass = ~10ms
100 tokens = ~1000ms = 1 second
Speculative decoding (Leviathan et al., 2023):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Step 1: Draft model (7B) generates K tokens quickly
["Paris"] ["is"] ["the"] ["capital"]
4 tokens in ~2ms (7B model)
Step 2: Target model (70B) verifies all K tokens in ONE forward pass!
Process all 4 draft tokens in parallel → ~10ms
(same cost as generating just 1 token normally)
Step 3: Verify each draft token:
"Paris" ✅ "is" ✅ "the" ✅ "capital" ❌
→ accept 3 tokens, reject from position 4
Step 4: Resample from target model distribution at rejection point
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Result: 3 accepted tokens in ~12ms (vs 30ms standard decode)
Speedup: 2.5× (varies with acceptance rate ~70–90%)
Quality: ZERO degradation (target model is the arbiter)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def speculative_decode(
target_model,
draft_model,
input_ids,
max_new_tokens=100,
K=4, # number of draft tokens per speculation
temperature=1.0,
):
"""
Speculative decoding: draft model proposes K tokens,
target model verifies them all in one forward pass.
Guarantees exactly the same distribution as target-only decoding.
"""
generated = input_ids.clone()
while generated.shape[1] < input_ids.shape[1] + max_new_tokens:
# --- Phase 1: Draft model generates K candidates ---
draft_ids = []
draft_probs = []
ctx = generated.clone()
for _ in range(K):
with torch.no_grad():
out = draft_model(ctx)
logits = out.logits[:, -1, :] / max(temperature, 1e-5)
probs = torch.softmax(logits, dim=-1)
tok = torch.multinomial(probs, 1)
draft_ids.append(tok)
draft_probs.append(probs[0, tok[0, 0]])
ctx = torch.cat([ctx, tok], dim=1)
# --- Phase 2: Target model verifies K positions simultaneously ---
candidate = torch.cat([generated] + draft_ids, dim=1)
with torch.no_grad():
tgt_out = target_model(candidate)
# logits for positions where draft tokens are placed
tgt_logits = tgt_out.logits[:, len(generated[0])-1:-1, :]
tgt_probs = torch.softmax(tgt_logits / max(temperature, 1e-5), dim=-1)
# --- Phase 3: Accept/reject each draft token ---
n_accepted = 0
for i in range(K):
token_id = draft_ids[i][0, 0].item()
p_target = tgt_probs[0, i, token_id].item()
p_draft = draft_probs[i].item()
# Acceptance probability: min(1, p_target / p_draft)
accept_p = min(1.0, p_target / max(p_draft, 1e-8))
if torch.rand(1).item() < accept_p:
generated = torch.cat([generated, draft_ids[i]], dim=1)
n_accepted += 1
else:
# Reject: resample from adjusted target distribution
adjusted = torch.clamp(tgt_probs[0, i] - tgt_probs[0, i], min=0)
# Correct adjusted distribution (residual of target minus draft)
diff = tgt_probs[0, i].clone()
diff[token_id] = max(0.0, diff[token_id] - p_draft)
diff = diff / diff.sum().clamp(min=1e-8)
new_tok = torch.multinomial(diff.unsqueeze(0), 1)
generated = torch.cat([generated, new_tok], dim=1)
break
if n_accepted == K:
# All accepted: also take the target model's bonus token
bonus_logits = tgt_out.logits[:, -1, :] / max(temperature, 1e-5)
bonus_probs = torch.softmax(bonus_logits, dim=-1)
bonus_tok = torch.multinomial(bonus_probs, 1)
generated = torch.cat([generated, bonus_tok], dim=1)
return generated
# Real speedups observed (A100, Llama-2 70B + Llama-2 7B draft):
# K=4: 2.3× speedup, acceptance rate ~80%
# K=8: 2.7× speedup, acceptance rate ~75%
# Optimal K depends on draft/target quality ratio
Tensor Parallelism and Pipeline Parallelism
Tensor Parallelism: Split Layers Across GPUs
Tensor Parallelism (Shoeybi et al., 2019 — Megatron-LM):
70B model, 8 GPUs, 64 attention heads:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
GPU 0: heads 0–7 (W_q slice: 1/8 of full matrix)
GPU 1: heads 8–15
GPU 2: heads 16–23
GPU 3: heads 24–31
GPU 4: heads 32–39
GPU 5: heads 40–47
GPU 6: heads 48–55
GPU 7: heads 56–63
Each GPU computes its heads independently,
then All-Reduce merges results.
Communication cost:
1 All-Reduce per attention layer
1 All-Reduce per FFN layer
NVLink (H100): 900 GB/s bidirectional → viable
PCIe: 64 GB/s → too slow for TP>2
import torch
import torch.distributed as dist
def tensor_parallel_linear(x, W_local, rank, world_size):
"""
Column-parallel linear (W split along output dimension).
x: (batch, seq, d_model) -- replicated on all GPUs
W_local: (d_model, d_out//world_size) -- each GPU holds a shard
"""
# Each GPU computes its output shard
out_local = x @ W_local # (batch, seq, d_out//world_size)
# All-Gather to reconstruct full output on every GPU
out_list = [torch.zeros_like(out_local) for _ in range(world_size)]
dist.all_gather(out_list, out_local)
out_full = torch.cat(out_list, dim=-1) # (batch, seq, d_out)
return out_full
# For row-parallel (W split along input dimension):
def tensor_parallel_linear_row(x_local, W_local, rank, world_size):
"""
Row-parallel linear: x is already sharded across GPUs.
x_local: (batch, seq, d_in//world_size)
W_local: (d_in//world_size, d_out)
"""
partial = x_local @ W_local # (batch, seq, d_out) — partial sum
dist.all_reduce(partial, op=dist.ReduceOp.SUM) # sum partial results
return partial
Pipeline Parallelism: Split Layers Sequentially
Pipeline Parallelism:
70B model, 80 layers, 4 GPUs:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
GPU 0: layers 0–19 (embedding + first 20 transformer layers)
GPU 1: layers 20–39
GPU 2: layers 40–59
GPU 3: layers 60–79 + LM head
With micro-batching to hide pipeline bubbles:
| mb1 | mb2 | mb3 | mb4 | mb5 |
GPU 0 →→→ [f1 ] [f2 ] [f3 ] [f4 ] [f5 ] [b5 ] [b4 ] [b3 ] [b2 ] [b1 ]
GPU 1 [ ] [f1 ] [f2 ] [f3 ] [f4 ] [f5 ] [b5 ] [b4 ] [b3 ] [b2 ] [b1 ]
GPU 2 [ ] [f1 ] [f2 ] [f3 ] [f4 ] [f5 ] [b1 ]
GPU 3 [ ] [f1 ] [f2 ] [f3 ] [f4 ] [f5 ] [b5 ]
Pipeline bubble ratio = (p - 1) / (m + p - 1)
p = number of pipeline stages
m = number of micro-batches
→ Larger m = smaller bubble = better efficiency
vLLM vs TGI vs TensorRT-LLM: Framework Comparison
LLM serving framework comparison (as of early 2026):
┌───────────────────┬────────────────────────────────────────────────┐
│ Framework │ vLLM │
├───────────────────┼────────────────────────────────────────────────┤
│ Developer │ UC Berkeley / vLLM community │
│ Key innovations │ PagedAttention, continuous batching │
│ Quantization │ AWQ, GPTQ, INT8, FP8 │
│ Throughput │ ★★★★☆ High │
│ TTFT latency │ ★★★☆☆ Medium │
│ Ease of use │ ★★★★★ Very easy (Python-native) │
│ Customizability │ ★★★☆☆ Medium │
│ License │ Apache 2.0 │
│ Notes │ Most active OSS community, OpenAI-compat API │
└───────────────────┴────────────────────────────────────────────────┘
┌───────────────────┬────────────────────────────────────────────────┐
│ Framework │ TGI (Text Generation Inference) │
├───────────────────┼────────────────────────────────────────────────┤
│ Developer │ Hugging Face │
│ Key innovations │ Continuous batching, FlashAttention │
│ Quantization │ GPTQ, AWQ, bitsandbytes │
│ Throughput │ ★★★☆☆ Medium │
│ TTFT latency │ ★★★☆☆ Medium │
│ Ease of use │ ★★★★☆ Easy (Docker-first) │
│ Customizability │ ★★★★☆ High │
│ License │ HFOIL (check commercial terms) │
│ Notes │ Native HF ecosystem integration │
└───────────────────┴────────────────────────────────────────────────┘
┌───────────────────┬────────────────────────────────────────────────┐
│ Framework │ TensorRT-LLM │
├───────────────────┼────────────────────────────────────────────────┤
│ Developer │ NVIDIA │
│ Key innovations │ TensorRT graph optimization, in-flight batching│
│ Quantization │ INT8, INT4, FP8, SmoothQuant, AWQ │
│ Throughput │ ★★★★★ Highest (NVIDIA GPUs only) │
│ TTFT latency │ ★★★★★ Lowest │
│ Ease of use │ ★★☆☆☆ Complex (C++ heavy) │
│ Customizability │ ★★☆☆☆ Difficult │
│ License │ Apache 2.0 │
│ Notes │ Best raw performance; use via Triton Server │
└───────────────────┴────────────────────────────────────────────────┘
┌───────────────────┬────────────────────────────────────────────────┐
│ Framework │ llama.cpp / Ollama │
├───────────────────┼────────────────────────────────────────────────┤
│ Developer │ ggerganov / Ollama Inc. │
│ Key innovations │ GGUF quantization, CPU+GPU hybrid │
│ Quantization │ Q2–Q8 (GGUF format) │
│ Throughput │ ★★☆☆☆ Low (on CPU) │
│ TTFT latency │ ★★☆☆☆ High │
│ Ease of use │ ★★★★★ Simplest possible │
│ Customizability │ ★★☆☆☆ Limited │
│ License │ MIT │
│ Notes │ Ideal for local dev, CPU inference, demos │
└───────────────────┴────────────────────────────────────────────────┘
Decision Guide
Choose vLLM if:
- Production serving, Python team, open source preferred
- Need OpenAI-compatible API drop-in replacement
- Want the best community support and newest features fastest
Choose TGI if:
- Deep HuggingFace ecosystem integration
- Docker-first deployment culture
- Need robust SSE streaming out-of-the-box
Choose TensorRT-LLM if:
- Maximum raw performance on NVIDIA hardware
- Have a team comfortable with C++/CUDA tooling
- Enterprise production with dedicated MLOps
Choose Ollama / llama.cpp if:
- Local development, prototyping
- CPU inference required
- Simplicity over performance
Production LLM Serving Stack
Production LLM serving architecture:
Clients
│
▼
Load Balancer (Nginx / AWS ALB / Cloudflare)
│
▼
API Gateway (FastAPI / Kong)
│ Rate limiting, auth, logging, request validation
▼
Router (model selection, priority queue)
│
├──→ vLLM server A: 8B model (fast/cheap requests)
│
├──→ vLLM server B: 70B model (high-quality requests)
│
└──→ vLLM server C: domain-specific fine-tune
│
▼
Observability (Prometheus + Grafana)
Key metrics:
- TTFT p50/p95/p99
- TBT p50/p95/p99
- Throughput (tokens/s)
- GPU utilization %
- KV cache utilization %
- Request queue depth
# Production vLLM server launch command:
import subprocess
cmd = [
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", "meta-llama/Llama-3.1-8B-Instruct",
"--tensor-parallel-size", "2", # 2-GPU tensor parallelism
"--gpu-memory-utilization", "0.9",
"--max-model-len", "8192",
"--max-num-seqs", "256",
"--max-num-batched-tokens", "8192",
"--quantization", "awq",
"--enable-prefix-caching",
"--block-size", "16",
"--port", "8000",
"--disable-log-requests", # reduce logging overhead
]
# Calling the server from a client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What is the GIL in Python?"},
],
temperature=0.0,
max_tokens=500,
stream=True, # streaming response
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
LLM serving optimization sits at the intersection of hardware, algorithms, and systems software. PagedAttention borrowed from operating system design. FlashAttention rediscovered the principle of tiling from numerical linear algebra. Speculative decoding revived draft-and-verify ideas from branch prediction. The engineers who will build the next generation of LLM serving systems are those who understand not just the current tools but the first-principles reasoning behind them.