- Published on
Complete LLM Inference Optimization Guide 2025: vLLM, TensorRT-LLM, KV Cache, Speculative Decoding
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Table of Contents
1. Understanding LLM Inference Bottlenecks: Compute-Bound vs Memory-Bound
Before diving into LLM inference optimization, we must first understand exactly where bottlenecks occur.
1.1 Arithmetic Intensity and the Roofline Model
GPU performance is governed by two resources:
| Resource | Unit | A100 80GB | H100 80GB | H200 141GB |
|---|---|---|---|---|
| Compute (FP16) | TFLOPS | 312 | 989 | 989 |
| Memory Bandwidth | TB/s | 2.0 | 3.35 | 4.8 |
| Arithmetic Intensity Boundary | FLOP/byte | 156 | 295 | 206 |
Arithmetic Intensity = Total FLOPs / Total Memory Transfers (Bytes)
- Compute-Bound: When arithmetic intensity exceeds the boundary. Matrix-matrix multiplication (GEMM) is the canonical example
- Memory-Bound: When arithmetic intensity is below the boundary. Attention and decoding are typical examples
1.2 Prefill vs Decode Phases
LLM inference consists of two main phases:
┌──────────────────────────────────────────────────────┐
│ LLM Inference Pipeline │
├──────────────────┬───────────────────────────────────┤
│ Prefill Phase │ Decode Phase │
│ (Prompt Proc.) │ (Token Generation) │
├──────────────────┼───────────────────────────────────┤
│ - Input tokens │ - Generates 1 token at a time │
│ in parallel │ - Memory-Bound │
│ - Compute-Bound │ - Low GPU utilization (5-15%) │
│ - High GPU util │ - Repeats for output length │
│ - Runs once │ - KV Cache read + append │
│ - Creates KV $ │ │
└──────────────────┴───────────────────────────────────┘
Prefill Phase: Processes the entire prompt at once. Dominated by matrix-matrix multiplication (GEMM), making it compute-bound.
Decode Phase: Generates tokens one at a time. Dominated by matrix-vector multiplication (GEMV), making it memory-bound. The entire model weights must be read from memory each step, but actual computation is minimal.
1.3 Why Decode Is Slow
For Llama-2 70B:
- Model weights: ~140 GB (FP16)
- Per decode step: must read 140 GB from memory
- At A100 bandwidth of 2 TB/s: 140 GB / 2 TB/s = 70ms per token
- Actual computation time: ~1ms
Memory reading takes 70x longer than computation. This is the core motivation for LLM inference optimization.
2. KV Cache: The Core Data Structure of LLM Inference
2.1 What Is KV Cache
Transformer Self-Attention requires the Key (K) and Value (V) of all previous tokens. KV Cache stores previously computed K and V tensors to avoid recomputation.
# Without KV Cache (full recomputation each step)
# Total compute for n tokens: O(n^2 * d)
# With KV Cache (reuse previous results)
# Total compute for n tokens: O(n * d)
# But KV Cache memory: O(n * d) additional
2.2 KV Cache Memory Calculation
KV Cache Size = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * dtype_size
Example: Llama-2 70B, seq_len=4096, batch_size=1, FP16
= 2 * 80 * 8 * 128 * 4096 * 1 * 2 bytes
= 1.34 GB (for a single sequence!)
With batch_size=32: 1.34 * 32 = 42.9 GB
| Model | Parameters | KV Cache/token (FP16) | 4K seq x1 | 4K seq x32 |
|---|---|---|---|---|
| Llama-2 7B | 7B | 800 KB | 3.2 GB | 102 GB |
| Llama-2 70B | 70B | 320 KB | 1.34 GB | 42.9 GB |
| Mixtral 8x7B | 46.7B | 640 KB | 2.56 GB | 81.9 GB |
| Llama-3 405B | 405B | 1.6 MB | 6.4 GB | 204 GB |
2.3 PagedAttention (Core of vLLM)
The problem with traditional approaches: each sequence pre-allocates contiguous memory for the maximum length. In practice, 60-80% is wasted.
┌─────────────────────────────────────────────┐
│ Traditional KV Cache Allocation │
│ │
│ Request 1: [████████░░░░░░░░░░░░] 40% used│
│ Request 2: [████████████░░░░░░░░] 60% used│
│ Request 3: [██░░░░░░░░░░░░░░░░░░] 10% used│
│ ^^^^^^^^ wasted memory │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ PagedAttention KV Cache Allocation │
│ │
│ Physical Blocks: [B0][B1][B2][B3][B4][B5] │
│ │
│ Request 1 -> Page Table: [B0, B3, B5] │
│ Request 2 -> Page Table: [B1, B4, B6, B7] │
│ Request 3 -> Page Table: [B2] │
│ │
│ Near-zero internal fragmentation │
│ Non-contiguous memory blocks utilized │
│ Copy-on-Write for shared prompts │
└─────────────────────────────────────────────┘
Key ideas of PagedAttention:
- Split KV Cache into fixed-size blocks (pages)
- Use page tables to logically link non-contiguous blocks, like OS virtual memory
- Allocate blocks only on demand, eliminating internal fragmentation
- Copy-on-Write: Requests sharing the same prompt share KV Cache
2.4 Prefix Caching
Reuses KV Cache for repeated system prompts or common prefixes.
# Enable Prefix Caching in vLLM
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
enable_prefix_caching=True,
max_model_len=8192,
)
# Requests using the same system prompt
# share the KV Cache for the system prompt portion
3. Attention Optimization: FlashAttention and MQA/GQA
3.1 FlashAttention: IO-Aware Attention
Problems with standard attention:
- Read Q, K, V matrices from HBM (High Bandwidth Memory)
- Compute S = Q @ K^T and write to HBM
- Compute P = softmax(S) and write to HBM
- Compute O = P @ V and write to HBM
4 HBM read/write round-trips -- this is the bottleneck.
┌──────────────────────────────────────────────┐
│ FlashAttention Core Idea │
│ │
│ GPU Memory Hierarchy: │
│ ┌─────────┐ 19 TB/s ┌─────────────────┐ │
│ │ SRAM │<---------->│ Compute Units │ │
│ │ (20 MB) │ └─────────────────┘ │
│ └────┬────┘ │
│ | 2-4.8 TB/s │
│ ┌────v────────────────┐ │
│ │ HBM (80-141 GB) │ │
│ └─────────────────────┘ │
│ │
│ Strategy: Split Q,K,V into tiles (blocks), │
│ perform all computation in SRAM, │
│ write only final results to HBM │
└──────────────────────────────────────────────┘
3.2 FlashAttention Version Comparison
| Feature | FlashAttention-1 | FlashAttention-2 | FlashAttention-3 |
|---|---|---|---|
| Release | 2022 | 2023 | 2024 |
| Speedup | 2-4x | Additional 2x | Additional 1.5-2x |
| GPU Support | A100 | A100, H100 | H100 (Hopper optimized) |
| Key Optimization | Tiling, recomputation | Improved parallelism, warp splitting | FP8, async copy, pipelining |
| FLOPS vs MHA | 50-70% | 70-80% | Up to 740 TFLOPS (75%) |
3.3 Multi-Query Attention (MQA) vs Grouped-Query Attention (GQA)
Architecture-level optimization to reduce KV Cache size:
┌─────────────────────────────────────────────────────┐
│ Multi-Head Attention (MHA) │
│ Q heads: [H1][H2][H3][H4][H5][H6][H7][H8] │
│ K heads: [H1][H2][H3][H4][H5][H6][H7][H8] │
│ V heads: [H1][H2][H3][H4][H5][H6][H7][H8] │
│ KV Cache: 8x │
├─────────────────────────────────────────────────────┤
│ Multi-Query Attention (MQA) │
│ Q heads: [H1][H2][H3][H4][H5][H6][H7][H8] │
│ K heads: [ H_shared ] │
│ V heads: [ H_shared ] │
│ KV Cache: 1x (8x reduction) │
├─────────────────────────────────────────────────────┤
│ Grouped-Query Attention (GQA, 2 groups) │
│ Q heads: [H1][H2][H3][H4] | [H5][H6][H7][H8] │
│ K heads: [ K_group1 ] | [ K_group2 ] │
│ V heads: [ V_group1 ] | [ V_group2 ] │
│ KV Cache: 2x (4x reduction) │
└─────────────────────────────────────────────────────┘
| Model | Attention Type | KV Heads | Q Heads | KV Cache Reduction |
|---|---|---|---|---|
| GPT-J 6B | MHA | 16 | 16 | 1x |
| Falcon-40B | MQA | 1 | 64 | 64x |
| Llama-2 70B | GQA | 8 | 64 | 8x |
| Llama-3 70B | GQA | 8 | 64 | 8x |
| Mistral 7B | GQA | 8 | 32 | 4x |
4. Batching Strategies: Static vs Continuous
4.1 Limitations of Static Batching
Static Batching (traditional):
Time ──────────────────────────────────>
Req 1: [████████████████████████████████] (long response)
Req 2: [████████░░░░░░░░░░░░░░░░░░░░░░] (short response)
Req 3: [██████████████░░░░░░░░░░░░░░░░] (medium response)
Req 4: [WAIT WAIT WAIT WAIT WAIT WAIT ] (waiting)
░ = GPU idle (padding), WAIT = waiting for batch to complete
Entire batch must finish before next batch starts -> very low throughput
4.2 Continuous Batching (In-Flight Batching)
Continuous Batching:
Time ──────────────────────────────────>
Req 1: [████████████████████████████████]
Req 2: [████████]
Req 3: [██████████████]
Req 4: [████████████████]
Req 5: [████████]
Completed requests immediately removed -> new requests immediately added
GPU idle time minimized -> 10-20x throughput improvement
Core principles of Continuous Batching:
- Every iteration, remove completed requests from the batch
- Immediately add waiting requests to the batch
- GPU always operates at maximum load
- Individual request latency also improves (reduced wait time)
4.3 Chunked Prefill
Solves the problem of long prompt prefill blocking decode requests.
# vLLM chunked prefill configuration
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enable_chunked_prefill=True,
max_num_batched_tokens=2048, # max tokens per iteration
)
# A long prompt (e.g., 32K tokens) is split into 2048-token chunks
# Decode requests can also be processed between chunks
# Slightly increases TTFT but improves overall system throughput and ITL
5. Speculative Decoding: A Game Changer for Inference Speed
5.1 Core Idea
A small Draft Model quickly predicts multiple tokens, and a large Target Model verifies them all in a single forward pass.
┌────────────────────────────────────────────────────┐
│ Speculative Decoding Flow │
│ │
│ Step 1: Draft Model (small, fast) │
│ "The capital of France is" -> [Paris][,][a][city] │
│ 4 tokens predicted very quickly (4ms) │
│ │
│ Step 2: Target Model (large, accurate) │
│ Single forward pass verifies all 4 tokens │
│ [Paris OK] [, OK] [a FAIL->"known"] [city FAIL] │
│ │
│ Result: "Paris, known" (2 accepted + 1 corrected) │
│ Before: 3 forward passes needed -> now 1 │
│ Speedup: ~2-3x │
└────────────────────────────────────────────────────┘
5.2 Mathematical Guarantee: Preserving Output Quality
The key advantage of Speculative Decoding is that it exactly preserves the target model's output distribution.
Acceptance/rejection probability:
- For draft token x: acceptance probability = min(1, p_target(x) / p_draft(x))
- On rejection: resample from (p_target(x) - p_draft(x)) distribution
Through this process, the final output has a mathematically identical distribution to using only the target model.
5.3 Speculative Decoding Variants
# 1. Separate Draft Model
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_model="meta-llama/Llama-3.1-8B-Instruct",
num_speculative_tokens=5,
use_v2_block_manager=True,
)
# 2. Medusa Heads (additional MLP heads predict multiple positions)
# No draft model needed - adds lightweight heads to target model itself
# Requires training but minimal memory overhead
# 3. EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)
# Draft model reuses target model's hidden states
# Higher acceptance rate than separate draft models
5.4 Tree Attention
Verifies multiple candidate sequences simultaneously in a tree structure.
Token position: 1 2 3
+-- Paris --+-- is -- ...
The -------------+ +-- was -- ...
+-- Lyon --- is -- ...
+-- capital - of -- ...
All tree paths verified in a single forward pass
-> Maximizes acceptance rate, improves throughput
6. Quantization for Inference Acceleration
6.1 Data Type Comparison
| Data Type | Bits | Range | Memory Savings | Quality Impact |
|---|---|---|---|---|
| FP32 | 32 | Very wide | Baseline | Baseline |
| FP16 | 16 | Wide | 2x | Negligible |
| BF16 | 16 | Same as FP32 | 2x | Negligible |
| FP8 (E4M3) | 8 | Medium | 4x | Very small |
| INT8 | 8 | -128 to 127 | 4x | Small |
| INT4 | 4 | -8 to 7 | 8x | Moderate |
| NF4 | 4 | Normal dist. optimized | 8x | Less than INT4 |
6.2 Quantization Technique Comparison
┌───────────────────────────────────────────────────────┐
│ Quantization Technique Classification │
├─────────────────────┬─────────────────────────────────┤
│ Post-Training │ Training-Aware │
│ Quantization(PTQ) │ Quantization │
├─────────────────────┼─────────────────────────────────┤
│ - GPTQ (INT4) │ - QLoRA + Merge │
│ - AWQ (INT4) │ - QAT (Quantization-Aware │
│ - GGUF (various) │ Training) │
│ - bitsandbytes │ │
│ - SmoothQuant │ │
│ - FP8 Dynamic │ │
└─────────────────────┴─────────────────────────────────┘
6.3 Major Quantization Formats in Detail
# GPTQ: Layer-wise optimal quantization (OBQ-based)
# Pros: Good quality even at INT4, optimized for GPU inference
# Cons: Requires calibration data, slow quantization
from transformers import AutoModelForCausalLM, GPTQConfig
gptq_config = GPTQConfig(
bits=4,
group_size=128,
dataset="c4",
desc_act=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B-Instruct",
quantization_config=gptq_config,
device_map="auto",
)
# AWQ: Activation-aware Weight Quantization
# Key: Finds and protects important weight channels (based on activation magnitude)
# Faster quantization than GPTQ, similar quality
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B-Instruct"
)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
# bitsandbytes: Simple INT8/NF4 quantization
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
)
6.4 GGUF: Format for CPU/Metal Inference
The quantization format used by llama.cpp, supporting various quantization levels.
| GGUF Quantization | Bits | Method | Quality | Speed |
|---|---|---|---|---|
| Q2_K | 2-3 | K-quant mixed | Low | Very fast |
| Q4_K_M | 4-5 | K-quant medium | Good | Fast |
| Q5_K_M | 5-6 | K-quant medium | Very good | Medium |
| Q6_K | 6 | K-quant | Near original | Slow |
| Q8_0 | 8 | Uniform quant | Same as original | Slow |
| F16 | 16 | No quantization | Original | Slowest |
7. Serving Framework Comparison: vLLM vs TensorRT-LLM vs TGI
7.1 Comprehensive Comparison
| Feature | vLLM | TensorRT-LLM | TGI | Ollama | llama.cpp |
|---|---|---|---|---|---|
| Developer | UC Berkeley | NVIDIA | Hugging Face | Ollama | ggerganov |
| Language | Python/C++ | C++/Python | Rust/Python | Go | C/C++ |
| PagedAttention | Yes | Yes | Yes | No | No |
| Continuous Batching | Yes | Yes | Yes | No | No |
| Tensor Parallelism | Yes | Yes | Yes | No | No |
| FP8 Support | Yes | Yes (optimal) | Yes | No | No |
| Speculative Decoding | Yes | Yes | Limited | No | Yes |
| LoRA Serving | Yes (multi) | Yes | Yes | Yes | Yes |
| Vision Models | Yes | Yes | Yes | Yes | Yes (some) |
| CPU Inference | Limited | No | No | Yes | Yes (optimal) |
| Metal (Apple) | No | No | No | Yes | Yes |
| Install Difficulty | Easy | Hard | Easy | Very easy | Medium |
| Production Ready | High | High | High | Low | Medium |
7.2 Throughput Benchmarks (Llama-3.1 8B, A100 80GB)
| Framework | Throughput (tok/s) | TTFT (ms) | ITL (ms) | Memory Usage |
|---|---|---|---|---|
| vLLM (FP16) | 4,200 | 45 | 12 | 18 GB |
| vLLM (AWQ-4bit) | 6,800 | 32 | 8 | 7 GB |
| TensorRT-LLM (FP16) | 4,800 | 38 | 10 | 17 GB |
| TensorRT-LLM (FP8) | 7,500 | 28 | 7 | 10 GB |
| TGI (FP16) | 3,600 | 52 | 14 | 18 GB |
| llama.cpp (Q4_K_M) | 120 | 200 | 35 | 5 GB |
8. vLLM Deep Dive: Architecture to LoRA Serving
8.1 vLLM Architecture
┌──────────────────────────────────────────┐
│ vLLM Architecture │
│ │
│ ┌─────────┐ ┌──────────────────┐ │
│ │ FastAPI │---->│ LLM Engine │ │
│ │ Server │ │ │ │
│ └─────────┘ │ ┌────────────┐ │ │
│ │ │ Scheduler │ │ │
│ ┌─────────┐ │ │ (Batching) │ │ │
│ │ OpenAI │---->│ └─────┬──────┘ │ │
│ │ compat │ │ | │ │
│ └─────────┘ │ ┌─────v──────┐ │ │
│ │ │ Block Mgr │ │ │
│ │ │ (PagedAttn) │ │ │
│ │ └─────┬──────┘ │ │
│ │ | │ │
│ │ ┌─────v──────┐ │ │
│ │ │ Worker(s) │ │ │
│ │ │ (GPU exec) │ │ │
│ │ └────────────┘ │ │
│ └──────────────────┘ │
└──────────────────────────────────────────┘
8.2 vLLM Production Deployment
# Start vLLM server (OpenAI API compatible)
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
# --tensor-parallel-size 4 \
# --max-model-len 32768 \
# --gpu-memory-utilization 0.90 \
# --enable-prefix-caching \
# --enable-chunked-prefill \
# --max-num-batched-tokens 4096 \
# --port 8000
# API call via Python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-70B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing"},
],
max_tokens=512,
temperature=0.7,
)
8.3 vLLM Multi-LoRA Serving
Serve multiple LoRA adapters simultaneously from a single base model.
# vllm serve meta-llama/Llama-3.1-8B-Instruct \
# --enable-lora \
# --lora-modules \
# sql-lora=./adapters/sql-lora \
# code-lora=./adapters/code-lora \
# chat-lora=./adapters/chat-lora \
# --max-loras 3 \
# --max-lora-rank 64
# Select LoRA adapter by model name in API call
response = client.chat.completions.create(
model="sql-lora", # LoRA adapter name
messages=[{"role": "user", "content": "SELECT ..."}],
)
8.4 vLLM Vision Model Serving
# Multimodal model serving
# vllm serve Qwen/Qwen2-VL-7B-Instruct \
# --max-model-len 8192 \
# --limit-mm-per-prompt image=4
from openai import OpenAI
import base64
client = OpenAI(base_url="http://localhost:8000/v1", api_key="key")
response = client.chat.completions.create(
model="Qwen/Qwen2-VL-7B-Instruct",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}],
)
9. TensorRT-LLM Deep Dive: The Choice for Maximum Performance
9.1 TensorRT-LLM Build Pipeline
┌────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐
│HF Model│---->│ Convert │---->│ TRT-LLM │---->│ Triton │
│(source) │ │ Checkpoint│ │ Engine │ │ Serving │
└────────┘ └───────────┘ └──────────┘ └──────────┘
Apply quant Compile optim API server
# Step 1: Checkpoint conversion + FP8 quantization
python convert_checkpoint.py \
--model_dir meta-llama/Llama-3.1-70B-Instruct \
--output_dir ./checkpoint_fp8 \
--dtype bfloat16 \
--tp_size 4 \
--pp_size 1 \
--use_fp8
# Step 2: Build TensorRT engine
trtllm-build \
--checkpoint_dir ./checkpoint_fp8 \
--output_dir ./engine_fp8 \
--gemm_plugin auto \
--max_batch_size 64 \
--max_input_len 4096 \
--max_seq_len 8192 \
--paged_kv_cache enable \
--use_paged_context_fmha enable \
--workers 4
9.2 TensorRT-LLM FP8 Optimization
Maximizes utilization of H100 GPU's FP8 Tensor Cores.
| Configuration | Throughput (Llama-3.1 70B, 4xH100) | Latency |
|---|---|---|
| FP16, TP=4 | 2,400 tok/s | 16ms ITL |
| FP8, TP=4 | 4,200 tok/s | 9ms ITL |
| FP8 + Speculative | 5,800 tok/s | 6ms ITL |
| INT4 AWQ, TP=2 | 3,800 tok/s | 11ms ITL |
9.3 Inflight Batching (TensorRT-LLM)
TensorRT-LLM's implementation of Continuous Batching.
# Triton Inference Server + TensorRT-LLM backend
# model_config.pbtxt configuration
"""
backend: "tensorrtllm"
max_batch_size: 64
model_transaction_policy {
decoupled: True # Streaming response support
}
parameters: {
key: "batching_type"
value: {string_value: "inflight"} # Enable Inflight Batching
}
parameters: {
key: "max_tokens_in_paged_kv_cache"
value: {string_value: "131072"} # Limit KV Cache token count
}
"""
10. Model Parallelism: Multi-GPU Strategies
10.1 Tensor Parallelism (TP)
Splits a single layer across multiple GPUs.
Tensor Parallelism (TP=4):
Layer N weight matrix W
┌──────┬──────┬──────┬──────┐
│ W_1 │ W_2 │ W_3 │ W_4 │
│GPU 0 │GPU 1 │GPU 2 │GPU 3 │
└──┬───┴──┬───┴──┬───┴──┬───┘
| | | |
v v v v
[part1] [part2] [part3] [part4]
| | | |
└──────┴──────┴──────┘
All-Reduce
(aggregate results)
Pros: Reduces latency (all GPUs compute simultaneously)
Cons: Requires inter-GPU communication (NVLink recommended)
Best for: GPUs within same node (low latency needed)
10.2 Pipeline Parallelism (PP)
Distributes layers sequentially across GPUs.
Pipeline Parallelism (PP=4, 80 layers):
GPU 0: [Layer 0-19] -> GPU 1: [Layer 20-39]
-> GPU 2: [Layer 40-59]
-> GPU 3: [Layer 60-79]
Pros: Minimal inter-GPU communication (one direction)
Cons: Pipeline bubbles (GPU idle time)
Best for: Cross-node distribution (higher latency tolerable)
10.3 Expert Parallelism (EP) - For MoE Models
Distributes experts in Mixture of Experts models.
Expert Parallelism (Mixtral 8x7B, EP=4):
GPU 0: Expert 0, 1 + Shared Layers
GPU 1: Expert 2, 3 + Shared Layers
GPU 2: Expert 4, 5 + Shared Layers
GPU 3: Expert 6, 7 + Shared Layers
Token routing: Each token sent to Top-2 Experts
-> Requires All-to-All communication between GPUs
10.4 Practical Parallelism Combinations
# Serving Llama-3.1 405B (8x H100 80GB)
# Model size: ~810 GB (FP16) -> FP8 ~405 GB
# Option 1: TP=8 (all layers split across all GPUs)
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
--tensor-parallel-size 8 \
--max-model-len 16384
# Option 2: TP=4, PP=2 (4 GPUs per pipeline stage, 2 stages)
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2 \
--max-model-len 16384
11. Advanced GPU Memory Optimization
11.1 KV Cache Quantization
# KV Cache FP8 quantization in vLLM
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
# --tensor-parallel-size 4 \
# --kv-cache-dtype fp8 \
# --quantization fp8
# KV Cache memory savings:
# FP16 KV Cache: 1.34 GB / sequence (Llama-2 70B, 4K)
# FP8 KV Cache: 0.67 GB / sequence (50% savings)
# 2x more concurrent requests on same GPU
11.2 Memory Allocation Strategy
GPU Memory Distribution (A100 80GB, Llama-3.1 70B FP16):
┌─────────────────────────────────┐
│ Model Weights: ~35 GB (TP=2) │ 43.75%
├─────────────────────────────────┤
│ KV Cache: ~35 GB │ 43.75%
│ (gpu_memory_utilization=0.90) │
├─────────────────────────────────┤
│ Activation Memory: ~2 GB │ 2.5%
├─────────────────────────────────┤
│ System Reserved: ~8 GB │ 10%
└─────────────────────────────────┘
KV Cache determines the maximum number of concurrent requests.
11.3 Strategies When Running Low on Memory
| Strategy | Implementation | Effect | Side Effect |
|---|---|---|---|
| Quantization | FP16 to INT4 | 4x weight reduction | Slight quality loss |
| KV Cache Quant | FP16 to FP8 | 2x KV Cache reduction | Negligible |
| Reduce max_model_len | 32K to 8K | Proportional KV Cache reduction | No long contexts |
| Increase TP | TP=2 to TP=4 | Half memory per GPU | Extra GPU cost |
| Prefix Caching | Shared system prompts | Large savings for repeated requests | No effect on unique requests |
12. Cost Analysis: tokens/dollar Across Platforms
12.1 Self-Hosting Cost Comparison
| GPU | Cloud Hourly Cost | Llama-3.1 70B Throughput | tokens/dollar |
|---|---|---|---|
| A100 80GB x1 | ~3.0 USD | 800 tok/s (FP16) | 960K |
| A100 80GB x4 (TP=4) | ~12.0 USD | 2,800 tok/s | 840K |
| H100 80GB x1 | ~4.5 USD | 1,500 tok/s (FP8) | 1,200K |
| H100 80GB x4 (TP=4) | ~18.0 USD | 5,000 tok/s (FP8) | 1,000K |
| L40S x1 | ~1.5 USD | 600 tok/s (INT4) | 1,440K |
| 4090 x1 (own server) | ~0.3 USD (power) | 400 tok/s (INT4) | 4,800K |
12.2 API vs Self-Hosting Break-Even Point
Monthly token usage cost comparison (Llama-3.1 70B class):
┌─────────────────────────────────────────────────┐
│ Cost │
│ ($) │
│ 5000| /API │
│ | / │
│ 3000| ---------- Self-Hosted │
│ | / (H100x4 monthly fixed) │
│ 1000| / API │
│ | / │
│ 0+--+----+----+----+----+----+------> │
│ 0 2B 5B 10B 20B 50B 100B tokens/mo │
│ │
│ Break-even: ~10B tokens/month │
└─────────────────────────────────────────────────┘
13. Benchmarking: How to Measure Correctly
13.1 Core Metrics
| Metric | Definition | Why It Matters |
|---|---|---|
| TTFT (Time To First Token) | Time until first token generated | User-perceived response start |
| ITL (Inter-Token Latency) | Time between tokens | Perceived streaming speed |
| E2E Latency | Total request completion time | Total wait time |
| Throughput | Tokens generated per second | Overall system capacity |
| TPS/User | Tokens per second per user | Individual perceived speed |
13.2 Benchmarking Tools and Methods
# vLLM built-in benchmark (recommended)
# After running: python -m vllm.entrypoints.openai.api_server
# python benchmarks/benchmark_serving.py \
# --backend vllm \
# --model meta-llama/Llama-3.1-8B-Instruct \
# --dataset-name sharegpt \
# --dataset-path ShareGPT_V3_unfiltered.json \
# --num-prompts 1000 \
# --request-rate 10 \
# --endpoint /v1/completions
# Example results:
# Successful requests: 1000
# Benchmark duration (s): 105.23
# Total input tokens: 215000
# Total generated tokens: 180000
# Request throughput (req/s): 9.50
# Output token throughput (tok/s): 1710.5
# Mean TTFT (ms): 48.2
# Median TTFT (ms): 42.1
# P99 TTFT (ms): 125.3
# Mean ITL (ms): 11.8
# Median ITL (ms): 10.2
# P99 ITL (ms): 35.7
13.3 Performance Characteristics Under Load
Throughput vs Latency relationship (as concurrency increases):
Throughput Latency
(tok/s) (ms)
| ┌────────── | /
| / | /
| / | /
| / | /
| / | /
| / | /
| / | /
| / |/
+──────────────> concurrency +──────────────> concurrency
Optimal operating point: Just before throughput saturates (Knee point)
Usually GPU utilization of 70-80%
14. Production Deployment Architecture
14.1 Production Serving Architecture
┌──────────────────────────────────────────────────┐
│ Production Architecture │
│ │
│ Client -> Load Balancer -> API Gateway │
│ | │
│ ┌─────────┼─────────┐ │
│ v v v │
│ ┌─────────┐┌────────┐┌────────┐ │
│ │ vLLM ││ vLLM ││ vLLM │ │
│ │ Pod 1 ││ Pod 2 ││ Pod 3 │ │
│ │ (4xH100)││(4xH100)││(4xH100)│ │
│ └────┬────┘└───┬────┘└───┬────┘ │
│ | | | │
│ ┌────v─────────v─────────v────┐ │
│ │ Prometheus + Grafana │ │
│ │ (Metrics collection) │ │
│ └─────────────────────────────┘ │
│ │
│ Autoscaling: Based on queue length / GPU util │
│ Health Check: /health endpoint │
│ Graceful Shutdown: Complete in-flight requests │
└──────────────────────────────────────────────────┘
14.2 Kubernetes Deployment Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-70b
spec:
replicas: 3
selector:
matchLabels:
app: vllm-llama3
template:
metadata:
labels:
app: vllm-llama3
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model=meta-llama/Llama-3.1-70B-Instruct"
- "--tensor-parallel-size=4"
- "--max-model-len=16384"
- "--gpu-memory-utilization=0.90"
- "--enable-prefix-caching"
- "--enable-chunked-prefill"
resources:
limits:
nvidia.com/gpu: "4"
memory: "64Gi"
requests:
nvidia.com/gpu: "4"
memory: "32Gi"
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
nodeSelector:
gpu-type: h100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
15. Quiz
Q1. What core problem does vLLM's PagedAttention solve?
Answer: It solves the memory fragmentation problem of KV Cache.
Traditional approaches pre-allocate contiguous memory for the maximum sequence length per request, wasting 60-80%. PagedAttention splits KV Cache into fixed-size blocks (pages) and uses page tables to logically link non-contiguous blocks, like OS virtual memory:
- Near-zero internal fragmentation
- Non-contiguous memory blocks utilized
- Copy-on-Write for shared prompt KV Cache
This enables 2-4x more concurrent requests on the same GPU memory.
Q2. Why does Continuous Batching achieve higher throughput than Static Batching?
Answer: Static Batching waits for all requests in a batch to complete before starting the next batch. Even when short responses finish early, the GPU sits idle.
Continuous Batching:
- Removes completed requests every iteration
- Immediately adds new requests from the queue
- Keeps GPU at maximum utilization
This achieves 10-20x higher throughput compared to Static Batching. Individual request latency also improves due to reduced waiting time.
Q3. Why does Speculative Decoding not degrade output quality?
Answer: Because it exactly preserves the target model's output distribution mathematically.
For a draft token x:
- Acceptance probability = min(1, p_target(x) / p_draft(x))
- On rejection: resample from (p_target - p_draft) distribution
This process ensures the final output has a mathematically identical distribution to using only the target model. Speed improves while quality loss is zero.
Q4. Why is the LLM Decode phase Memory-Bound?
Answer: In the decode phase, only one token is generated at a time. The entire model weights must be read from memory (matrix-vector multiplication), but actual computation is minimal.
Llama-2 70B example:
- Must read 140 GB model weights each step
- At A100 bandwidth of 2 TB/s: 70ms (memory reading)
- Actual computation time: ~1ms
Memory bandwidth is the bottleneck, making it Memory-Bound. This is why quantization (reducing weight size) and batching (reading weights once for multiple requests) are effective.
Q5. Why is FP8 quantization more suitable for LLM inference than INT8?
Answer: FP8 is a floating-point format with a wide dynamic range. LLM weights and activations have highly varied magnitudes, making FP8 more suitable than fixed-point INT8.
Specifically:
- FP8 E4M3: 4-bit exponent, 3-bit mantissa -- wide range, decent precision
- INT8: Fixed range of -128 to 127 -- vulnerable to outliers
- H100 GPUs have dedicated FP8 Tensor Cores with 2x FP16 compute
- FP8 supports dynamic quantization without calibration
As a result, FP8 provides performance gains close to INT4 while maintaining quality close to FP16.
16. References
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - Kwon et al., 2023
- FlashAttention: Fast and Memory-Efficient Exact Attention - Dao et al., 2022
- FlashAttention-2: Faster Attention with Better Parallelism - Dao, 2023
- Efficient Memory Management for Large Language Model Serving with PagedAttention - Kwon et al., 2023
- Fast Inference from Transformers via Speculative Decoding - Leviathan et al., 2023
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - Frantar et al., 2023
- AWQ: Activation-aware Weight Quantization - Lin et al., 2024
- TensorRT-LLM - NVIDIA Official Documentation
- Orca: A Distributed Serving System for Transformer-Based Generative Models - Yu et al., 2022
- GQA: Training Generalized Multi-Query Transformer Models - Ainslie et al., 2023
- Medusa: Simple LLM Inference Acceleration Framework - Cai et al., 2024
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty - Li et al., 2024
- SmoothQuant: Accurate and Efficient Post-Training Quantization - Xiao et al., 2023