Complete LLM Inference Optimization Guide 2025: vLLM, TensorRT-LLM, KV Cache, Speculative Decoding

1. Understanding LLM Inference Bottlenecks: Compute-Bound vs Memory-Bound

Before diving into LLM inference optimization, we must first understand exactly where bottlenecks occur.

1.1 Arithmetic Intensity and the Roofline Model

GPU performance is governed by two resources:

Resource	Unit	A100 80GB	H100 80GB	H200 141GB
Compute (FP16)	TFLOPS	312	989	989
Memory Bandwidth	TB/s	2.0	3.35	4.8
Arithmetic Intensity Boundary	FLOP/byte	156	295	206

Arithmetic Intensity = Total FLOPs / Total Memory Transfers (Bytes)

Compute-Bound: When arithmetic intensity exceeds the boundary. Matrix-matrix multiplication (GEMM) is the canonical example
Memory-Bound: When arithmetic intensity is below the boundary. Attention and decoding are typical examples

1.2 Prefill vs Decode Phases

LLM inference consists of two main phases:

┌──────────────────────────────────────────────────────┐
│                 LLM Inference Pipeline                │
├──────────────────┬───────────────────────────────────┤
│   Prefill Phase  │         Decode Phase              │
│  (Prompt Proc.)  │      (Token Generation)           │
├──────────────────┼───────────────────────────────────┤
│ - Input tokens   │ - Generates 1 token at a time     │
│   in parallel    │ - Memory-Bound                    │
│ - Compute-Bound  │ - Low GPU utilization (5-15%)     │
│ - High GPU util  │ - Repeats for output length       │
│ - Runs once      │ - KV Cache read + append          │
│ - Creates KV $   │                                   │
└──────────────────┴───────────────────────────────────┘

Prefill Phase: Processes the entire prompt at once. Dominated by matrix-matrix multiplication (GEMM), making it compute-bound.

Decode Phase: Generates tokens one at a time. Dominated by matrix-vector multiplication (GEMV), making it memory-bound. The entire model weights must be read from memory each step, but actual computation is minimal.

1.3 Why Decode Is Slow

For Llama-2 70B:

Model weights: ~140 GB (FP16)
Per decode step: must read 140 GB from memory
At A100 bandwidth of 2 TB/s: 140 GB / 2 TB/s = 70ms per token
Actual computation time: ~1ms

Memory reading takes 70x longer than computation. This is the core motivation for LLM inference optimization.

2. KV Cache: The Core Data Structure of LLM Inference

2.1 What Is KV Cache

Transformer Self-Attention requires the Key (K) and Value (V) of all previous tokens. KV Cache stores previously computed K and V tensors to avoid recomputation.

# Without KV Cache (full recomputation each step)
# Total compute for n tokens: O(n^2 * d)

# With KV Cache (reuse previous results)
# Total compute for n tokens: O(n * d)
# But KV Cache memory: O(n * d) additional

2.2 KV Cache Memory Calculation

KV Cache Size = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * dtype_size

Example: Llama-2 70B, seq_len=4096, batch_size=1, FP16
= 2 * 80 * 8 * 128 * 4096 * 1 * 2 bytes
= 1.34 GB (for a single sequence!)

With batch_size=32: 1.34 * 32 = 42.9 GB

Model	Parameters	KV Cache/token (FP16)	4K seq x1	4K seq x32
Llama-2 7B	7B	800 KB	3.2 GB	102 GB
Llama-2 70B	70B	320 KB	1.34 GB	42.9 GB
Mixtral 8x7B	46.7B	640 KB	2.56 GB	81.9 GB
Llama-3 405B	405B	1.6 MB	6.4 GB	204 GB

2.3 PagedAttention (Core of vLLM)

The problem with traditional approaches: each sequence pre-allocates contiguous memory for the maximum length. In practice, 60-80% is wasted.

┌─────────────────────────────────────────────┐
│    Traditional KV Cache Allocation           │
│                                             │
│  Request 1: [████████░░░░░░░░░░░░]  40% used│
│  Request 2: [████████████░░░░░░░░]  60% used│
│  Request 3: [██░░░░░░░░░░░░░░░░░░]  10% used│
│              ^^^^^^^^ wasted memory          │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│       PagedAttention KV Cache Allocation     │
│                                             │
│  Physical Blocks: [B0][B1][B2][B3][B4][B5]  │
│                                             │
│  Request 1 -> Page Table: [B0, B3, B5]      │
│  Request 2 -> Page Table: [B1, B4, B6, B7]  │
│  Request 3 -> Page Table: [B2]              │
│                                             │
│  Near-zero internal fragmentation           │
│  Non-contiguous memory blocks utilized      │
│  Copy-on-Write for shared prompts           │
└─────────────────────────────────────────────┘

Key ideas of PagedAttention:

Split KV Cache into fixed-size blocks (pages)
Use page tables to logically link non-contiguous blocks, like OS virtual memory
Allocate blocks only on demand, eliminating internal fragmentation
Copy-on-Write: Requests sharing the same prompt share KV Cache

2.4 Prefix Caching

Reuses KV Cache for repeated system prompts or common prefixes.

# Enable Prefix Caching in vLLM
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    enable_prefix_caching=True,
    max_model_len=8192,
)

# Requests using the same system prompt
# share the KV Cache for the system prompt portion

3. Attention Optimization: FlashAttention and MQA/GQA

3.1 FlashAttention: IO-Aware Attention

Problems with standard attention:

Read Q, K, V matrices from HBM (High Bandwidth Memory)
Compute S = Q @ K^T and write to HBM
Compute P = softmax(S) and write to HBM
Compute O = P @ V and write to HBM

4 HBM read/write round-trips -- this is the bottleneck.

┌──────────────────────────────────────────────┐
│          FlashAttention Core Idea             │
│                                              │
│  GPU Memory Hierarchy:                       │
│  ┌─────────┐  19 TB/s   ┌─────────────────┐ │
│  │  SRAM   │<---------->│  Compute Units   │ │
│  │ (20 MB) │            └─────────────────┘ │
│  └────┬────┘                                 │
│       | 2-4.8 TB/s                           │
│  ┌────v────────────────┐                     │
│  │    HBM (80-141 GB)  │                     │
│  └─────────────────────┘                     │
│                                              │
│  Strategy: Split Q,K,V into tiles (blocks),  │
│  perform all computation in SRAM,            │
│  write only final results to HBM             │
└──────────────────────────────────────────────┘

3.2 FlashAttention Version Comparison

Feature	FlashAttention-1	FlashAttention-2	FlashAttention-3
Release	2022	2023	2024
Speedup	2-4x	Additional 2x	Additional 1.5-2x
GPU Support	A100	A100, H100	H100 (Hopper optimized)
Key Optimization	Tiling, recomputation	Improved parallelism, warp splitting	FP8, async copy, pipelining
FLOPS vs MHA	50-70%	70-80%	Up to 740 TFLOPS (75%)

3.3 Multi-Query Attention (MQA) vs Grouped-Query Attention (GQA)

Architecture-level optimization to reduce KV Cache size:

┌─────────────────────────────────────────────────────┐
│    Multi-Head Attention (MHA)                       │
│    Q heads: [H1][H2][H3][H4][H5][H6][H7][H8]      │
│    K heads: [H1][H2][H3][H4][H5][H6][H7][H8]      │
│    V heads: [H1][H2][H3][H4][H5][H6][H7][H8]      │
│    KV Cache: 8x                                     │
├─────────────────────────────────────────────────────┤
│    Multi-Query Attention (MQA)                      │
│    Q heads: [H1][H2][H3][H4][H5][H6][H7][H8]      │
│    K heads: [        H_shared         ]             │
│    V heads: [        H_shared         ]             │
│    KV Cache: 1x (8x reduction)                      │
├─────────────────────────────────────────────────────┤
│    Grouped-Query Attention (GQA, 2 groups)          │
│    Q heads: [H1][H2][H3][H4] | [H5][H6][H7][H8]   │
│    K heads: [   K_group1   ] | [   K_group2   ]    │
│    V heads: [   V_group1   ] | [   V_group2   ]    │
│    KV Cache: 2x (4x reduction)                      │
└─────────────────────────────────────────────────────┘

Model	Attention Type	KV Heads	Q Heads	KV Cache Reduction
GPT-J 6B	MHA	16	16	1x
Falcon-40B	MQA	1	64	64x
Llama-2 70B	GQA	8	64	8x
Llama-3 70B	GQA	8	64	8x
Mistral 7B	GQA	8	32	4x

4. Batching Strategies: Static vs Continuous

4.1 Limitations of Static Batching

Static Batching (traditional):
Time ──────────────────────────────────>

Req 1: [████████████████████████████████]  (long response)
Req 2: [████████░░░░░░░░░░░░░░░░░░░░░░]  (short response)
Req 3: [██████████████░░░░░░░░░░░░░░░░]  (medium response)
Req 4: [WAIT WAIT WAIT WAIT WAIT WAIT ]  (waiting)

░ = GPU idle (padding), WAIT = waiting for batch to complete
Entire batch must finish before next batch starts -> very low throughput

4.2 Continuous Batching (In-Flight Batching)

Continuous Batching:
Time ──────────────────────────────────>

Req 1: [████████████████████████████████]
Req 2: [████████]
Req 3:          [██████████████]
Req 4:                  [████████████████]
Req 5:                          [████████]

Completed requests immediately removed -> new requests immediately added
GPU idle time minimized -> 10-20x throughput improvement

Core principles of Continuous Batching:

Every iteration, remove completed requests from the batch
Immediately add waiting requests to the batch
GPU always operates at maximum load
Individual request latency also improves (reduced wait time)

4.3 Chunked Prefill

Solves the problem of long prompt prefill blocking decode requests.

# vLLM chunked prefill configuration
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_chunked_prefill=True,
    max_num_batched_tokens=2048,  # max tokens per iteration
)

# A long prompt (e.g., 32K tokens) is split into 2048-token chunks
# Decode requests can also be processed between chunks
# Slightly increases TTFT but improves overall system throughput and ITL

5. Speculative Decoding: A Game Changer for Inference Speed

5.1 Core Idea

A small Draft Model quickly predicts multiple tokens, and a large Target Model verifies them all in a single forward pass.

┌────────────────────────────────────────────────────┐
│           Speculative Decoding Flow                │
│                                                    │
│  Step 1: Draft Model (small, fast)                 │
│  "The capital of France is" -> [Paris][,][a][city] │
│  4 tokens predicted very quickly (4ms)             │
│                                                    │
│  Step 2: Target Model (large, accurate)            │
│  Single forward pass verifies all 4 tokens         │
│  [Paris OK] [, OK] [a FAIL->"known"] [city FAIL]  │
│                                                    │
│  Result: "Paris, known" (2 accepted + 1 corrected) │
│  Before: 3 forward passes needed -> now 1          │
│  Speedup: ~2-3x                                    │
└────────────────────────────────────────────────────┘

5.2 Mathematical Guarantee: Preserving Output Quality

The key advantage of Speculative Decoding is that it exactly preserves the target model's output distribution.

Acceptance/rejection probability:

For draft token x: acceptance probability = min(1, p_target(x) / p_draft(x))
On rejection: resample from (p_target(x) - p_draft(x)) distribution

Through this process, the final output has a mathematically identical distribution to using only the target model.

5.3 Speculative Decoding Variants

# 1. Separate Draft Model
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.1-8B-Instruct",
    num_speculative_tokens=5,
    use_v2_block_manager=True,
)

# 2. Medusa Heads (additional MLP heads predict multiple positions)
# No draft model needed - adds lightweight heads to target model itself
# Requires training but minimal memory overhead

# 3. EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)
# Draft model reuses target model's hidden states
# Higher acceptance rate than separate draft models

5.4 Tree Attention

Verifies multiple candidate sequences simultaneously in a tree structure.

Token position:    1        2        3
                 +-- Paris --+-- is -- ...
The -------------+           +-- was -- ...
                 +-- Lyon --- is -- ...
                 +-- capital - of -- ...

All tree paths verified in a single forward pass
-> Maximizes acceptance rate, improves throughput

6. Quantization for Inference Acceleration

6.1 Data Type Comparison

Data Type	Bits	Range	Memory Savings	Quality Impact
FP32	32	Very wide	Baseline	Baseline
FP16	16	Wide	2x	Negligible
BF16	16	Same as FP32	2x	Negligible
FP8 (E4M3)	8	Medium	4x	Very small
INT8	8	-128 to 127	4x	Small
INT4	4	-8 to 7	8x	Moderate
NF4	4	Normal dist. optimized	8x	Less than INT4

6.2 Quantization Technique Comparison

┌───────────────────────────────────────────────────────┐
│            Quantization Technique Classification      │
├─────────────────────┬─────────────────────────────────┤
│  Post-Training      │  Training-Aware                 │
│  Quantization(PTQ)  │  Quantization                   │
├─────────────────────┼─────────────────────────────────┤
│  - GPTQ (INT4)      │  - QLoRA + Merge                │
│  - AWQ (INT4)       │  - QAT (Quantization-Aware      │
│  - GGUF (various)   │    Training)                    │
│  - bitsandbytes     │                                 │
│  - SmoothQuant      │                                 │
│  - FP8 Dynamic      │                                 │
└─────────────────────┴─────────────────────────────────┘

6.3 Major Quantization Formats in Detail

# GPTQ: Layer-wise optimal quantization (OBQ-based)
# Pros: Good quality even at INT4, optimized for GPU inference
# Cons: Requires calibration data, slow quantization

from transformers import AutoModelForCausalLM, GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    group_size=128,
    dataset="c4",
    desc_act=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    quantization_config=gptq_config,
    device_map="auto",
)

# AWQ: Activation-aware Weight Quantization
# Key: Finds and protects important weight channels (based on activation magnitude)
# Faster quantization than GPTQ, similar quality

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct"
)
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)

# bitsandbytes: Simple INT8/NF4 quantization
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
)

6.4 GGUF: Format for CPU/Metal Inference

The quantization format used by llama.cpp, supporting various quantization levels.

GGUF Quantization	Bits	Method	Quality	Speed
Q2_K	2-3	K-quant mixed	Low	Very fast
Q4_K_M	4-5	K-quant medium	Good	Fast
Q5_K_M	5-6	K-quant medium	Very good	Medium
Q6_K	6	K-quant	Near original	Slow
Q8_0	8	Uniform quant	Same as original	Slow
F16	16	No quantization	Original	Slowest

7. Serving Framework Comparison: vLLM vs TensorRT-LLM vs TGI

7.1 Comprehensive Comparison

Feature	vLLM	TensorRT-LLM	TGI	Ollama	llama.cpp
Developer	UC Berkeley	NVIDIA	Hugging Face	Ollama	ggerganov
Language	Python/C++	C++/Python	Rust/Python	Go	C/C++
PagedAttention	Yes	Yes	Yes	No	No
Continuous Batching	Yes	Yes	Yes	No	No
Tensor Parallelism	Yes	Yes	Yes	No	No
FP8 Support	Yes	Yes (optimal)	Yes	No	No
Speculative Decoding	Yes	Yes	Limited	No	Yes
LoRA Serving	Yes (multi)	Yes	Yes	Yes	Yes
Vision Models	Yes	Yes	Yes	Yes	Yes (some)
CPU Inference	Limited	No	No	Yes	Yes (optimal)
Metal (Apple)	No	No	No	Yes	Yes
Install Difficulty	Easy	Hard	Easy	Very easy	Medium
Production Ready	High	High	High	Low	Medium

7.2 Throughput Benchmarks (Llama-3.1 8B, A100 80GB)

Framework	Throughput (tok/s)	TTFT (ms)	ITL (ms)	Memory Usage
vLLM (FP16)	4,200	45	12	18 GB
vLLM (AWQ-4bit)	6,800	32	8	7 GB
TensorRT-LLM (FP16)	4,800	38	10	17 GB
TensorRT-LLM (FP8)	7,500	28	7	10 GB
TGI (FP16)	3,600	52	14	18 GB
llama.cpp (Q4_K_M)	120	200	35	5 GB

8. vLLM Deep Dive: Architecture to LoRA Serving

8.1 vLLM Architecture

┌──────────────────────────────────────────┐
│              vLLM Architecture            │
│                                          │
│  ┌─────────┐     ┌──────────────────┐   │
│  │ FastAPI  │---->│   LLM Engine     │   │
│  │ Server   │     │                  │   │
│  └─────────┘     │  ┌────────────┐  │   │
│                  │  │ Scheduler   │  │   │
│  ┌─────────┐    │  │ (Batching)  │  │   │
│  │ OpenAI  │---->│  └─────┬──────┘  │   │
│  │ compat  │     │        |         │   │
│  └─────────┘     │  ┌─────v──────┐  │   │
│                  │  │ Block Mgr   │  │   │
│                  │  │ (PagedAttn) │  │   │
│                  │  └─────┬──────┘  │   │
│                  │        |         │   │
│                  │  ┌─────v──────┐  │   │
│                  │  │  Worker(s)  │  │   │
│                  │  │ (GPU exec)  │  │   │
│                  │  └────────────┘  │   │
│                  └──────────────────┘   │
└──────────────────────────────────────────┘

8.2 vLLM Production Deployment

# Start vLLM server (OpenAI API compatible)
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
#     --tensor-parallel-size 4 \
#     --max-model-len 32768 \
#     --gpu-memory-utilization 0.90 \
#     --enable-prefix-caching \
#     --enable-chunked-prefill \
#     --max-num-batched-tokens 4096 \
#     --port 8000

# API call via Python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing"},
    ],
    max_tokens=512,
    temperature=0.7,
)

8.3 vLLM Multi-LoRA Serving

Serve multiple LoRA adapters simultaneously from a single base model.

# vllm serve meta-llama/Llama-3.1-8B-Instruct \
#     --enable-lora \
#     --lora-modules \
#         sql-lora=./adapters/sql-lora \
#         code-lora=./adapters/code-lora \
#         chat-lora=./adapters/chat-lora \
#     --max-loras 3 \
#     --max-lora-rank 64

# Select LoRA adapter by model name in API call
response = client.chat.completions.create(
    model="sql-lora",  # LoRA adapter name
    messages=[{"role": "user", "content": "SELECT ..."}],
)

8.4 vLLM Vision Model Serving

# Multimodal model serving
# vllm serve Qwen/Qwen2-VL-7B-Instruct \
#     --max-model-len 8192 \
#     --limit-mm-per-prompt image=4

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="key")

response = client.chat.completions.create(
    model="Qwen/Qwen2-VL-7B-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
        ]
    }],
)

9. TensorRT-LLM Deep Dive: The Choice for Maximum Performance

9.1 TensorRT-LLM Build Pipeline

┌────────┐     ┌───────────┐     ┌──────────┐     ┌──────────┐
│HF Model│---->│ Convert   │---->│ TRT-LLM  │---->│ Triton   │
│(source) │     │ Checkpoint│     │ Engine   │     │ Serving  │
└────────┘     └───────────┘     └──────────┘     └──────────┘
                 Apply quant      Compile optim    API server

# Step 1: Checkpoint conversion + FP8 quantization
python convert_checkpoint.py \
    --model_dir meta-llama/Llama-3.1-70B-Instruct \
    --output_dir ./checkpoint_fp8 \
    --dtype bfloat16 \
    --tp_size 4 \
    --pp_size 1 \
    --use_fp8

# Step 2: Build TensorRT engine
trtllm-build \
    --checkpoint_dir ./checkpoint_fp8 \
    --output_dir ./engine_fp8 \
    --gemm_plugin auto \
    --max_batch_size 64 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --paged_kv_cache enable \
    --use_paged_context_fmha enable \
    --workers 4

9.2 TensorRT-LLM FP8 Optimization

Maximizes utilization of H100 GPU's FP8 Tensor Cores.

Configuration	Throughput (Llama-3.1 70B, 4xH100)	Latency
FP16, TP=4	2,400 tok/s	16ms ITL
FP8, TP=4	4,200 tok/s	9ms ITL
FP8 + Speculative	5,800 tok/s	6ms ITL
INT4 AWQ, TP=2	3,800 tok/s	11ms ITL

9.3 Inflight Batching (TensorRT-LLM)

TensorRT-LLM's implementation of Continuous Batching.

# Triton Inference Server + TensorRT-LLM backend
# model_config.pbtxt configuration
"""
backend: "tensorrtllm"
max_batch_size: 64

model_transaction_policy {
  decoupled: True    # Streaming response support
}

parameters: {
  key: "batching_type"
  value: {string_value: "inflight"}  # Enable Inflight Batching
}

parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {string_value: "131072"}   # Limit KV Cache token count
}
"""

10. Model Parallelism: Multi-GPU Strategies

10.1 Tensor Parallelism (TP)

Splits a single layer across multiple GPUs.

Tensor Parallelism (TP=4):

         Layer N weight matrix W
    ┌──────┬──────┬──────┬──────┐
    │ W_1  │ W_2  │ W_3  │ W_4  │
    │GPU 0 │GPU 1 │GPU 2 │GPU 3 │
    └──┬───┴──┬───┴──┬───┴──┬───┘
       |      |      |      |
       v      v      v      v
    [part1] [part2] [part3] [part4]
       |      |      |      |
       └──────┴──────┴──────┘
              All-Reduce
              (aggregate results)

Pros: Reduces latency (all GPUs compute simultaneously)
Cons: Requires inter-GPU communication (NVLink recommended)
Best for: GPUs within same node (low latency needed)

10.2 Pipeline Parallelism (PP)

Distributes layers sequentially across GPUs.

Pipeline Parallelism (PP=4, 80 layers):

GPU 0: [Layer 0-19]  -> GPU 1: [Layer 20-39]
                              -> GPU 2: [Layer 40-59]
                                       -> GPU 3: [Layer 60-79]

Pros: Minimal inter-GPU communication (one direction)
Cons: Pipeline bubbles (GPU idle time)
Best for: Cross-node distribution (higher latency tolerable)

10.3 Expert Parallelism (EP) - For MoE Models

Distributes experts in Mixture of Experts models.

Expert Parallelism (Mixtral 8x7B, EP=4):

GPU 0: Expert 0, 1 + Shared Layers
GPU 1: Expert 2, 3 + Shared Layers
GPU 2: Expert 4, 5 + Shared Layers
GPU 3: Expert 6, 7 + Shared Layers

Token routing: Each token sent to Top-2 Experts
-> Requires All-to-All communication between GPUs

10.4 Practical Parallelism Combinations

# Serving Llama-3.1 405B (8x H100 80GB)
# Model size: ~810 GB (FP16) -> FP8 ~405 GB

# Option 1: TP=8 (all layers split across all GPUs)
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --tensor-parallel-size 8 \
    --max-model-len 16384

# Option 2: TP=4, PP=2 (4 GPUs per pipeline stage, 2 stages)
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 2 \
    --max-model-len 16384

11. Advanced GPU Memory Optimization

11.1 KV Cache Quantization

# KV Cache FP8 quantization in vLLM
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
#     --tensor-parallel-size 4 \
#     --kv-cache-dtype fp8 \
#     --quantization fp8

# KV Cache memory savings:
# FP16 KV Cache: 1.34 GB / sequence (Llama-2 70B, 4K)
# FP8 KV Cache:  0.67 GB / sequence (50% savings)
# 2x more concurrent requests on same GPU

11.2 Memory Allocation Strategy

GPU Memory Distribution (A100 80GB, Llama-3.1 70B FP16):

┌─────────────────────────────────┐
│  Model Weights: ~35 GB (TP=2)   │  43.75%
├─────────────────────────────────┤
│  KV Cache: ~35 GB               │  43.75%
│  (gpu_memory_utilization=0.90)  │
├─────────────────────────────────┤
│  Activation Memory: ~2 GB       │  2.5%
├─────────────────────────────────┤
│  System Reserved: ~8 GB         │  10%
└─────────────────────────────────┘

KV Cache determines the maximum number of concurrent requests.

11.3 Strategies When Running Low on Memory

Strategy	Implementation	Effect	Side Effect
Quantization	FP16 to INT4	4x weight reduction	Slight quality loss
KV Cache Quant	FP16 to FP8	2x KV Cache reduction	Negligible
Reduce max_model_len	32K to 8K	Proportional KV Cache reduction	No long contexts
Increase TP	TP=2 to TP=4	Half memory per GPU	Extra GPU cost
Prefix Caching	Shared system prompts	Large savings for repeated requests	No effect on unique requests

12. Cost Analysis: tokens/dollar Across Platforms

12.1 Self-Hosting Cost Comparison

GPU	Cloud Hourly Cost	Llama-3.1 70B Throughput	tokens/dollar
A100 80GB x1	~3.0 USD	800 tok/s (FP16)	960K
A100 80GB x4 (TP=4)	~12.0 USD	2,800 tok/s	840K
H100 80GB x1	~4.5 USD	1,500 tok/s (FP8)	1,200K
H100 80GB x4 (TP=4)	~18.0 USD	5,000 tok/s (FP8)	1,000K
L40S x1	~1.5 USD	600 tok/s (INT4)	1,440K
4090 x1 (own server)	~0.3 USD (power)	400 tok/s (INT4)	4,800K

12.2 API vs Self-Hosting Break-Even Point

Monthly token usage cost comparison (Llama-3.1 70B class):

┌─────────────────────────────────────────────────┐
│ Cost                                            │
│ ($)                                             │
│ 5000|                                    /API   │
│     |                                  /        │
│ 3000|                    ---------- Self-Hosted  │
│     |              /  (H100x4 monthly fixed)    │
│ 1000|      / API                                │
│     |   /                                       │
│    0+--+----+----+----+----+----+------>        │
│     0  2B   5B  10B  20B  50B 100B  tokens/mo  │
│                                                 │
│  Break-even: ~10B tokens/month                  │
└─────────────────────────────────────────────────┘

13. Benchmarking: How to Measure Correctly

13.1 Core Metrics

Metric	Definition	Why It Matters
TTFT (Time To First Token)	Time until first token generated	User-perceived response start
ITL (Inter-Token Latency)	Time between tokens	Perceived streaming speed
E2E Latency	Total request completion time	Total wait time
Throughput	Tokens generated per second	Overall system capacity
TPS/User	Tokens per second per user	Individual perceived speed

13.2 Benchmarking Tools and Methods

# vLLM built-in benchmark (recommended)
# After running: python -m vllm.entrypoints.openai.api_server

# python benchmarks/benchmark_serving.py \
#     --backend vllm \
#     --model meta-llama/Llama-3.1-8B-Instruct \
#     --dataset-name sharegpt \
#     --dataset-path ShareGPT_V3_unfiltered.json \
#     --num-prompts 1000 \
#     --request-rate 10 \
#     --endpoint /v1/completions

# Example results:
# Successful requests:                     1000
# Benchmark duration (s):                  105.23
# Total input tokens:                      215000
# Total generated tokens:                  180000
# Request throughput (req/s):              9.50
# Output token throughput (tok/s):         1710.5
# Mean TTFT (ms):                          48.2
# Median TTFT (ms):                        42.1
# P99 TTFT (ms):                           125.3
# Mean ITL (ms):                           11.8
# Median ITL (ms):                         10.2
# P99 ITL (ms):                            35.7

13.3 Performance Characteristics Under Load

Throughput vs Latency relationship (as concurrency increases):

Throughput                          Latency
(tok/s)                            (ms)
  |        ┌──────────              |              /
  |       /                         |            /
  |      /                          |          /
  |     /                           |        /
  |    /                            |      /
  |   /                             |    /
  |  /                              |  /
  | /                               |/
  +──────────────> concurrency      +──────────────> concurrency

  Optimal operating point: Just before throughput saturates (Knee point)
  Usually GPU utilization of 70-80%

14. Production Deployment Architecture

14.1 Production Serving Architecture

┌──────────────────────────────────────────────────┐
│                Production Architecture            │
│                                                  │
│  Client -> Load Balancer -> API Gateway          │
│                              |                   │
│                    ┌─────────┼─────────┐         │
│                    v         v         v         │
│              ┌─────────┐┌────────┐┌────────┐    │
│              │ vLLM    ││ vLLM   ││ vLLM   │    │
│              │ Pod 1   ││ Pod 2  ││ Pod 3  │    │
│              │ (4xH100)││(4xH100)││(4xH100)│    │
│              └────┬────┘└───┬────┘└───┬────┘    │
│                   |         |         |          │
│              ┌────v─────────v─────────v────┐    │
│              │     Prometheus + Grafana     │    │
│              │    (Metrics collection)      │    │
│              └─────────────────────────────┘    │
│                                                  │
│  Autoscaling: Based on queue length / GPU util   │
│  Health Check: /health endpoint                  │
│  Graceful Shutdown: Complete in-flight requests  │
└──────────────────────────────────────────────────┘

14.2 Kubernetes Deployment Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-70b
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-llama3
  template:
    metadata:
      labels:
        app: vllm-llama3
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
        args:
          - "--model=meta-llama/Llama-3.1-70B-Instruct"
          - "--tensor-parallel-size=4"
          - "--max-model-len=16384"
          - "--gpu-memory-utilization=0.90"
          - "--enable-prefix-caching"
          - "--enable-chunked-prefill"
        resources:
          limits:
            nvidia.com/gpu: "4"
            memory: "64Gi"
          requests:
            nvidia.com/gpu: "4"
            memory: "32Gi"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
      nodeSelector:
        gpu-type: h100
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

15. Quiz

Q1. What core problem does vLLM's PagedAttention solve?

Answer: It solves the memory fragmentation problem of KV Cache.

Traditional approaches pre-allocate contiguous memory for the maximum sequence length per request, wasting 60-80%. PagedAttention splits KV Cache into fixed-size blocks (pages) and uses page tables to logically link non-contiguous blocks, like OS virtual memory:

Near-zero internal fragmentation
Non-contiguous memory blocks utilized
Copy-on-Write for shared prompt KV Cache

This enables 2-4x more concurrent requests on the same GPU memory.

Q2. Why does Continuous Batching achieve higher throughput than Static Batching?

Answer: Static Batching waits for all requests in a batch to complete before starting the next batch. Even when short responses finish early, the GPU sits idle.

Continuous Batching:

Removes completed requests every iteration
Immediately adds new requests from the queue
Keeps GPU at maximum utilization

This achieves 10-20x higher throughput compared to Static Batching. Individual request latency also improves due to reduced waiting time.

Q3. Why does Speculative Decoding not degrade output quality?

Answer: Because it exactly preserves the target model's output distribution mathematically.

For a draft token x:

Acceptance probability = min(1, p_target(x) / p_draft(x))
On rejection: resample from (p_target - p_draft) distribution

This process ensures the final output has a mathematically identical distribution to using only the target model. Speed improves while quality loss is zero.

Q4. Why is the LLM Decode phase Memory-Bound?

Answer: In the decode phase, only one token is generated at a time. The entire model weights must be read from memory (matrix-vector multiplication), but actual computation is minimal.

Llama-2 70B example:

Must read 140 GB model weights each step
At A100 bandwidth of 2 TB/s: 70ms (memory reading)
Actual computation time: ~1ms

Memory bandwidth is the bottleneck, making it Memory-Bound. This is why quantization (reducing weight size) and batching (reading weights once for multiple requests) are effective.

Q5. Why is FP8 quantization more suitable for LLM inference than INT8?

Answer: FP8 is a floating-point format with a wide dynamic range. LLM weights and activations have highly varied magnitudes, making FP8 more suitable than fixed-point INT8.

Specifically:

FP8 E4M3: 4-bit exponent, 3-bit mantissa -- wide range, decent precision
INT8: Fixed range of -128 to 127 -- vulnerable to outliers
H100 GPUs have dedicated FP8 Tensor Cores with 2x FP16 compute
FP8 supports dynamic quantization without calibration

As a result, FP8 provides performance gains close to INT4 while maintaining quality close to FP16.

16. References

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - Kwon et al., 2023
- https://arxiv.org/abs/2309.06180
FlashAttention: Fast and Memory-Efficient Exact Attention - Dao et al., 2022
- https://arxiv.org/abs/2205.14135
FlashAttention-2: Faster Attention with Better Parallelism - Dao, 2023
- https://arxiv.org/abs/2307.08691
Efficient Memory Management for Large Language Model Serving with PagedAttention - Kwon et al., 2023
- https://arxiv.org/abs/2309.06180
Fast Inference from Transformers via Speculative Decoding - Leviathan et al., 2023
- https://arxiv.org/abs/2211.17192
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - Frantar et al., 2023
- https://arxiv.org/abs/2210.17323
AWQ: Activation-aware Weight Quantization - Lin et al., 2024
- https://arxiv.org/abs/2306.00978
TensorRT-LLM - NVIDIA Official Documentation
- https://nvidia.github.io/TensorRT-LLM/
Orca: A Distributed Serving System for Transformer-Based Generative Models - Yu et al., 2022
- https://www.usenix.org/conference/osdi22/presentation/yu
GQA: Training Generalized Multi-Query Transformer Models - Ainslie et al., 2023
- https://arxiv.org/abs/2305.13245
Medusa: Simple LLM Inference Acceleration Framework - Cai et al., 2024
- https://arxiv.org/abs/2401.10774
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty - Li et al., 2024
- https://arxiv.org/abs/2401.15077
SmoothQuant: Accurate and Efficient Post-Training Quantization - Xiao et al., 2023
- https://arxiv.org/abs/2211.10438

Table of Contents