Skip to content
Published on

Complete LLM Inference Optimization Guide 2025: vLLM, TensorRT-LLM, KV Cache, Speculative Decoding

Authors

Table of Contents

1. Understanding LLM Inference Bottlenecks: Compute-Bound vs Memory-Bound

Before diving into LLM inference optimization, we must first understand exactly where bottlenecks occur.

1.1 Arithmetic Intensity and the Roofline Model

GPU performance is governed by two resources:

ResourceUnitA100 80GBH100 80GBH200 141GB
Compute (FP16)TFLOPS312989989
Memory BandwidthTB/s2.03.354.8
Arithmetic Intensity BoundaryFLOP/byte156295206

Arithmetic Intensity = Total FLOPs / Total Memory Transfers (Bytes)

  • Compute-Bound: When arithmetic intensity exceeds the boundary. Matrix-matrix multiplication (GEMM) is the canonical example
  • Memory-Bound: When arithmetic intensity is below the boundary. Attention and decoding are typical examples

1.2 Prefill vs Decode Phases

LLM inference consists of two main phases:

┌──────────────────────────────────────────────────────┐
LLM Inference Pipeline├──────────────────┬───────────────────────────────────┤
Prefill PhaseDecode Phase  (Prompt Proc.)        (Token Generation)├──────────────────┼───────────────────────────────────┤
- Input tokens   │ - Generates 1 token at a time     │
in parallel    │ - Memory-Bound- Compute-Bound- Low GPU utilization (5-15%)- High GPU util  │ - Repeats for output length       │
- Runs once      │ - KV Cache read + append          │
- Creates KV $   │                                   │
└──────────────────┴───────────────────────────────────┘

Prefill Phase: Processes the entire prompt at once. Dominated by matrix-matrix multiplication (GEMM), making it compute-bound.

Decode Phase: Generates tokens one at a time. Dominated by matrix-vector multiplication (GEMV), making it memory-bound. The entire model weights must be read from memory each step, but actual computation is minimal.

1.3 Why Decode Is Slow

For Llama-2 70B:

  • Model weights: ~140 GB (FP16)
  • Per decode step: must read 140 GB from memory
  • At A100 bandwidth of 2 TB/s: 140 GB / 2 TB/s = 70ms per token
  • Actual computation time: ~1ms

Memory reading takes 70x longer than computation. This is the core motivation for LLM inference optimization.


2. KV Cache: The Core Data Structure of LLM Inference

2.1 What Is KV Cache

Transformer Self-Attention requires the Key (K) and Value (V) of all previous tokens. KV Cache stores previously computed K and V tensors to avoid recomputation.

# Without KV Cache (full recomputation each step)
# Total compute for n tokens: O(n^2 * d)

# With KV Cache (reuse previous results)
# Total compute for n tokens: O(n * d)
# But KV Cache memory: O(n * d) additional

2.2 KV Cache Memory Calculation

KV Cache Size = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * dtype_size

Example: Llama-2 70B, seq_len=4096, batch_size=1, FP16
= 2 * 80 * 8 * 128 * 4096 * 1 * 2 bytes
= 1.34 GB (for a single sequence!)

With batch_size=32: 1.34 * 32 = 42.9 GB
ModelParametersKV Cache/token (FP16)4K seq x14K seq x32
Llama-2 7B7B800 KB3.2 GB102 GB
Llama-2 70B70B320 KB1.34 GB42.9 GB
Mixtral 8x7B46.7B640 KB2.56 GB81.9 GB
Llama-3 405B405B1.6 MB6.4 GB204 GB

2.3 PagedAttention (Core of vLLM)

The problem with traditional approaches: each sequence pre-allocates contiguous memory for the maximum length. In practice, 60-80% is wasted.

┌─────────────────────────────────────────────┐
Traditional KV Cache Allocation│                                             │
Request 1: [████████░░░░░░░░░░░░]  40% used│
Request 2: [████████████░░░░░░░░]  60% used│
Request 3: [██░░░░░░░░░░░░░░░░░░]  10% used│
^^^^^^^^ wasted memory          │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
PagedAttention KV Cache Allocation│                                             │
Physical Blocks: [B0][B1][B2][B3][B4][B5]│                                             │
Request 1 -> Page Table: [B0, B3, B5]Request 2 -> Page Table: [B1, B4, B6, B7]Request 3 -> Page Table: [B2]│                                             │
Near-zero internal fragmentation           │
Non-contiguous memory blocks utilized      │
Copy-on-Write for shared prompts           │
└─────────────────────────────────────────────┘

Key ideas of PagedAttention:

  1. Split KV Cache into fixed-size blocks (pages)
  2. Use page tables to logically link non-contiguous blocks, like OS virtual memory
  3. Allocate blocks only on demand, eliminating internal fragmentation
  4. Copy-on-Write: Requests sharing the same prompt share KV Cache

2.4 Prefix Caching

Reuses KV Cache for repeated system prompts or common prefixes.

# Enable Prefix Caching in vLLM
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    enable_prefix_caching=True,
    max_model_len=8192,
)

# Requests using the same system prompt
# share the KV Cache for the system prompt portion

3. Attention Optimization: FlashAttention and MQA/GQA

3.1 FlashAttention: IO-Aware Attention

Problems with standard attention:

  1. Read Q, K, V matrices from HBM (High Bandwidth Memory)
  2. Compute S = Q @ K^T and write to HBM
  3. Compute P = softmax(S) and write to HBM
  4. Compute O = P @ V and write to HBM

4 HBM read/write round-trips -- this is the bottleneck.

┌──────────────────────────────────────────────┐
FlashAttention Core Idea│                                              │
GPU Memory Hierarchy:│  ┌─────────┐  19 TB/s   ┌─────────────────┐ │
│  │  SRAM<---------->Compute Units   │ │
 (20 MB) │            └─────────────────┘ │
│  └────┬────┘                                 │
| 2-4.8 TB/s                           │
│  ┌────v────────────────┐                     │
│  │    HBM (80-141 GB)  │                     │
│  └─────────────────────┘                     │
│                                              │
Strategy: Split Q,K,V into tiles (blocks),│  perform all computation in SRAM,│  write only final results to HBM└──────────────────────────────────────────────┘

3.2 FlashAttention Version Comparison

FeatureFlashAttention-1FlashAttention-2FlashAttention-3
Release202220232024
Speedup2-4xAdditional 2xAdditional 1.5-2x
GPU SupportA100A100, H100H100 (Hopper optimized)
Key OptimizationTiling, recomputationImproved parallelism, warp splittingFP8, async copy, pipelining
FLOPS vs MHA50-70%70-80%Up to 740 TFLOPS (75%)

3.3 Multi-Query Attention (MQA) vs Grouped-Query Attention (GQA)

Architecture-level optimization to reduce KV Cache size:

┌─────────────────────────────────────────────────────┐
Multi-Head Attention (MHA)Q heads: [H1][H2][H3][H4][H5][H6][H7][H8]K heads: [H1][H2][H3][H4][H5][H6][H7][H8]V heads: [H1][H2][H3][H4][H5][H6][H7][H8]KV Cache: 8x                                     │
├─────────────────────────────────────────────────────┤
Multi-Query Attention (MQA)Q heads: [H1][H2][H3][H4][H5][H6][H7][H8]K heads: [        H_shared         ]V heads: [        H_shared         ]KV Cache: 1x (8x reduction)├─────────────────────────────────────────────────────┤
Grouped-Query Attention (GQA, 2 groups)Q heads: [H1][H2][H3][H4] | [H5][H6][H7][H8]K heads: [   K_group1   ] | [   K_group2   ]V heads: [   V_group1   ] | [   V_group2   ]KV Cache: 2x (4x reduction)└─────────────────────────────────────────────────────┘
ModelAttention TypeKV HeadsQ HeadsKV Cache Reduction
GPT-J 6BMHA16161x
Falcon-40BMQA16464x
Llama-2 70BGQA8648x
Llama-3 70BGQA8648x
Mistral 7BGQA8324x

4. Batching Strategies: Static vs Continuous

4.1 Limitations of Static Batching

Static Batching (traditional):
Time ──────────────────────────────────>

Req 1: [████████████████████████████████]  (long response)
Req 2: [████████░░░░░░░░░░░░░░░░░░░░░░]  (short response)
Req 3: [██████████████░░░░░░░░░░░░░░░░]  (medium response)
Req 4: [WAIT WAIT WAIT WAIT WAIT WAIT ]  (waiting)

= GPU idle (padding), WAIT = waiting for batch to complete
Entire batch must finish before next batch starts -> very low throughput

4.2 Continuous Batching (In-Flight Batching)

Continuous Batching:
Time ──────────────────────────────────>

Req 1: [████████████████████████████████]
Req 2: [████████]
Req 3:          [██████████████]
Req 4:                  [████████████████]
Req 5:                          [████████]

Completed requests immediately removed -> new requests immediately added
GPU idle time minimized -> 10-20x throughput improvement

Core principles of Continuous Batching:

  1. Every iteration, remove completed requests from the batch
  2. Immediately add waiting requests to the batch
  3. GPU always operates at maximum load
  4. Individual request latency also improves (reduced wait time)

4.3 Chunked Prefill

Solves the problem of long prompt prefill blocking decode requests.

# vLLM chunked prefill configuration
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_chunked_prefill=True,
    max_num_batched_tokens=2048,  # max tokens per iteration
)

# A long prompt (e.g., 32K tokens) is split into 2048-token chunks
# Decode requests can also be processed between chunks
# Slightly increases TTFT but improves overall system throughput and ITL

5. Speculative Decoding: A Game Changer for Inference Speed

5.1 Core Idea

A small Draft Model quickly predicts multiple tokens, and a large Target Model verifies them all in a single forward pass.

┌────────────────────────────────────────────────────┐
Speculative Decoding Flow│                                                    │
Step 1: Draft Model (small, fast)"The capital of France is" -> [Paris][,][a][city]4 tokens predicted very quickly (4ms)│                                                    │
Step 2: Target Model (large, accurate)Single forward pass verifies all 4 tokens         │
[Paris OK] [, OK] [a FAIL->"known"] [city FAIL]│                                                    │
Result: "Paris, known" (2 accepted + 1 corrected)Before: 3 forward passes needed -> now 1Speedup: ~2-3x                                    │
└────────────────────────────────────────────────────┘

5.2 Mathematical Guarantee: Preserving Output Quality

The key advantage of Speculative Decoding is that it exactly preserves the target model's output distribution.

Acceptance/rejection probability:

  • For draft token x: acceptance probability = min(1, p_target(x) / p_draft(x))
  • On rejection: resample from (p_target(x) - p_draft(x)) distribution

Through this process, the final output has a mathematically identical distribution to using only the target model.

5.3 Speculative Decoding Variants

# 1. Separate Draft Model
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.1-8B-Instruct",
    num_speculative_tokens=5,
    use_v2_block_manager=True,
)

# 2. Medusa Heads (additional MLP heads predict multiple positions)
# No draft model needed - adds lightweight heads to target model itself
# Requires training but minimal memory overhead

# 3. EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)
# Draft model reuses target model's hidden states
# Higher acceptance rate than separate draft models

5.4 Tree Attention

Verifies multiple candidate sequences simultaneously in a tree structure.

Token position:    1        2        3
                 +-- Paris --+-- is -- ...
The -------------+           +-- was -- ...
                 +-- Lyon --- is -- ...
                 +-- capital - of -- ...

All tree paths verified in a single forward pass
-> Maximizes acceptance rate, improves throughput

6. Quantization for Inference Acceleration

6.1 Data Type Comparison

Data TypeBitsRangeMemory SavingsQuality Impact
FP3232Very wideBaselineBaseline
FP1616Wide2xNegligible
BF1616Same as FP322xNegligible
FP8 (E4M3)8Medium4xVery small
INT88-128 to 1274xSmall
INT44-8 to 78xModerate
NF44Normal dist. optimized8xLess than INT4

6.2 Quantization Technique Comparison

┌───────────────────────────────────────────────────────┐
Quantization Technique Classification├─────────────────────┬─────────────────────────────────┤
Post-TrainingTraining-AwareQuantization(PTQ)Quantization├─────────────────────┼─────────────────────────────────┤
- GPTQ (INT4)- QLoRA + Merge- AWQ (INT4)- QAT (Quantization-Aware- GGUF (various)Training)- bitsandbytes     │                                 │
- SmoothQuant      │                                 │
- FP8 Dynamic      │                                 │
└─────────────────────┴─────────────────────────────────┘

6.3 Major Quantization Formats in Detail

# GPTQ: Layer-wise optimal quantization (OBQ-based)
# Pros: Good quality even at INT4, optimized for GPU inference
# Cons: Requires calibration data, slow quantization

from transformers import AutoModelForCausalLM, GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    group_size=128,
    dataset="c4",
    desc_act=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    quantization_config=gptq_config,
    device_map="auto",
)
# AWQ: Activation-aware Weight Quantization
# Key: Finds and protects important weight channels (based on activation magnitude)
# Faster quantization than GPTQ, similar quality

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct"
)
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
# bitsandbytes: Simple INT8/NF4 quantization
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
)

6.4 GGUF: Format for CPU/Metal Inference

The quantization format used by llama.cpp, supporting various quantization levels.

GGUF QuantizationBitsMethodQualitySpeed
Q2_K2-3K-quant mixedLowVery fast
Q4_K_M4-5K-quant mediumGoodFast
Q5_K_M5-6K-quant mediumVery goodMedium
Q6_K6K-quantNear originalSlow
Q8_08Uniform quantSame as originalSlow
F1616No quantizationOriginalSlowest

7. Serving Framework Comparison: vLLM vs TensorRT-LLM vs TGI

7.1 Comprehensive Comparison

FeaturevLLMTensorRT-LLMTGIOllamallama.cpp
DeveloperUC BerkeleyNVIDIAHugging FaceOllamaggerganov
LanguagePython/C++C++/PythonRust/PythonGoC/C++
PagedAttentionYesYesYesNoNo
Continuous BatchingYesYesYesNoNo
Tensor ParallelismYesYesYesNoNo
FP8 SupportYesYes (optimal)YesNoNo
Speculative DecodingYesYesLimitedNoYes
LoRA ServingYes (multi)YesYesYesYes
Vision ModelsYesYesYesYesYes (some)
CPU InferenceLimitedNoNoYesYes (optimal)
Metal (Apple)NoNoNoYesYes
Install DifficultyEasyHardEasyVery easyMedium
Production ReadyHighHighHighLowMedium

7.2 Throughput Benchmarks (Llama-3.1 8B, A100 80GB)

FrameworkThroughput (tok/s)TTFT (ms)ITL (ms)Memory Usage
vLLM (FP16)4,200451218 GB
vLLM (AWQ-4bit)6,8003287 GB
TensorRT-LLM (FP16)4,800381017 GB
TensorRT-LLM (FP8)7,50028710 GB
TGI (FP16)3,600521418 GB
llama.cpp (Q4_K_M)120200355 GB

8. vLLM Deep Dive: Architecture to LoRA Serving

8.1 vLLM Architecture

┌──────────────────────────────────────────┐
│              vLLM Architecture│                                          │
│  ┌─────────┐     ┌──────────────────┐   │
│  │ FastAPI---->LLM Engine     │   │
│  │ Server   │     │                  │   │
│  └─────────┘     │  ┌────────────┐  │   │
│                  │  │ Scheduler   │  │   │
│  ┌─────────┐    │   (Batching)  │  │   │
│  │ OpenAI---->│  └─────┬──────┘  │   │
│  │ compat  │     │        |         │   │
│  └─────────┘     │  ┌─────v──────┐  │   │
│                  │  │ Block Mgr   │  │   │
│                  │   (PagedAttn) │  │   │
│                  │  └─────┬──────┘  │   │
│                  │        |         │   │
│                  │  ┌─────v──────┐  │   │
│                  │  │  Worker(s)  │  │   │
│                  │   (GPU exec)  │  │   │
│                  │  └────────────┘  │   │
│                  └──────────────────┘   │
└──────────────────────────────────────────┘

8.2 vLLM Production Deployment

# Start vLLM server (OpenAI API compatible)
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
#     --tensor-parallel-size 4 \
#     --max-model-len 32768 \
#     --gpu-memory-utilization 0.90 \
#     --enable-prefix-caching \
#     --enable-chunked-prefill \
#     --max-num-batched-tokens 4096 \
#     --port 8000

# API call via Python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing"},
    ],
    max_tokens=512,
    temperature=0.7,
)

8.3 vLLM Multi-LoRA Serving

Serve multiple LoRA adapters simultaneously from a single base model.

# vllm serve meta-llama/Llama-3.1-8B-Instruct \
#     --enable-lora \
#     --lora-modules \
#         sql-lora=./adapters/sql-lora \
#         code-lora=./adapters/code-lora \
#         chat-lora=./adapters/chat-lora \
#     --max-loras 3 \
#     --max-lora-rank 64

# Select LoRA adapter by model name in API call
response = client.chat.completions.create(
    model="sql-lora",  # LoRA adapter name
    messages=[{"role": "user", "content": "SELECT ..."}],
)

8.4 vLLM Vision Model Serving

# Multimodal model serving
# vllm serve Qwen/Qwen2-VL-7B-Instruct \
#     --max-model-len 8192 \
#     --limit-mm-per-prompt image=4

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="key")

response = client.chat.completions.create(
    model="Qwen/Qwen2-VL-7B-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
        ]
    }],
)

9. TensorRT-LLM Deep Dive: The Choice for Maximum Performance

9.1 TensorRT-LLM Build Pipeline

┌────────┐     ┌───────────┐     ┌──────────┐     ┌──────────┐
HF Model│---->Convert---->TRT-LLM---->Triton(source) │     │ Checkpoint│Engine   │     │ Serving└────────┘     └───────────┘     └──────────┘     └──────────┘
                 Apply quant      Compile optim    API server
# Step 1: Checkpoint conversion + FP8 quantization
python convert_checkpoint.py \
    --model_dir meta-llama/Llama-3.1-70B-Instruct \
    --output_dir ./checkpoint_fp8 \
    --dtype bfloat16 \
    --tp_size 4 \
    --pp_size 1 \
    --use_fp8

# Step 2: Build TensorRT engine
trtllm-build \
    --checkpoint_dir ./checkpoint_fp8 \
    --output_dir ./engine_fp8 \
    --gemm_plugin auto \
    --max_batch_size 64 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --paged_kv_cache enable \
    --use_paged_context_fmha enable \
    --workers 4

9.2 TensorRT-LLM FP8 Optimization

Maximizes utilization of H100 GPU's FP8 Tensor Cores.

ConfigurationThroughput (Llama-3.1 70B, 4xH100)Latency
FP16, TP=42,400 tok/s16ms ITL
FP8, TP=44,200 tok/s9ms ITL
FP8 + Speculative5,800 tok/s6ms ITL
INT4 AWQ, TP=23,800 tok/s11ms ITL

9.3 Inflight Batching (TensorRT-LLM)

TensorRT-LLM's implementation of Continuous Batching.

# Triton Inference Server + TensorRT-LLM backend
# model_config.pbtxt configuration
"""
backend: "tensorrtllm"
max_batch_size: 64

model_transaction_policy {
  decoupled: True    # Streaming response support
}

parameters: {
  key: "batching_type"
  value: {string_value: "inflight"}  # Enable Inflight Batching
}

parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {string_value: "131072"}   # Limit KV Cache token count
}
"""

10. Model Parallelism: Multi-GPU Strategies

10.1 Tensor Parallelism (TP)

Splits a single layer across multiple GPUs.

Tensor Parallelism (TP=4):

         Layer N weight matrix W
    ┌──────┬──────┬──────┬──────┐
W_1W_2W_3W_4GPU 0GPU 1GPU 2GPU 3    └──┬───┴──┬───┴──┬───┴──┬───┘
       |      |      |      |
       v      v      v      v
    [part1] [part2] [part3] [part4]
       |      |      |      |
       └──────┴──────┴──────┘
              All-Reduce
              (aggregate results)

Pros: Reduces latency (all GPUs compute simultaneously)
Cons: Requires inter-GPU communication (NVLink recommended)
Best for: GPUs within same node (low latency needed)

10.2 Pipeline Parallelism (PP)

Distributes layers sequentially across GPUs.

Pipeline Parallelism (PP=4, 80 layers):

GPU 0: [Layer 0-19]  -> GPU 1: [Layer 20-39]
                              -> GPU 2: [Layer 40-59]
                                       -> GPU 3: [Layer 60-79]

Pros: Minimal inter-GPU communication (one direction)
Cons: Pipeline bubbles (GPU idle time)
Best for: Cross-node distribution (higher latency tolerable)

10.3 Expert Parallelism (EP) - For MoE Models

Distributes experts in Mixture of Experts models.

Expert Parallelism (Mixtral 8x7B, EP=4):

GPU 0: Expert 0, 1 + Shared Layers
GPU 1: Expert 2, 3 + Shared Layers
GPU 2: Expert 4, 5 + Shared Layers
GPU 3: Expert 6, 7 + Shared Layers

Token routing: Each token sent to Top-2 Experts
-> Requires All-to-All communication between GPUs

10.4 Practical Parallelism Combinations

# Serving Llama-3.1 405B (8x H100 80GB)
# Model size: ~810 GB (FP16) -> FP8 ~405 GB

# Option 1: TP=8 (all layers split across all GPUs)
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --tensor-parallel-size 8 \
    --max-model-len 16384

# Option 2: TP=4, PP=2 (4 GPUs per pipeline stage, 2 stages)
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 2 \
    --max-model-len 16384

11. Advanced GPU Memory Optimization

11.1 KV Cache Quantization

# KV Cache FP8 quantization in vLLM
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
#     --tensor-parallel-size 4 \
#     --kv-cache-dtype fp8 \
#     --quantization fp8

# KV Cache memory savings:
# FP16 KV Cache: 1.34 GB / sequence (Llama-2 70B, 4K)
# FP8 KV Cache:  0.67 GB / sequence (50% savings)
# 2x more concurrent requests on same GPU

11.2 Memory Allocation Strategy

GPU Memory Distribution (A100 80GB, Llama-3.1 70B FP16):

┌─────────────────────────────────┐
Model Weights: ~35 GB (TP=2)43.75%
├─────────────────────────────────┤
KV Cache: ~35 GB43.75%
  (gpu_memory_utilization=0.90)├─────────────────────────────────┤
Activation Memory: ~2 GB2.5%
├─────────────────────────────────┤
System Reserved: ~8 GB10%
└─────────────────────────────────┘

KV Cache determines the maximum number of concurrent requests.

11.3 Strategies When Running Low on Memory

StrategyImplementationEffectSide Effect
QuantizationFP16 to INT44x weight reductionSlight quality loss
KV Cache QuantFP16 to FP82x KV Cache reductionNegligible
Reduce max_model_len32K to 8KProportional KV Cache reductionNo long contexts
Increase TPTP=2 to TP=4Half memory per GPUExtra GPU cost
Prefix CachingShared system promptsLarge savings for repeated requestsNo effect on unique requests

12. Cost Analysis: tokens/dollar Across Platforms

12.1 Self-Hosting Cost Comparison

GPUCloud Hourly CostLlama-3.1 70B Throughputtokens/dollar
A100 80GB x1~3.0 USD800 tok/s (FP16)960K
A100 80GB x4 (TP=4)~12.0 USD2,800 tok/s840K
H100 80GB x1~4.5 USD1,500 tok/s (FP8)1,200K
H100 80GB x4 (TP=4)~18.0 USD5,000 tok/s (FP8)1,000K
L40S x1~1.5 USD600 tok/s (INT4)1,440K
4090 x1 (own server)~0.3 USD (power)400 tok/s (INT4)4,800K

12.2 API vs Self-Hosting Break-Even Point

Monthly token usage cost comparison (Llama-3.1 70B class):

┌─────────────────────────────────────────────────┐
Cost ($)5000|                                    /API|                                  /3000|                    ---------- Self-Hosted|              /  (H100x4 monthly fixed)1000|      / API|   /0+--+----+----+----+----+----+------>0  2B   5B  10B  20B  50B 100B  tokens/mo  │
│                                                 │
Break-even: ~10B tokens/month                  │
└─────────────────────────────────────────────────┘

13. Benchmarking: How to Measure Correctly

13.1 Core Metrics

MetricDefinitionWhy It Matters
TTFT (Time To First Token)Time until first token generatedUser-perceived response start
ITL (Inter-Token Latency)Time between tokensPerceived streaming speed
E2E LatencyTotal request completion timeTotal wait time
ThroughputTokens generated per secondOverall system capacity
TPS/UserTokens per second per userIndividual perceived speed

13.2 Benchmarking Tools and Methods

# vLLM built-in benchmark (recommended)
# After running: python -m vllm.entrypoints.openai.api_server

# python benchmarks/benchmark_serving.py \
#     --backend vllm \
#     --model meta-llama/Llama-3.1-8B-Instruct \
#     --dataset-name sharegpt \
#     --dataset-path ShareGPT_V3_unfiltered.json \
#     --num-prompts 1000 \
#     --request-rate 10 \
#     --endpoint /v1/completions

# Example results:
# Successful requests:                     1000
# Benchmark duration (s):                  105.23
# Total input tokens:                      215000
# Total generated tokens:                  180000
# Request throughput (req/s):              9.50
# Output token throughput (tok/s):         1710.5
# Mean TTFT (ms):                          48.2
# Median TTFT (ms):                        42.1
# P99 TTFT (ms):                           125.3
# Mean ITL (ms):                           11.8
# Median ITL (ms):                         10.2
# P99 ITL (ms):                            35.7

13.3 Performance Characteristics Under Load

Throughput vs Latency relationship (as concurrency increases):

Throughput                          Latency
(tok/s)                            (ms)
  |        ┌──────────              |              /
  |       /                         |            /
  |      /                          |          /
  |     /                           |        /
  |    /                            |      /
  |   /                             |    /
  |  /                              |  /
  | /                               |/
  +──────────────> concurrency      +──────────────> concurrency

  Optimal operating point: Just before throughput saturates (Knee point)
  Usually GPU utilization of 70-80%

14. Production Deployment Architecture

14.1 Production Serving Architecture

┌──────────────────────────────────────────────────┐
Production Architecture│                                                  │
Client -> Load Balancer -> API Gateway|│                    ┌─────────┼─────────┐         │
│                    v         v         v         │
│              ┌─────────┐┌────────┐┌────────┐    │
│              │ vLLM    ││ vLLM   ││ vLLM   │    │
│              │ Pod 1   ││ Pod 2  ││ Pod 3  │    │
 (4xH100)││(4xH100)││(4xH100)│    │
│              └────┬────┘└───┬────┘└───┬────┘    │
|         |         |│              ┌────v─────────v─────────v────┐    │
│              │     Prometheus + Grafana     │    │
    (Metrics collection)      │    │
│              └─────────────────────────────┘    │
│                                                  │
Autoscaling: Based on queue length / GPU util   │
Health Check: /health endpoint                  │
Graceful Shutdown: Complete in-flight requests  │
└──────────────────────────────────────────────────┘

14.2 Kubernetes Deployment Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-70b
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-llama3
  template:
    metadata:
      labels:
        app: vllm-llama3
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
        args:
          - "--model=meta-llama/Llama-3.1-70B-Instruct"
          - "--tensor-parallel-size=4"
          - "--max-model-len=16384"
          - "--gpu-memory-utilization=0.90"
          - "--enable-prefix-caching"
          - "--enable-chunked-prefill"
        resources:
          limits:
            nvidia.com/gpu: "4"
            memory: "64Gi"
          requests:
            nvidia.com/gpu: "4"
            memory: "32Gi"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
      nodeSelector:
        gpu-type: h100
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

15. Quiz

Q1. What core problem does vLLM's PagedAttention solve?

Answer: It solves the memory fragmentation problem of KV Cache.

Traditional approaches pre-allocate contiguous memory for the maximum sequence length per request, wasting 60-80%. PagedAttention splits KV Cache into fixed-size blocks (pages) and uses page tables to logically link non-contiguous blocks, like OS virtual memory:

  • Near-zero internal fragmentation
  • Non-contiguous memory blocks utilized
  • Copy-on-Write for shared prompt KV Cache

This enables 2-4x more concurrent requests on the same GPU memory.

Q2. Why does Continuous Batching achieve higher throughput than Static Batching?

Answer: Static Batching waits for all requests in a batch to complete before starting the next batch. Even when short responses finish early, the GPU sits idle.

Continuous Batching:

  1. Removes completed requests every iteration
  2. Immediately adds new requests from the queue
  3. Keeps GPU at maximum utilization

This achieves 10-20x higher throughput compared to Static Batching. Individual request latency also improves due to reduced waiting time.

Q3. Why does Speculative Decoding not degrade output quality?

Answer: Because it exactly preserves the target model's output distribution mathematically.

For a draft token x:

  • Acceptance probability = min(1, p_target(x) / p_draft(x))
  • On rejection: resample from (p_target - p_draft) distribution

This process ensures the final output has a mathematically identical distribution to using only the target model. Speed improves while quality loss is zero.

Q4. Why is the LLM Decode phase Memory-Bound?

Answer: In the decode phase, only one token is generated at a time. The entire model weights must be read from memory (matrix-vector multiplication), but actual computation is minimal.

Llama-2 70B example:

  • Must read 140 GB model weights each step
  • At A100 bandwidth of 2 TB/s: 70ms (memory reading)
  • Actual computation time: ~1ms

Memory bandwidth is the bottleneck, making it Memory-Bound. This is why quantization (reducing weight size) and batching (reading weights once for multiple requests) are effective.

Q5. Why is FP8 quantization more suitable for LLM inference than INT8?

Answer: FP8 is a floating-point format with a wide dynamic range. LLM weights and activations have highly varied magnitudes, making FP8 more suitable than fixed-point INT8.

Specifically:

  • FP8 E4M3: 4-bit exponent, 3-bit mantissa -- wide range, decent precision
  • INT8: Fixed range of -128 to 127 -- vulnerable to outliers
  • H100 GPUs have dedicated FP8 Tensor Cores with 2x FP16 compute
  • FP8 supports dynamic quantization without calibration

As a result, FP8 provides performance gains close to INT4 while maintaining quality close to FP16.


16. References

  1. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - Kwon et al., 2023
  2. FlashAttention: Fast and Memory-Efficient Exact Attention - Dao et al., 2022
  3. FlashAttention-2: Faster Attention with Better Parallelism - Dao, 2023
  4. Efficient Memory Management for Large Language Model Serving with PagedAttention - Kwon et al., 2023
  5. Fast Inference from Transformers via Speculative Decoding - Leviathan et al., 2023
  6. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - Frantar et al., 2023
  7. AWQ: Activation-aware Weight Quantization - Lin et al., 2024
  8. TensorRT-LLM - NVIDIA Official Documentation
  9. Orca: A Distributed Serving System for Transformer-Based Generative Models - Yu et al., 2022
  10. GQA: Training Generalized Multi-Query Transformer Models - Ainslie et al., 2023
  11. Medusa: Simple LLM Inference Acceleration Framework - Cai et al., 2024
  12. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty - Li et al., 2024
  13. SmoothQuant: Accurate and Efficient Post-Training Quantization - Xiao et al., 2023