Complete LLM Serving Optimization Guide: KV Cache, PagedAttention, and Quantization

The Two Phases of LLM Inference: Prefill and Decode
KV Cache: The Memory Dilemma of Attention
PagedAttention (vLLM): Virtual Memory Saves LLMs
- The Core Insight: OS Virtual Memory Applied to KV Cache
- Prefix Caching: Share Common Prompts
Continuous Batching: Maximizing Throughput
- The Problem with Static Batching
- Continuous Batching: Dynamic Scheduling Every Token Step
Quantization: Trade Precision for Speed and Memory
Speculative Decoding: Free Lunch Exists
- The Idea: Draft Fast, Verify in Parallel
Tensor Parallelism and Pipeline Parallelism
- Tensor Parallelism: Split Layers Across GPUs
- Pipeline Parallelism: Split Layers Sequentially
vLLM vs TGI vs TensorRT-LLM: Framework Comparison
- Decision Guide
Production LLM Serving Stack

The Two Phases of LLM Inference: Prefill and Decode

LLM text generation splits into two fundamentally different phases. Without understanding this split, optimization is impossible.

Phase 1: PREFILL (process the input)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input: "What is the capital of France?"
       └─ all 9 tokens processed at once

What happens:
  - All input tokens are processed in parallel (big matrix multiply!)
  - Q, K, V are computed for each input token
  - KV cache is created (saves K, V for later reuse)
  - First output token is generated

Characteristics:
  - GPU operation: COMPUTE-BOUND (matrix × matrix)
  - GPU utilization: HIGH ✅
  - Latency metric: TTFT (Time To First Token)

Phase 2: DECODE (generate tokens one-by-one)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Generates: "Paris" → "is" → "the" → "capital" → ...

What happens:
  - Generate exactly one token per forward pass
  - Compute Q for the new token, attend over cached K, V
  - Must read ALL model weights for every single token

Characteristics:
  - GPU operation: MEMORY-BOUND (matrix × vector)
  - GPU utilization: LOW (often 5–20%!)
  - Throughput metric: TBT (Time Between Tokens)

This is why LLM serving is hard to optimize:
the two phases have completely different bottlenecks!

Measuring it in practice:

import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

def measure_llm_phases(model_name="meta-llama/Llama-3.2-1B"):
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float16, device_map="cuda"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    prompt = "Explain the transformer architecture in detail:"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    input_len = inputs["input_ids"].shape[1]

    # Measure prefill time
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    with torch.no_grad():
        _ = model(**inputs)   # forward pass on input only
    torch.cuda.synchronize()
    t_prefill = time.perf_counter() - t0

    # Measure decode time
    t0 = time.perf_counter()
    with torch.no_grad():
        generated = model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
            do_sample=False
        )
    torch.cuda.synchronize()
    t_total = time.perf_counter() - t0

    t_decode = t_total - t_prefill
    n_new = generated.shape[1] - input_len

    print(f"Input tokens:          {input_len}")
    print(f"Prefill time (TTFT):   {t_prefill*1000:.1f}ms")
    print(f"Generated tokens:      {n_new}")
    print(f"Decode time:           {t_decode*1000:.1f}ms")
    print(f"Per-token decode time: {t_decode/n_new*1000:.1f}ms/token")
    # Llama-1B on H100 (~):
    # Prefill: ~5ms (linear in input length)
    # Decode:  ~3ms/token (proportional to model size, batch-dependent)

KV Cache: The Memory Dilemma of Attention

What Happens Without a KV Cache?

Autoregressive generation WITHOUT KV cache:

Step 1: [token_1] → generate token_2
  - Compute Q1,K1,V1 for token_1
  - Compute Q2,K2,V2 for token_2 (partial)
  - Attention: Q2 × [K1,K2]^T
  - Ops: 2^2 = 4 dot products

Step 2: [token_1, token_2] → generate token_3
  - Re-compute K1,V1 (wasted work!)
  - Re-compute K2,V2 (wasted work!)
  - Compute Q3,K3,V3
  - Attention: Q3 × [K1,K2,K3]^T
  - Ops: 3^2 = 9 dot products

Step N: O(N^2) operations per token
Total for L tokens: O(L^3) total compute
  100 tokens:  1,000,000 dot products
  1000 tokens: 1,000,000,000 dot products

KV Cache: Reuse Previous Computation

import torch
import torch.nn as nn
import math

class AttentionWithKVCache(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

        # KV cache storage
        self.k_cache = None  # (batch, heads, past_len, d_k)
        self.v_cache = None

    def forward(self, x, use_cache=True):
        batch, seq, d = x.shape

        Q = self.W_q(x).view(batch, seq, self.n_heads, self.d_k).transpose(1,2)
        K = self.W_k(x).view(batch, seq, self.n_heads, self.d_k).transpose(1,2)
        V = self.W_v(x).view(batch, seq, self.n_heads, self.d_k).transpose(1,2)

        if use_cache and self.k_cache is not None:
            # Append new K, V to the cache
            K = torch.cat([self.k_cache, K], dim=2)
            V = torch.cat([self.v_cache, V], dim=2)

        if use_cache:
            self.k_cache = K.detach()
            self.v_cache = V.detach()

        # Q is only the current token(s); K, V are the full sequence
        scale = math.sqrt(self.d_k)
        scores  = torch.matmul(Q, K.transpose(-2,-1)) / scale
        weights = torch.softmax(scores, dim=-1)
        output  = torch.matmul(weights, V)

        return output.transpose(1,2).contiguous().view(batch, seq, d)


def compute_kv_cache_bytes(seq_len, n_layers, n_kv_heads, head_dim,
                            batch_size, dtype_bytes=2):
    """
    KV cache memory in bytes.
    Factor of 2: one tensor for K, one for V.
    """
    return 2 * n_layers * n_kv_heads * head_dim * seq_len * batch_size * dtype_bytes


# Llama 3.1 70B (uses GQA: 8 KV heads, 64 Q heads):
size = compute_kv_cache_bytes(
    seq_len=4096, n_layers=80, n_kv_heads=8,
    head_dim=128, batch_size=1, dtype_bytes=2
)
print(f"KV cache (Llama-70B, seq=4096, batch=1): {size/1e9:.1f} GB")
# Result: ~6.7 GB per request
# batch=32: ~214 GB → won't fit in one H100 (80 GB)!

The Memory Fragmentation Problem

Traditional KV cache allocation (pre-allocate max_seq_len per request):

┌───────────────────────────────────────────────────────┐
│ Request 1: current_len=100, reserved=512              │
│ ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░         │
│ [used: 100]  [wasted: 412 slots = 80%!]               │
├───────────────────────────────────────────────────────┤
│ Request 2: current_len=50,  reserved=512              │
│ ██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░         │
│ [used: 50]   [wasted: 462 slots = 90%!]               │
├───────────────────────────────────────────────────────┤
│ Request 3: current_len=300, reserved=512              │
│ █████████████████████████████████████░░░░░░░░         │
│ [used: 300]  [wasted: 212 slots = 41%!]               │
└───────────────────────────────────────────────────────┘

Total allocated: 3 × 512 = 1536 slots
Total used:      450 slots
Wasted:          1086 slots = 71%

Internal fragmentation (allocated but unused) +
External fragmentation (gaps between allocations)
→ Typical real-world GPU memory utilization: 20–40%

PagedAttention (vLLM): Virtual Memory Saves LLMs

The Core Insight: OS Virtual Memory Applied to KV Cache

Kwon et al. (UC Berkeley, 2023): "Operating systems solved memory fragmentation decades ago. Apply the same idea to KV cache."

The OS Virtual Memory Lesson:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
virtual address → page table → physical address
Process sees contiguous virtual space
Physical memory can be non-contiguous
→ No fragmentation, efficient RAM usage

PagedAttention Analogy:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
virtual KV slot → block table → physical block
Sequence sees contiguous virtual slots
Physical GPU blocks can be non-contiguous
→ Near-zero fragmentation, efficient GPU VRAM usage

PagedAttention Memory Layout:

GPU memory split into fixed-size blocks (default: 16 tokens each):
┌─────────────────────────────────────────────────────────┐
│                  Physical KV Cache Blocks               │
│ Block 0: [tok0–15]    Block 1: [tok16–31]               │
│ Block 2: [tok32–47]   Block 3: [tok48–63]               │
│ Block 4: [tok64–79]   Block 5: FREE                     │
│ Block 6: FREE         Block 7: FREE                     │
└─────────────────────────────────────────────────────────┘

Block Table (same role as OS page table):
┌──────────┬──────────────────────────────────────────────┐
│ Request  │ Virtual block → Physical block mapping       │
├──────────┼──────────────────────────────────────────────┤
│ Req 1    │ virt[0]→phys[0], virt[1]→phys[2]            │
│          │ (tokens 0–15: block 0; tokens 32–47: block 2) │
├──────────┼──────────────────────────────────────────────┤
│ Req 2    │ virt[0]→phys[1], virt[1]→phys[3]            │
│          │ (tokens 0–15: block 1; tokens 16–31: block 3) │
└──────────┴──────────────────────────────────────────────┘

Key properties:
- Blocks are allocated ON-DEMAND as the sequence grows
- Internal fragmentation < 1 block = at most 15 wasted slots
- Blocks can be SHARED across requests (same prefix)!

# vLLM with PagedAttention:
from vllm import LLM, SamplingParams
import time

def benchmark_vllm():
    llm = LLM(
        model="meta-llama/Llama-3.2-8B-Instruct",
        gpu_memory_utilization=0.9,
        max_model_len=8192,
        block_size=16,
        max_num_seqs=256,
    )

    prompts = [
        "Short question: What is Python?",
        "Medium question: " + "Explain the history of machine learning. " * 5,
        "Long question: " + "How do you build a transformer from scratch? " * 10,
    ] * 20  # 60 requests of varying lengths

    params = SamplingParams(temperature=0.0, max_tokens=100)

    t0 = time.perf_counter()
    outputs = llm.generate(prompts, params)
    elapsed = time.perf_counter() - t0

    total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
    print(f"Requests:         {len(prompts)}")
    print(f"Tokens generated: {total_tokens}")
    print(f"Elapsed:          {elapsed:.1f}s")
    print(f"Throughput:       {total_tokens/elapsed:.0f} tokens/s")


# Memory efficiency improvement:
# Traditional:     20–40% of GPU memory used for actual KV cache
# PagedAttention:  >95% of GPU memory used for actual KV cache
# Result: 2–3× more concurrent requests on the same GPU

# Enable prefix caching in vLLM:
llm = LLM(
    model="meta-llama/Llama-3.2-8B-Instruct",
    enable_prefix_caching=True,
)

# Many requests sharing a long system prompt:
system = "You are an expert software engineer. " * 100  # long system prompt

requests = [
    system + "User: What is a binary search tree?",
    system + "User: How does garbage collection work?",
    system + "User: Explain ACID properties in databases.",
]

# The KV cache for `system` is computed ONCE and shared across all 3 requests.
# Prefill cost: computed 1 time instead of 3 times (3× savings on prefill!)
# This matters hugely for RAG pipelines where context is repeated.

Continuous Batching: Maximizing Throughput

The Problem with Static Batching

Static (request-level) batching:

GPU batch at each step:
Step 1:  [Req1: running] [Req2: running] [Req3: running]
Step 2:  [Req1: running] [Req2: DONE   ] [Req3: running]
Step 3:  [Req1: running] [  idle/wait  ] [Req3: running]  ← GPU waste!
Step 4:  [Req1: DONE   ] [  idle/wait  ] [Req3: running]  ← GPU waste!
Step 5:  [  idle/wait  ] [  idle/wait  ] [Req3: DONE   ]  ← GPU waste!

New requests must wait until the ENTIRE batch finishes.
GPU waste rate: often 50%+

Continuous Batching: Dynamic Scheduling Every Token Step

Continuous (iteration-level) batching:

Step 1:  [Req1] [Req2] [Req3]
Step 2:  [Req1] [Req2] [Req3]
Step 3:  [Req1] [Req4] [Req3]   ← Req2 done → Req4 inserted immediately!
Step 4:  [Req5] [Req4] [Req3]   ← Req1 done → Req5 inserted!
Step 5:  [Req5] [Req4] [Req6]   ← Req3 done → Req6 inserted!

GPU is at maximum utilization at every step.
Throughput improvement over static batching: 2–4×

from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
import asyncio

async def run_continuous_batching_server():
    engine_args = AsyncEngineArgs(
        model="meta-llama/Llama-3.2-8B-Instruct",
        max_num_seqs=256,               # max concurrent sequences
        max_num_batched_tokens=8192,    # max tokens per batch step
    )
    engine = AsyncLLMEngine.from_engine_args(engine_args)

    async def generate_one(prompt, req_id):
        from vllm import SamplingParams
        params = SamplingParams(temperature=0.7, max_tokens=200)
        async for output in engine.generate(prompt, params, request_id=req_id):
            if output.finished:
                return output.outputs[0].text

    # Requests submitted concurrently — engine handles continuous batching
    results = await asyncio.gather(
        generate_one("Explain quantum entanglement.", "r1"),
        generate_one("Write a Python quicksort.", "r2"),
        generate_one("Summarize the French Revolution.", "r3"),
    )
    for r in results:
        print(r[:80])

Quantization: Trade Precision for Speed and Memory

Why Quantization?

LLM memory footprint (FP16):
  Llama 3.1 8B:    16 GB
  Llama 3.1 70B:   140 GB
  Llama 3.1 405B:  810 GB

Common GPU memory:
  RTX 4090:     24 GB  → tight even for 8B
  A100 80GB:           → 70B impossible on one card
  H100 80GB:           → 70B impossible on one card
  H100 ×8 (640 GB):    → 70B fine, 405B barely

Memory savings with quantization:
  FP16  (16-bit): baseline
  INT8   (8-bit): 50% saved, ~1%  accuracy loss
  INT4   (4-bit): 75% saved, ~2–3% accuracy loss
  INT3   (3-bit): 81% saved, use cautiously
  INT2   (2-bit): 88% saved, usually unacceptable

Post-Training Quantization: INT8 (LLM.int8())

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# INT8 quantization — Dettmers et al., 2022 (bitsandbytes):
config_int8 = BitsAndBytesConfig(
    load_in_8bit=True,
    # Optionally keep certain layers in FP16 (e.g., output head)
    llm_int8_skip_modules=["lm_head"],
)

model_int8 = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    quantization_config=config_int8,
    device_map="auto",
)
# 70B model: 140 GB (FP16) → 70 GB (INT8), ~1% accuracy loss

# The key insight behind LLM.int8():
# Problem: activation outliers in certain channels ruin naive INT8 quality
# Solution: "Mixed-precision decomposition"
#   - Detect outlier channels (top ~1% by magnitude)
#   - Keep those channels in FP16
#   - Quantize all other channels to INT8
#   → Near-lossless quality with ~50% memory savings

4-bit Quantization: NF4 and GPTQ

# NF4 quantization (QLoRA paper, Dettmers et al. 2023):
config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,   # compute in FP16
    bnb_4bit_quant_type="nf4",              # NormalFloat4
    bnb_4bit_use_double_quant=True,         # quantize the scale too!
)

model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    quantization_config=config_4bit,
    device_map="auto",
)
# 70B: 140 GB → 35 GB, ~2–3% accuracy loss

# Why NF4?
# Neural network weights are approximately normally distributed.
# NF4 uses 16 codepoints placed at equal-probability quantiles
# of a standard normal distribution.
# Each codepoint covers an equal probability mass → minimal quantization error
# vs. uniform INT4 which distributes points evenly on the number line.

# Double quantization:
# Quantization scale factors are FP32: 1 per group of 64 weights
# Double-quant quantizes those scale factors to INT8 too
# Net savings: ~0.5 additional bits per weight


# GPTQ (Frantar et al., 2022) — layer-wise optimal quantization:
from auto_gptq import AutoGPTQForCausalLM

model_gptq = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-GPTQ",
    device="cuda:0",
    use_triton=True,     # Triton kernels for faster inference
)
# GPTQ uses the Hessian of each layer's loss to minimize quantization error.
# Generally highest accuracy among INT4 methods.

AWQ: Activation-Aware Weight Quantization

# AWQ (Lin et al., 2023) key insight:
# Not all weights are equally important!
# ~1% of channels produce large activations — these are "salient"
# Naively quantizing them to INT4 crushes quality

# AWQ solution:
# 1. Run calibration data, record activation magnitudes per channel
# 2. Scale up salient channels in the weight matrix (per-channel scaling)
# 3. Quantize everything to INT4 — the scaling absorbs the error for salient channels

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path  = "llama-3.1-8b-awq"

model     = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

# AWQ vs GPTQ:
# AWQ:  faster inference (hand-tuned CUDA/Triton kernels)
#       ~25% of FP16 memory
#       accuracy: slightly below GPTQ
# GPTQ: higher accuracy (Hessian-based error minimization)
#       similar inference speed
#       same memory as AWQ

Quantization Comparison Table

Llama 3.1 70B quantization comparison (single H100):

┌──────────┬──────────┬────────────┬──────────┬───────────────────┐
│ Format   │ Memory   │ Throughput │ MMLU     │ Hardware needed   │
├──────────┼──────────┼────────────┼──────────┼───────────────────┤
│ FP16     │ 140 GB   │ baseline   │ 80.9%    │ 8× H100           │
│ BF16     │ 140 GB   │ +5%        │ 80.9%    │ 8× H100           │
│ INT8     │  70 GB   │ +10%       │ 80.2%    │ 2× H100           │
│ GPTQ-4b  │  36 GB   │ +30%       │ 79.8%    │ 1× H100           │
│ AWQ-4b   │  36 GB   │ +35%       │ 79.5%    │ 1× H100           │
│ GGUF-Q4  │  38 GB   │ CPU ok     │ 79.1%    │ CPU or 1× H100    │
└──────────┴──────────┴────────────┴──────────┴───────────────────┘

Speculative Decoding: Free Lunch Exists

The Idea: Draft Fast, Verify in Parallel

Standard decode:
  70B model generates 1 token = 1 forward pass = ~10ms
  100 tokens = ~1000ms = 1 second

Speculative decoding (Leviathan et al., 2023):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Step 1: Draft model (7B) generates K tokens quickly
        ["Paris"] ["is"] ["the"] ["capital"]
        4 tokens in ~2ms (7B model)

Step 2: Target model (70B) verifies all K tokens in ONE forward pass!
        Process all 4 draft tokens in parallel → ~10ms
        (same cost as generating just 1 token normally)

Step 3: Verify each draft token:
        "Paris" ✅  "is" ✅  "the" ✅  "capital" ❌
        → accept 3 tokens, reject from position 4

Step 4: Resample from target model distribution at rejection point
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Result: 3 accepted tokens in ~12ms (vs 30ms standard decode)
Speedup: 2.5× (varies with acceptance rate ~70–90%)
Quality: ZERO degradation (target model is the arbiter)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def speculative_decode(
    target_model,
    draft_model,
    input_ids,
    max_new_tokens=100,
    K=4,              # number of draft tokens per speculation
    temperature=1.0,
):
    """
    Speculative decoding: draft model proposes K tokens,
    target model verifies them all in one forward pass.
    Guarantees exactly the same distribution as target-only decoding.
    """
    generated = input_ids.clone()

    while generated.shape[1] < input_ids.shape[1] + max_new_tokens:
        # --- Phase 1: Draft model generates K candidates ---
        draft_ids   = []
        draft_probs = []

        ctx = generated.clone()
        for _ in range(K):
            with torch.no_grad():
                out    = draft_model(ctx)
                logits = out.logits[:, -1, :] / max(temperature, 1e-5)
                probs  = torch.softmax(logits, dim=-1)
                tok    = torch.multinomial(probs, 1)
                draft_ids.append(tok)
                draft_probs.append(probs[0, tok[0, 0]])
                ctx = torch.cat([ctx, tok], dim=1)

        # --- Phase 2: Target model verifies K positions simultaneously ---
        candidate = torch.cat([generated] + draft_ids, dim=1)
        with torch.no_grad():
            tgt_out    = target_model(candidate)
            # logits for positions where draft tokens are placed
            tgt_logits = tgt_out.logits[:, len(generated[0])-1:-1, :]
            tgt_probs  = torch.softmax(tgt_logits / max(temperature, 1e-5), dim=-1)

        # --- Phase 3: Accept/reject each draft token ---
        n_accepted = 0
        for i in range(K):
            token_id  = draft_ids[i][0, 0].item()
            p_target  = tgt_probs[0, i, token_id].item()
            p_draft   = draft_probs[i].item()

            # Acceptance probability: min(1, p_target / p_draft)
            accept_p = min(1.0, p_target / max(p_draft, 1e-8))
            if torch.rand(1).item() < accept_p:
                generated = torch.cat([generated, draft_ids[i]], dim=1)
                n_accepted += 1
            else:
                # Reject: resample from adjusted target distribution
                adjusted = torch.clamp(tgt_probs[0, i] - tgt_probs[0, i], min=0)
                # Correct adjusted distribution (residual of target minus draft)
                diff = tgt_probs[0, i].clone()
                diff[token_id] = max(0.0, diff[token_id] - p_draft)
                diff = diff / diff.sum().clamp(min=1e-8)
                new_tok = torch.multinomial(diff.unsqueeze(0), 1)
                generated = torch.cat([generated, new_tok], dim=1)
                break

        if n_accepted == K:
            # All accepted: also take the target model's bonus token
            bonus_logits = tgt_out.logits[:, -1, :] / max(temperature, 1e-5)
            bonus_probs  = torch.softmax(bonus_logits, dim=-1)
            bonus_tok    = torch.multinomial(bonus_probs, 1)
            generated    = torch.cat([generated, bonus_tok], dim=1)

    return generated


# Real speedups observed (A100, Llama-2 70B + Llama-2 7B draft):
# K=4: 2.3× speedup, acceptance rate ~80%
# K=8: 2.7× speedup, acceptance rate ~75%
# Optimal K depends on draft/target quality ratio

Tensor Parallelism and Pipeline Parallelism

Tensor Parallelism: Split Layers Across GPUs

Tensor Parallelism (Shoeybi et al., 2019 — Megatron-LM):

70B model, 8 GPUs, 64 attention heads:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
GPU 0: heads 0–7    (W_q slice: 1/8 of full matrix)
GPU 1: heads 8–15
GPU 2: heads 16–23
GPU 3: heads 24–31
GPU 4: heads 32–39
GPU 5: heads 40–47
GPU 6: heads 48–55
GPU 7: heads 56–63

Each GPU computes its heads independently,
then All-Reduce merges results.

Communication cost:
  1 All-Reduce per attention layer
  1 All-Reduce per FFN layer
  NVLink (H100): 900 GB/s bidirectional → viable
  PCIe:           64 GB/s              → too slow for TP>2

import torch
import torch.distributed as dist

def tensor_parallel_linear(x, W_local, rank, world_size):
    """
    Column-parallel linear (W split along output dimension).
    x:       (batch, seq, d_model)   -- replicated on all GPUs
    W_local: (d_model, d_out//world_size) -- each GPU holds a shard
    """
    # Each GPU computes its output shard
    out_local = x @ W_local    # (batch, seq, d_out//world_size)

    # All-Gather to reconstruct full output on every GPU
    out_list = [torch.zeros_like(out_local) for _ in range(world_size)]
    dist.all_gather(out_list, out_local)
    out_full = torch.cat(out_list, dim=-1)   # (batch, seq, d_out)

    return out_full


# For row-parallel (W split along input dimension):
def tensor_parallel_linear_row(x_local, W_local, rank, world_size):
    """
    Row-parallel linear: x is already sharded across GPUs.
    x_local: (batch, seq, d_in//world_size)
    W_local: (d_in//world_size, d_out)
    """
    partial = x_local @ W_local    # (batch, seq, d_out) — partial sum
    dist.all_reduce(partial, op=dist.ReduceOp.SUM)   # sum partial results
    return partial

Pipeline Parallelism: Split Layers Sequentially

Pipeline Parallelism:

70B model, 80 layers, 4 GPUs:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
GPU 0: layers  0–19   (embedding + first 20 transformer layers)
GPU 1: layers 20–39
GPU 2: layers 40–59
GPU 3: layers 60–79  + LM head

With micro-batching to hide pipeline bubbles:

           | mb1 | mb2 | mb3 | mb4 | mb5 |
GPU 0 →→→ [f1 ] [f2 ] [f3 ] [f4 ] [f5 ] [b5 ] [b4 ] [b3 ] [b2 ] [b1 ]
GPU 1      [   ] [f1 ] [f2 ] [f3 ] [f4 ] [f5 ] [b5 ] [b4 ] [b3 ] [b2 ] [b1 ]
GPU 2            [   ] [f1 ] [f2 ] [f3 ] [f4 ] [f5 ]            [b1 ]
GPU 3                  [   ] [f1 ] [f2 ] [f3 ] [f4 ] [f5 ] [b5 ]

Pipeline bubble ratio = (p - 1) / (m + p - 1)
  p = number of pipeline stages
  m = number of micro-batches
  → Larger m = smaller bubble = better efficiency

vLLM vs TGI vs TensorRT-LLM: Framework Comparison

LLM serving framework comparison (as of early 2026):

┌───────────────────┬────────────────────────────────────────────────┐
│ Framework         │ vLLM                                           │
├───────────────────┼────────────────────────────────────────────────┤
│ Developer         │ UC Berkeley / vLLM community                   │
│ Key innovations   │ PagedAttention, continuous batching            │
│ Quantization      │ AWQ, GPTQ, INT8, FP8                           │
│ Throughput        │ ★★★★☆  High                                   │
│ TTFT latency      │ ★★★☆☆  Medium                                 │
│ Ease of use       │ ★★★★★  Very easy (Python-native)              │
│ Customizability   │ ★★★☆☆  Medium                                 │
│ License           │ Apache 2.0                                     │
│ Notes             │ Most active OSS community, OpenAI-compat API   │
└───────────────────┴────────────────────────────────────────────────┘

┌───────────────────┬────────────────────────────────────────────────┐
│ Framework         │ TGI (Text Generation Inference)                │
├───────────────────┼────────────────────────────────────────────────┤
│ Developer         │ Hugging Face                                   │
│ Key innovations   │ Continuous batching, FlashAttention            │
│ Quantization      │ GPTQ, AWQ, bitsandbytes                        │
│ Throughput        │ ★★★☆☆  Medium                                 │
│ TTFT latency      │ ★★★☆☆  Medium                                 │
│ Ease of use       │ ★★★★☆  Easy (Docker-first)                    │
│ Customizability   │ ★★★★☆  High                                   │
│ License           │ HFOIL (check commercial terms)                 │
│ Notes             │ Native HF ecosystem integration                │
└───────────────────┴────────────────────────────────────────────────┘

┌───────────────────┬────────────────────────────────────────────────┐
│ Framework         │ TensorRT-LLM                                   │
├───────────────────┼────────────────────────────────────────────────┤
│ Developer         │ NVIDIA                                         │
│ Key innovations   │ TensorRT graph optimization, in-flight batching│
│ Quantization      │ INT8, INT4, FP8, SmoothQuant, AWQ              │
│ Throughput        │ ★★★★★  Highest (NVIDIA GPUs only)             │
│ TTFT latency      │ ★★★★★  Lowest                                 │
│ Ease of use       │ ★★☆☆☆  Complex (C++ heavy)                    │
│ Customizability   │ ★★☆☆☆  Difficult                              │
│ License           │ Apache 2.0                                     │
│ Notes             │ Best raw performance; use via Triton Server     │
└───────────────────┴────────────────────────────────────────────────┘

┌───────────────────┬────────────────────────────────────────────────┐
│ Framework         │ llama.cpp / Ollama                             │
├───────────────────┼────────────────────────────────────────────────┤
│ Developer         │ ggerganov / Ollama Inc.                        │
│ Key innovations   │ GGUF quantization, CPU+GPU hybrid              │
│ Quantization      │ Q2–Q8 (GGUF format)                            │
│ Throughput        │ ★★☆☆☆  Low (on CPU)                          │
│ TTFT latency      │ ★★☆☆☆  High                                   │
│ Ease of use       │ ★★★★★  Simplest possible                      │
│ Customizability   │ ★★☆☆☆  Limited                                │
│ License           │ MIT                                            │
│ Notes             │ Ideal for local dev, CPU inference, demos      │
└───────────────────┴────────────────────────────────────────────────┘

Decision Guide

Choose vLLM if:
  - Production serving, Python team, open source preferred
  - Need OpenAI-compatible API drop-in replacement
  - Want the best community support and newest features fastest

Choose TGI if:
  - Deep HuggingFace ecosystem integration
  - Docker-first deployment culture
  - Need robust SSE streaming out-of-the-box

Choose TensorRT-LLM if:
  - Maximum raw performance on NVIDIA hardware
  - Have a team comfortable with C++/CUDA tooling
  - Enterprise production with dedicated MLOps

Choose Ollama / llama.cpp if:
  - Local development, prototyping
  - CPU inference required
  - Simplicity over performance

Production LLM Serving Stack

Production LLM serving architecture:

Clients
  │
  ▼
Load Balancer (Nginx / AWS ALB / Cloudflare)
  │
  ▼
API Gateway (FastAPI / Kong)
  │  Rate limiting, auth, logging, request validation
  ▼
Router (model selection, priority queue)
  │
  ├──→ vLLM server A: 8B model   (fast/cheap requests)
  │
  ├──→ vLLM server B: 70B model  (high-quality requests)
  │
  └──→ vLLM server C: domain-specific fine-tune
         │
         ▼
      Observability (Prometheus + Grafana)
      Key metrics:
        - TTFT p50/p95/p99
        - TBT  p50/p95/p99
        - Throughput (tokens/s)
        - GPU utilization %
        - KV cache utilization %
        - Request queue depth

# Production vLLM server launch command:
import subprocess

cmd = [
    "python", "-m", "vllm.entrypoints.openai.api_server",
    "--model", "meta-llama/Llama-3.1-8B-Instruct",
    "--tensor-parallel-size", "2",          # 2-GPU tensor parallelism
    "--gpu-memory-utilization", "0.9",
    "--max-model-len", "8192",
    "--max-num-seqs", "256",
    "--max-num-batched-tokens", "8192",
    "--quantization", "awq",
    "--enable-prefix-caching",
    "--block-size", "16",
    "--port", "8000",
    "--disable-log-requests",               # reduce logging overhead
]

# Calling the server from a client:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user",   "content": "What is the GIL in Python?"},
    ],
    temperature=0.0,
    max_tokens=500,
    stream=True,    # streaming response
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

LLM serving optimization sits at the intersection of hardware, algorithms, and systems software. PagedAttention borrowed from operating system design. FlashAttention rediscovered the principle of tiling from numerical linear algebra. Speculative decoding revived draft-and-verify ideas from branch prediction. The engineers who will build the next generation of LLM serving systems are those who understand not just the current tools but the first-principles reasoning behind them.