GPU Memory Management & LLM Inference Optimization: vLLM, PagedAttention, GPTQ, TensorRT-LLM

Introduction

Deploying Large Language Models in production presents two fundamental challenges: GPU memory management and inference efficiency. GPT-4-scale models demand hundreds of gigabytes of memory, and real-time responsiveness requires generating dozens of tokens per second.

This guide covers every critical aspect of LLM inference optimization. From understanding GPU memory hierarchy to KV cache optimization, GPTQ/AWQ quantization, PagedAttention, continuous batching, and multi-GPU inference — everything a production engineer needs to know, explained step by step.

1. GPU Memory Hierarchy

HBM (High Bandwidth Memory)

The cornerstone of modern AI GPUs is HBM. HBM stacks multiple DRAM dies vertically, providing a far wider memory bus than conventional GDDR6.

GPU	Memory	HBM Type	Bandwidth	Bus Width
A100 80G	80 GB	HBM2e	2.0 TB/s	5120-bit
H100 SXM	80 GB	HBM3	3.35 TB/s	5120-bit
H200 SXM	141 GB	HBM3e	4.8 TB/s	5120-bit
B200 SXM	192 GB	HBM3e	8.0 TB/s	8192-bit
MI300X	192 GB	HBM3	5.3 TB/s	8192-bit

L2 Cache and SRAM

The GPU memory hierarchy has three main levels:

HBM (global memory): Tens to hundreds of GB, bandwidth in TB/s, latency ~hundreds of ns
L2 cache: Tens to hundreds of MB (H100: 50 MB), shared across all SMs
L1 cache / SRAM (shared memory): 128–256 KB per SM, bandwidth tens of TB/s, latency ~a few ns

The SRAM inside each SM (Streaming Multiprocessor) is the second-fastest memory after register files. Optimizations like Flash Attention leverage SRAM aggressively to reduce HBM accesses.

Roofline Model: Analyzing Performance Limits

The Roofline Model is an analytical tool for determining whether a given computation is compute-bound or memory-bound.

Arithmetic Intensity (AI) = Number of FLOPs / Memory accessed (bytes)

Performance ceiling = min(Peak FLOPS, Peak Memory BW × AI)

Low AI (memory-bound): Memory bandwidth is the bottleneck. The LLM decode phase is the classic example.
High AI (compute-bound): Arithmetic speed is the bottleneck. LLM prefill phase and large-batch workloads.

For an H100:

Peak FP16 FLOPS: 989 TFLOPS
Peak HBM bandwidth: 3.35 TB/s
Ridge point (balance point): 989 / 3.35 ≈ 295 FLOP/byte

When generating a single token, a 70B model (FP16) has AI ≈ 1–2 FLOP/byte — extremely memory-bound.

2. LLM Memory Calculations

Parameter Memory

Accurately calculating LLM memory requirements is the foundation of deployment planning.

def calc_model_memory_gb(
    num_params: int,       # Number of parameters (e.g., 70e9)
    dtype_bytes: int = 2,  # FP16=2, FP32=4, INT8=1, INT4=0.5
) -> float:
    """Calculate model weight memory"""
    return (num_params * dtype_bytes) / (1024 ** 3)

# Major model memory (FP16 baseline)
models = {
    "Llama-3.1-8B":   {"params": 8e9,   "bytes": 2},
    "Llama-3.1-70B":  {"params": 70e9,  "bytes": 2},
    "Llama-3.1-405B": {"params": 405e9, "bytes": 2},
    "Mistral-7B":     {"params": 7e9,   "bytes": 2},
    "Qwen2-72B":      {"params": 72e9,  "bytes": 2},
}

for name, cfg in models.items():
    mem_gb = calc_model_memory_gb(cfg["params"], cfg["bytes"])
    print(f"{name}: {mem_gb:.1f} GB")

Model	Parameters	FP32	FP16/BF16	INT8	INT4
Llama-3.1-8B	8B	32 GB	16 GB	8 GB	4 GB
Llama-3.1-70B	70B	280 GB	140 GB	70 GB	35 GB
Llama-3.1-405B	405B	1620 GB	810 GB	405 GB	202 GB
Mistral-7B	7B	28 GB	14 GB	7 GB	3.5 GB

KV Cache Memory Calculation

KV cache is the most dynamically changing component of inference memory. It scales proportionally with sequence length and batch size.

def calc_kv_cache_memory_gb(
    num_layers: int,
    num_heads: int,
    head_dim: int,
    seq_len: int,
    batch_size: int,
    dtype_bytes: int = 2,  # FP16
) -> float:
    """
    KV cache memory calculation.
    Per layer: 2 (K, V) × num_heads × head_dim × seq_len × batch_size
    """
    kv_per_layer = 2 * num_heads * head_dim * seq_len * batch_size
    total_bytes = kv_per_layer * num_layers * dtype_bytes
    return total_bytes / (1024 ** 3)

# Llama-3.1-70B example
# layers=80, heads=64 (GQA: kv_heads=8), head_dim=128
kv_mem = calc_kv_cache_memory_gb(
    num_layers=80,
    num_heads=8,      # GQA uses kv_heads
    head_dim=128,
    seq_len=4096,
    batch_size=1,
    dtype_bytes=2,
)
print(f"KV cache (seq=4096, bs=1): {kv_mem:.2f} GB")
# Output: KV cache (seq=4096, bs=1): 0.50 GB

# KV cache by batch size
for bs in [1, 4, 8, 16, 32]:
    mem = calc_kv_cache_memory_gb(80, 8, 128, 4096, bs, 2)
    print(f"  batch_size={bs:2d}: {mem:.2f} GB")

KV Cache Memory (Llama-3.1-70B, seq_len=4096, FP16)

Batch Size	KV Cache	Model Weights	Total
1	0.5 GB	140 GB	140.5 GB
4	2.0 GB	140 GB	142.0 GB
8	4.0 GB	140 GB	144.0 GB
16	8.0 GB	140 GB	148.0 GB
32	16.0 GB	140 GB	156.0 GB

Activation Memory

During inference, activation memory is proportional to the product of batch size, sequence length, and hidden size. Unlike training, inference does not store gradients, making activation memory relatively small.

def calc_activation_memory_gb(
    hidden_size: int,
    seq_len: int,
    batch_size: int,
    num_layers: int,
    dtype_bytes: int = 2,
) -> float:
    """Approximate activation memory during inference"""
    # Per layer: attention + FFN activations
    # Approximation: 2 × hidden_size × seq_len × batch_size per layer
    bytes_per_layer = 2 * hidden_size * seq_len * batch_size * dtype_bytes
    return (bytes_per_layer * num_layers) / (1024 ** 3)

3. KV Cache Optimization: PagedAttention

Problems with Conventional KV Cache

Traditional LLM serving systems allocate KV cache as contiguous memory blocks. This causes serious issues:

Internal Fragmentation: Pre-allocating to max sequence length wastes unused space
External Fragmentation: Varying request sizes create unusable gaps when requests complete
Memory Efficiency: In practice, 60–80% of KV cache memory is wasted

PagedAttention: Applying OS Paging Principles

vLLM's PagedAttention applies virtual memory paging concepts to KV cache management.

OS Virtual Memory    → PagedAttention
─────────────────────────────────────
Virtual page         → Logical block
Physical frame       → Physical block
Page table           → Block table
Page fault           → Block allocation

Core ideas:

Split KV cache into fixed-size blocks (e.g., 16 tokens each)
Access sequence KV via logical blocks; allocate physical blocks on demand
Share physical blocks with Copy-on-Write when requests share a common prefix

Request A: [Block 0] → [Block 1] → [Block 2]
                                        ↕ Physical block sharing (common prompt)
Request B: [Block 0] → [Block 1] → [Block 3]

vLLM Server Launch Example

# Install vLLM
pip install vllm

# Start server (single GPU)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

# OpenAI-compatible API call
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain GPU memory optimization"}],
    "max_tokens": 512,
    "temperature": 0.7
  }'

# Python client calling vLLM API
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a GPU optimization expert."},
        {"role": "user", "content": "Explain how PagedAttention works."},
    ],
    max_tokens=1024,
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

4. Quantization: GPTQ, AWQ, GGUF, bitsandbytes

Quantization Method Comparison

Method	Precision	Memory Savings	Speed	Quality Loss	Notes
FP16/BF16	16-bit	Baseline	Baseline	None	Default
GPTQ	4-bit	~75%	Fast	Low	PTQ, GPU only
AWQ	4-bit	~75%	Fast	Very low	Activation-aware
GGUF	2–8-bit	Variable	CPU capable	Variable	llama.cpp
bitsandbytes NF4	4-bit	~75%	Moderate	Low	QLoRA training
bitsandbytes INT8	8-bit	~50%	Moderate	Very low	LLM.int8()

bitsandbytes 4-bit Quantization Loading

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit NF4 quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in BF16
    bnb_4bit_quant_type="nf4",              # NormalFloat4 quantization
    bnb_4bit_use_double_quant=True,         # Double quantization for extra compression
)

model_id = "meta-llama/Llama-3.1-70B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",      # Automatic multi-GPU distribution
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Check memory usage
print(f"GPU memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

GPTQ Quantization (auto-gptq)

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"

quantize_config = BaseQuantizeConfig(
    bits=4,              # 4-bit quantization
    group_size=128,      # Group size (smaller = more accurate but more memory)
    desc_act=False,      # Describe activation ordering
    damp_percent=0.01,   # Hessian damping coefficient
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Prepare calibration data (representative text samples)
calibration_data = [
    tokenizer("The GPU accelerates machine learning by...", return_tensors="pt").input_ids,
    tokenizer("Quantization reduces model size while...", return_tensors="pt").input_ids,
    # Recommend 1024+ samples in practice
]

# Load model and quantize
model = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    quantize_config=quantize_config,
    torch_dtype=torch.float16,
)

model.quantize(calibration_data)
model.save_quantized("llama-3.1-8b-gptq-4bit")
print("GPTQ quantization complete!")

# Load quantized model
quantized_model = AutoGPTQForCausalLM.from_quantized(
    "llama-3.1-8b-gptq-4bit",
    use_safetensors=True,
    device="cuda:0",
)

AWQ: Activation-Aware Weight Quantization

AWQ does not quantize all weights equally. Channels with large activation values (important weights) are protected at higher precision.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"

# AWQ quantization configuration
quant_config = {
    "zero_point": True,   # Zero-point quantization
    "q_group_size": 128,  # Group size
    "w_bit": 4,           # 4-bit
    "version": "GEMM",    # Matrix multiplication kernel
}

model = AutoAWQForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print("AWQ quantization complete!")

Quantization Benchmark (Llama-3.1-8B)

Method	Memory	Throughput (tok/s)	Perplexity	Notes
FP16	16 GB	100 (baseline)	7.2	Baseline
BF16	16 GB	100	7.2	Equivalent to FP16
INT8	8 GB	75	7.3	Minor quality loss
GPTQ-4bit	4.5 GB	120	7.6	Memory savings + speed
AWQ-4bit	4.5 GB	125	7.4	Better quality than GPTQ
GGUF-Q4_K_M	4.8 GB	80 (CPU)	7.5	CPU inference capable

5. Batching Strategies: Continuous Batching

Limitations of Static Batching

Traditional static batching waits for all requests in a batch to start simultaneously and complete together. This leads to severe GPU underutilization.

Static Batching (batch_size=3):

Time →
[Request A: ████████████░░░░░░░░]  (12 tokens generated)
[Request B: ████░░░░░░░░░░░░░░░░]  (4 tokens generated)
[Request C: ████████░░░░░░░░░░░░]  (8 tokens generated)
                └─ B and C must wait for A to finish (GPU wasted)

Continuous Batching (Iteration-level Scheduling)

Modern LLM serving systems like vLLM and TensorRT-LLM use continuous batching, dynamically reconstructing the batch at each inference step (iteration).

Continuous Batching:

Step 1: [A1][B1][C1]  ← 3 processed simultaneously
Step 2: [A2][B2][C2]
Step 3: [A3][B3][C3]  ← B completes, new request D added
Step 4: [A4][C4][D1]  ← Empty slot immediately filled
Step 5: [A5][C5][D2]
Step 6: [A6][C6][D3]  ← C completes, new request E added
...

GPU utilization improves 2–5x compared to static batching.

Separating Prefill and Decode

LLM inference has two distinct phases:

Prefill: Processes the entire prompt at once. Compute-bound (behaves like batch processing)
Decode: Autoregressive token-by-token generation. Memory-bound

These phases have different GPU resource requirements. Disaggregated Prefill is an architecture that separates prefill-dedicated GPUs from decode-dedicated GPUs.

6. LLM Inference Framework Comparison

Framework	Developer	Key Features	Best Use Case
vLLM	UC Berkeley	PagedAttention, OpenAI-compatible API	High-throughput serving
TensorRT-LLM	NVIDIA	Optimized CUDA kernels, FP8 support	Lowest latency
Ollama	Ollama Inc	Easy local execution	Development/testing
llama.cpp	ggml	CPU inference, GGUF format	Edge/local
SGLang	LM-Sys	Structured generation, RadixAttention	Complex pipelines

vLLM Tensor Parallel Inference

from vllm import LLM, SamplingParams

# Tensor parallel across 4 GPUs
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,        # Distribute across 4 GPUs
    dtype="bfloat16",
    max_model_len=8192,
    gpu_memory_utilization=0.90,
    enforce_eager=False,           # Use CUDA graph optimization
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "[INST]"],
)

prompts = [
    "Explain the GPU memory hierarchy",
    "What are the benefits of PagedAttention?",
    "Compare quantization methods",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    print(f"Prompt: {prompt[:50]}...")
    print(f"Generated: {generated[:100]}...")
    print()

7. Multi-GPU Inference: Tensor and Pipeline Parallelism

Tensor Parallelism

Tensor parallelism distributes individual matrix operations across multiple GPUs, splitting each Transformer layer horizontally.

Attention head distribution (4-way tensor parallel):

GPU 0: Head 0-15
GPU 1: Head 16-31
GPU 2: Head 32-47
GPU 3: Head 48-63

Each GPU computes independently, then AllReduce aggregates results

Pros: Reduced latency, enables large layers that don't fit on a single GPU
Cons: AllReduce communication required per layer — high-bandwidth NVLink is essential
Best for: Intra-node NVLink-connected GPUs, latency-sensitive applications

Pipeline Parallelism

Pipeline parallelism assigns groups of layers to different GPUs.

Llama-3.1-70B (80 layers) → 4-way pipeline:

GPU 0: Layers 0-19
GPU 1: Layers 20-39
GPU 2: Layers 40-59
GPU 3: Layers 60-79

Sequential layer processing, activations forwarded between GPUs

Pros: Efficient even with low-bandwidth inter-node connections, minimal communication volume
Cons: Pipeline bubbles (downstream GPUs idle while upstream computes), increased latency
Best for: Multi-node distributed inference, very large models

Memory Profiling

import torch

def profile_gpu_memory(func, *args, **kwargs):
    """Profile GPU memory usage"""
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.synchronize()

    before = torch.cuda.memory_allocated()
    result = func(*args, **kwargs)
    torch.cuda.synchronize()

    after = torch.cuda.memory_allocated()
    peak = torch.cuda.max_memory_allocated()

    print(f"Memory increase: {(after - before) / 1e9:.3f} GB")
    print(f"Peak memory:     {peak / 1e9:.3f} GB")
    print()
    print(torch.cuda.memory_summary())
    return result

# Example memory stats output
def load_and_infer():
    from transformers import pipeline
    pipe = pipeline(
        "text-generation",
        model="microsoft/phi-2",
        torch_dtype=torch.float16,
        device_map="auto",
    )
    return pipe("GPU memory management is", max_new_tokens=50)

profile_gpu_memory(load_and_infer)

8. Practical Optimization Checklist

GPU Memory Optimization Strategies

Apply quantization: INT4/INT8 saves 50–75% memory
Optimize KV cache: Limit max_model_len, choose GQA models
Flash Attention 2: Leverages SRAM to reduce memory from O(n²) to O(n)
Model sharding: Use tensor or pipeline parallelism for multi-GPU
Continuous batching: Maximize GPU utilization with dynamic request scheduling

Inference Speed Optimization

# Optimized vLLM server configuration example
vllm_config = {
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "dtype": "bfloat16",
    "tensor_parallel_size": 1,
    "gpu_memory_utilization": 0.90,    # Use 90% of GPU memory
    "max_model_len": 8192,
    "max_num_batched_tokens": 8192,    # Max tokens per batch
    "max_num_seqs": 256,               # Max concurrent sequences
    "enable_chunked_prefill": True,    # Enable chunked prefill
    "block_size": 16,                  # KV cache block size (PagedAttention)
    "swap_space": 4,                   # CPU swap space in GB
    "enforce_eager": False,            # Use CUDA graphs
    "disable_log_stats": False,
}

Quiz: Check Your Understanding

Q1. Why do the prefill and decode phases of LLM inference have different compute characteristics?

Answer: Prefill is compute-bound; decode is memory-bound.

Explanation: During the prefill phase, all tokens in the prompt are processed in parallel — similar to batch processing. This yields high arithmetic intensity and keeps GPU compute units busy (compute-bound). During the decode phase, one token is generated per step by reading the entire KV cache from all previous tokens plus the full model weights. Every step requires loading the entire model weight matrix and the accumulated KV cache from memory, resulting in extremely low arithmetic intensity — making decode deeply memory-bound. With the H100's ridge point at ~295 FLOP/byte, the decode phase's AI of just 1–2 FLOP/byte means it runs far below peak compute.

Q2. Why does PagedAttention achieve higher memory efficiency than conventional KV cache management?

Answer: Non-contiguous physical block allocation and on-demand allocation eliminate fragmentation.

Explanation: Conventional systems pre-reserve contiguous memory equal to max sequence length per request, causing internal fragmentation (unused reserved space) and external fragmentation (unusable gaps left by completed requests). Studies show 60–80% of KV cache memory is wasted in practice. PagedAttention divides the KV cache into fixed-size blocks (e.g., 16 tokens), allocating physical blocks only when needed. Non-contiguous physical memory is abstracted through logical blocks, nearly eliminating fragmentation. Multiple requests sharing a common prompt prefix can also share physical KV blocks via Copy-on-Write, further reducing memory consumption.

Q3. How does AWQ preserve important weights better than GPTQ?

Answer: AWQ scales per-channel based on activation magnitude to protect salient weights.

Explanation: GPTQ minimizes quantization error using a second-order Hessian approximation but treats all weights roughly equally. AWQ (Activation-aware Weight Quantization) analyzes the activation distribution and observes that channels with large activation values (salient channels) contribute disproportionately to model performance. For these salient channels, AWQ multiplies the weights by a scale factor before quantization to inflate their values, then divides the corresponding activations by the same factor at inference to compensate. This protects the most important weights while maintaining hardware-friendly uniform quantization, resulting in lower perplexity than GPTQ at the same bit width.

Q4. How does continuous batching improve GPU utilization over static batching?

Answer: Iteration-level scheduling immediately reclaims completed sequence slots for new requests.

Explanation: Static batching stalls the GPU until every request in the batch finishes. Short requests leave their GPU slots idle while waiting for the longest sequence to complete. Continuous batching (iteration-level scheduling) reconstructs the batch at every inference step. When a sequence produces an EOS token or reaches max_tokens, its slot is immediately assigned to a waiting new request. The GPU therefore always operates at maximum batch capacity. In experiments, throughput improves 2–5x over static batching, and vLLM's paper reported up to 24x higher throughput compared to Hugging Face's static serving.

Q5. How do communication patterns differ between Tensor Parallelism and Pipeline Parallelism?

Answer: Tensor parallelism uses AllReduce every layer; pipeline parallelism uses point-to-point transfers at layer boundaries.

Explanation: Tensor Parallelism splits the weight matrices of each Transformer layer across GPUs. After each layer's computation, all GPUs must synchronize via an AllReduce collective to sum partial results. With 80 layers, that means 80 AllReduce operations, each adding communication latency — high-bandwidth NVLink interconnects are essential. Pipeline Parallelism assigns layer groups to different GPUs and only transfers activation tensors at group boundaries. This minimizes communication frequency but introduces pipeline bubbles (downstream GPUs idle while upstream GPUs process). For intra-node NVLink environments, tensor parallelism is preferred; for inter-node InfiniBand environments, pipeline parallelism scales better.

Conclusion

LLM inference optimization is the art of pushing hardware to its physical limits through software. Understanding the GPU memory hierarchy, managing KV cache efficiently, and combining the right quantization and batching strategies can dramatically improve performance on the same hardware.

Key Takeaways:

Memory savings: AWQ/GPTQ 4-bit quantization allows 70B models to run on a single A100 80G
Throughput gains: vLLM's PagedAttention + continuous batching delivers up to 24x throughput vs. static serving
Latency reduction: TensorRT-LLM with CUDA kernel fusion and FP8 inference
Scale-out: Tensor/pipeline parallelism breaks single-GPU limits across multi-GPU clusters