- Published on
GPU Memory Management & LLM Inference Optimization: vLLM, PagedAttention, GPTQ, TensorRT-LLM
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction
Deploying Large Language Models in production presents two fundamental challenges: GPU memory management and inference efficiency. GPT-4-scale models demand hundreds of gigabytes of memory, and real-time responsiveness requires generating dozens of tokens per second.
This guide covers every critical aspect of LLM inference optimization. From understanding GPU memory hierarchy to KV cache optimization, GPTQ/AWQ quantization, PagedAttention, continuous batching, and multi-GPU inference — everything a production engineer needs to know, explained step by step.
1. GPU Memory Hierarchy
HBM (High Bandwidth Memory)
The cornerstone of modern AI GPUs is HBM. HBM stacks multiple DRAM dies vertically, providing a far wider memory bus than conventional GDDR6.
| GPU | Memory | HBM Type | Bandwidth | Bus Width |
|---|---|---|---|---|
| A100 80G | 80 GB | HBM2e | 2.0 TB/s | 5120-bit |
| H100 SXM | 80 GB | HBM3 | 3.35 TB/s | 5120-bit |
| H200 SXM | 141 GB | HBM3e | 4.8 TB/s | 5120-bit |
| B200 SXM | 192 GB | HBM3e | 8.0 TB/s | 8192-bit |
| MI300X | 192 GB | HBM3 | 5.3 TB/s | 8192-bit |
L2 Cache and SRAM
The GPU memory hierarchy has three main levels:
- HBM (global memory): Tens to hundreds of GB, bandwidth in TB/s, latency ~hundreds of ns
- L2 cache: Tens to hundreds of MB (H100: 50 MB), shared across all SMs
- L1 cache / SRAM (shared memory): 128–256 KB per SM, bandwidth tens of TB/s, latency ~a few ns
The SRAM inside each SM (Streaming Multiprocessor) is the second-fastest memory after register files. Optimizations like Flash Attention leverage SRAM aggressively to reduce HBM accesses.
Roofline Model: Analyzing Performance Limits
The Roofline Model is an analytical tool for determining whether a given computation is compute-bound or memory-bound.
Arithmetic Intensity (AI) = Number of FLOPs / Memory accessed (bytes)
Performance ceiling = min(Peak FLOPS, Peak Memory BW × AI)
- Low AI (memory-bound): Memory bandwidth is the bottleneck. The LLM decode phase is the classic example.
- High AI (compute-bound): Arithmetic speed is the bottleneck. LLM prefill phase and large-batch workloads.
For an H100:
- Peak FP16 FLOPS: 989 TFLOPS
- Peak HBM bandwidth: 3.35 TB/s
- Ridge point (balance point): 989 / 3.35 ≈ 295 FLOP/byte
When generating a single token, a 70B model (FP16) has AI ≈ 1–2 FLOP/byte — extremely memory-bound.
2. LLM Memory Calculations
Parameter Memory
Accurately calculating LLM memory requirements is the foundation of deployment planning.
def calc_model_memory_gb(
num_params: int, # Number of parameters (e.g., 70e9)
dtype_bytes: int = 2, # FP16=2, FP32=4, INT8=1, INT4=0.5
) -> float:
"""Calculate model weight memory"""
return (num_params * dtype_bytes) / (1024 ** 3)
# Major model memory (FP16 baseline)
models = {
"Llama-3.1-8B": {"params": 8e9, "bytes": 2},
"Llama-3.1-70B": {"params": 70e9, "bytes": 2},
"Llama-3.1-405B": {"params": 405e9, "bytes": 2},
"Mistral-7B": {"params": 7e9, "bytes": 2},
"Qwen2-72B": {"params": 72e9, "bytes": 2},
}
for name, cfg in models.items():
mem_gb = calc_model_memory_gb(cfg["params"], cfg["bytes"])
print(f"{name}: {mem_gb:.1f} GB")
| Model | Parameters | FP32 | FP16/BF16 | INT8 | INT4 |
|---|---|---|---|---|---|
| Llama-3.1-8B | 8B | 32 GB | 16 GB | 8 GB | 4 GB |
| Llama-3.1-70B | 70B | 280 GB | 140 GB | 70 GB | 35 GB |
| Llama-3.1-405B | 405B | 1620 GB | 810 GB | 405 GB | 202 GB |
| Mistral-7B | 7B | 28 GB | 14 GB | 7 GB | 3.5 GB |
KV Cache Memory Calculation
KV cache is the most dynamically changing component of inference memory. It scales proportionally with sequence length and batch size.
def calc_kv_cache_memory_gb(
num_layers: int,
num_heads: int,
head_dim: int,
seq_len: int,
batch_size: int,
dtype_bytes: int = 2, # FP16
) -> float:
"""
KV cache memory calculation.
Per layer: 2 (K, V) × num_heads × head_dim × seq_len × batch_size
"""
kv_per_layer = 2 * num_heads * head_dim * seq_len * batch_size
total_bytes = kv_per_layer * num_layers * dtype_bytes
return total_bytes / (1024 ** 3)
# Llama-3.1-70B example
# layers=80, heads=64 (GQA: kv_heads=8), head_dim=128
kv_mem = calc_kv_cache_memory_gb(
num_layers=80,
num_heads=8, # GQA uses kv_heads
head_dim=128,
seq_len=4096,
batch_size=1,
dtype_bytes=2,
)
print(f"KV cache (seq=4096, bs=1): {kv_mem:.2f} GB")
# Output: KV cache (seq=4096, bs=1): 0.50 GB
# KV cache by batch size
for bs in [1, 4, 8, 16, 32]:
mem = calc_kv_cache_memory_gb(80, 8, 128, 4096, bs, 2)
print(f" batch_size={bs:2d}: {mem:.2f} GB")
KV Cache Memory (Llama-3.1-70B, seq_len=4096, FP16)
| Batch Size | KV Cache | Model Weights | Total |
|---|---|---|---|
| 1 | 0.5 GB | 140 GB | 140.5 GB |
| 4 | 2.0 GB | 140 GB | 142.0 GB |
| 8 | 4.0 GB | 140 GB | 144.0 GB |
| 16 | 8.0 GB | 140 GB | 148.0 GB |
| 32 | 16.0 GB | 140 GB | 156.0 GB |
Activation Memory
During inference, activation memory is proportional to the product of batch size, sequence length, and hidden size. Unlike training, inference does not store gradients, making activation memory relatively small.
def calc_activation_memory_gb(
hidden_size: int,
seq_len: int,
batch_size: int,
num_layers: int,
dtype_bytes: int = 2,
) -> float:
"""Approximate activation memory during inference"""
# Per layer: attention + FFN activations
# Approximation: 2 × hidden_size × seq_len × batch_size per layer
bytes_per_layer = 2 * hidden_size * seq_len * batch_size * dtype_bytes
return (bytes_per_layer * num_layers) / (1024 ** 3)
3. KV Cache Optimization: PagedAttention
Problems with Conventional KV Cache
Traditional LLM serving systems allocate KV cache as contiguous memory blocks. This causes serious issues:
- Internal Fragmentation: Pre-allocating to max sequence length wastes unused space
- External Fragmentation: Varying request sizes create unusable gaps when requests complete
- Memory Efficiency: In practice, 60–80% of KV cache memory is wasted
PagedAttention: Applying OS Paging Principles
vLLM's PagedAttention applies virtual memory paging concepts to KV cache management.
OS Virtual Memory → PagedAttention
─────────────────────────────────────
Virtual page → Logical block
Physical frame → Physical block
Page table → Block table
Page fault → Block allocation
Core ideas:
- Split KV cache into fixed-size blocks (e.g., 16 tokens each)
- Access sequence KV via logical blocks; allocate physical blocks on demand
- Share physical blocks with Copy-on-Write when requests share a common prefix
Request A: [Block 0] → [Block 1] → [Block 2]
↕ Physical block sharing (common prompt)
Request B: [Block 0] → [Block 1] → [Block 3]
vLLM Server Launch Example
# Install vLLM
pip install vllm
# Start server (single GPU)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
# OpenAI-compatible API call
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Explain GPU memory optimization"}],
"max_tokens": 512,
"temperature": 0.7
}'
# Python client calling vLLM API
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a GPU optimization expert."},
{"role": "user", "content": "Explain how PagedAttention works."},
],
max_tokens=1024,
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
4. Quantization: GPTQ, AWQ, GGUF, bitsandbytes
Quantization Method Comparison
| Method | Precision | Memory Savings | Speed | Quality Loss | Notes |
|---|---|---|---|---|---|
| FP16/BF16 | 16-bit | Baseline | Baseline | None | Default |
| GPTQ | 4-bit | ~75% | Fast | Low | PTQ, GPU only |
| AWQ | 4-bit | ~75% | Fast | Very low | Activation-aware |
| GGUF | 2–8-bit | Variable | CPU capable | Variable | llama.cpp |
| bitsandbytes NF4 | 4-bit | ~75% | Moderate | Low | QLoRA training |
| bitsandbytes INT8 | 8-bit | ~50% | Moderate | Very low | LLM.int8() |
bitsandbytes 4-bit Quantization Loading
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# 4-bit NF4 quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16
bnb_4bit_quant_type="nf4", # NormalFloat4 quantization
bnb_4bit_use_double_quant=True, # Double quantization for extra compression
)
model_id = "meta-llama/Llama-3.1-70B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto", # Automatic multi-GPU distribution
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Check memory usage
print(f"GPU memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
GPTQ Quantization (auto-gptq)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch
model_id = "meta-llama/Llama-3.1-8B-Instruct"
quantize_config = BaseQuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Group size (smaller = more accurate but more memory)
desc_act=False, # Describe activation ordering
damp_percent=0.01, # Hessian damping coefficient
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Prepare calibration data (representative text samples)
calibration_data = [
tokenizer("The GPU accelerates machine learning by...", return_tensors="pt").input_ids,
tokenizer("Quantization reduces model size while...", return_tensors="pt").input_ids,
# Recommend 1024+ samples in practice
]
# Load model and quantize
model = AutoGPTQForCausalLM.from_pretrained(
model_id,
quantize_config=quantize_config,
torch_dtype=torch.float16,
)
model.quantize(calibration_data)
model.save_quantized("llama-3.1-8b-gptq-4bit")
print("GPTQ quantization complete!")
# Load quantized model
quantized_model = AutoGPTQForCausalLM.from_quantized(
"llama-3.1-8b-gptq-4bit",
use_safetensors=True,
device="cuda:0",
)
AWQ: Activation-Aware Weight Quantization
AWQ does not quantize all weights equally. Channels with large activation values (important weights) are protected at higher precision.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_id = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"
# AWQ quantization configuration
quant_config = {
"zero_point": True, # Zero-point quantization
"q_group_size": 128, # Group size
"w_bit": 4, # 4-bit
"version": "GEMM", # Matrix multiplication kernel
}
model = AutoAWQForCausalLM.from_pretrained(
model_id,
low_cpu_mem_usage=True,
use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print("AWQ quantization complete!")
Quantization Benchmark (Llama-3.1-8B)
| Method | Memory | Throughput (tok/s) | Perplexity | Notes |
|---|---|---|---|---|
| FP16 | 16 GB | 100 (baseline) | 7.2 | Baseline |
| BF16 | 16 GB | 100 | 7.2 | Equivalent to FP16 |
| INT8 | 8 GB | 75 | 7.3 | Minor quality loss |
| GPTQ-4bit | 4.5 GB | 120 | 7.6 | Memory savings + speed |
| AWQ-4bit | 4.5 GB | 125 | 7.4 | Better quality than GPTQ |
| GGUF-Q4_K_M | 4.8 GB | 80 (CPU) | 7.5 | CPU inference capable |
5. Batching Strategies: Continuous Batching
Limitations of Static Batching
Traditional static batching waits for all requests in a batch to start simultaneously and complete together. This leads to severe GPU underutilization.
Static Batching (batch_size=3):
Time →
[Request A: ████████████░░░░░░░░] (12 tokens generated)
[Request B: ████░░░░░░░░░░░░░░░░] (4 tokens generated)
[Request C: ████████░░░░░░░░░░░░] (8 tokens generated)
└─ B and C must wait for A to finish (GPU wasted)
Continuous Batching (Iteration-level Scheduling)
Modern LLM serving systems like vLLM and TensorRT-LLM use continuous batching, dynamically reconstructing the batch at each inference step (iteration).
Continuous Batching:
Step 1: [A1][B1][C1] ← 3 processed simultaneously
Step 2: [A2][B2][C2]
Step 3: [A3][B3][C3] ← B completes, new request D added
Step 4: [A4][C4][D1] ← Empty slot immediately filled
Step 5: [A5][C5][D2]
Step 6: [A6][C6][D3] ← C completes, new request E added
...
GPU utilization improves 2–5x compared to static batching.
Separating Prefill and Decode
LLM inference has two distinct phases:
- Prefill: Processes the entire prompt at once. Compute-bound (behaves like batch processing)
- Decode: Autoregressive token-by-token generation. Memory-bound
These phases have different GPU resource requirements. Disaggregated Prefill is an architecture that separates prefill-dedicated GPUs from decode-dedicated GPUs.
6. LLM Inference Framework Comparison
| Framework | Developer | Key Features | Best Use Case |
|---|---|---|---|
| vLLM | UC Berkeley | PagedAttention, OpenAI-compatible API | High-throughput serving |
| TensorRT-LLM | NVIDIA | Optimized CUDA kernels, FP8 support | Lowest latency |
| Ollama | Ollama Inc | Easy local execution | Development/testing |
| llama.cpp | ggml | CPU inference, GGUF format | Edge/local |
| SGLang | LM-Sys | Structured generation, RadixAttention | Complex pipelines |
vLLM Tensor Parallel Inference
from vllm import LLM, SamplingParams
# Tensor parallel across 4 GPUs
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
tensor_parallel_size=4, # Distribute across 4 GPUs
dtype="bfloat16",
max_model_len=8192,
gpu_memory_utilization=0.90,
enforce_eager=False, # Use CUDA graph optimization
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
stop=["</s>", "[INST]"],
)
prompts = [
"Explain the GPU memory hierarchy",
"What are the benefits of PagedAttention?",
"Compare quantization methods",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
print(f"Prompt: {prompt[:50]}...")
print(f"Generated: {generated[:100]}...")
print()
7. Multi-GPU Inference: Tensor and Pipeline Parallelism
Tensor Parallelism
Tensor parallelism distributes individual matrix operations across multiple GPUs, splitting each Transformer layer horizontally.
Attention head distribution (4-way tensor parallel):
GPU 0: Head 0-15
GPU 1: Head 16-31
GPU 2: Head 32-47
GPU 3: Head 48-63
Each GPU computes independently, then AllReduce aggregates results
- Pros: Reduced latency, enables large layers that don't fit on a single GPU
- Cons: AllReduce communication required per layer — high-bandwidth NVLink is essential
- Best for: Intra-node NVLink-connected GPUs, latency-sensitive applications
Pipeline Parallelism
Pipeline parallelism assigns groups of layers to different GPUs.
Llama-3.1-70B (80 layers) → 4-way pipeline:
GPU 0: Layers 0-19
GPU 1: Layers 20-39
GPU 2: Layers 40-59
GPU 3: Layers 60-79
Sequential layer processing, activations forwarded between GPUs
- Pros: Efficient even with low-bandwidth inter-node connections, minimal communication volume
- Cons: Pipeline bubbles (downstream GPUs idle while upstream computes), increased latency
- Best for: Multi-node distributed inference, very large models
Memory Profiling
import torch
def profile_gpu_memory(func, *args, **kwargs):
"""Profile GPU memory usage"""
torch.cuda.reset_peak_memory_stats()
torch.cuda.synchronize()
before = torch.cuda.memory_allocated()
result = func(*args, **kwargs)
torch.cuda.synchronize()
after = torch.cuda.memory_allocated()
peak = torch.cuda.max_memory_allocated()
print(f"Memory increase: {(after - before) / 1e9:.3f} GB")
print(f"Peak memory: {peak / 1e9:.3f} GB")
print()
print(torch.cuda.memory_summary())
return result
# Example memory stats output
def load_and_infer():
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="microsoft/phi-2",
torch_dtype=torch.float16,
device_map="auto",
)
return pipe("GPU memory management is", max_new_tokens=50)
profile_gpu_memory(load_and_infer)
8. Practical Optimization Checklist
GPU Memory Optimization Strategies
- Apply quantization: INT4/INT8 saves 50–75% memory
- Optimize KV cache: Limit max_model_len, choose GQA models
- Flash Attention 2: Leverages SRAM to reduce memory from O(n²) to O(n)
- Model sharding: Use tensor or pipeline parallelism for multi-GPU
- Continuous batching: Maximize GPU utilization with dynamic request scheduling
Inference Speed Optimization
# Optimized vLLM server configuration example
vllm_config = {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"dtype": "bfloat16",
"tensor_parallel_size": 1,
"gpu_memory_utilization": 0.90, # Use 90% of GPU memory
"max_model_len": 8192,
"max_num_batched_tokens": 8192, # Max tokens per batch
"max_num_seqs": 256, # Max concurrent sequences
"enable_chunked_prefill": True, # Enable chunked prefill
"block_size": 16, # KV cache block size (PagedAttention)
"swap_space": 4, # CPU swap space in GB
"enforce_eager": False, # Use CUDA graphs
"disable_log_stats": False,
}
Quiz: Check Your Understanding
Q1. Why do the prefill and decode phases of LLM inference have different compute characteristics?
Answer: Prefill is compute-bound; decode is memory-bound.
Explanation: During the prefill phase, all tokens in the prompt are processed in parallel — similar to batch processing. This yields high arithmetic intensity and keeps GPU compute units busy (compute-bound). During the decode phase, one token is generated per step by reading the entire KV cache from all previous tokens plus the full model weights. Every step requires loading the entire model weight matrix and the accumulated KV cache from memory, resulting in extremely low arithmetic intensity — making decode deeply memory-bound. With the H100's ridge point at ~295 FLOP/byte, the decode phase's AI of just 1–2 FLOP/byte means it runs far below peak compute.
Q2. Why does PagedAttention achieve higher memory efficiency than conventional KV cache management?
Answer: Non-contiguous physical block allocation and on-demand allocation eliminate fragmentation.
Explanation: Conventional systems pre-reserve contiguous memory equal to max sequence length per request, causing internal fragmentation (unused reserved space) and external fragmentation (unusable gaps left by completed requests). Studies show 60–80% of KV cache memory is wasted in practice. PagedAttention divides the KV cache into fixed-size blocks (e.g., 16 tokens), allocating physical blocks only when needed. Non-contiguous physical memory is abstracted through logical blocks, nearly eliminating fragmentation. Multiple requests sharing a common prompt prefix can also share physical KV blocks via Copy-on-Write, further reducing memory consumption.
Q3. How does AWQ preserve important weights better than GPTQ?
Answer: AWQ scales per-channel based on activation magnitude to protect salient weights.
Explanation: GPTQ minimizes quantization error using a second-order Hessian approximation but treats all weights roughly equally. AWQ (Activation-aware Weight Quantization) analyzes the activation distribution and observes that channels with large activation values (salient channels) contribute disproportionately to model performance. For these salient channels, AWQ multiplies the weights by a scale factor before quantization to inflate their values, then divides the corresponding activations by the same factor at inference to compensate. This protects the most important weights while maintaining hardware-friendly uniform quantization, resulting in lower perplexity than GPTQ at the same bit width.
Q4. How does continuous batching improve GPU utilization over static batching?
Answer: Iteration-level scheduling immediately reclaims completed sequence slots for new requests.
Explanation: Static batching stalls the GPU until every request in the batch finishes. Short requests leave their GPU slots idle while waiting for the longest sequence to complete. Continuous batching (iteration-level scheduling) reconstructs the batch at every inference step. When a sequence produces an EOS token or reaches max_tokens, its slot is immediately assigned to a waiting new request. The GPU therefore always operates at maximum batch capacity. In experiments, throughput improves 2–5x over static batching, and vLLM's paper reported up to 24x higher throughput compared to Hugging Face's static serving.
Q5. How do communication patterns differ between Tensor Parallelism and Pipeline Parallelism?
Answer: Tensor parallelism uses AllReduce every layer; pipeline parallelism uses point-to-point transfers at layer boundaries.
Explanation: Tensor Parallelism splits the weight matrices of each Transformer layer across GPUs. After each layer's computation, all GPUs must synchronize via an AllReduce collective to sum partial results. With 80 layers, that means 80 AllReduce operations, each adding communication latency — high-bandwidth NVLink interconnects are essential. Pipeline Parallelism assigns layer groups to different GPUs and only transfers activation tensors at group boundaries. This minimizes communication frequency but introduces pipeline bubbles (downstream GPUs idle while upstream GPUs process). For intra-node NVLink environments, tensor parallelism is preferred; for inter-node InfiniBand environments, pipeline parallelism scales better.
Conclusion
LLM inference optimization is the art of pushing hardware to its physical limits through software. Understanding the GPU memory hierarchy, managing KV cache efficiently, and combining the right quantization and batching strategies can dramatically improve performance on the same hardware.
Key Takeaways:
- Memory savings: AWQ/GPTQ 4-bit quantization allows 70B models to run on a single A100 80G
- Throughput gains: vLLM's PagedAttention + continuous batching delivers up to 24x throughput vs. static serving
- Latency reduction: TensorRT-LLM with CUDA kernel fusion and FP8 inference
- Scale-out: Tensor/pipeline parallelism breaks single-GPU limits across multi-GPU clusters