Skip to content
Published on

GPU Memory Management & LLM Inference Optimization: vLLM, PagedAttention, GPTQ, TensorRT-LLM

Authors

Introduction

Deploying Large Language Models in production presents two fundamental challenges: GPU memory management and inference efficiency. GPT-4-scale models demand hundreds of gigabytes of memory, and real-time responsiveness requires generating dozens of tokens per second.

This guide covers every critical aspect of LLM inference optimization. From understanding GPU memory hierarchy to KV cache optimization, GPTQ/AWQ quantization, PagedAttention, continuous batching, and multi-GPU inference — everything a production engineer needs to know, explained step by step.


1. GPU Memory Hierarchy

HBM (High Bandwidth Memory)

The cornerstone of modern AI GPUs is HBM. HBM stacks multiple DRAM dies vertically, providing a far wider memory bus than conventional GDDR6.

GPUMemoryHBM TypeBandwidthBus Width
A100 80G80 GBHBM2e2.0 TB/s5120-bit
H100 SXM80 GBHBM33.35 TB/s5120-bit
H200 SXM141 GBHBM3e4.8 TB/s5120-bit
B200 SXM192 GBHBM3e8.0 TB/s8192-bit
MI300X192 GBHBM35.3 TB/s8192-bit

L2 Cache and SRAM

The GPU memory hierarchy has three main levels:

  1. HBM (global memory): Tens to hundreds of GB, bandwidth in TB/s, latency ~hundreds of ns
  2. L2 cache: Tens to hundreds of MB (H100: 50 MB), shared across all SMs
  3. L1 cache / SRAM (shared memory): 128–256 KB per SM, bandwidth tens of TB/s, latency ~a few ns

The SRAM inside each SM (Streaming Multiprocessor) is the second-fastest memory after register files. Optimizations like Flash Attention leverage SRAM aggressively to reduce HBM accesses.

Roofline Model: Analyzing Performance Limits

The Roofline Model is an analytical tool for determining whether a given computation is compute-bound or memory-bound.

Arithmetic Intensity (AI) = Number of FLOPs / Memory accessed (bytes)

Performance ceiling = min(Peak FLOPS, Peak Memory BW × AI)
  • Low AI (memory-bound): Memory bandwidth is the bottleneck. The LLM decode phase is the classic example.
  • High AI (compute-bound): Arithmetic speed is the bottleneck. LLM prefill phase and large-batch workloads.

For an H100:

  • Peak FP16 FLOPS: 989 TFLOPS
  • Peak HBM bandwidth: 3.35 TB/s
  • Ridge point (balance point): 989 / 3.35 ≈ 295 FLOP/byte

When generating a single token, a 70B model (FP16) has AI ≈ 1–2 FLOP/byte — extremely memory-bound.


2. LLM Memory Calculations

Parameter Memory

Accurately calculating LLM memory requirements is the foundation of deployment planning.

def calc_model_memory_gb(
    num_params: int,       # Number of parameters (e.g., 70e9)
    dtype_bytes: int = 2,  # FP16=2, FP32=4, INT8=1, INT4=0.5
) -> float:
    """Calculate model weight memory"""
    return (num_params * dtype_bytes) / (1024 ** 3)

# Major model memory (FP16 baseline)
models = {
    "Llama-3.1-8B":   {"params": 8e9,   "bytes": 2},
    "Llama-3.1-70B":  {"params": 70e9,  "bytes": 2},
    "Llama-3.1-405B": {"params": 405e9, "bytes": 2},
    "Mistral-7B":     {"params": 7e9,   "bytes": 2},
    "Qwen2-72B":      {"params": 72e9,  "bytes": 2},
}

for name, cfg in models.items():
    mem_gb = calc_model_memory_gb(cfg["params"], cfg["bytes"])
    print(f"{name}: {mem_gb:.1f} GB")
ModelParametersFP32FP16/BF16INT8INT4
Llama-3.1-8B8B32 GB16 GB8 GB4 GB
Llama-3.1-70B70B280 GB140 GB70 GB35 GB
Llama-3.1-405B405B1620 GB810 GB405 GB202 GB
Mistral-7B7B28 GB14 GB7 GB3.5 GB

KV Cache Memory Calculation

KV cache is the most dynamically changing component of inference memory. It scales proportionally with sequence length and batch size.

def calc_kv_cache_memory_gb(
    num_layers: int,
    num_heads: int,
    head_dim: int,
    seq_len: int,
    batch_size: int,
    dtype_bytes: int = 2,  # FP16
) -> float:
    """
    KV cache memory calculation.
    Per layer: 2 (K, V) × num_heads × head_dim × seq_len × batch_size
    """
    kv_per_layer = 2 * num_heads * head_dim * seq_len * batch_size
    total_bytes = kv_per_layer * num_layers * dtype_bytes
    return total_bytes / (1024 ** 3)

# Llama-3.1-70B example
# layers=80, heads=64 (GQA: kv_heads=8), head_dim=128
kv_mem = calc_kv_cache_memory_gb(
    num_layers=80,
    num_heads=8,      # GQA uses kv_heads
    head_dim=128,
    seq_len=4096,
    batch_size=1,
    dtype_bytes=2,
)
print(f"KV cache (seq=4096, bs=1): {kv_mem:.2f} GB")
# Output: KV cache (seq=4096, bs=1): 0.50 GB

# KV cache by batch size
for bs in [1, 4, 8, 16, 32]:
    mem = calc_kv_cache_memory_gb(80, 8, 128, 4096, bs, 2)
    print(f"  batch_size={bs:2d}: {mem:.2f} GB")

KV Cache Memory (Llama-3.1-70B, seq_len=4096, FP16)

Batch SizeKV CacheModel WeightsTotal
10.5 GB140 GB140.5 GB
42.0 GB140 GB142.0 GB
84.0 GB140 GB144.0 GB
168.0 GB140 GB148.0 GB
3216.0 GB140 GB156.0 GB

Activation Memory

During inference, activation memory is proportional to the product of batch size, sequence length, and hidden size. Unlike training, inference does not store gradients, making activation memory relatively small.

def calc_activation_memory_gb(
    hidden_size: int,
    seq_len: int,
    batch_size: int,
    num_layers: int,
    dtype_bytes: int = 2,
) -> float:
    """Approximate activation memory during inference"""
    # Per layer: attention + FFN activations
    # Approximation: 2 × hidden_size × seq_len × batch_size per layer
    bytes_per_layer = 2 * hidden_size * seq_len * batch_size * dtype_bytes
    return (bytes_per_layer * num_layers) / (1024 ** 3)

3. KV Cache Optimization: PagedAttention

Problems with Conventional KV Cache

Traditional LLM serving systems allocate KV cache as contiguous memory blocks. This causes serious issues:

  1. Internal Fragmentation: Pre-allocating to max sequence length wastes unused space
  2. External Fragmentation: Varying request sizes create unusable gaps when requests complete
  3. Memory Efficiency: In practice, 60–80% of KV cache memory is wasted

PagedAttention: Applying OS Paging Principles

vLLM's PagedAttention applies virtual memory paging concepts to KV cache management.

OS Virtual MemoryPagedAttention
─────────────────────────────────────
Virtual page         → Logical block
Physical frame       → Physical block
Page table           → Block table
Page fault           → Block allocation

Core ideas:

  • Split KV cache into fixed-size blocks (e.g., 16 tokens each)
  • Access sequence KV via logical blocks; allocate physical blocks on demand
  • Share physical blocks with Copy-on-Write when requests share a common prefix
Request A: [Block 0][Block 1][Block 2]
Physical block sharing (common prompt)
Request B: [Block 0][Block 1][Block 3]

vLLM Server Launch Example

# Install vLLM
pip install vllm

# Start server (single GPU)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

# OpenAI-compatible API call
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain GPU memory optimization"}],
    "max_tokens": 512,
    "temperature": 0.7
  }'
# Python client calling vLLM API
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a GPU optimization expert."},
        {"role": "user", "content": "Explain how PagedAttention works."},
    ],
    max_tokens=1024,
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

4. Quantization: GPTQ, AWQ, GGUF, bitsandbytes

Quantization Method Comparison

MethodPrecisionMemory SavingsSpeedQuality LossNotes
FP16/BF1616-bitBaselineBaselineNoneDefault
GPTQ4-bit~75%FastLowPTQ, GPU only
AWQ4-bit~75%FastVery lowActivation-aware
GGUF2–8-bitVariableCPU capableVariablellama.cpp
bitsandbytes NF44-bit~75%ModerateLowQLoRA training
bitsandbytes INT88-bit~50%ModerateVery lowLLM.int8()

bitsandbytes 4-bit Quantization Loading

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit NF4 quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in BF16
    bnb_4bit_quant_type="nf4",              # NormalFloat4 quantization
    bnb_4bit_use_double_quant=True,         # Double quantization for extra compression
)

model_id = "meta-llama/Llama-3.1-70B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",      # Automatic multi-GPU distribution
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Check memory usage
print(f"GPU memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

GPTQ Quantization (auto-gptq)

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"

quantize_config = BaseQuantizeConfig(
    bits=4,              # 4-bit quantization
    group_size=128,      # Group size (smaller = more accurate but more memory)
    desc_act=False,      # Describe activation ordering
    damp_percent=0.01,   # Hessian damping coefficient
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Prepare calibration data (representative text samples)
calibration_data = [
    tokenizer("The GPU accelerates machine learning by...", return_tensors="pt").input_ids,
    tokenizer("Quantization reduces model size while...", return_tensors="pt").input_ids,
    # Recommend 1024+ samples in practice
]

# Load model and quantize
model = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    quantize_config=quantize_config,
    torch_dtype=torch.float16,
)

model.quantize(calibration_data)
model.save_quantized("llama-3.1-8b-gptq-4bit")
print("GPTQ quantization complete!")

# Load quantized model
quantized_model = AutoGPTQForCausalLM.from_quantized(
    "llama-3.1-8b-gptq-4bit",
    use_safetensors=True,
    device="cuda:0",
)

AWQ: Activation-Aware Weight Quantization

AWQ does not quantize all weights equally. Channels with large activation values (important weights) are protected at higher precision.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"

# AWQ quantization configuration
quant_config = {
    "zero_point": True,   # Zero-point quantization
    "q_group_size": 128,  # Group size
    "w_bit": 4,           # 4-bit
    "version": "GEMM",    # Matrix multiplication kernel
}

model = AutoAWQForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print("AWQ quantization complete!")

Quantization Benchmark (Llama-3.1-8B)

MethodMemoryThroughput (tok/s)PerplexityNotes
FP1616 GB100 (baseline)7.2Baseline
BF1616 GB1007.2Equivalent to FP16
INT88 GB757.3Minor quality loss
GPTQ-4bit4.5 GB1207.6Memory savings + speed
AWQ-4bit4.5 GB1257.4Better quality than GPTQ
GGUF-Q4_K_M4.8 GB80 (CPU)7.5CPU inference capable

5. Batching Strategies: Continuous Batching

Limitations of Static Batching

Traditional static batching waits for all requests in a batch to start simultaneously and complete together. This leads to severe GPU underutilization.

Static Batching (batch_size=3):

Time[Request A: ████████████░░░░░░░░]  (12 tokens generated)
[Request B: ████░░░░░░░░░░░░░░░░]  (4 tokens generated)
[Request C: ████████░░░░░░░░░░░░]  (8 tokens generated)
                └─ B and C must wait for A to finish (GPU wasted)

Continuous Batching (Iteration-level Scheduling)

Modern LLM serving systems like vLLM and TensorRT-LLM use continuous batching, dynamically reconstructing the batch at each inference step (iteration).

Continuous Batching:

Step 1: [A1][B1][C1]3 processed simultaneously
Step 2: [A2][B2][C2]
Step 3: [A3][B3][C3]B completes, new request D added
Step 4: [A4][C4][D1]Empty slot immediately filled
Step 5: [A5][C5][D2]
Step 6: [A6][C6][D3]C completes, new request E added
...

GPU utilization improves 2–5x compared to static batching.

Separating Prefill and Decode

LLM inference has two distinct phases:

  • Prefill: Processes the entire prompt at once. Compute-bound (behaves like batch processing)
  • Decode: Autoregressive token-by-token generation. Memory-bound

These phases have different GPU resource requirements. Disaggregated Prefill is an architecture that separates prefill-dedicated GPUs from decode-dedicated GPUs.


6. LLM Inference Framework Comparison

FrameworkDeveloperKey FeaturesBest Use Case
vLLMUC BerkeleyPagedAttention, OpenAI-compatible APIHigh-throughput serving
TensorRT-LLMNVIDIAOptimized CUDA kernels, FP8 supportLowest latency
OllamaOllama IncEasy local executionDevelopment/testing
llama.cppggmlCPU inference, GGUF formatEdge/local
SGLangLM-SysStructured generation, RadixAttentionComplex pipelines

vLLM Tensor Parallel Inference

from vllm import LLM, SamplingParams

# Tensor parallel across 4 GPUs
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,        # Distribute across 4 GPUs
    dtype="bfloat16",
    max_model_len=8192,
    gpu_memory_utilization=0.90,
    enforce_eager=False,           # Use CUDA graph optimization
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "[INST]"],
)

prompts = [
    "Explain the GPU memory hierarchy",
    "What are the benefits of PagedAttention?",
    "Compare quantization methods",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    print(f"Prompt: {prompt[:50]}...")
    print(f"Generated: {generated[:100]}...")
    print()

7. Multi-GPU Inference: Tensor and Pipeline Parallelism

Tensor Parallelism

Tensor parallelism distributes individual matrix operations across multiple GPUs, splitting each Transformer layer horizontally.

Attention head distribution (4-way tensor parallel):

GPU 0: Head 0-15
GPU 1: Head 16-31
GPU 2: Head 32-47
GPU 3: Head 48-63

Each GPU computes independently, then AllReduce aggregates results
  • Pros: Reduced latency, enables large layers that don't fit on a single GPU
  • Cons: AllReduce communication required per layer — high-bandwidth NVLink is essential
  • Best for: Intra-node NVLink-connected GPUs, latency-sensitive applications

Pipeline Parallelism

Pipeline parallelism assigns groups of layers to different GPUs.

Llama-3.1-70B (80 layers)4-way pipeline:

GPU 0: Layers 0-19
GPU 1: Layers 20-39
GPU 2: Layers 40-59
GPU 3: Layers 60-79

Sequential layer processing, activations forwarded between GPUs
  • Pros: Efficient even with low-bandwidth inter-node connections, minimal communication volume
  • Cons: Pipeline bubbles (downstream GPUs idle while upstream computes), increased latency
  • Best for: Multi-node distributed inference, very large models

Memory Profiling

import torch

def profile_gpu_memory(func, *args, **kwargs):
    """Profile GPU memory usage"""
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.synchronize()

    before = torch.cuda.memory_allocated()
    result = func(*args, **kwargs)
    torch.cuda.synchronize()

    after = torch.cuda.memory_allocated()
    peak = torch.cuda.max_memory_allocated()

    print(f"Memory increase: {(after - before) / 1e9:.3f} GB")
    print(f"Peak memory:     {peak / 1e9:.3f} GB")
    print()
    print(torch.cuda.memory_summary())
    return result

# Example memory stats output
def load_and_infer():
    from transformers import pipeline
    pipe = pipeline(
        "text-generation",
        model="microsoft/phi-2",
        torch_dtype=torch.float16,
        device_map="auto",
    )
    return pipe("GPU memory management is", max_new_tokens=50)

profile_gpu_memory(load_and_infer)

8. Practical Optimization Checklist

GPU Memory Optimization Strategies

  1. Apply quantization: INT4/INT8 saves 50–75% memory
  2. Optimize KV cache: Limit max_model_len, choose GQA models
  3. Flash Attention 2: Leverages SRAM to reduce memory from O(n²) to O(n)
  4. Model sharding: Use tensor or pipeline parallelism for multi-GPU
  5. Continuous batching: Maximize GPU utilization with dynamic request scheduling

Inference Speed Optimization

# Optimized vLLM server configuration example
vllm_config = {
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "dtype": "bfloat16",
    "tensor_parallel_size": 1,
    "gpu_memory_utilization": 0.90,    # Use 90% of GPU memory
    "max_model_len": 8192,
    "max_num_batched_tokens": 8192,    # Max tokens per batch
    "max_num_seqs": 256,               # Max concurrent sequences
    "enable_chunked_prefill": True,    # Enable chunked prefill
    "block_size": 16,                  # KV cache block size (PagedAttention)
    "swap_space": 4,                   # CPU swap space in GB
    "enforce_eager": False,            # Use CUDA graphs
    "disable_log_stats": False,
}

Quiz: Check Your Understanding

Q1. Why do the prefill and decode phases of LLM inference have different compute characteristics?

Answer: Prefill is compute-bound; decode is memory-bound.

Explanation: During the prefill phase, all tokens in the prompt are processed in parallel — similar to batch processing. This yields high arithmetic intensity and keeps GPU compute units busy (compute-bound). During the decode phase, one token is generated per step by reading the entire KV cache from all previous tokens plus the full model weights. Every step requires loading the entire model weight matrix and the accumulated KV cache from memory, resulting in extremely low arithmetic intensity — making decode deeply memory-bound. With the H100's ridge point at ~295 FLOP/byte, the decode phase's AI of just 1–2 FLOP/byte means it runs far below peak compute.

Q2. Why does PagedAttention achieve higher memory efficiency than conventional KV cache management?

Answer: Non-contiguous physical block allocation and on-demand allocation eliminate fragmentation.

Explanation: Conventional systems pre-reserve contiguous memory equal to max sequence length per request, causing internal fragmentation (unused reserved space) and external fragmentation (unusable gaps left by completed requests). Studies show 60–80% of KV cache memory is wasted in practice. PagedAttention divides the KV cache into fixed-size blocks (e.g., 16 tokens), allocating physical blocks only when needed. Non-contiguous physical memory is abstracted through logical blocks, nearly eliminating fragmentation. Multiple requests sharing a common prompt prefix can also share physical KV blocks via Copy-on-Write, further reducing memory consumption.

Q3. How does AWQ preserve important weights better than GPTQ?

Answer: AWQ scales per-channel based on activation magnitude to protect salient weights.

Explanation: GPTQ minimizes quantization error using a second-order Hessian approximation but treats all weights roughly equally. AWQ (Activation-aware Weight Quantization) analyzes the activation distribution and observes that channels with large activation values (salient channels) contribute disproportionately to model performance. For these salient channels, AWQ multiplies the weights by a scale factor before quantization to inflate their values, then divides the corresponding activations by the same factor at inference to compensate. This protects the most important weights while maintaining hardware-friendly uniform quantization, resulting in lower perplexity than GPTQ at the same bit width.

Q4. How does continuous batching improve GPU utilization over static batching?

Answer: Iteration-level scheduling immediately reclaims completed sequence slots for new requests.

Explanation: Static batching stalls the GPU until every request in the batch finishes. Short requests leave their GPU slots idle while waiting for the longest sequence to complete. Continuous batching (iteration-level scheduling) reconstructs the batch at every inference step. When a sequence produces an EOS token or reaches max_tokens, its slot is immediately assigned to a waiting new request. The GPU therefore always operates at maximum batch capacity. In experiments, throughput improves 2–5x over static batching, and vLLM's paper reported up to 24x higher throughput compared to Hugging Face's static serving.

Q5. How do communication patterns differ between Tensor Parallelism and Pipeline Parallelism?

Answer: Tensor parallelism uses AllReduce every layer; pipeline parallelism uses point-to-point transfers at layer boundaries.

Explanation: Tensor Parallelism splits the weight matrices of each Transformer layer across GPUs. After each layer's computation, all GPUs must synchronize via an AllReduce collective to sum partial results. With 80 layers, that means 80 AllReduce operations, each adding communication latency — high-bandwidth NVLink interconnects are essential. Pipeline Parallelism assigns layer groups to different GPUs and only transfers activation tensors at group boundaries. This minimizes communication frequency but introduces pipeline bubbles (downstream GPUs idle while upstream GPUs process). For intra-node NVLink environments, tensor parallelism is preferred; for inter-node InfiniBand environments, pipeline parallelism scales better.


Conclusion

LLM inference optimization is the art of pushing hardware to its physical limits through software. Understanding the GPU memory hierarchy, managing KV cache efficiently, and combining the right quantization and batching strategies can dramatically improve performance on the same hardware.

Key Takeaways:

  • Memory savings: AWQ/GPTQ 4-bit quantization allows 70B models to run on a single A100 80G
  • Throughput gains: vLLM's PagedAttention + continuous batching delivers up to 24x throughput vs. static serving
  • Latency reduction: TensorRT-LLM with CUDA kernel fusion and FP8 inference
  • Scale-out: Tensor/pipeline parallelism breaks single-GPU limits across multi-GPU clusters