Split View: GPU 메모리 관리 & LLM 추론 최적화: vLLM, PagedAttention, GPTQ, TensorRT-LLM까지

GPU 메모리 관리 & LLM 추론 최적화: vLLM, PagedAttention, GPTQ, TensorRT-LLM까지

시작하며

LLM(Large Language Model)을 실제 서비스에 배포할 때 가장 큰 도전은 GPU 메모리 관리와 추론 효율화입니다. GPT-4급 모델은 수백 GB의 메모리를 요구하고, 실시간 응답을 위해서는 초당 수십 토큰의 생성 속도가 필요합니다.

이 가이드는 LLM 추론 최적화의 모든 핵심 요소를 다룹니다. GPU 메모리 계층 구조의 이해부터 KV 캐시 최적화, GPTQ/AWQ 양자화, PagedAttention, continuous batching, 멀티-GPU 추론까지 실전 엔지니어가 반드시 알아야 할 내용을 단계별로 설명합니다.

1. GPU 메모리 계층 구조

HBM (High Bandwidth Memory)

현대 AI GPU의 핵심은 HBM입니다. HBM은 여러 개의 DRAM 다이를 수직으로 쌓아 만든 메모리로, 일반 GDDR6보다 훨씬 넓은 메모리 버스를 제공합니다.

GPU	메모리	HBM 타입	대역폭	버스 폭
A100 80G	80 GB	HBM2e	2.0 TB/s	5120-bit
H100 SXM	80 GB	HBM3	3.35 TB/s	5120-bit
H200 SXM	141 GB	HBM3e	4.8 TB/s	5120-bit
B200 SXM	192 GB	HBM3e	8.0 TB/s	8192-bit
MI300X	192 GB	HBM3	5.3 TB/s	8192-bit

L2 캐시와 SRAM

GPU 메모리 계층은 크게 세 단계로 구성됩니다:

HBM (전역 메모리): 수십~수백 GB, 대역폭 수 TB/s, 레이턴시 ~수백 ns
L2 캐시: 수십~수백 MB (H100: 50 MB), GPU 내 모든 SM이 공유
L1 캐시 / SRAM (공유 메모리): SM당 128~256 KB, 대역폭 수십 TB/s, 레이턴시 ~수 ns

각 SM(Streaming Multiprocessor) 내부의 SRAM은 레지스터 파일 다음으로 빠른 메모리입니다. Flash Attention 같은 최적화 알고리즘은 이 SRAM을 적극 활용하여 HBM 접근 횟수를 줄입니다.

Roofline Model: 성능 한계 분석

Roofline Model은 주어진 연산이 compute-bound인지 memory-bound인지 판단하는 분석 도구입니다.

Arithmetic Intensity (AI) = FLOP 수 / 메모리 접근량 (bytes)

성능 상한 = min(Peak FLOPS, Peak Memory BW × AI)

AI가 낮을 때 (memory-bound): 메모리 대역폭이 병목. LLM decode 단계가 대표적
AI가 높을 때 (compute-bound): 연산 속도가 병목. LLM prefill 단계, batch가 클 때

H100의 경우:

Peak FP16 FLOPS: 989 TFLOPS
Peak HBM 대역폭: 3.35 TB/s
Ridge point (균형점): 989 / 3.35 ≈ 295 FLOP/byte

토큰 하나를 생성할 때 70B 모델(FP16)은 AI ≈ 1~2 FLOP/byte로 극도로 memory-bound입니다.

2. LLM 메모리 계산

파라미터 메모리

LLM의 메모리 사용량을 정확히 계산하는 것은 배포 계획의 핵심입니다.

def calc_model_memory_gb(
    num_params: int,       # 파라미터 수 (예: 70e9)
    dtype_bytes: int = 2,  # FP16=2, FP32=4, INT8=1, INT4=0.5
) -> float:
    """모델 가중치 메모리 계산"""
    return (num_params * dtype_bytes) / (1024 ** 3)

# 주요 모델 메모리 (FP16 기준)
models = {
    "Llama-3.1-8B":   {"params": 8e9,   "bytes": 2},
    "Llama-3.1-70B":  {"params": 70e9,  "bytes": 2},
    "Llama-3.1-405B": {"params": 405e9, "bytes": 2},
    "Mistral-7B":     {"params": 7e9,   "bytes": 2},
    "Qwen2-72B":      {"params": 72e9,  "bytes": 2},
}

for name, cfg in models.items():
    mem_gb = calc_model_memory_gb(cfg["params"], cfg["bytes"])
    print(f"{name}: {mem_gb:.1f} GB")

모델	파라미터	FP32	FP16/BF16	INT8	INT4
Llama-3.1-8B	8B	32 GB	16 GB	8 GB	4 GB
Llama-3.1-70B	70B	280 GB	140 GB	70 GB	35 GB
Llama-3.1-405B	405B	1620 GB	810 GB	405 GB	202 GB
Mistral-7B	7B	28 GB	14 GB	7 GB	3.5 GB

KV 캐시 메모리 계산

KV 캐시는 추론 시 가장 동적으로 변하는 메모리 사용량입니다. 시퀀스 길이와 배치 크기에 비례합니다.

def calc_kv_cache_memory_gb(
    num_layers: int,
    num_heads: int,
    head_dim: int,
    seq_len: int,
    batch_size: int,
    dtype_bytes: int = 2,  # FP16
) -> float:
    """
    KV 캐시 메모리 계산
    각 레이어: 2 (K, V) × num_heads × head_dim × seq_len × batch_size
    """
    kv_per_layer = 2 * num_heads * head_dim * seq_len * batch_size
    total_bytes = kv_per_layer * num_layers * dtype_bytes
    return total_bytes / (1024 ** 3)

# Llama-3.1-70B 예시
# layers=80, heads=64 (GQA: kv_heads=8), head_dim=128
kv_mem = calc_kv_cache_memory_gb(
    num_layers=80,
    num_heads=8,      # GQA의 경우 kv_heads 사용
    head_dim=128,
    seq_len=4096,
    batch_size=1,
    dtype_bytes=2,
)
print(f"KV 캐시 (seq=4096, bs=1): {kv_mem:.2f} GB")
# 출력: KV 캐시 (seq=4096, bs=1): 0.50 GB

# 배치 크기별 KV 캐시
for bs in [1, 4, 8, 16, 32]:
    mem = calc_kv_cache_memory_gb(80, 8, 128, 4096, bs, 2)
    print(f"  batch_size={bs:2d}: {mem:.2f} GB")

KV 캐시 메모리 (Llama-3.1-70B, seq_len=4096, FP16)

Batch Size	KV 캐시	모델 가중치	총 사용량
1	0.5 GB	140 GB	140.5 GB
4	2.0 GB	140 GB	142.0 GB
8	4.0 GB	140 GB	144.0 GB
16	8.0 GB	140 GB	148.0 GB
32	16.0 GB	140 GB	156.0 GB

활성화 메모리

추론 시 활성화 메모리는 배치 크기, 시퀀스 길이, 히든 사이즈의 곱에 비례합니다. 학습과 달리 추론에서는 그래디언트를 저장하지 않으므로 상대적으로 작습니다.

def calc_activation_memory_gb(
    hidden_size: int,
    seq_len: int,
    batch_size: int,
    num_layers: int,
    dtype_bytes: int = 2,
) -> float:
    """추론 시 활성화 메모리 근사 계산"""
    # 각 레이어: attention + FFN 활성화
    # 근사치: 2 × hidden_size × seq_len × batch_size per layer
    bytes_per_layer = 2 * hidden_size * seq_len * batch_size * dtype_bytes
    return (bytes_per_layer * num_layers) / (1024 ** 3)

3. KV 캐시 최적화: PagedAttention

기존 KV 캐시의 문제점

전통적인 LLM 서빙 시스템은 KV 캐시를 연속된 메모리 블록으로 할당합니다. 이는 심각한 문제를 야기합니다:

내부 단편화 (Internal Fragmentation): 최대 시퀀스 길이에 맞춰 미리 할당하면 실제 사용되지 않는 공간이 낭비됩니다
외부 단편화 (External Fragmentation): 요청이 끝날 때마다 크기가 다른 빈 공간들이 생겨 새 요청을 할당하기 어렵습니다
메모리 효율: 실제 시스템에서 KV 캐시의 60~80%가 낭비됩니다

PagedAttention: OS 페이징 원리의 적용

vLLM의 PagedAttention은 운영체제의 가상 메모리 페이징 개념을 KV 캐시에 적용합니다.

OS 가상 메모리 → PagedAttention
────────────────────────────────
가상 페이지     → 논리 블록 (logical block)
물리 프레임     → 물리 블록 (physical block)
페이지 테이블   → 블록 테이블 (block table)
페이지 폴트     → 블록 할당

핵심 아이디어:

KV 캐시를 고정 크기 블록(예: 16 토큰)으로 분할
시퀀스의 KV는 논리 블록으로 접근하고, 실제 물리 블록은 필요할 때 할당
서로 다른 시퀀스가 공통 프롬프트를 공유할 때 물리 블록을 Copy-on-Write로 공유

요청 A: [Block 0] → [Block 1] → [Block 2]
                                      ↕ 물리 블록 공유 (공통 프롬프트)
요청 B: [Block 0] → [Block 1] → [Block 3]

vLLM 서버 실행 예시

# vLLM 설치
pip install vllm

# 서버 시작 (단일 GPU)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

# OpenAI 호환 API 호출
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "GPU 메모리 최적화를 설명해줘"}],
    "max_tokens": 512,
    "temperature": 0.7
  }'

# Python 클라이언트로 vLLM API 호출
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "당신은 GPU 최적화 전문가입니다."},
        {"role": "user", "content": "PagedAttention의 작동 원리를 설명해주세요."},
    ],
    max_tokens=1024,
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

4. 양자화: GPTQ, AWQ, GGUF, bitsandbytes

양자화 기법 비교

기법	정밀도	메모리 절약	속도	품질 손실	특징
FP16/BF16	16-bit	기준	기준	없음	기본값
GPTQ	4-bit	~75%	빠름	낮음	PTQ, GPU 전용
AWQ	4-bit	~75%	빠름	매우 낮음	활성화 인식
GGUF	2~8-bit	가변	CPU 가능	가변	llama.cpp
bitsandbytes NF4	4-bit	~75%	보통	낮음	QLoRA 학습
bitsandbytes INT8	8-bit	~50%	보통	매우 낮음	LLM.int8()

bitsandbytes 4-bit 양자화 로딩

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit NF4 양자화 설정
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # 계산은 BF16으로
    bnb_4bit_quant_type="nf4",              # NormalFloat4 양자화
    bnb_4bit_use_double_quant=True,         # 이중 양자화로 추가 압축
)

model_id = "meta-llama/Llama-3.1-70B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",      # 자동 멀티-GPU 분산
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# 메모리 사용량 확인
print(f"GPU 메모리: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

GPTQ 양자화 (auto-gptq)

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"

quantize_config = BaseQuantizeConfig(
    bits=4,              # 4-bit 양자화
    group_size=128,      # 그룹 크기 (작을수록 정확하지만 메모리 증가)
    desc_act=False,      # 활성화 순서 기술
    damp_percent=0.01,   # Hessian 댐핑 계수
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# 캘리브레이션 데이터 준비 (대표 텍스트 샘플)
calibration_data = [
    tokenizer("The GPU accelerates machine learning by...", return_tensors="pt").input_ids,
    tokenizer("Quantization reduces model size while...", return_tensors="pt").input_ids,
    # 실제로는 1024개 이상의 샘플 사용 권장
]

# 모델 로드 및 양자화
model = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    quantize_config=quantize_config,
    torch_dtype=torch.float16,
)

model.quantize(calibration_data)
model.save_quantized("llama-3.1-8b-gptq-4bit")
print("GPTQ 양자화 완료!")

# 양자화된 모델 로드
quantized_model = AutoGPTQForCausalLM.from_quantized(
    "llama-3.1-8b-gptq-4bit",
    use_safetensors=True,
    device="cuda:0",
)

AWQ: 활성화 인식 가중치 양자화

AWQ는 모든 가중치를 동등하게 양자화하지 않습니다. 활성화값이 크게 나오는 채널(중요한 가중치)은 더 높은 정밀도로 보호합니다.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"

# AWQ 양자화 설정
quant_config = {
    "zero_point": True,   # 제로 포인트 양자화
    "q_group_size": 128,  # 그룹 크기
    "w_bit": 4,           # 4-bit
    "version": "GEMM",    # 행렬 곱셈 커널
}

model = AutoAWQForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print("AWQ 양자화 완료!")

양자화별 성능 벤치마크 (Llama-3.1-8B)

방식	메모리	처리량 (tok/s)	Perplexity	특이사항
FP16	16 GB	100 (기준)	7.2	기준값
BF16	16 GB	100	7.2	FP16과 동급
INT8	8 GB	75	7.3	약간의 품질 손실
GPTQ-4bit	4.5 GB	120	7.6	메모리 절약, 속도 향상
AWQ-4bit	4.5 GB	125	7.4	GPTQ보다 우수한 품질
GGUF-Q4_K_M	4.8 GB	80 (CPU)	7.5	CPU 추론 가능

5. 배치 전략: Continuous Batching

Static Batching의 한계

전통적인 static batching은 모든 요청이 동시에 시작하고 끝날 때까지 기다립니다. 이는 GPU 활용률이 낮아지는 심각한 비효율을 초래합니다.

Static Batching 예시 (batch_size=3):

시간 →
[요청 A: ████████████░░░░░░░░]  (토큰 12개 생성)
[요청 B: ████░░░░░░░░░░░░░░░░]  (토큰 4개 생성)
[요청 C: ████████░░░░░░░░░░░░]  (토큰 8개 생성)
                └─ B, C가 끝나도 A를 기다려야 함 (GPU 낭비)

Continuous Batching (Iteration-level Scheduling)

vLLM, TensorRT-LLM 등 현대 LLM 서빙 시스템은 continuous batching을 사용합니다. 각 추론 스텝(이터레이션)마다 배치를 동적으로 재구성합니다.

Continuous Batching:

Step 1: [A1][B1][C1]  ← 3개 동시 처리
Step 2: [A2][B2][C2]
Step 3: [A3][B3][C3]  ← B 완료, 새 요청 D 추가
Step 4: [A4][C4][D1]  ← 즉시 빈 슬롯 채움
Step 5: [A5][C5][D2]
Step 6: [A6][C6][D3]  ← C 완료, 새 요청 E 추가
...

GPU 활용률이 static batching 대비 2~5배 향상됩니다.

Prefill vs Decode 분리

LLM 추론은 두 단계로 나뉩니다:

Prefill: 프롬프트 전체를 한 번에 처리. compute-bound (배치처럼 동작)
Decode: 토큰 하나씩 자기회귀적 생성. memory-bound

이 두 단계는 서로 다른 GPU 특성을 필요로 합니다. Disaggregated Prefill은 prefill 전용 GPU와 decode 전용 GPU를 분리하는 아키텍처입니다.

6. LLM 추론 프레임워크 비교

프레임워크	개발사	특징	최적 용도
vLLM	UC Berkeley	PagedAttention, OpenAI 호환 API	고처리량 서빙
TensorRT-LLM	NVIDIA	최적화 CUDA 커널, FP8 지원	최저 레이턴시
Ollama	Ollama Inc	간편한 로컬 실행	개발/테스트
llama.cpp	ggml	CPU 추론, GGUF 형식	엣지/로컬
SGLang	LM-Sys	구조화 생성, RadixAttention	복잡한 파이프라인

vLLM 텐서 병렬 추론

from vllm import LLM, SamplingParams

# 텐서 병렬로 4 GPU에 분산
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,       # GPU 4개에 텐서 병렬 분산
    dtype="bfloat16",
    max_model_len=8192,
    gpu_memory_utilization=0.90,
    enforce_eager=False,           # CUDA 그래프 최적화 사용
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "[INST]"],
)

prompts = [
    "GPU 메모리 계층을 설명해줘",
    "PagedAttention의 장점은?",
    "양자화 기법을 비교해줘",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    print(f"프롬프트: {prompt[:50]}...")
    print(f"생성: {generated[:100]}...")
    print()

7. 멀티-GPU 추론: Tensor/Pipeline Parallelism

Tensor Parallelism (텐서 병렬)

텐서 병렬은 개별 행렬 연산을 여러 GPU에 분산합니다. 각 Transformer 레이어 내부를 수평으로 분할합니다.

Attention 헤드 분산 (4-way tensor parallel):

GPU 0: Head 0~15
GPU 1: Head 16~31
GPU 2: Head 32~47
GPU 3: Head 48~63

각 GPU가 독립적으로 계산 후 AllReduce로 결과 합산

장점: 레이턴시 감소, 대형 레이어 처리 가능
단점: 레이어마다 AllReduce 통신 필요 → NVLink 고속 연결 필수
적합: 단일 노드 내 NVLink 연결된 GPU, 레이턴시 민감 응용

Pipeline Parallelism (파이프라인 병렬)

파이프라인 병렬은 레이어를 그룹으로 나눠 각 GPU에 할당합니다.

Llama-3.1-70B (80 레이어) → 4-way pipeline:

GPU 0: Layer 0~19
GPU 1: Layer 20~39
GPU 2: Layer 40~59
GPU 3: Layer 60~79

레이어 순서대로 처리, GPU 간 activation 전달

장점: 노드 간 저속 연결에서도 효율적, 통신량 적음
단점: 파이프라인 버블 발생, 레이턴시 증가
적합: 다중 노드 분산 추론, 초대형 모델

메모리 프로파일링

import torch

def profile_gpu_memory(func, *args, **kwargs):
    """GPU 메모리 사용량 프로파일링"""
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.synchronize()

    before = torch.cuda.memory_allocated()
    result = func(*args, **kwargs)
    torch.cuda.synchronize()

    after = torch.cuda.memory_allocated()
    peak = torch.cuda.max_memory_allocated()

    print(f"메모리 증가: {(after - before) / 1e9:.3f} GB")
    print(f"피크 메모리: {peak / 1e9:.3f} GB")
    print()
    print(torch.cuda.memory_summary())
    return result

# 메모리 통계 출력 예시
def load_and_infer():
    from transformers import pipeline
    pipe = pipeline(
        "text-generation",
        model="microsoft/phi-2",
        torch_dtype=torch.float16,
        device_map="auto",
    )
    return pipe("GPU memory management is", max_new_tokens=50)

profile_gpu_memory(load_and_infer)

8. 실전 최적화 체크리스트

GPU 메모리 최적화 전략

양자화 적용: INT4/INT8 양자화로 메모리 50~75% 절약
KV 캐시 최적화: max_model_len 제한, GQA(Grouped Query Attention) 모델 선택
Flash Attention 2: SRAM 활용 최적화, 메모리 O(n²) → O(n) 감소
모델 샤딩: 텐서 병렬 또는 파이프라인 병렬로 멀티-GPU 활용
연속 배치: continuous batching으로 GPU 활용률 극대화

추론 속도 최적화

# 최적화된 vLLM 서버 설정 예시
vllm_config = {
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "dtype": "bfloat16",
    "tensor_parallel_size": 1,
    "gpu_memory_utilization": 0.90,   # 90% GPU 메모리 사용
    "max_model_len": 8192,
    "max_num_batched_tokens": 8192,   # 배치당 최대 토큰 수
    "max_num_seqs": 256,              # 동시 처리 시퀀스 수
    "enable_chunked_prefill": True,   # Chunked prefill 활성화
    "block_size": 16,                 # KV 캐시 블록 크기 (PagedAttention)
    "swap_space": 4,                  # CPU swap 공간 (GB)
    "enforce_eager": False,           # CUDA 그래프 사용
    "disable_log_stats": False,
}

퀴즈: 이해도 확인

Q1. LLM 추론에서 prefill 단계와 decode 단계의 compute 특성이 다른 이유는?

정답: Prefill은 compute-bound, Decode는 memory-bound

설명: Prefill 단계에서는 프롬프트의 모든 토큰을 병렬로 처리합니다. 배치 처리와 유사하여 arithmetic intensity가 높고 GPU의 연산 유닛을 최대한 활용합니다 (compute-bound). 반면 Decode 단계에서는 이전에 생성된 모든 토큰의 KV 캐시를 읽으면서 토큰 하나를 생성합니다. 매 스텝마다 모델 가중치 전체와 KV 캐시를 메모리에서 읽어야 하므로 arithmetic intensity가 극도로 낮아 memory-bound가 됩니다. H100의 ridge point가 ~~295 FLOP/byte인데 decode 단계의 AI는 1~~2 FLOP/byte에 불과합니다.

Q2. PagedAttention이 기존 KV 캐시 관리보다 메모리 효율이 높은 이유는?

정답: 비연속 물리 블록 할당과 동적 할당으로 단편화 제거

설명: 기존 방식은 각 요청에 최대 시퀀스 길이만큼의 연속 메모리를 미리 예약합니다. 실제로 짧게 끝나는 요청도 긴 메모리를 선점하는 내부 단편화, 그리고 서로 다른 크기의 요청들이 종료되면서 생기는 외부 단편화가 심각합니다. PagedAttention은 OS의 페이징처럼 KV 캐시를 고정 크기 블록으로 나눠 필요할 때마다 할당합니다. 비연속 물리 메모리를 논리 블록으로 추상화하므로 단편화가 거의 없고, 여러 요청이 공통 프롬프트의 KV 블록을 Copy-on-Write로 공유할 수 있어 메모리 효율이 크게 향상됩니다.

Q3. AWQ가 GPTQ보다 중요 가중치를 잘 보존하는 방법은?

정답: 활성화값 크기에 따라 채널별로 스케일링하여 중요 가중치 보호

설명: GPTQ는 2차 근사(Hessian)를 이용해 양자화 오차를 최소화하지만, 모든 가중치를 유사하게 취급합니다. AWQ(Activation-aware Weight Quantization)는 활성화값의 분포를 분석하여 큰 활성화값과 연관된 채널(salience channel)이 전체 성능에 더 중요하다는 관찰에 기반합니다. 이런 중요 채널의 가중치에는 스케일 팩터를 곱해 양자화 전에 값을 키우고, 추론 시 대응하는 활성화에는 역수를 곱하여 보상합니다. 중요 가중치를 보호하면서도 하드웨어 친화적인 균일 양자화를 유지할 수 있어 GPTQ 대비 perplexity가 낮습니다.

Q4. Continuous batching이 static batching보다 GPU 활용률을 높이는 방식은?

정답: 이터레이션 단위 스케줄링으로 완료된 시퀀스의 슬롯을 즉시 재활용

설명: Static batching은 배치 내 모든 요청이 완료될 때까지 GPU가 기다립니다. 가장 긴 시퀀스가 완료될 때까지 짧게 끝난 요청의 GPU 슬롯은 낭비됩니다. Continuous batching(또는 iteration-level scheduling)은 매 추론 스텝마다 배치를 재구성합니다. 어떤 시퀀스가 EOS 토큰을 생성하거나 max_tokens에 도달하면 그 슬롯에 즉시 새로운 대기 요청을 추가합니다. 결과적으로 GPU는 항상 최대 배치 크기로 동작하며, 실험에서 static batching 대비 처리량이 2~5배 향상됩니다.

Q5. Tensor Parallelism과 Pipeline Parallelism의 통신 패턴 차이는?

정답: 텐서 병렬은 레이어마다 AllReduce, 파이프라인 병렬은 레이어 경계에서 P2P 전송

설명: Tensor Parallelism은 각 Transformer 레이어의 가중치 행렬을 여러 GPU에 분할합니다. 각 레이어 연산 후 모든 GPU가 AllReduce 통신으로 결과를 합산해야 합니다. 레이어가 80개면 80번의 AllReduce가 필요하고, 통신 레이턴시가 누적됩니다. NVLink 같은 고대역폭 인터커넥트가 필수입니다. 반면 Pipeline Parallelism은 레이어 그룹 경계에서만 activation을 다음 GPU로 전달합니다. 통신 횟수는 적지만 파이프라인 버블(앞 GPU가 계산하는 동안 뒷 GPU가 대기)이 발생합니다. 단일 노드 내 NVLink 환경에는 텐서 병렬, 노드 간 InfiniBand 환경에는 파이프라인 병렬이 적합합니다.

마치며

LLM 추론 최적화는 하드웨어의 물리적 한계를 소프트웨어로 극복하는 도전입니다. GPU 메모리 계층을 이해하고, KV 캐시를 효율적으로 관리하며, 적절한 양자화와 배치 전략을 조합하면 같은 하드웨어에서 훨씬 뛰어난 성능을 달성할 수 있습니다.

핵심 요약:

메모리 절약: AWQ/GPTQ 4-bit 양자화로 70B 모델을 단일 A100 80G에서 실행
처리량 향상: vLLM의 PagedAttention + continuous batching으로 정적 서빙 대비 최대 24배 처리량
레이턴시 감소: TensorRT-LLM으로 CUDA 커널 최적화, FP8 활용
스케일아웃: 텐서/파이프라인 병렬로 단일 GPU 한계를 넘어 멀티-GPU 클러스터 활용

GPU Memory Management & LLM Inference Optimization: vLLM, PagedAttention, GPTQ, TensorRT-LLM

Introduction

Deploying Large Language Models in production presents two fundamental challenges: GPU memory management and inference efficiency. GPT-4-scale models demand hundreds of gigabytes of memory, and real-time responsiveness requires generating dozens of tokens per second.

This guide covers every critical aspect of LLM inference optimization. From understanding GPU memory hierarchy to KV cache optimization, GPTQ/AWQ quantization, PagedAttention, continuous batching, and multi-GPU inference — everything a production engineer needs to know, explained step by step.

1. GPU Memory Hierarchy

HBM (High Bandwidth Memory)

The cornerstone of modern AI GPUs is HBM. HBM stacks multiple DRAM dies vertically, providing a far wider memory bus than conventional GDDR6.

GPU	Memory	HBM Type	Bandwidth	Bus Width
A100 80G	80 GB	HBM2e	2.0 TB/s	5120-bit
H100 SXM	80 GB	HBM3	3.35 TB/s	5120-bit
H200 SXM	141 GB	HBM3e	4.8 TB/s	5120-bit
B200 SXM	192 GB	HBM3e	8.0 TB/s	8192-bit
MI300X	192 GB	HBM3	5.3 TB/s	8192-bit

L2 Cache and SRAM

The GPU memory hierarchy has three main levels:

HBM (global memory): Tens to hundreds of GB, bandwidth in TB/s, latency ~hundreds of ns
L2 cache: Tens to hundreds of MB (H100: 50 MB), shared across all SMs
L1 cache / SRAM (shared memory): 128–256 KB per SM, bandwidth tens of TB/s, latency ~a few ns

The SRAM inside each SM (Streaming Multiprocessor) is the second-fastest memory after register files. Optimizations like Flash Attention leverage SRAM aggressively to reduce HBM accesses.

Roofline Model: Analyzing Performance Limits

The Roofline Model is an analytical tool for determining whether a given computation is compute-bound or memory-bound.

Arithmetic Intensity (AI) = Number of FLOPs / Memory accessed (bytes)

Performance ceiling = min(Peak FLOPS, Peak Memory BW × AI)

Low AI (memory-bound): Memory bandwidth is the bottleneck. The LLM decode phase is the classic example.
High AI (compute-bound): Arithmetic speed is the bottleneck. LLM prefill phase and large-batch workloads.

For an H100:

Peak FP16 FLOPS: 989 TFLOPS
Peak HBM bandwidth: 3.35 TB/s
Ridge point (balance point): 989 / 3.35 ≈ 295 FLOP/byte

When generating a single token, a 70B model (FP16) has AI ≈ 1–2 FLOP/byte — extremely memory-bound.

2. LLM Memory Calculations

Parameter Memory

Accurately calculating LLM memory requirements is the foundation of deployment planning.

def calc_model_memory_gb(
    num_params: int,       # Number of parameters (e.g., 70e9)
    dtype_bytes: int = 2,  # FP16=2, FP32=4, INT8=1, INT4=0.5
) -> float:
    """Calculate model weight memory"""
    return (num_params * dtype_bytes) / (1024 ** 3)

# Major model memory (FP16 baseline)
models = {
    "Llama-3.1-8B":   {"params": 8e9,   "bytes": 2},
    "Llama-3.1-70B":  {"params": 70e9,  "bytes": 2},
    "Llama-3.1-405B": {"params": 405e9, "bytes": 2},
    "Mistral-7B":     {"params": 7e9,   "bytes": 2},
    "Qwen2-72B":      {"params": 72e9,  "bytes": 2},
}

for name, cfg in models.items():
    mem_gb = calc_model_memory_gb(cfg["params"], cfg["bytes"])
    print(f"{name}: {mem_gb:.1f} GB")

Model	Parameters	FP32	FP16/BF16	INT8	INT4
Llama-3.1-8B	8B	32 GB	16 GB	8 GB	4 GB
Llama-3.1-70B	70B	280 GB	140 GB	70 GB	35 GB
Llama-3.1-405B	405B	1620 GB	810 GB	405 GB	202 GB
Mistral-7B	7B	28 GB	14 GB	7 GB	3.5 GB

KV Cache Memory Calculation

KV cache is the most dynamically changing component of inference memory. It scales proportionally with sequence length and batch size.

def calc_kv_cache_memory_gb(
    num_layers: int,
    num_heads: int,
    head_dim: int,
    seq_len: int,
    batch_size: int,
    dtype_bytes: int = 2,  # FP16
) -> float:
    """
    KV cache memory calculation.
    Per layer: 2 (K, V) × num_heads × head_dim × seq_len × batch_size
    """
    kv_per_layer = 2 * num_heads * head_dim * seq_len * batch_size
    total_bytes = kv_per_layer * num_layers * dtype_bytes
    return total_bytes / (1024 ** 3)

# Llama-3.1-70B example
# layers=80, heads=64 (GQA: kv_heads=8), head_dim=128
kv_mem = calc_kv_cache_memory_gb(
    num_layers=80,
    num_heads=8,      # GQA uses kv_heads
    head_dim=128,
    seq_len=4096,
    batch_size=1,
    dtype_bytes=2,
)
print(f"KV cache (seq=4096, bs=1): {kv_mem:.2f} GB")
# Output: KV cache (seq=4096, bs=1): 0.50 GB

# KV cache by batch size
for bs in [1, 4, 8, 16, 32]:
    mem = calc_kv_cache_memory_gb(80, 8, 128, 4096, bs, 2)
    print(f"  batch_size={bs:2d}: {mem:.2f} GB")

KV Cache Memory (Llama-3.1-70B, seq_len=4096, FP16)

Batch Size	KV Cache	Model Weights	Total
1	0.5 GB	140 GB	140.5 GB
4	2.0 GB	140 GB	142.0 GB
8	4.0 GB	140 GB	144.0 GB
16	8.0 GB	140 GB	148.0 GB
32	16.0 GB	140 GB	156.0 GB

Activation Memory

During inference, activation memory is proportional to the product of batch size, sequence length, and hidden size. Unlike training, inference does not store gradients, making activation memory relatively small.

def calc_activation_memory_gb(
    hidden_size: int,
    seq_len: int,
    batch_size: int,
    num_layers: int,
    dtype_bytes: int = 2,
) -> float:
    """Approximate activation memory during inference"""
    # Per layer: attention + FFN activations
    # Approximation: 2 × hidden_size × seq_len × batch_size per layer
    bytes_per_layer = 2 * hidden_size * seq_len * batch_size * dtype_bytes
    return (bytes_per_layer * num_layers) / (1024 ** 3)

3. KV Cache Optimization: PagedAttention

Problems with Conventional KV Cache

Traditional LLM serving systems allocate KV cache as contiguous memory blocks. This causes serious issues:

Internal Fragmentation: Pre-allocating to max sequence length wastes unused space
External Fragmentation: Varying request sizes create unusable gaps when requests complete
Memory Efficiency: In practice, 60–80% of KV cache memory is wasted

PagedAttention: Applying OS Paging Principles

vLLM's PagedAttention applies virtual memory paging concepts to KV cache management.

OS Virtual Memory    → PagedAttention
─────────────────────────────────────
Virtual page         → Logical block
Physical frame       → Physical block
Page table           → Block table
Page fault           → Block allocation

Core ideas:

Split KV cache into fixed-size blocks (e.g., 16 tokens each)
Access sequence KV via logical blocks; allocate physical blocks on demand
Share physical blocks with Copy-on-Write when requests share a common prefix

Request A: [Block 0] → [Block 1] → [Block 2]
                                        ↕ Physical block sharing (common prompt)
Request B: [Block 0] → [Block 1] → [Block 3]

vLLM Server Launch Example

# Install vLLM
pip install vllm

# Start server (single GPU)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

# OpenAI-compatible API call
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain GPU memory optimization"}],
    "max_tokens": 512,
    "temperature": 0.7
  }'

# Python client calling vLLM API
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a GPU optimization expert."},
        {"role": "user", "content": "Explain how PagedAttention works."},
    ],
    max_tokens=1024,
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

4. Quantization: GPTQ, AWQ, GGUF, bitsandbytes

Quantization Method Comparison

Method	Precision	Memory Savings	Speed	Quality Loss	Notes
FP16/BF16	16-bit	Baseline	Baseline	None	Default
GPTQ	4-bit	~75%	Fast	Low	PTQ, GPU only
AWQ	4-bit	~75%	Fast	Very low	Activation-aware
GGUF	2–8-bit	Variable	CPU capable	Variable	llama.cpp
bitsandbytes NF4	4-bit	~75%	Moderate	Low	QLoRA training
bitsandbytes INT8	8-bit	~50%	Moderate	Very low	LLM.int8()

bitsandbytes 4-bit Quantization Loading

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit NF4 quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in BF16
    bnb_4bit_quant_type="nf4",              # NormalFloat4 quantization
    bnb_4bit_use_double_quant=True,         # Double quantization for extra compression
)

model_id = "meta-llama/Llama-3.1-70B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",      # Automatic multi-GPU distribution
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Check memory usage
print(f"GPU memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

GPTQ Quantization (auto-gptq)

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"

quantize_config = BaseQuantizeConfig(
    bits=4,              # 4-bit quantization
    group_size=128,      # Group size (smaller = more accurate but more memory)
    desc_act=False,      # Describe activation ordering
    damp_percent=0.01,   # Hessian damping coefficient
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Prepare calibration data (representative text samples)
calibration_data = [
    tokenizer("The GPU accelerates machine learning by...", return_tensors="pt").input_ids,
    tokenizer("Quantization reduces model size while...", return_tensors="pt").input_ids,
    # Recommend 1024+ samples in practice
]

# Load model and quantize
model = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    quantize_config=quantize_config,
    torch_dtype=torch.float16,
)

model.quantize(calibration_data)
model.save_quantized("llama-3.1-8b-gptq-4bit")
print("GPTQ quantization complete!")

# Load quantized model
quantized_model = AutoGPTQForCausalLM.from_quantized(
    "llama-3.1-8b-gptq-4bit",
    use_safetensors=True,
    device="cuda:0",
)

AWQ: Activation-Aware Weight Quantization

AWQ does not quantize all weights equally. Channels with large activation values (important weights) are protected at higher precision.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"

# AWQ quantization configuration
quant_config = {
    "zero_point": True,   # Zero-point quantization
    "q_group_size": 128,  # Group size
    "w_bit": 4,           # 4-bit
    "version": "GEMM",    # Matrix multiplication kernel
}

model = AutoAWQForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print("AWQ quantization complete!")

Quantization Benchmark (Llama-3.1-8B)

Method	Memory	Throughput (tok/s)	Perplexity	Notes
FP16	16 GB	100 (baseline)	7.2	Baseline
BF16	16 GB	100	7.2	Equivalent to FP16
INT8	8 GB	75	7.3	Minor quality loss
GPTQ-4bit	4.5 GB	120	7.6	Memory savings + speed
AWQ-4bit	4.5 GB	125	7.4	Better quality than GPTQ
GGUF-Q4_K_M	4.8 GB	80 (CPU)	7.5	CPU inference capable

5. Batching Strategies: Continuous Batching

Limitations of Static Batching

Traditional static batching waits for all requests in a batch to start simultaneously and complete together. This leads to severe GPU underutilization.

Static Batching (batch_size=3):

Time →
[Request A: ████████████░░░░░░░░]  (12 tokens generated)
[Request B: ████░░░░░░░░░░░░░░░░]  (4 tokens generated)
[Request C: ████████░░░░░░░░░░░░]  (8 tokens generated)
                └─ B and C must wait for A to finish (GPU wasted)

Continuous Batching (Iteration-level Scheduling)

Modern LLM serving systems like vLLM and TensorRT-LLM use continuous batching, dynamically reconstructing the batch at each inference step (iteration).

Continuous Batching:

Step 1: [A1][B1][C1]  ← 3 processed simultaneously
Step 2: [A2][B2][C2]
Step 3: [A3][B3][C3]  ← B completes, new request D added
Step 4: [A4][C4][D1]  ← Empty slot immediately filled
Step 5: [A5][C5][D2]
Step 6: [A6][C6][D3]  ← C completes, new request E added
...

GPU utilization improves 2–5x compared to static batching.

Separating Prefill and Decode

LLM inference has two distinct phases:

Prefill: Processes the entire prompt at once. Compute-bound (behaves like batch processing)
Decode: Autoregressive token-by-token generation. Memory-bound

These phases have different GPU resource requirements. Disaggregated Prefill is an architecture that separates prefill-dedicated GPUs from decode-dedicated GPUs.

6. LLM Inference Framework Comparison

Framework	Developer	Key Features	Best Use Case
vLLM	UC Berkeley	PagedAttention, OpenAI-compatible API	High-throughput serving
TensorRT-LLM	NVIDIA	Optimized CUDA kernels, FP8 support	Lowest latency
Ollama	Ollama Inc	Easy local execution	Development/testing
llama.cpp	ggml	CPU inference, GGUF format	Edge/local
SGLang	LM-Sys	Structured generation, RadixAttention	Complex pipelines

vLLM Tensor Parallel Inference

from vllm import LLM, SamplingParams

# Tensor parallel across 4 GPUs
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,        # Distribute across 4 GPUs
    dtype="bfloat16",
    max_model_len=8192,
    gpu_memory_utilization=0.90,
    enforce_eager=False,           # Use CUDA graph optimization
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "[INST]"],
)

prompts = [
    "Explain the GPU memory hierarchy",
    "What are the benefits of PagedAttention?",
    "Compare quantization methods",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    print(f"Prompt: {prompt[:50]}...")
    print(f"Generated: {generated[:100]}...")
    print()

7. Multi-GPU Inference: Tensor and Pipeline Parallelism

Tensor Parallelism

Tensor parallelism distributes individual matrix operations across multiple GPUs, splitting each Transformer layer horizontally.

Attention head distribution (4-way tensor parallel):

GPU 0: Head 0-15
GPU 1: Head 16-31
GPU 2: Head 32-47
GPU 3: Head 48-63

Each GPU computes independently, then AllReduce aggregates results

Pros: Reduced latency, enables large layers that don't fit on a single GPU
Cons: AllReduce communication required per layer — high-bandwidth NVLink is essential
Best for: Intra-node NVLink-connected GPUs, latency-sensitive applications

Pipeline Parallelism

Pipeline parallelism assigns groups of layers to different GPUs.

Llama-3.1-70B (80 layers) → 4-way pipeline:

GPU 0: Layers 0-19
GPU 1: Layers 20-39
GPU 2: Layers 40-59
GPU 3: Layers 60-79

Sequential layer processing, activations forwarded between GPUs

Pros: Efficient even with low-bandwidth inter-node connections, minimal communication volume
Cons: Pipeline bubbles (downstream GPUs idle while upstream computes), increased latency
Best for: Multi-node distributed inference, very large models

Memory Profiling

import torch

def profile_gpu_memory(func, *args, **kwargs):
    """Profile GPU memory usage"""
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.synchronize()

    before = torch.cuda.memory_allocated()
    result = func(*args, **kwargs)
    torch.cuda.synchronize()

    after = torch.cuda.memory_allocated()
    peak = torch.cuda.max_memory_allocated()

    print(f"Memory increase: {(after - before) / 1e9:.3f} GB")
    print(f"Peak memory:     {peak / 1e9:.3f} GB")
    print()
    print(torch.cuda.memory_summary())
    return result

# Example memory stats output
def load_and_infer():
    from transformers import pipeline
    pipe = pipeline(
        "text-generation",
        model="microsoft/phi-2",
        torch_dtype=torch.float16,
        device_map="auto",
    )
    return pipe("GPU memory management is", max_new_tokens=50)

profile_gpu_memory(load_and_infer)

8. Practical Optimization Checklist

GPU Memory Optimization Strategies

Apply quantization: INT4/INT8 saves 50–75% memory
Optimize KV cache: Limit max_model_len, choose GQA models
Flash Attention 2: Leverages SRAM to reduce memory from O(n²) to O(n)
Model sharding: Use tensor or pipeline parallelism for multi-GPU
Continuous batching: Maximize GPU utilization with dynamic request scheduling

Inference Speed Optimization

# Optimized vLLM server configuration example
vllm_config = {
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "dtype": "bfloat16",
    "tensor_parallel_size": 1,
    "gpu_memory_utilization": 0.90,    # Use 90% of GPU memory
    "max_model_len": 8192,
    "max_num_batched_tokens": 8192,    # Max tokens per batch
    "max_num_seqs": 256,               # Max concurrent sequences
    "enable_chunked_prefill": True,    # Enable chunked prefill
    "block_size": 16,                  # KV cache block size (PagedAttention)
    "swap_space": 4,                   # CPU swap space in GB
    "enforce_eager": False,            # Use CUDA graphs
    "disable_log_stats": False,
}

Quiz: Check Your Understanding

Q1. Why do the prefill and decode phases of LLM inference have different compute characteristics?

Answer: Prefill is compute-bound; decode is memory-bound.

Explanation: During the prefill phase, all tokens in the prompt are processed in parallel — similar to batch processing. This yields high arithmetic intensity and keeps GPU compute units busy (compute-bound). During the decode phase, one token is generated per step by reading the entire KV cache from all previous tokens plus the full model weights. Every step requires loading the entire model weight matrix and the accumulated KV cache from memory, resulting in extremely low arithmetic intensity — making decode deeply memory-bound. With the H100's ridge point at ~295 FLOP/byte, the decode phase's AI of just 1–2 FLOP/byte means it runs far below peak compute.

Q2. Why does PagedAttention achieve higher memory efficiency than conventional KV cache management?

Answer: Non-contiguous physical block allocation and on-demand allocation eliminate fragmentation.

Explanation: Conventional systems pre-reserve contiguous memory equal to max sequence length per request, causing internal fragmentation (unused reserved space) and external fragmentation (unusable gaps left by completed requests). Studies show 60–80% of KV cache memory is wasted in practice. PagedAttention divides the KV cache into fixed-size blocks (e.g., 16 tokens), allocating physical blocks only when needed. Non-contiguous physical memory is abstracted through logical blocks, nearly eliminating fragmentation. Multiple requests sharing a common prompt prefix can also share physical KV blocks via Copy-on-Write, further reducing memory consumption.

Q3. How does AWQ preserve important weights better than GPTQ?

Answer: AWQ scales per-channel based on activation magnitude to protect salient weights.

Explanation: GPTQ minimizes quantization error using a second-order Hessian approximation but treats all weights roughly equally. AWQ (Activation-aware Weight Quantization) analyzes the activation distribution and observes that channels with large activation values (salient channels) contribute disproportionately to model performance. For these salient channels, AWQ multiplies the weights by a scale factor before quantization to inflate their values, then divides the corresponding activations by the same factor at inference to compensate. This protects the most important weights while maintaining hardware-friendly uniform quantization, resulting in lower perplexity than GPTQ at the same bit width.

Q4. How does continuous batching improve GPU utilization over static batching?

Answer: Iteration-level scheduling immediately reclaims completed sequence slots for new requests.

Explanation: Static batching stalls the GPU until every request in the batch finishes. Short requests leave their GPU slots idle while waiting for the longest sequence to complete. Continuous batching (iteration-level scheduling) reconstructs the batch at every inference step. When a sequence produces an EOS token or reaches max_tokens, its slot is immediately assigned to a waiting new request. The GPU therefore always operates at maximum batch capacity. In experiments, throughput improves 2–5x over static batching, and vLLM's paper reported up to 24x higher throughput compared to Hugging Face's static serving.

Q5. How do communication patterns differ between Tensor Parallelism and Pipeline Parallelism?

Answer: Tensor parallelism uses AllReduce every layer; pipeline parallelism uses point-to-point transfers at layer boundaries.

Explanation: Tensor Parallelism splits the weight matrices of each Transformer layer across GPUs. After each layer's computation, all GPUs must synchronize via an AllReduce collective to sum partial results. With 80 layers, that means 80 AllReduce operations, each adding communication latency — high-bandwidth NVLink interconnects are essential. Pipeline Parallelism assigns layer groups to different GPUs and only transfers activation tensors at group boundaries. This minimizes communication frequency but introduces pipeline bubbles (downstream GPUs idle while upstream GPUs process). For intra-node NVLink environments, tensor parallelism is preferred; for inter-node InfiniBand environments, pipeline parallelism scales better.

Conclusion

LLM inference optimization is the art of pushing hardware to its physical limits through software. Understanding the GPU memory hierarchy, managing KV cache efficiently, and combining the right quantization and batching strategies can dramatically improve performance on the same hardware.

Key Takeaways:

Memory savings: AWQ/GPTQ 4-bit quantization allows 70B models to run on a single A100 80G
Throughput gains: vLLM's PagedAttention + continuous batching delivers up to 24x throughput vs. static serving
Latency reduction: TensorRT-LLM with CUDA kernel fusion and FP8 inference
Scale-out: Tensor/pipeline parallelism breaks single-GPU limits across multi-GPU clusters