GPU 메모리 관리 & LLM 추론 최적화: vLLM, PagedAttention, GPTQ, TensorRT-LLM까지

시작하며

LLM(Large Language Model)을 실제 서비스에 배포할 때 가장 큰 도전은 GPU 메모리 관리와 추론 효율화입니다. GPT-4급 모델은 수백 GB의 메모리를 요구하고, 실시간 응답을 위해서는 초당 수십 토큰의 생성 속도가 필요합니다.

이 가이드는 LLM 추론 최적화의 모든 핵심 요소를 다룹니다. GPU 메모리 계층 구조의 이해부터 KV 캐시 최적화, GPTQ/AWQ 양자화, PagedAttention, continuous batching, 멀티-GPU 추론까지 실전 엔지니어가 반드시 알아야 할 내용을 단계별로 설명합니다.

1. GPU 메모리 계층 구조

HBM (High Bandwidth Memory)

현대 AI GPU의 핵심은 HBM입니다. HBM은 여러 개의 DRAM 다이를 수직으로 쌓아 만든 메모리로, 일반 GDDR6보다 훨씬 넓은 메모리 버스를 제공합니다.

GPU	메모리	HBM 타입	대역폭	버스 폭
A100 80G	80 GB	HBM2e	2.0 TB/s	5120-bit
H100 SXM	80 GB	HBM3	3.35 TB/s	5120-bit
H200 SXM	141 GB	HBM3e	4.8 TB/s	5120-bit
B200 SXM	192 GB	HBM3e	8.0 TB/s	8192-bit
MI300X	192 GB	HBM3	5.3 TB/s	8192-bit

L2 캐시와 SRAM

GPU 메모리 계층은 크게 세 단계로 구성됩니다:

HBM (전역 메모리): 수십~수백 GB, 대역폭 수 TB/s, 레이턴시 ~수백 ns
L2 캐시: 수십~수백 MB (H100: 50 MB), GPU 내 모든 SM이 공유
L1 캐시 / SRAM (공유 메모리): SM당 128~256 KB, 대역폭 수십 TB/s, 레이턴시 ~수 ns

각 SM(Streaming Multiprocessor) 내부의 SRAM은 레지스터 파일 다음으로 빠른 메모리입니다. Flash Attention 같은 최적화 알고리즘은 이 SRAM을 적극 활용하여 HBM 접근 횟수를 줄입니다.

Roofline Model: 성능 한계 분석

Roofline Model은 주어진 연산이 compute-bound인지 memory-bound인지 판단하는 분석 도구입니다.

Arithmetic Intensity (AI) = FLOP 수 / 메모리 접근량 (bytes)

성능 상한 = min(Peak FLOPS, Peak Memory BW × AI)

AI가 낮을 때 (memory-bound): 메모리 대역폭이 병목. LLM decode 단계가 대표적
AI가 높을 때 (compute-bound): 연산 속도가 병목. LLM prefill 단계, batch가 클 때

H100의 경우:

Peak FP16 FLOPS: 989 TFLOPS
Peak HBM 대역폭: 3.35 TB/s
Ridge point (균형점): 989 / 3.35 ≈ 295 FLOP/byte

토큰 하나를 생성할 때 70B 모델(FP16)은 AI ≈ 1~2 FLOP/byte로 극도로 memory-bound입니다.

2. LLM 메모리 계산

파라미터 메모리

LLM의 메모리 사용량을 정확히 계산하는 것은 배포 계획의 핵심입니다.

def calc_model_memory_gb(
    num_params: int,       # 파라미터 수 (예: 70e9)
    dtype_bytes: int = 2,  # FP16=2, FP32=4, INT8=1, INT4=0.5
) -> float:
    """모델 가중치 메모리 계산"""
    return (num_params * dtype_bytes) / (1024 ** 3)

# 주요 모델 메모리 (FP16 기준)
models = {
    "Llama-3.1-8B":   {"params": 8e9,   "bytes": 2},
    "Llama-3.1-70B":  {"params": 70e9,  "bytes": 2},
    "Llama-3.1-405B": {"params": 405e9, "bytes": 2},
    "Mistral-7B":     {"params": 7e9,   "bytes": 2},
    "Qwen2-72B":      {"params": 72e9,  "bytes": 2},
}

for name, cfg in models.items():
    mem_gb = calc_model_memory_gb(cfg["params"], cfg["bytes"])
    print(f"{name}: {mem_gb:.1f} GB")

모델	파라미터	FP32	FP16/BF16	INT8	INT4
Llama-3.1-8B	8B	32 GB	16 GB	8 GB	4 GB
Llama-3.1-70B	70B	280 GB	140 GB	70 GB	35 GB
Llama-3.1-405B	405B	1620 GB	810 GB	405 GB	202 GB
Mistral-7B	7B	28 GB	14 GB	7 GB	3.5 GB

KV 캐시 메모리 계산

KV 캐시는 추론 시 가장 동적으로 변하는 메모리 사용량입니다. 시퀀스 길이와 배치 크기에 비례합니다.

def calc_kv_cache_memory_gb(
    num_layers: int,
    num_heads: int,
    head_dim: int,
    seq_len: int,
    batch_size: int,
    dtype_bytes: int = 2,  # FP16
) -> float:
    """
    KV 캐시 메모리 계산
    각 레이어: 2 (K, V) × num_heads × head_dim × seq_len × batch_size
    """
    kv_per_layer = 2 * num_heads * head_dim * seq_len * batch_size
    total_bytes = kv_per_layer * num_layers * dtype_bytes
    return total_bytes / (1024 ** 3)

# Llama-3.1-70B 예시
# layers=80, heads=64 (GQA: kv_heads=8), head_dim=128
kv_mem = calc_kv_cache_memory_gb(
    num_layers=80,
    num_heads=8,      # GQA의 경우 kv_heads 사용
    head_dim=128,
    seq_len=4096,
    batch_size=1,
    dtype_bytes=2,
)
print(f"KV 캐시 (seq=4096, bs=1): {kv_mem:.2f} GB")
# 출력: KV 캐시 (seq=4096, bs=1): 0.50 GB

# 배치 크기별 KV 캐시
for bs in [1, 4, 8, 16, 32]:
    mem = calc_kv_cache_memory_gb(80, 8, 128, 4096, bs, 2)
    print(f"  batch_size={bs:2d}: {mem:.2f} GB")

KV 캐시 메모리 (Llama-3.1-70B, seq_len=4096, FP16)

Batch Size	KV 캐시	모델 가중치	총 사용량
1	0.5 GB	140 GB	140.5 GB
4	2.0 GB	140 GB	142.0 GB
8	4.0 GB	140 GB	144.0 GB
16	8.0 GB	140 GB	148.0 GB
32	16.0 GB	140 GB	156.0 GB

활성화 메모리

추론 시 활성화 메모리는 배치 크기, 시퀀스 길이, 히든 사이즈의 곱에 비례합니다. 학습과 달리 추론에서는 그래디언트를 저장하지 않으므로 상대적으로 작습니다.

def calc_activation_memory_gb(
    hidden_size: int,
    seq_len: int,
    batch_size: int,
    num_layers: int,
    dtype_bytes: int = 2,
) -> float:
    """추론 시 활성화 메모리 근사 계산"""
    # 각 레이어: attention + FFN 활성화
    # 근사치: 2 × hidden_size × seq_len × batch_size per layer
    bytes_per_layer = 2 * hidden_size * seq_len * batch_size * dtype_bytes
    return (bytes_per_layer * num_layers) / (1024 ** 3)

3. KV 캐시 최적화: PagedAttention

기존 KV 캐시의 문제점

전통적인 LLM 서빙 시스템은 KV 캐시를 연속된 메모리 블록으로 할당합니다. 이는 심각한 문제를 야기합니다:

내부 단편화 (Internal Fragmentation): 최대 시퀀스 길이에 맞춰 미리 할당하면 실제 사용되지 않는 공간이 낭비됩니다
외부 단편화 (External Fragmentation): 요청이 끝날 때마다 크기가 다른 빈 공간들이 생겨 새 요청을 할당하기 어렵습니다
메모리 효율: 실제 시스템에서 KV 캐시의 60~80%가 낭비됩니다

PagedAttention: OS 페이징 원리의 적용

vLLM의 PagedAttention은 운영체제의 가상 메모리 페이징 개념을 KV 캐시에 적용합니다.

OS 가상 메모리 → PagedAttention
────────────────────────────────
가상 페이지     → 논리 블록 (logical block)
물리 프레임     → 물리 블록 (physical block)
페이지 테이블   → 블록 테이블 (block table)
페이지 폴트     → 블록 할당

핵심 아이디어:

KV 캐시를 고정 크기 블록(예: 16 토큰)으로 분할
시퀀스의 KV는 논리 블록으로 접근하고, 실제 물리 블록은 필요할 때 할당
서로 다른 시퀀스가 공통 프롬프트를 공유할 때 물리 블록을 Copy-on-Write로 공유

요청 A: [Block 0] → [Block 1] → [Block 2]
                                      ↕ 물리 블록 공유 (공통 프롬프트)
요청 B: [Block 0] → [Block 1] → [Block 3]

vLLM 서버 실행 예시

# vLLM 설치
pip install vllm

# 서버 시작 (단일 GPU)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

# OpenAI 호환 API 호출
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "GPU 메모리 최적화를 설명해줘"}],
    "max_tokens": 512,
    "temperature": 0.7
  }'

# Python 클라이언트로 vLLM API 호출
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "당신은 GPU 최적화 전문가입니다."},
        {"role": "user", "content": "PagedAttention의 작동 원리를 설명해주세요."},
    ],
    max_tokens=1024,
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

4. 양자화: GPTQ, AWQ, GGUF, bitsandbytes

양자화 기법 비교

기법	정밀도	메모리 절약	속도	품질 손실	특징
FP16/BF16	16-bit	기준	기준	없음	기본값
GPTQ	4-bit	~75%	빠름	낮음	PTQ, GPU 전용
AWQ	4-bit	~75%	빠름	매우 낮음	활성화 인식
GGUF	2~8-bit	가변	CPU 가능	가변	llama.cpp
bitsandbytes NF4	4-bit	~75%	보통	낮음	QLoRA 학습
bitsandbytes INT8	8-bit	~50%	보통	매우 낮음	LLM.int8()

bitsandbytes 4-bit 양자화 로딩

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit NF4 양자화 설정
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # 계산은 BF16으로
    bnb_4bit_quant_type="nf4",              # NormalFloat4 양자화
    bnb_4bit_use_double_quant=True,         # 이중 양자화로 추가 압축
)

model_id = "meta-llama/Llama-3.1-70B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",      # 자동 멀티-GPU 분산
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# 메모리 사용량 확인
print(f"GPU 메모리: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

GPTQ 양자화 (auto-gptq)

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"

quantize_config = BaseQuantizeConfig(
    bits=4,              # 4-bit 양자화
    group_size=128,      # 그룹 크기 (작을수록 정확하지만 메모리 증가)
    desc_act=False,      # 활성화 순서 기술
    damp_percent=0.01,   # Hessian 댐핑 계수
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# 캘리브레이션 데이터 준비 (대표 텍스트 샘플)
calibration_data = [
    tokenizer("The GPU accelerates machine learning by...", return_tensors="pt").input_ids,
    tokenizer("Quantization reduces model size while...", return_tensors="pt").input_ids,
    # 실제로는 1024개 이상의 샘플 사용 권장
]

# 모델 로드 및 양자화
model = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    quantize_config=quantize_config,
    torch_dtype=torch.float16,
)

model.quantize(calibration_data)
model.save_quantized("llama-3.1-8b-gptq-4bit")
print("GPTQ 양자화 완료!")

# 양자화된 모델 로드
quantized_model = AutoGPTQForCausalLM.from_quantized(
    "llama-3.1-8b-gptq-4bit",
    use_safetensors=True,
    device="cuda:0",
)

AWQ: 활성화 인식 가중치 양자화

AWQ는 모든 가중치를 동등하게 양자화하지 않습니다. 활성화값이 크게 나오는 채널(중요한 가중치)은 더 높은 정밀도로 보호합니다.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "llama-3.1-8b-awq"

# AWQ 양자화 설정
quant_config = {
    "zero_point": True,   # 제로 포인트 양자화
    "q_group_size": 128,  # 그룹 크기
    "w_bit": 4,           # 4-bit
    "version": "GEMM",    # 행렬 곱셈 커널
}

model = AutoAWQForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print("AWQ 양자화 완료!")

양자화별 성능 벤치마크 (Llama-3.1-8B)

방식	메모리	처리량 (tok/s)	Perplexity	특이사항
FP16	16 GB	100 (기준)	7.2	기준값
BF16	16 GB	100	7.2	FP16과 동급
INT8	8 GB	75	7.3	약간의 품질 손실
GPTQ-4bit	4.5 GB	120	7.6	메모리 절약, 속도 향상
AWQ-4bit	4.5 GB	125	7.4	GPTQ보다 우수한 품질
GGUF-Q4_K_M	4.8 GB	80 (CPU)	7.5	CPU 추론 가능

5. 배치 전략: Continuous Batching

Static Batching의 한계

전통적인 static batching은 모든 요청이 동시에 시작하고 끝날 때까지 기다립니다. 이는 GPU 활용률이 낮아지는 심각한 비효율을 초래합니다.

Static Batching 예시 (batch_size=3):

시간 →
[요청 A: ████████████░░░░░░░░]  (토큰 12개 생성)
[요청 B: ████░░░░░░░░░░░░░░░░]  (토큰 4개 생성)
[요청 C: ████████░░░░░░░░░░░░]  (토큰 8개 생성)
                └─ B, C가 끝나도 A를 기다려야 함 (GPU 낭비)

Continuous Batching (Iteration-level Scheduling)

vLLM, TensorRT-LLM 등 현대 LLM 서빙 시스템은 continuous batching을 사용합니다. 각 추론 스텝(이터레이션)마다 배치를 동적으로 재구성합니다.

Continuous Batching:

Step 1: [A1][B1][C1]  ← 3개 동시 처리
Step 2: [A2][B2][C2]
Step 3: [A3][B3][C3]  ← B 완료, 새 요청 D 추가
Step 4: [A4][C4][D1]  ← 즉시 빈 슬롯 채움
Step 5: [A5][C5][D2]
Step 6: [A6][C6][D3]  ← C 완료, 새 요청 E 추가
...

GPU 활용률이 static batching 대비 2~5배 향상됩니다.

Prefill vs Decode 분리

LLM 추론은 두 단계로 나뉩니다:

Prefill: 프롬프트 전체를 한 번에 처리. compute-bound (배치처럼 동작)
Decode: 토큰 하나씩 자기회귀적 생성. memory-bound

이 두 단계는 서로 다른 GPU 특성을 필요로 합니다. Disaggregated Prefill은 prefill 전용 GPU와 decode 전용 GPU를 분리하는 아키텍처입니다.

6. LLM 추론 프레임워크 비교

프레임워크	개발사	특징	최적 용도
vLLM	UC Berkeley	PagedAttention, OpenAI 호환 API	고처리량 서빙
TensorRT-LLM	NVIDIA	최적화 CUDA 커널, FP8 지원	최저 레이턴시
Ollama	Ollama Inc	간편한 로컬 실행	개발/테스트
llama.cpp	ggml	CPU 추론, GGUF 형식	엣지/로컬
SGLang	LM-Sys	구조화 생성, RadixAttention	복잡한 파이프라인

vLLM 텐서 병렬 추론

from vllm import LLM, SamplingParams

# 텐서 병렬로 4 GPU에 분산
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,       # GPU 4개에 텐서 병렬 분산
    dtype="bfloat16",
    max_model_len=8192,
    gpu_memory_utilization=0.90,
    enforce_eager=False,           # CUDA 그래프 최적화 사용
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "[INST]"],
)

prompts = [
    "GPU 메모리 계층을 설명해줘",
    "PagedAttention의 장점은?",
    "양자화 기법을 비교해줘",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    print(f"프롬프트: {prompt[:50]}...")
    print(f"생성: {generated[:100]}...")
    print()

7. 멀티-GPU 추론: Tensor/Pipeline Parallelism

Tensor Parallelism (텐서 병렬)

텐서 병렬은 개별 행렬 연산을 여러 GPU에 분산합니다. 각 Transformer 레이어 내부를 수평으로 분할합니다.

Attention 헤드 분산 (4-way tensor parallel):

GPU 0: Head 0~15
GPU 1: Head 16~31
GPU 2: Head 32~47
GPU 3: Head 48~63

각 GPU가 독립적으로 계산 후 AllReduce로 결과 합산

장점: 레이턴시 감소, 대형 레이어 처리 가능
단점: 레이어마다 AllReduce 통신 필요 → NVLink 고속 연결 필수
적합: 단일 노드 내 NVLink 연결된 GPU, 레이턴시 민감 응용

Pipeline Parallelism (파이프라인 병렬)

파이프라인 병렬은 레이어를 그룹으로 나눠 각 GPU에 할당합니다.

Llama-3.1-70B (80 레이어) → 4-way pipeline:

GPU 0: Layer 0~19
GPU 1: Layer 20~39
GPU 2: Layer 40~59
GPU 3: Layer 60~79

레이어 순서대로 처리, GPU 간 activation 전달

장점: 노드 간 저속 연결에서도 효율적, 통신량 적음
단점: 파이프라인 버블 발생, 레이턴시 증가
적합: 다중 노드 분산 추론, 초대형 모델

메모리 프로파일링

import torch

def profile_gpu_memory(func, *args, **kwargs):
    """GPU 메모리 사용량 프로파일링"""
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.synchronize()

    before = torch.cuda.memory_allocated()
    result = func(*args, **kwargs)
    torch.cuda.synchronize()

    after = torch.cuda.memory_allocated()
    peak = torch.cuda.max_memory_allocated()

    print(f"메모리 증가: {(after - before) / 1e9:.3f} GB")
    print(f"피크 메모리: {peak / 1e9:.3f} GB")
    print()
    print(torch.cuda.memory_summary())
    return result

# 메모리 통계 출력 예시
def load_and_infer():
    from transformers import pipeline
    pipe = pipeline(
        "text-generation",
        model="microsoft/phi-2",
        torch_dtype=torch.float16,
        device_map="auto",
    )
    return pipe("GPU memory management is", max_new_tokens=50)

profile_gpu_memory(load_and_infer)

8. 실전 최적화 체크리스트

GPU 메모리 최적화 전략

양자화 적용: INT4/INT8 양자화로 메모리 50~75% 절약
KV 캐시 최적화: max_model_len 제한, GQA(Grouped Query Attention) 모델 선택
Flash Attention 2: SRAM 활용 최적화, 메모리 O(n²) → O(n) 감소
모델 샤딩: 텐서 병렬 또는 파이프라인 병렬로 멀티-GPU 활용
연속 배치: continuous batching으로 GPU 활용률 극대화

추론 속도 최적화

# 최적화된 vLLM 서버 설정 예시
vllm_config = {
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "dtype": "bfloat16",
    "tensor_parallel_size": 1,
    "gpu_memory_utilization": 0.90,   # 90% GPU 메모리 사용
    "max_model_len": 8192,
    "max_num_batched_tokens": 8192,   # 배치당 최대 토큰 수
    "max_num_seqs": 256,              # 동시 처리 시퀀스 수
    "enable_chunked_prefill": True,   # Chunked prefill 활성화
    "block_size": 16,                 # KV 캐시 블록 크기 (PagedAttention)
    "swap_space": 4,                  # CPU swap 공간 (GB)
    "enforce_eager": False,           # CUDA 그래프 사용
    "disable_log_stats": False,
}

퀴즈: 이해도 확인

Q1. LLM 추론에서 prefill 단계와 decode 단계의 compute 특성이 다른 이유는?

정답: Prefill은 compute-bound, Decode는 memory-bound

설명: Prefill 단계에서는 프롬프트의 모든 토큰을 병렬로 처리합니다. 배치 처리와 유사하여 arithmetic intensity가 높고 GPU의 연산 유닛을 최대한 활용합니다 (compute-bound). 반면 Decode 단계에서는 이전에 생성된 모든 토큰의 KV 캐시를 읽으면서 토큰 하나를 생성합니다. 매 스텝마다 모델 가중치 전체와 KV 캐시를 메모리에서 읽어야 하므로 arithmetic intensity가 극도로 낮아 memory-bound가 됩니다. H100의 ridge point가 ~~295 FLOP/byte인데 decode 단계의 AI는 1~~2 FLOP/byte에 불과합니다.

Q2. PagedAttention이 기존 KV 캐시 관리보다 메모리 효율이 높은 이유는?

정답: 비연속 물리 블록 할당과 동적 할당으로 단편화 제거

설명: 기존 방식은 각 요청에 최대 시퀀스 길이만큼의 연속 메모리를 미리 예약합니다. 실제로 짧게 끝나는 요청도 긴 메모리를 선점하는 내부 단편화, 그리고 서로 다른 크기의 요청들이 종료되면서 생기는 외부 단편화가 심각합니다. PagedAttention은 OS의 페이징처럼 KV 캐시를 고정 크기 블록으로 나눠 필요할 때마다 할당합니다. 비연속 물리 메모리를 논리 블록으로 추상화하므로 단편화가 거의 없고, 여러 요청이 공통 프롬프트의 KV 블록을 Copy-on-Write로 공유할 수 있어 메모리 효율이 크게 향상됩니다.

Q3. AWQ가 GPTQ보다 중요 가중치를 잘 보존하는 방법은?

정답: 활성화값 크기에 따라 채널별로 스케일링하여 중요 가중치 보호

설명: GPTQ는 2차 근사(Hessian)를 이용해 양자화 오차를 최소화하지만, 모든 가중치를 유사하게 취급합니다. AWQ(Activation-aware Weight Quantization)는 활성화값의 분포를 분석하여 큰 활성화값과 연관된 채널(salience channel)이 전체 성능에 더 중요하다는 관찰에 기반합니다. 이런 중요 채널의 가중치에는 스케일 팩터를 곱해 양자화 전에 값을 키우고, 추론 시 대응하는 활성화에는 역수를 곱하여 보상합니다. 중요 가중치를 보호하면서도 하드웨어 친화적인 균일 양자화를 유지할 수 있어 GPTQ 대비 perplexity가 낮습니다.

Q4. Continuous batching이 static batching보다 GPU 활용률을 높이는 방식은?

정답: 이터레이션 단위 스케줄링으로 완료된 시퀀스의 슬롯을 즉시 재활용

설명: Static batching은 배치 내 모든 요청이 완료될 때까지 GPU가 기다립니다. 가장 긴 시퀀스가 완료될 때까지 짧게 끝난 요청의 GPU 슬롯은 낭비됩니다. Continuous batching(또는 iteration-level scheduling)은 매 추론 스텝마다 배치를 재구성합니다. 어떤 시퀀스가 EOS 토큰을 생성하거나 max_tokens에 도달하면 그 슬롯에 즉시 새로운 대기 요청을 추가합니다. 결과적으로 GPU는 항상 최대 배치 크기로 동작하며, 실험에서 static batching 대비 처리량이 2~5배 향상됩니다.

Q5. Tensor Parallelism과 Pipeline Parallelism의 통신 패턴 차이는?

정답: 텐서 병렬은 레이어마다 AllReduce, 파이프라인 병렬은 레이어 경계에서 P2P 전송

설명: Tensor Parallelism은 각 Transformer 레이어의 가중치 행렬을 여러 GPU에 분할합니다. 각 레이어 연산 후 모든 GPU가 AllReduce 통신으로 결과를 합산해야 합니다. 레이어가 80개면 80번의 AllReduce가 필요하고, 통신 레이턴시가 누적됩니다. NVLink 같은 고대역폭 인터커넥트가 필수입니다. 반면 Pipeline Parallelism은 레이어 그룹 경계에서만 activation을 다음 GPU로 전달합니다. 통신 횟수는 적지만 파이프라인 버블(앞 GPU가 계산하는 동안 뒷 GPU가 대기)이 발생합니다. 단일 노드 내 NVLink 환경에는 텐서 병렬, 노드 간 InfiniBand 환경에는 파이프라인 병렬이 적합합니다.

마치며

LLM 추론 최적화는 하드웨어의 물리적 한계를 소프트웨어로 극복하는 도전입니다. GPU 메모리 계층을 이해하고, KV 캐시를 효율적으로 관리하며, 적절한 양자화와 배치 전략을 조합하면 같은 하드웨어에서 훨씬 뛰어난 성능을 달성할 수 있습니다.

핵심 요약:

메모리 절약: AWQ/GPTQ 4-bit 양자화로 70B 모델을 단일 A100 80G에서 실행
처리량 향상: vLLM의 PagedAttention + continuous batching으로 정적 서빙 대비 최대 24배 처리량
레이턴시 감소: TensorRT-LLM으로 CUDA 커널 최적화, FP8 활용
스케일아웃: 텐서/파이프라인 병렬로 단일 GPU 한계를 넘어 멀티-GPU 클러스터 활용