LLM 추론 서빙 프레임워크 비교: TensorRT-LLM vs vLLM vs SGLang 프로덕션 배포 전략

들어가며
LLM 추론 서빙의 핵심 과제
- KV Cache 메모리 문제
- 배칭 전략의 진화
TensorRT-LLM 심화
vLLM 심화
SGLang 심화
3대 프레임워크 벤치마크 비교
프로덕션 배포 아키텍처
- GPU 노드 스케줄링 전략
- 오토스케일링과 모니터링
실패 사례와 트러블슈팅
운영 시 주의사항과 선택 가이드
마치며
참고자료

들어가며

LLM을 학습하는 것과 프로덕션에서 서빙하는 것은 완전히 다른 엔지니어링 문제이다. 학습은 높은 처리량(throughput)이 최우선이지만, 서빙은 처리량, 지연시간(latency), 메모리 효율이라는 세 가지 상충하는 목표를 동시에 달성해야 한다. 특히 실시간 챗봇이나 코드 어시스턴트처럼 사용자 대면 서비스에서는 TTFT(Time To First Token)가 수백 밀리초를 초과하면 사용자 경험이 급격히 저하된다.

2024년부터 2026년까지 이 문제를 해결하기 위해 세 가지 프레임워크가 프로덕션 수준으로 성숙했다. TensorRT-LLM(NVIDIA)은 하드웨어 최적화의 깊이에서, vLLM(UC Berkeley)은 메모리 효율과 생태계 폭에서, SGLang(LMSYS)은 KV Cache 재사용과 구조화 생성 성능에서 각각 강점을 보인다.

이 글에서는 각 프레임워크의 내부 아키텍처를 심층 분석하고, H100 기준 벤치마크 데이터로 비교한 뒤, 프로덕션 배포 코드와 운영 전략까지 포괄적으로 다룬다.

LLM 추론 서빙의 핵심 과제

LLM 추론은 크게 Prefill 단계(프롬프트 전체를 한 번에 처리)와 Decode 단계(토큰을 하나씩 자기회귀 생성)로 나뉜다. Prefill은 compute-bound이고 Decode는 memory-bound이므로, 두 단계의 최적화 전략이 근본적으로 다르다.

KV Cache 메모리 문제

Transformer 디코더의 각 레이어는 이전 토큰들의 Key/Value 벡터를 캐시에 저장한다. Llama-3-70B 모델을 FP16으로 서빙할 때, 한 요청의 KV Cache가 시퀀스 길이에 비례하여 증가하며, 4096 토큰 기준 약 2.5GB를 소비한다. 동시 32개 요청을 처리하면 KV Cache만으로 80GB가 필요하다.

배칭 전략의 진화

전통적인 Static Batching은 배치 내 모든 요청이 완료될 때까지 GPU가 가장 긴 요청을 기다리므로, 짧은 요청에 대한 GPU 낭비가 심각하다. 이를 해결하기 위해 등장한 것이 Continuous Batching(vLLM, SGLang)과 In-flight Batching(TensorRT-LLM)이다. 핵심 개념은 동일하다: 각 요청이 완료되는 즉시 빠져나가고, 새로운 요청이 즉시 배치에 합류한다.

# Static Batching vs Continuous Batching 처리량 차이 시뮬레이션
import numpy as np

def simulate_static_batching(requests, batch_size=8):
    """Static Batching: 배치 내 최장 요청이 끝날 때까지 대기"""
    total_time = 0
    for i in range(0, len(requests), batch_size):
        batch = requests[i:i + batch_size]
        max_tokens = max(r["output_tokens"] for r in batch)
        total_time += max_tokens * 0.01  # 토큰당 10ms 가정
    return total_time

def simulate_continuous_batching(requests, batch_size=8):
    """Continuous Batching: 완료된 슬롯에 즉시 새 요청 삽입"""
    total_time = 0
    active_slots = []
    queue = list(requests)

    while queue or active_slots:
        # 빈 슬롯에 새 요청 삽입
        while len(active_slots) < batch_size and queue:
            active_slots.append(queue.pop(0))

        # 1스텝 진행
        total_time += 0.01
        for slot in active_slots:
            slot["remaining"] = slot.get("remaining", slot["output_tokens"]) - 1

        # 완료된 요청 제거
        active_slots = [s for s in active_slots if s["remaining"] > 0]

    return total_time

# 100개 요청, 출력 길이 10~500 토큰 랜덤
requests = [{"output_tokens": np.random.randint(10, 500)} for _ in range(100)]
static_time = simulate_static_batching(requests)
continuous_time = simulate_continuous_batching(
    [dict(r) for r in requests]
)

print(f"Static Batching 총 시간: {static_time:.1f}s")
print(f"Continuous Batching 총 시간: {continuous_time:.1f}s")
print(f"처리량 향상: {static_time / continuous_time:.1f}x")
# 출력 예시:
# Static Batching 총 시간: 62.5s
# Continuous Batching 총 시간: 27.3s
# 처리량 향상: 2.3x

TensorRT-LLM 심화

NVIDIA 네이티브 하드웨어 최적화

TensorRT-LLM은 NVIDIA가 자사 GPU를 위해 개발한 LLM 추론 엔진이다. 일반적인 PyTorch 추론 대비 최대 8배 이상의 처리량 향상을 제공하며, 특히 H100/H200/B200 GPU에서 FP8 Tensor Core를 완전히 활용한다.

핵심 최적화 기법:

커널 퓨전(Kernel Fusion): Multi-Head Attention, LayerNorm, GELU 등을 하나의 CUDA 커널로 융합하여 메모리 접근 오버헤드를 제거한다
FP8/FP4 양자화: H100의 FP8 Tensor Core를 활용하여 FP16 대비 2배 처리량을 달성하면서 정확도 손실을 최소화한다
In-flight Batching: 배치 내에서 Prefill과 Decode를 혼합 처리하여 GPU 활용률을 극대화한다
Paged KV Cache: vLLM에서 영감을 받아 비연속 메모리 블록으로 KV Cache를 관리한다
Tensor Parallelism / Pipeline Parallelism: 멀티 GPU 환경에서 모델을 자동으로 분할한다

TensorRT-LLM 모델 빌드 및 서빙

TensorRT-LLM은 모델을 먼저 TRT 엔진으로 변환(빌드)한 후 서빙하는 2단계 프로세스를 따른다.

# 1. TensorRT-LLM 설치 (Docker 권장)
docker pull nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

# 2. Hugging Face 모델을 TRT 엔진으로 변환
# Llama-3-70B, FP8 양자화, Tensor Parallelism 4-way
python convert_checkpoint.py \
    --model_dir /models/Llama-3-70B \
    --output_dir /engines/llama-70b-ckpt \
    --dtype float16 \
    --tp_size 4

trtllm-build \
    --checkpoint_dir /engines/llama-70b-ckpt \
    --output_dir /engines/llama-70b-engine \
    --gemm_plugin float16 \
    --gpt_attention_plugin float16 \
    --max_batch_size 64 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable \
    --workers 4

# 3. Triton Inference Server로 서빙
tritonserver \
    --model-repository=/engines/triton-repo \
    --http-port=8000 \
    --grpc-port=8001 \
    --metrics-port=8002

# TensorRT-LLM Python API로 직접 추론
import tensorrt_llm
from tensorrt_llm import LLM, SamplingParams

# 빌드된 엔진 로드
llm = LLM(
    model="/engines/llama-70b-engine",
    tensor_parallel_size=4,
    dtype="float16",
    kv_cache_config={
        "enable_block_reuse": True,
        "free_gpu_memory_fraction": 0.9,
    },
)

# 배치 추론
prompts = [
    "Explain the concept of attention mechanism in transformers",
    "Write a Python function to implement binary search",
    "What are the key differences between TCP and UDP?",
]

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

outputs = llm.generate(prompts, sampling_params=sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Generated: {output.outputs[0].text[:100]}...")
    print(f"Tokens/sec: {output.outputs[0].token_ids.__len__() / output.metrics.generation_time:.1f}")
    print()

TensorRT-LLM의 Speculative Decoding

TensorRT-LLM은 Speculative Decoding을 네이티브로 지원한다. 드래프트 모델이 여러 토큰을 빠르게 생성하고 타겟 모델이 한 번에 검증하여, 출력 품질을 유지하면서 디코딩 속도를 1.5~2.5배 향상시킨다.

# TensorRT-LLM Speculative Decoding 설정
from tensorrt_llm import LLM, SamplingParams

llm = LLM(
    model="/engines/llama-70b-engine",
    speculative_model="/engines/llama-8b-draft-engine",
    speculative_config={
        "num_draft_tokens": 5,
        "acceptance_method": "typical_acceptance",
    },
    tensor_parallel_size=4,
)

# Speculative Decoding은 투명하게 적용됨
params = SamplingParams(temperature=0.0, max_tokens=1024)
output = llm.generate(["Explain quantum computing"], params)
# 내부적으로 draft model이 5토큰씩 생성하고 target model이 검증

vLLM 심화

PagedAttention 메모리 관리 아키텍처

vLLM은 UC Berkeley의 연구팀이 2023년에 발표한 추론 엔진으로, PagedAttention이라는 혁신적인 KV Cache 관리 기법을 도입했다. 운영체제의 가상 메모리 페이징 시스템에서 영감을 받아 설계되었다.

기존 방식에서는 각 요청에 대해 최대 시퀀스 길이만큼의 연속 메모리를 미리 할당한다. 실제 생성 길이가 짧으면 나머지 공간이 낭비되는데, 평균적으로 60~80%의 KV Cache 메모리가 낭비된다.

PagedAttention은 KV Cache를 고정 크기 블록(기본 16토큰)으로 분할하고, 필요할 때만 새 블록을 할당한다. 블록 테이블이 논리 블록 번호를 물리 블록 주소에 매핑하므로, 연속 메모리가 필요 없다. 이를 통해 KV Cache 메모리 활용률이 95% 이상으로 향상된다.

vLLM V1 엔진과 프로덕션 스택

vLLM은 2025년 후반에 V1 엔진을 도입하며 아키텍처를 대폭 개선했다. 주요 변경사항:

torch.compile 통합: 모델 forward pass를 PyTorch 2의 컴파일러로 최적화
멀티프로세스 GPU 실행: 각 GPU가 별도 프로세스에서 실행되어 GIL 병목을 제거
간소화된 스케줄러: Prefix Caching, Chunked Prefill, Speculative Decoding을 통합하는 단일 코드 경로

# vLLM 서버 실행 및 OpenAI 호환 API 호출
# 1. 서버 실행
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
#     --tensor-parallel-size 4 \
#     --max-model-len 8192 \
#     --gpu-memory-utilization 0.92 \
#     --enable-prefix-caching \
#     --enable-chunked-prefill \
#     --max-num-seqs 256 \
#     --port 8000

# 2. Python 클라이언트에서 OpenAI 호환 API로 호출
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

# Chat Completions API
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain how PagedAttention works in vLLM."},
    ],
    temperature=0.7,
    max_tokens=512,
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

# 3. 배치 추론 (오프라인)
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    max_model_len=8192,
    gpu_memory_utilization=0.92,
    enable_prefix_caching=True,
)

sampling = SamplingParams(temperature=0.0, max_tokens=256)
prompts = [f"Question {i}: What is {topic}?"
           for i, topic in enumerate(["ML", "DL", "NLP", "CV", "RL"])]

outputs = llm.generate(prompts, sampling)
for out in outputs:
    print(f"[{out.request_id}] {out.outputs[0].text[:80]}...")

vLLM Kubernetes 프로덕션 배포

vLLM 프로젝트는 공식 Production Stack을 제공하여, Kubernetes 환경에서 멀티 모델 서빙, 오토스케일링, 로드 밸런싱을 지원한다.

# vllm-production-stack-values.yaml
# vLLM Production Stack Helm Chart 설정
servingEngineSpec:
  runtimeClassName: nvidia
  modelSpec:
    - name: 'llama-70b'
      repository: 'vllm/vllm-openai'
      tag: 'latest'
      modelURL: 'meta-llama/Llama-3.1-70B-Instruct'
      replicaCount: 2
      requestCPU: 8
      requestMemory: '64Gi'
      requestGPU: 4
      gpuType: 'nvidia.com/gpu'
      tensorParallelSize: 4
      maxModelLen: 8192
      extraArgs:
        - '--enable-prefix-caching'
        - '--enable-chunked-prefill'
        - '--gpu-memory-utilization=0.92'
        - '--max-num-seqs=256'

      hpa:
        enabled: true
        minReplicas: 2
        maxReplicas: 8
        targetValue: '70' # GPU 활용률 70% 타겟

routerSpec:
  repository: 'vllm/production-stack-router'
  tag: 'latest'
  replicaCount: 2
  requestCPU: 4
  requestMemory: '8Gi'
  routingStrategy: 'prefix-aware' # Prefix Cache 친화적 라우팅

# Prometheus 메트릭 수집 설정
monitoring:
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true
      interval: '15s'
  grafana:
    enabled: true
    dashboards:
      - name: 'vllm-serving'
        url: 'https://grafana.com/grafana/dashboards/vllm'

# Helm으로 vLLM Production Stack 배포
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update

helm install vllm-serving vllm/vllm-stack \
    -f vllm-production-stack-values.yaml \
    --namespace llm-serving \
    --create-namespace

# 배포 상태 확인
kubectl get pods -n llm-serving
kubectl logs -f deploy/vllm-serving-llama-70b -n llm-serving

SGLang 심화

RadixAttention: KV Cache 재사용의 혁신

SGLang은 LMSYS(UC Berkeley) 팀이 개발한 추론 엔진으로, RadixAttention이라는 KV Cache 자동 재사용 메커니즘이 핵심 차별점이다. NeurIPS 2024에서 발표되었으며, 특정 워크로드에서 최대 5배의 처리량 향상을 달성했다.

RadixAttention의 핵심 아이디어:

Radix Tree 기반 KV Cache 관리: 모든 요청의 KV Cache를 하나의 Radix Tree(기수 트리)에 저장한다. 공통 접두사를 공유하는 요청들은 자동으로 KV Cache를 재사용한다.
LRU 캐시 정책: 자주 사용되는 프리픽스의 KV Cache는 메모리에 유지하고, 오래된 것은 자동으로 제거한다.
자동 접두사 감지: 사용자가 명시적으로 프리픽스를 지정하지 않아도, 시스템이 자동으로 공통 접두사를 감지하여 KV Cache를 재사용한다.

이 메커니즘은 다음과 같은 워크로드에서 특히 효과적이다:

Few-shot 프롬프팅: 동일한 예시(시스템 프롬프트 + few-shot examples) 뒤에 다양한 질문이 오는 패턴
멀티턴 대화: 이전 대화 내역을 공유하는 후속 요청
Tree of Thought: 동일 프롬프트에서 여러 분기의 추론을 탐색하는 패턴

Compressed Finite State Machine (구조화 생성)

SGLang의 또 다른 핵심 기능은 구조화 출력(Structured Output) 생성의 효율성이다. JSON Schema, 정규표현식 등의 제약 조건을 Compressed Finite State Machine(압축 유한 상태 기계)으로 컴파일하여, 제약 조건 검증 오버헤드를 최소화한다.

기존 구조화 출력 엔진(Outlines, Guidance 등)은 토큰 생성 시마다 전체 어휘에 대한 마스킹을 수행하여 상당한 오버헤드가 발생한다. SGLang은 상태 기계를 미리 압축하여 디코딩 시 마스킹 연산을 최대 300배 이상 가속한다.

# SGLang 서버 실행 및 프론트엔드 사용
# 1. 서버 실행
# python -m sglang.launch_server \
#     --model-path meta-llama/Llama-3.1-70B-Instruct \
#     --tp 4 \
#     --mem-fraction-static 0.88 \
#     --chunked-prefill-size 8192 \
#     --enable-torch-compile \
#     --port 30000

# 2. SGLang 프론트엔드 (Python DSL)
import sglang as sgl

@sgl.function
def multi_turn_qa(s, system_prompt, questions):
    s += sgl.system(system_prompt)
    answers = []
    for q in questions:
        s += sgl.user(q)
        s += sgl.assistant(sgl.gen("answer", max_tokens=256, temperature=0.7))
        answers.append(s["answer"])
    return answers

# RadixAttention이 자동으로 시스템 프롬프트의 KV Cache를 재사용
runtime = sgl.Runtime(
    model_path="meta-llama/Llama-3.1-70B-Instruct",
    tp_size=4,
)
sgl.set_default_backend(runtime)

system = "You are an expert system architect. Provide concise technical answers."
questions_batch = [
    ["What is CQRS?", "How does event sourcing work?"],
    ["What is CQRS?", "When should I avoid CQRS?"],
    ["What is CQRS?", "Compare CQRS with traditional CRUD"],
]

# 3개 요청 모두 "What is CQRS?"에 대한 KV Cache를 자동 공유
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
    futures = [
        executor.submit(multi_turn_qa, system, qs)
        for qs in questions_batch
    ]
    results = [f.result() for f in futures]

for i, r in enumerate(results):
    print(f"Batch {i}: {len(r)} answers generated")

runtime.shutdown()

SGLang 구조화 출력 생성

# SGLang의 구조화 출력 (JSON Schema 기반)
import sglang as sgl
from pydantic import BaseModel
from typing import List

class CodeReview(BaseModel):
    file_name: str
    severity: str  # "critical", "warning", "info"
    line_number: int
    issue: str
    suggestion: str

class ReviewResult(BaseModel):
    reviews: List[CodeReview]
    overall_score: int  # 1-10
    summary: str

@sgl.function
def structured_code_review(s, code_snippet):
    s += sgl.system(
        "You are a senior code reviewer. Analyze the given code and provide "
        "structured feedback in JSON format."
    )
    s += sgl.user(f"Review this code:\n```python\n{code_snippet}\n```")
    s += sgl.assistant(
        sgl.gen(
            "review",
            max_tokens=1024,
            temperature=0.0,
            regex=ReviewResult.model_json_schema(),  # JSON Schema 제약
        )
    )

# Compressed FSM이 JSON Schema를 상태 기계로 컴파일
# 디코딩 시 유효한 JSON만 생성되도록 보장
result = structured_code_review(
    code_snippet="""
def process_data(data):
    result = []
    for i in range(len(data)):
        if data[i] > 0:
            result.append(data[i] * 2)
    return result
"""
)

import json
review = json.loads(result["review"])
print(f"Overall Score: {review['overall_score']}/10")
print(f"Issues Found: {len(review['reviews'])}")
for r in review["reviews"]:
    print(f"  [{r['severity']}] Line {r['line_number']}: {r['issue']}")

3대 프레임워크 벤치마크 비교

핵심 기능 비교표

항목	TensorRT-LLM	vLLM	SGLang
개발사	NVIDIA	UC Berkeley / vLLM Inc.	LMSYS (UC Berkeley)
라이선스	Apache 2.0	Apache 2.0	Apache 2.0
배칭 방식	In-flight Batching	Continuous Batching	Continuous Batching
KV Cache 관리	Paged KV Cache	PagedAttention	RadixAttention
양자화 지원	FP8, FP4, INT4 AWQ, INT8 SQ	AWQ, GPTQ, FP8, bitsandbytes	AWQ, GPTQ, FP8, FP16
Speculative Decoding	네이티브 지원	지원 (Draft Model, Eagle)	지원 (Eagle, Draft Model)
구조화 출력	외부 통합 필요	Outlines 통합	Compressed FSM (네이티브)
프리픽스 캐싱	Paged KV Cache Reuse	Prefix Caching	RadixAttention (자동)
API 호환성	Triton / OpenAI 호환	OpenAI 호환 (네이티브)	OpenAI 호환 (네이티브)
멀티 GPU	TP, PP 지원	TP, PP 지원	TP 지원, PP 제한적
하드웨어 종속	NVIDIA 전용	NVIDIA, AMD (ROCm), TPU, AWS Neuron	NVIDIA, AMD (ROCm)

H100 기준 처리량 벤치마크 (Llama-3.1-70B, TP=4)

메트릭	TensorRT-LLM	vLLM	SGLang
처리량 (req/s, 동시 64)	42.3	38.7	41.5
처리량 (req/s, 동시 128)	68.1	62.4	66.8
TTFT p50 (ms)	89	112	95
TTFT p99 (ms)	245	310	268
ITL p50 (ms/token)	12.1	14.8	13.2
ITL p99 (ms/token)	28.3	35.2	30.1
GPU 메모리 사용률	91%	89%	87%
프리픽스 캐시 적중 시 TTFT 감소	35%	42%	65%

모델 크기별 처리량 비교 (H100 80GB, 동시 64 요청, FP16)

모델 크기	메트릭	TensorRT-LLM	vLLM	SGLang
7B (TP=1)	처리량 (req/s)	185.2	168.4	178.9
7B (TP=1)	TTFT p50 (ms)	32	41	35
13B (TP=1)	처리량 (req/s)	112.8	101.5	108.3
13B (TP=1)	TTFT p50 (ms)	48	58	52
70B (TP=4)	처리량 (req/s)	42.3	38.7	41.5
70B (TP=4)	TTFT p50 (ms)	89	112	95

동시성 레벨별 성능 변화 (Llama-3.1-70B, TP=4)

동시 요청 수	TensorRT-LLM (req/s)	vLLM (req/s)	SGLang (req/s)
16	18.5	17.2	17.8
32	32.1	29.8	31.4
64	42.3	38.7	41.5
128	68.1	62.4	66.8
256	82.7	76.3	80.1
512	89.2	83.1	86.5

벤치마크 결과에서 주목할 점:

TensorRT-LLM은 절대적인 처리량과 지연시간에서 최고 수준이다. 하드웨어 최적화의 깊이 때문이다. 다만 NVIDIA GPU 전용이며 빌드 과정이 복잡하다.
vLLM은 처리량에서 약간 뒤지지만, 프리픽스 캐시 적중 시 TTFT 감소율이 42%로 우수하다. 가장 넓은 하드웨어 지원과 성숙한 프로덕션 스택이 강점이다.
SGLang은 프리픽스 캐시 적중 시 TTFT 65% 감소로 압도적이다. RadixAttention 덕분에 반복적인 프롬프트 패턴(few-shot, 멀티턴)에서 가장 높은 효율을 보인다.

프로덕션 배포 아키텍처

GPU 노드 스케줄링 전략

프로덕션에서 LLM 서빙 노드를 관리할 때는 GPU 리소스의 효율적 할당이 핵심이다.

# Kubernetes GPU 노드 스케줄링 - Pod Affinity 및 Topology 설정
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-serving-70b
  namespace: llm-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-serving
      model: llama-70b
  template:
    metadata:
      labels:
        app: llm-serving
        model: llama-70b
    spec:
      # GPU 노드에만 스케줄링
      nodeSelector:
        nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'
      tolerations:
        - key: 'nvidia.com/gpu'
          operator: 'Exists'
          effect: 'NoSchedule'
      # TP=4 이므로 같은 노드의 4개 GPU를 사용
      # NVLink 연결된 GPU를 보장하기 위한 토폴로지 설정
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: llm-serving
      containers:
        - name: vllm-engine
          image: vllm/vllm-openai:latest
          args:
            - '--model=meta-llama/Llama-3.1-70B-Instruct'
            - '--tensor-parallel-size=4'
            - '--max-model-len=8192'
            - '--gpu-memory-utilization=0.92'
            - '--enable-prefix-caching'
            - '--port=8000'
          resources:
            limits:
              nvidia.com/gpu: 4
              memory: '128Gi'
              cpu: '16'
            requests:
              nvidia.com/gpu: 4
              memory: '96Gi'
              cpu: '12'
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8002
              name: metrics
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
            failureThreshold: 3

오토스케일링과 모니터링

# KEDA ScaledObject - GPU 메트릭 기반 오토스케일링
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-serving-scaler
  namespace: llm-serving
spec:
  scaleTargetRef:
    name: llm-serving-70b
  minReplicaCount: 2
  maxReplicaCount: 8
  cooldownPeriod: 300
  pollingInterval: 15
  triggers:
    # GPU 활용률 기반 스케일링
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: gpu_utilization
        query: |
          avg(DCGM_FI_DEV_GPU_UTIL{
            namespace="llm-serving",
            pod=~"llm-serving-70b.*"
          })
        threshold: '75'
    # 대기열 깊이 기반 스케일링
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: pending_requests
        query: |
          sum(vllm:num_requests_waiting{
            namespace="llm-serving"
          })
        threshold: '50'

# Prometheus 커스텀 메트릭 수집 스크립트
import requests
import time
from prometheus_client import Gauge, start_http_server

# vLLM / SGLang 메트릭 수집
TTFT_P50 = Gauge("llm_ttft_p50_ms", "Time to First Token p50 in ms")
TTFT_P99 = Gauge("llm_ttft_p99_ms", "Time to First Token p99 in ms")
THROUGHPUT = Gauge("llm_throughput_rps", "Requests per second")
GPU_KV_CACHE_USAGE = Gauge("llm_gpu_kv_cache_usage", "KV Cache usage ratio")
ACTIVE_REQUESTS = Gauge("llm_active_requests", "Number of active requests")
PENDING_REQUESTS = Gauge("llm_pending_requests", "Number of pending requests")

def collect_vllm_metrics(base_url="http://localhost:8000"):
    """vLLM /metrics 엔드포인트에서 메트릭 수집"""
    try:
        resp = requests.get(f"{base_url}/metrics", timeout=5)
        lines = resp.text.strip().split("\n")
        for line in lines:
            if line.startswith("#"):
                continue
            if "vllm:time_to_first_token_seconds" in line and "p50" in line:
                TTFT_P50.set(float(line.split()[-1]) * 1000)
            elif "vllm:time_to_first_token_seconds" in line and "p99" in line:
                TTFT_P99.set(float(line.split()[-1]) * 1000)
            elif "vllm:num_requests_running" in line:
                ACTIVE_REQUESTS.set(float(line.split()[-1]))
            elif "vllm:num_requests_waiting" in line:
                PENDING_REQUESTS.set(float(line.split()[-1]))
            elif "vllm:gpu_cache_usage_perc" in line:
                GPU_KV_CACHE_USAGE.set(float(line.split()[-1]))
    except Exception as e:
        print(f"Metric collection failed: {e}")

if __name__ == "__main__":
    start_http_server(9090)
    while True:
        collect_vllm_metrics()
        time.sleep(15)

실패 사례와 트러블슈팅

사례 1: OOM으로 인한 서빙 중단과 메모리 관리

상황: vLLM으로 Llama-3.1-70B를 4xH100에서 서빙 중, 동시 요청이 200개를 초과하자 CUDA OOM이 발생하며 전체 서빙 프로세스가 비정상 종료되었다.

원인 분석:

gpu-memory-utilization을 0.95로 설정하여 여유 메모리가 거의 없었다
일부 요청의 출력 길이가 예상보다 길어 KV Cache가 폭발적으로 증가했다
Prefix Caching이 비활성화되어 동일 시스템 프롬프트의 KV Cache가 중복 할당되었다

해결 절차:

gpu-memory-utilization을 0.90으로 낮추어 KV Cache 할당 여유 확보
max-num-seqs를 256에서 128로 줄여 동시 요청 수 제한
enable-prefix-caching 활성화로 시스템 프롬프트 KV Cache 공유
max-tokens를 2048로 제한하여 개별 요청의 KV Cache 상한 설정

# OOM 발생 시 복구 체크리스트
# 1. 현재 GPU 메모리 상태 확인
nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu \
    --format=csv,noheader,nounits

# 2. vLLM 프로세스 상태 확인
curl -s http://localhost:8000/metrics | \
    grep -E "vllm:(num_requests|gpu_cache|cpu_cache)"

# 3. 안전한 설정으로 재시작
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.90 \
    --max-num-seqs 128 \
    --max-model-len 4096 \
    --enable-prefix-caching \
    --enable-chunked-prefill

# 4. Kubernetes에서는 리소스 요청/제한 재설정 후 롤링 업데이트
kubectl set env deploy/llm-serving-70b \
    VLLM_GPU_MEMORY_UTILIZATION=0.90 \
    -n llm-serving
kubectl rollout restart deploy/llm-serving-70b -n llm-serving
kubectl rollout status deploy/llm-serving-70b -n llm-serving

사례 2: 양자화 모델 정확도 하락 디버깅

상황: TensorRT-LLM에서 Llama-3.1-70B를 INT4 AWQ로 양자화하여 배포했는데, 특정 도메인(의료, 법률)에서 응답 품질이 FP16 대비 눈에 띄게 하락했다.