Split View: AI 모델 배포 & 서빙 완전 가이드: Triton, vLLM, BentoML, Kubernetes까지

AI 모델 배포 & 서빙 완전 가이드: Triton, vLLM, BentoML, Kubernetes까지

들어가며

AI 모델을 연구 환경에서 학습시키는 것과 실제 프로덕션 환경에서 서빙하는 것은 완전히 다른 문제입니다. 모델 정확도만큼 중요한 것이 낮은 지연시간, 높은 처리량, 안정적인 스케일링입니다. 이 가이드에서는 AI 모델 배포의 전 과정을 실전 중심으로 다룹니다.

1. 서빙 아키텍처 패턴

1.1 Online Serving vs Batch Serving

**Online Serving (실시간 서빙)**은 사용자 요청에 즉시 응답해야 하는 시나리오에서 사용합니다.

응답 시간 목표: P99 < 200ms
활용 사례: 추천 시스템, 챗봇, 실시간 이미지 분류
인프라: REST API / gRPC 엔드포인트, 오토스케일링

**Batch Serving (배치 서빙)**은 대량 데이터를 일괄 처리할 때 사용합니다.

응답 시간 목표: 분 ~ 시간 단위
활용 사례: 야간 데이터 스코어링, 오프라인 추천 생성
인프라: Spark, Airflow DAG, 대용량 GPU 배치 잡

1.2 동기 vs 비동기 서빙

방식	특징	적합 시나리오
동기(Sync)	요청 즉시 응답 대기	지연시간 민감 API
비동기(Async)	작업 큐 경유, 결과 폴링	긴 추론 작업, LLM
스트리밍	토큰 단위 점진적 응답	LLM 채팅, 코드 생성

1.3 스트리밍 응답 (Server-Sent Events)

LLM 서빙에서 스트리밍은 사용자 체감 응답성을 크게 향상시킵니다. 첫 번째 토큰까지의 시간(TTFT, Time To First Token)이 핵심 지표입니다.

import httpx
import asyncio

async def stream_llm_response(prompt: str):
    async with httpx.AsyncClient() as client:
        async with client.stream(
            "POST",
            "http://localhost:8000/v1/completions",
            json={
                "model": "llama-3-8b",
                "prompt": prompt,
                "max_tokens": 512,
                "stream": True
            },
            timeout=60.0
        ) as response:
            async for line in response.aiter_lines():
                if line.startswith("data: "):
                    data = line[6:]
                    if data != "[DONE]":
                        import json
                        chunk = json.loads(data)
                        token = chunk["choices"][0]["text"]
                        print(token, end="", flush=True)

2. 컨테이너화: Docker GPU 서빙

2.1 멀티스테이지 빌드로 이미지 최적화

프로덕션 이미지는 빌드 의존성을 제거하고 최소한의 런타임만 포함해야 합니다.

# ---- Stage 1: Builder ----
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 AS builder

WORKDIR /build

RUN apt-get update && apt-get install -y \
    python3.11 \
    python3.11-dev \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# ---- Stage 2: Runtime ----
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

ENV PYTHONUNBUFFERED=1
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

WORKDIR /app

RUN apt-get update && apt-get install -y \
    python3.11 \
    python3-pip \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# builder에서 설치된 패키지만 복사
COPY --from=builder /install /usr/local
COPY . .

# 비root 사용자로 실행 (보안)
RUN useradd -m -u 1000 mluser
USER mluser

EXPOSE 8080
CMD ["python3.11", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

2.2 NVIDIA Container Toolkit 설정

GPU 컨테이너를 실행하려면 호스트에 NVIDIA Container Toolkit이 설치되어 있어야 합니다.

# NVIDIA Container Toolkit 설치 (Ubuntu)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor \
  -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# GPU 컨테이너 실행 검증
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

3. Kubernetes ML 배포

3.1 GPU 노드 배포 + HPA

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: model-serving
  template:
    metadata:
      labels:
        app: model-serving
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8080'
        prometheus.io/path: '/metrics'
    spec:
      nodeSelector:
        accelerator: nvidia-tesla-a10g
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: model-server
          image: myregistry/model-server:v1.2.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: '2'
              memory: '8Gi'
              nvidia.com/gpu: '1'
            limits:
              cpu: '4'
              memory: '16Gi'
              nvidia.com/gpu: '1'
          env:
            - name: MODEL_PATH
              value: '/models/llama-3-8b'
            - name: MAX_BATCH_SIZE
              value: '32'
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-serving-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-serving
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Pods
      pods:
        metric:
          name: model_requests_per_second
        target:
          type: AverageValue
          averageValue: '50'

3.2 Karpenter 노드 오토스케일링

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    metadata:
      labels:
        accelerator: nvidia-tesla-a10g
    spec:
      nodeClassRef:
        name: gpu-nodeclass
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['g5.xlarge', 'g5.2xlarge', 'g5.4xlarge']
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
  limits:
    nvidia.com/gpu: '20'
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s

4. AI 서빙 프레임워크 비교

4.1 NVIDIA Triton Inference Server

Triton은 다양한 모델 포맷(TensorRT, ONNX, PyTorch, TensorFlow)을 지원하는 고성능 서빙 서버입니다. Dynamic Batching이 핵심 기능으로, 여러 요청을 자동으로 묶어 GPU 활용률을 극대화합니다.

# config.pbtxt
name: "text_classifier"
platform: "onnxruntime_onnx"
max_batch_size: 64

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1 ]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1 ]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1, 2 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]
  max_queue_delay_microseconds: 5000
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

4.2 BentoML 서비스 정의

BentoML은 Python 친화적인 서빙 프레임워크로 빠른 프로토타이핑과 프로덕션 배포 모두 지원합니다.

import bentoml
from bentoml.io import JSON, NumpyNdarray
from pydantic import BaseModel
import numpy as np
from typing import List

class InferenceRequest(BaseModel):
    texts: List[str]
    top_k: int = 5

class InferenceResponse(BaseModel):
    labels: List[str]
    scores: List[float]

# 모델 러너 생성
classifier_runner = bentoml.pytorch.get("text_classifier:latest").to_runner()

svc = bentoml.Service("text_classification_svc", runners=[classifier_runner])

@svc.api(input=JSON(pydantic_model=InferenceRequest),
         output=JSON(pydantic_model=InferenceResponse))
async def classify(request: InferenceRequest) -> InferenceResponse:
    # 배치 추론
    batch_results = await classifier_runner.async_run(request.texts)

    labels = []
    scores = []
    for result in batch_results:
        top_idx = np.argsort(result)[-request.top_k:][::-1]
        labels.extend([f"label_{i}" for i in top_idx])
        scores.extend(result[top_idx].tolist())

    return InferenceResponse(labels=labels, scores=scores)

4.3 프레임워크 비교 요약

프레임워크	강점	약점	적합 시나리오
Triton	최고 성능, 다중 포맷	설정 복잡	GPU 고처리량 서빙
BentoML	쉬운 사용법, 패키징	Triton 대비 낮은 성능	빠른 MVP, 소규모 팀
Ray Serve	분산 처리, 파이프라인	학습 곡선 높음	복잡한 ML 파이프라인
TorchServe	PyTorch 네이티브	멀티 프레임워크 미지원	PyTorch 전용 배포

5. LLM 서빙: vLLM과 TGI

5.1 vLLM — PagedAttention 기반 고성능 LLM 서빙

vLLM은 PagedAttention 알고리즘으로 KV 캐시 메모리를 가상 메모리처럼 관리하여 GPU 메모리 낭비를 최소화합니다.

from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import uvicorn
import json
import uuid

app = FastAPI(title="vLLM OpenAI-Compatible API")

engine_args = AsyncEngineArgs(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,       # GPU 2장 텐서 병렬
    gpu_memory_utilization=0.90,
    max_model_len=8192,
    enable_chunked_prefill=True,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

@app.post("/v1/chat/completions")
async def chat_completions(request: dict):
    messages = request.get("messages", [])
    stream = request.get("stream", False)

    # 메시지를 프롬프트로 변환
    prompt = format_messages(messages)
    sampling_params = SamplingParams(
        temperature=request.get("temperature", 0.7),
        max_tokens=request.get("max_tokens", 512),
        top_p=request.get("top_p", 0.95),
    )

    request_id = str(uuid.uuid4())

    if stream:
        async def generate_stream():
            async for output in engine.generate(prompt, sampling_params, request_id):
                if output.outputs:
                    token = output.outputs[0].text
                    chunk = {
                        "id": request_id,
                        "object": "chat.completion.chunk",
                        "choices": [{"delta": {"content": token}, "index": 0}]
                    }
                    yield f"data: {json.dumps(chunk)}\n\n"
            yield "data: [DONE]\n\n"

        return StreamingResponse(generate_stream(), media_type="text/event-stream")

    # 비스트리밍
    final_output = None
    async for output in engine.generate(prompt, sampling_params, request_id):
        final_output = output

    return {
        "choices": [{"message": {"content": final_output.outputs[0].text}}]
    }

def format_messages(messages):
    result = ""
    for msg in messages:
        role = msg.get("role", "user")
        content = msg.get("content", "")
        result += f"<|{role}|>\n{content}\n"
    return result + "<|assistant|>\n"

5.2 TGI (Text Generation Inference) 배포

Hugging Face TGI는 Docker로 간단하게 배포할 수 있습니다.

# TGI로 LLaMA-3 서빙 시작
docker run --gpus all \
  -p 8080:80 \
  -v /data/models:/data \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:2.0.4 \
  --model-id meta-llama/Llama-3-8B-Instruct \
  --num-shard 2 \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --max-batch-prefill-tokens 16384 \
  --dtype bfloat16

# 스트리밍 응답 테스트
curl http://localhost:8080/generate_stream \
  -H 'Content-Type: application/json' \
  -d '{"inputs": "한국의 수도는?", "parameters": {"max_new_tokens": 100, "stream": true}}'

6. 성능 최적화 전략

6.1 모델 워밍업

콜드 스타트를 방지하기 위해 서버 시작 시 더미 요청으로 모델을 워밍업합니다.

import asyncio
import httpx
import logging

logger = logging.getLogger(__name__)

async def warmup_model(base_url: str, num_warmup_requests: int = 5):
    """서버 시작 시 모델 워밍업 실행"""
    dummy_request = {
        "inputs": "warmup",
        "parameters": {"max_new_tokens": 10}
    }

    async with httpx.AsyncClient(timeout=120.0) as client:
        logger.info("모델 워밍업 시작...")

        # 헬스체크 대기
        for _ in range(30):
            try:
                resp = await client.get(f"{base_url}/health")
                if resp.status_code == 200:
                    break
            except Exception:
                await asyncio.sleep(2)

        # 워밍업 요청 전송
        tasks = [
            client.post(f"{base_url}/generate", json=dummy_request)
            for _ in range(num_warmup_requests)
        ]
        await asyncio.gather(*tasks, return_exceptions=True)
        logger.info(f"워밍업 완료 ({num_warmup_requests}개 요청)")

6.2 Request Batching 및 로드밸런싱

import asyncio
from collections import deque
from dataclasses import dataclass, field
from typing import Any

@dataclass
class BatchRequest:
    request_id: str
    payload: dict
    future: asyncio.Future = field(default_factory=asyncio.Future)

class DynamicBatcher:
    def __init__(self, max_batch_size: int = 32, max_wait_ms: float = 10.0):
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue: deque = deque()
        self._lock = asyncio.Lock()

    async def add_request(self, request_id: str, payload: dict) -> Any:
        req = BatchRequest(request_id=request_id, payload=payload)
        async with self._lock:
            self.queue.append(req)
        return await req.future

    async def process_batches(self, model_fn):
        while True:
            await asyncio.sleep(self.max_wait_ms / 1000)
            async with self._lock:
                if not self.queue:
                    continue
                batch = []
                while self.queue and len(batch) < self.max_batch_size:
                    batch.append(self.queue.popleft())

            if batch:
                try:
                    inputs = [r.payload for r in batch]
                    results = await model_fn(inputs)
                    for req, result in zip(batch, results):
                        req.future.set_result(result)
                except Exception as e:
                    for req in batch:
                        req.future.set_exception(e)

7. 모니터링: Prometheus 메트릭

from prometheus_client import Histogram, Counter, Gauge, generate_latest
from fastapi import FastAPI, Request, Response
import time

# 메트릭 정의
REQUEST_LATENCY = Histogram(
    "model_request_latency_seconds",
    "모델 추론 지연시간",
    ["model_name", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

REQUEST_COUNT = Counter(
    "model_requests_total",
    "총 요청 수",
    ["model_name", "endpoint", "status"]
)

ACTIVE_REQUESTS = Gauge(
    "model_active_requests",
    "현재 처리 중인 요청 수",
    ["model_name"]
)

TOKEN_THROUGHPUT = Counter(
    "model_tokens_generated_total",
    "생성된 총 토큰 수",
    ["model_name"]
)

GPU_MEMORY_USED = Gauge(
    "gpu_memory_used_bytes",
    "GPU 메모리 사용량",
    ["gpu_index"]
)

app = FastAPI()

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    model_name = "llama-3-8b"
    endpoint = request.url.path

    ACTIVE_REQUESTS.labels(model_name=model_name).inc()
    start = time.perf_counter()

    try:
        response = await call_next(request)
        status = str(response.status_code)
        REQUEST_COUNT.labels(
            model_name=model_name, endpoint=endpoint, status=status
        ).inc()
        return response
    except Exception as e:
        REQUEST_COUNT.labels(
            model_name=model_name, endpoint=endpoint, status="500"
        ).inc()
        raise
    finally:
        latency = time.perf_counter() - start
        REQUEST_LATENCY.labels(
            model_name=model_name, endpoint=endpoint
        ).observe(latency)
        ACTIVE_REQUESTS.labels(model_name=model_name).dec()

@app.get("/metrics")
async def metrics():
    return Response(generate_latest(), media_type="text/plain")

퀴즈: AI 모델 서빙 심화

Q1. NVIDIA Triton의 dynamic batching이 단순 request batching보다 GPU 활용률을 높이는 방식은?

정답: Dynamic batching은 큐에서 대기 중인 요청들을 서버 측에서 자동으로 그룹화하며, preferred_batch_size 및 max_queue_delay 설정으로 GPU가 항상 최적 크기의 배치로 실행되도록 보장합니다.

설명: 단순 request batching은 클라이언트가 배치를 직접 구성하지만, Triton의 dynamic batching은 서버 내부에서 큐 대기 시간(최대 5ms 등)을 허용하면서 들어오는 요청을 자동으로 묶습니다. 이로 인해 GPU SM(Streaming Multiprocessor) 활용률이 증가하고, 개별 요청 처리 대비 처리량이 수 배 향상됩니다. instance_group으로 여러 모델 인스턴스를 GPU에 배치하면 배치 조합 기회가 더욱 늘어납니다.

Q2. vLLM의 PagedAttention이 LLM 서빙에서 메모리 단편화를 해결하는 방법은?

정답: PagedAttention은 KV 캐시를 고정 크기 "페이지" 단위로 분할하여 OS의 가상 메모리처럼 비연속 물리 메모리에 매핑합니다. 이를 통해 시퀀스 길이 변동에 따른 외부/내부 단편화를 거의 제거합니다.

설명: 기존 LLM 서빙은 각 시퀀스의 최대 길이만큼 연속 메모리를 미리 할당(pre-allocation)하여 실제 생성 토큰 수보다 많은 메모리를 낭비했습니다. vLLM의 PagedAttention은 블록 크기(예: 16 토큰) 단위로 KV 캐시를 나누고, 블록 테이블로 논리-물리 주소를 관리합니다. 결과적으로 메모리 낭비가 4% 미만으로 감소하고, 같은 GPU로 최대 24배 더 많은 동시 요청을 처리할 수 있습니다.

Q3. BentoML과 Ray Serve의 아키텍처 차이 및 각각 적합한 배포 시나리오는?

정답: BentoML은 단일 서비스 패키징과 컨테이너 빌드에 최적화된 도구이고, Ray Serve는 분산 액터 기반으로 복잡한 ML 파이프라인과 앙상블 모델에 강합니다.

설명: BentoML은 모델 + 의존성 + API를 하나의 Bento로 패키징하여 Docker 이미지 빌드와 클라우드 배포를 단순화합니다. 소규모 팀이나 단일 모델 API에 적합합니다. Ray Serve는 Ray 클러스터 위에서 동작하며, 여러 모델을 체이닝하는 파이프라인, A/B 테스트, 복잡한 라우팅 로직에 뛰어납니다. 대규모 분산 추론이나 앙상블이 필요한 엔터프라이즈 환경에 적합합니다.

Q4. Kubernetes HPA에서 GPU 메트릭 기반 스케일링이 CPU 기반보다 어려운 이유는?

정답: GPU 메트릭은 Kubernetes 기본 metrics-server에서 지원하지 않아 DCGM Exporter + Prometheus Adapter 같은 별도 스택이 필요하고, GPU는 CPU와 달리 정수 단위 자원이라 세밀한 활용률 제어가 어렵습니다.

설명: CPU/메모리는 kubelet이 기본 수집하지만, GPU 활용률(DCGM_FI_DEV_GPU_UTIL)은 NVIDIA DCGM Exporter가 수집하고 Prometheus에 저장한 뒤, Prometheus Adapter가 Custom Metrics API로 노출해야 HPA가 읽을 수 있습니다. 또한 GPU 메모리는 프로세스가 점유하면 해제가 느려 스케일다운 시 메모리 부족이 발생할 수 있으며, GPU 노드 프로비저닝 시간(5~10분)이 CPU보다 훨씬 길어 선제적 스케일아웃이 더 중요합니다.

Q5. P99 지연시간이 평균 지연시간보다 중요한 서비스 품질 지표인 이유는?

정답: P99는 전체 요청의 99%가 경험하는 최악에 가까운 응답 시간으로, 평균은 이상치에 가려진 일부 사용자의 열악한 경험을 숨기지만 P99는 실제 사용자 불만을 반영합니다.

설명: 평균 지연시간이 50ms라도 1%의 요청이 2000ms를 경험한다면 대규모 서비스에서는 수천 명의 사용자가 느린 응답을 받습니다. P99 SLO(예: 200ms 이내)를 설정하면 꼬리 지연시간(tail latency) 문제를 조기에 발견할 수 있습니다. 특히 마이크로서비스 환경에서 여러 서비스가 직렬로 호출될 때, 각 서비스의 P99가 합산되어 최종 P99가 급격히 악화되는 "tail latency amplification" 현상이 발생하므로 P99 모니터링이 필수입니다.

마무리

AI 모델 서빙은 단순한 API 래핑이 아닙니다. GPU 자원 관리, 배치 최적화, 스트리밍 응답, Kubernetes 오토스케일링, 그리고 철저한 모니터링이 모두 맞물려야 진정한 프로덕션 서빙이 완성됩니다. 이 가이드의 패턴들을 실제 프로젝트에 적용하며 단계적으로 개선해 나가시길 바랍니다.

Complete Guide to AI Model Deployment & Serving: Triton, vLLM, BentoML, and Kubernetes

Introduction

Training an AI model in a research environment and serving it reliably in production are entirely different challenges. Alongside model accuracy, low latency, high throughput, and stable scaling are equally critical. This guide walks through the full lifecycle of AI model deployment with a production-first mindset.

1. Serving Architecture Patterns

1.1 Online Serving vs Batch Serving

Online Serving handles user requests in real time with strict latency targets.

Latency target: P99 under 200ms
Use cases: recommendation systems, chatbots, real-time image classification
Infrastructure: REST API / gRPC endpoints, autoscaling replicas

Batch Serving processes large volumes of data in scheduled jobs.

Latency target: minutes to hours
Use cases: nightly scoring pipelines, offline recommendation generation
Infrastructure: Spark jobs, Airflow DAGs, large-scale GPU batch jobs

1.2 Synchronous vs Asynchronous Serving

Mode	Characteristics	Best Scenario
Sync	Request waits for response	Latency-sensitive APIs
Async	Work queue + result polling	Long inference jobs, LLMs
Streaming	Incremental token-by-token response	LLM chat, code generation

1.3 Streaming Responses with Server-Sent Events

Streaming dramatically improves perceived responsiveness for LLM serving. Time To First Token (TTFT) is the key user-facing metric.

import httpx
import asyncio

async def stream_llm_response(prompt: str):
    async with httpx.AsyncClient() as client:
        async with client.stream(
            "POST",
            "http://localhost:8000/v1/completions",
            json={
                "model": "llama-3-8b",
                "prompt": prompt,
                "max_tokens": 512,
                "stream": True
            },
            timeout=60.0
        ) as response:
            async for line in response.aiter_lines():
                if line.startswith("data: "):
                    data = line[6:]
                    if data != "[DONE]":
                        import json
                        chunk = json.loads(data)
                        token = chunk["choices"][0]["text"]
                        print(token, end="", flush=True)

2. Containerization: Docker for GPU Serving

2.1 Multi-Stage Builds for Optimized Images

Production images should exclude build-time dependencies and include only the minimal runtime footprint.

# ---- Stage 1: Builder ----
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 AS builder

WORKDIR /build

RUN apt-get update && apt-get install -y \
    python3.11 \
    python3.11-dev \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# ---- Stage 2: Runtime ----
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

ENV PYTHONUNBUFFERED=1
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

WORKDIR /app

RUN apt-get update && apt-get install -y \
    python3.11 \
    python3-pip \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# Copy only installed packages from the builder stage
COPY --from=builder /install /usr/local
COPY . .

# Run as non-root user for security
RUN useradd -m -u 1000 mluser
USER mluser

EXPOSE 8080
CMD ["python3.11", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

2.2 NVIDIA Container Toolkit Setup

GPU containers require the NVIDIA Container Toolkit to be installed on the host machine.

# Install NVIDIA Container Toolkit on Ubuntu
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor \
  -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify GPU container access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

3. Kubernetes ML Deployment

3.1 GPU Deployment + HPA

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: model-serving
  template:
    metadata:
      labels:
        app: model-serving
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8080'
        prometheus.io/path: '/metrics'
    spec:
      nodeSelector:
        accelerator: nvidia-tesla-a10g
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: model-server
          image: myregistry/model-server:v1.2.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: '2'
              memory: '8Gi'
              nvidia.com/gpu: '1'
            limits:
              cpu: '4'
              memory: '16Gi'
              nvidia.com/gpu: '1'
          env:
            - name: MODEL_PATH
              value: '/models/llama-3-8b'
            - name: MAX_BATCH_SIZE
              value: '32'
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-serving-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-serving
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Pods
      pods:
        metric:
          name: model_requests_per_second
        target:
          type: AverageValue
          averageValue: '50'

3.2 Karpenter Node Autoscaling

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    metadata:
      labels:
        accelerator: nvidia-tesla-a10g
    spec:
      nodeClassRef:
        name: gpu-nodeclass
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['g5.xlarge', 'g5.2xlarge', 'g5.4xlarge']
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
  limits:
    nvidia.com/gpu: '20'
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s

4. AI Serving Frameworks Compared

4.1 NVIDIA Triton Inference Server

Triton supports multiple model formats (TensorRT, ONNX, PyTorch, TensorFlow) in a single server. Its Dynamic Batching feature automatically groups incoming requests to maximize GPU utilization.

# config.pbtxt
name: "text_classifier"
platform: "onnxruntime_onnx"
max_batch_size: 64

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1 ]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1 ]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1, 2 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]
  max_queue_delay_microseconds: 5000
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

4.2 BentoML Service Definition

BentoML is a Python-native serving framework that supports both rapid prototyping and production deployment.

import bentoml
from bentoml.io import JSON
from pydantic import BaseModel
import numpy as np
from typing import List

class InferenceRequest(BaseModel):
    texts: List[str]
    top_k: int = 5

class InferenceResponse(BaseModel):
    labels: List[str]
    scores: List[float]

# Create model runner
classifier_runner = bentoml.pytorch.get("text_classifier:latest").to_runner()

svc = bentoml.Service("text_classification_svc", runners=[classifier_runner])

@svc.api(input=JSON(pydantic_model=InferenceRequest),
         output=JSON(pydantic_model=InferenceResponse))
async def classify(request: InferenceRequest) -> InferenceResponse:
    batch_results = await classifier_runner.async_run(request.texts)

    labels = []
    scores = []
    for result in batch_results:
        top_idx = np.argsort(result)[-request.top_k:][::-1]
        labels.extend([f"label_{i}" for i in top_idx])
        scores.extend(result[top_idx].tolist())

    return InferenceResponse(labels=labels, scores=scores)

4.3 Framework Comparison Summary

Framework	Strengths	Weaknesses	Best Scenario
Triton	Top performance, multi-format	Complex configuration	High-throughput GPU serving
BentoML	Easy to use, packaging	Lower perf vs Triton	Fast MVP, small teams
Ray Serve	Distributed, pipeline support	Steep learning curve	Complex ML pipelines
TorchServe	Native PyTorch integration	Single-framework only	PyTorch-only deployments

5. LLM Serving: vLLM and TGI

5.1 vLLM — High-Performance LLM Serving with PagedAttention

vLLM uses the PagedAttention algorithm to manage KV cache memory like virtual memory in an OS, minimizing GPU memory waste.

from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import uvicorn
import json
import uuid

app = FastAPI(title="vLLM OpenAI-Compatible API")

engine_args = AsyncEngineArgs(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,       # 2-GPU tensor parallelism
    gpu_memory_utilization=0.90,
    max_model_len=8192,
    enable_chunked_prefill=True,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

@app.post("/v1/chat/completions")
async def chat_completions(request: dict):
    messages = request.get("messages", [])
    stream = request.get("stream", False)

    prompt = format_messages(messages)
    sampling_params = SamplingParams(
        temperature=request.get("temperature", 0.7),
        max_tokens=request.get("max_tokens", 512),
        top_p=request.get("top_p", 0.95),
    )

    request_id = str(uuid.uuid4())

    if stream:
        async def generate_stream():
            async for output in engine.generate(prompt, sampling_params, request_id):
                if output.outputs:
                    token = output.outputs[0].text
                    chunk = {
                        "id": request_id,
                        "object": "chat.completion.chunk",
                        "choices": [{"delta": {"content": token}, "index": 0}]
                    }
                    yield f"data: {json.dumps(chunk)}\n\n"
            yield "data: [DONE]\n\n"

        return StreamingResponse(generate_stream(), media_type="text/event-stream")

    final_output = None
    async for output in engine.generate(prompt, sampling_params, request_id):
        final_output = output

    return {
        "choices": [{"message": {"content": final_output.outputs[0].text}}]
    }

def format_messages(messages):
    result = ""
    for msg in messages:
        role = msg.get("role", "user")
        content = msg.get("content", "")
        result += f"<|{role}|>\n{content}\n"
    return result + "<|assistant|>\n"

5.2 TGI (Text Generation Inference) Deployment

Hugging Face TGI can be deployed quickly via Docker.

# Start LLaMA-3 serving with TGI
docker run --gpus all \
  -p 8080:80 \
  -v /data/models:/data \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:2.0.4 \
  --model-id meta-llama/Llama-3-8B-Instruct \
  --num-shard 2 \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --max-batch-prefill-tokens 16384 \
  --dtype bfloat16

# Test streaming response
curl http://localhost:8080/generate_stream \
  -H 'Content-Type: application/json' \
  -d '{"inputs": "What is the capital of France?", "parameters": {"max_new_tokens": 100, "stream": true}}'

6. Performance Optimization Strategies

6.1 Model Warmup

Prevent cold start latency by sending dummy requests when the server starts.

import asyncio
import httpx
import logging

logger = logging.getLogger(__name__)

async def warmup_model(base_url: str, num_warmup_requests: int = 5):
    """Execute model warmup on server startup."""
    dummy_request = {
        "inputs": "warmup",
        "parameters": {"max_new_tokens": 10}
    }

    async with httpx.AsyncClient(timeout=120.0) as client:
        logger.info("Starting model warmup...")

        # Wait for health check
        for _ in range(30):
            try:
                resp = await client.get(f"{base_url}/health")
                if resp.status_code == 200:
                    break
            except Exception:
                await asyncio.sleep(2)

        # Send warmup requests
        tasks = [
            client.post(f"{base_url}/generate", json=dummy_request)
            for _ in range(num_warmup_requests)
        ]
        await asyncio.gather(*tasks, return_exceptions=True)
        logger.info(f"Warmup complete ({num_warmup_requests} requests sent)")

6.2 Dynamic Request Batching

import asyncio
from collections import deque
from dataclasses import dataclass, field
from typing import Any

@dataclass
class BatchRequest:
    request_id: str
    payload: dict
    future: asyncio.Future = field(default_factory=asyncio.Future)

class DynamicBatcher:
    def __init__(self, max_batch_size: int = 32, max_wait_ms: float = 10.0):
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue: deque = deque()
        self._lock = asyncio.Lock()

    async def add_request(self, request_id: str, payload: dict) -> Any:
        req = BatchRequest(request_id=request_id, payload=payload)
        async with self._lock:
            self.queue.append(req)
        return await req.future

    async def process_batches(self, model_fn):
        while True:
            await asyncio.sleep(self.max_wait_ms / 1000)
            async with self._lock:
                if not self.queue:
                    continue
                batch = []
                while self.queue and len(batch) < self.max_batch_size:
                    batch.append(self.queue.popleft())

            if batch:
                try:
                    inputs = [r.payload for r in batch]
                    results = await model_fn(inputs)
                    for req, result in zip(batch, results):
                        req.future.set_result(result)
                except Exception as e:
                    for req in batch:
                        req.future.set_exception(e)

7. Monitoring: Prometheus Metrics

from prometheus_client import Histogram, Counter, Gauge, generate_latest
from fastapi import FastAPI, Request, Response
import time

REQUEST_LATENCY = Histogram(
    "model_request_latency_seconds",
    "Model inference latency in seconds",
    ["model_name", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

REQUEST_COUNT = Counter(
    "model_requests_total",
    "Total number of inference requests",
    ["model_name", "endpoint", "status"]
)

ACTIVE_REQUESTS = Gauge(
    "model_active_requests",
    "Number of currently in-flight requests",
    ["model_name"]
)

TOKEN_THROUGHPUT = Counter(
    "model_tokens_generated_total",
    "Total tokens generated",
    ["model_name"]
)

GPU_MEMORY_USED = Gauge(
    "gpu_memory_used_bytes",
    "GPU memory usage in bytes",
    ["gpu_index"]
)

app = FastAPI()

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    model_name = "llama-3-8b"
    endpoint = request.url.path

    ACTIVE_REQUESTS.labels(model_name=model_name).inc()
    start = time.perf_counter()

    try:
        response = await call_next(request)
        status = str(response.status_code)
        REQUEST_COUNT.labels(
            model_name=model_name, endpoint=endpoint, status=status
        ).inc()
        return response
    except Exception as e:
        REQUEST_COUNT.labels(
            model_name=model_name, endpoint=endpoint, status="500"
        ).inc()
        raise
    finally:
        latency = time.perf_counter() - start
        REQUEST_LATENCY.labels(
            model_name=model_name, endpoint=endpoint
        ).observe(latency)
        ACTIVE_REQUESTS.labels(model_name=model_name).dec()

@app.get("/metrics")
async def metrics():
    return Response(generate_latest(), media_type="text/plain")

Quiz: AI Model Serving Deep Dive

Q1. How does NVIDIA Triton's dynamic batching improve GPU utilization over simple request batching?

Answer: Dynamic batching automatically groups queued requests on the server side, using preferred_batch_size and max_queue_delay settings to ensure the GPU always executes at optimal batch sizes rather than processing one request at a time.

Explanation: With simple request batching, the client must manually construct batches before sending them. Triton's dynamic batching allows a short queuing window (e.g., up to 5ms) on the server and automatically groups arriving requests. This increases GPU Streaming Multiprocessor (SM) utilization and can multiply throughput several times compared to per-request processing. Combining multiple model instances via instance_group further increases batching opportunities.

Q2. How does vLLM's PagedAttention solve memory fragmentation in LLM serving?

Answer: PagedAttention splits the KV cache into fixed-size "pages" and maps them to non-contiguous physical memory using a block table, similar to OS virtual memory. This eliminates most external and internal fragmentation caused by variable sequence lengths.

Explanation: Traditional LLM serving pre-allocates contiguous memory equal to the maximum sequence length for each request, wasting memory when actual output is shorter. vLLM's PagedAttention divides KV cache into blocks (e.g., 16 tokens per block) and uses a logical-to-physical block table. Memory waste drops below 4%, and the same GPU can handle up to 24x more concurrent requests compared to naive implementations.

Q3. What are the architectural differences between BentoML and Ray Serve, and which deployment scenarios suit each?

Answer: BentoML focuses on single-service packaging and container builds for straightforward deployments, while Ray Serve uses a distributed actor model suited for complex ML pipelines and ensemble inference.

Explanation: BentoML bundles model, dependencies, and API into a single Bento artifact, simplifying Docker image builds and cloud deployments. It fits small teams or single-model APIs well. Ray Serve runs on a Ray cluster and excels at chaining multiple models into pipelines, A/B testing, and complex routing logic. It is the better choice for enterprise-scale distributed inference or ensemble workloads requiring fine-grained resource control.

Q4. Why is GPU metric-based HPA scaling in Kubernetes harder than CPU-based scaling?

Answer: GPU metrics are not supported by Kubernetes' default metrics-server, requiring a separate stack (DCGM Exporter + Prometheus Adapter), and GPUs are integer resources that make fine-grained utilization control difficult.

Explanation: CPU and memory are collected by kubelet natively, but GPU utilization (DCGM_FI_DEV_GPU_UTIL) must be collected by NVIDIA DCGM Exporter, stored in Prometheus, and exposed via the Custom Metrics API by Prometheus Adapter before HPA can consume them. GPU memory held by a process is slow to release, risking OOM during scale-down. GPU node provisioning takes 5 to 10 minutes, far longer than CPU nodes, making proactive scale-out critical.

Q5. Why is P99 latency a more important service quality metric than average latency?

Answer: P99 represents the worst-case response time experienced by 99% of users, while the average hides the poor experience of a tail of users. P99 directly reflects real user dissatisfaction that averages obscure.

Explanation: Even if average latency is 50ms, if 1% of requests take 2000ms, thousands of users per minute receive slow responses at scale. Setting a P99 SLO (e.g., under 200ms) allows teams to detect tail latency issues early. In microservice architectures where services call each other serially, individual P99 values compound, causing "tail latency amplification" that makes monitoring P99 at every layer essential.

Conclusion

AI model serving is far more than wrapping a model in an API endpoint. GPU resource management, request batching, streaming responses, Kubernetes autoscaling, and comprehensive monitoring must all work together for reliable production serving. Apply the patterns in this guide incrementally and measure the impact of each improvement with real traffic.