Skip to content
Published on

Complete Guide to AI Model Deployment & Serving: Triton, vLLM, BentoML, and Kubernetes

Authors

Introduction

Training an AI model in a research environment and serving it reliably in production are entirely different challenges. Alongside model accuracy, low latency, high throughput, and stable scaling are equally critical. This guide walks through the full lifecycle of AI model deployment with a production-first mindset.


1. Serving Architecture Patterns

1.1 Online Serving vs Batch Serving

Online Serving handles user requests in real time with strict latency targets.

  • Latency target: P99 under 200ms
  • Use cases: recommendation systems, chatbots, real-time image classification
  • Infrastructure: REST API / gRPC endpoints, autoscaling replicas

Batch Serving processes large volumes of data in scheduled jobs.

  • Latency target: minutes to hours
  • Use cases: nightly scoring pipelines, offline recommendation generation
  • Infrastructure: Spark jobs, Airflow DAGs, large-scale GPU batch jobs

1.2 Synchronous vs Asynchronous Serving

ModeCharacteristicsBest Scenario
SyncRequest waits for responseLatency-sensitive APIs
AsyncWork queue + result pollingLong inference jobs, LLMs
StreamingIncremental token-by-token responseLLM chat, code generation

1.3 Streaming Responses with Server-Sent Events

Streaming dramatically improves perceived responsiveness for LLM serving. Time To First Token (TTFT) is the key user-facing metric.

import httpx
import asyncio

async def stream_llm_response(prompt: str):
    async with httpx.AsyncClient() as client:
        async with client.stream(
            "POST",
            "http://localhost:8000/v1/completions",
            json={
                "model": "llama-3-8b",
                "prompt": prompt,
                "max_tokens": 512,
                "stream": True
            },
            timeout=60.0
        ) as response:
            async for line in response.aiter_lines():
                if line.startswith("data: "):
                    data = line[6:]
                    if data != "[DONE]":
                        import json
                        chunk = json.loads(data)
                        token = chunk["choices"][0]["text"]
                        print(token, end="", flush=True)

2. Containerization: Docker for GPU Serving

2.1 Multi-Stage Builds for Optimized Images

Production images should exclude build-time dependencies and include only the minimal runtime footprint.

# ---- Stage 1: Builder ----
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 AS builder

WORKDIR /build

RUN apt-get update && apt-get install -y \
    python3.11 \
    python3.11-dev \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# ---- Stage 2: Runtime ----
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

ENV PYTHONUNBUFFERED=1
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

WORKDIR /app

RUN apt-get update && apt-get install -y \
    python3.11 \
    python3-pip \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# Copy only installed packages from the builder stage
COPY --from=builder /install /usr/local
COPY . .

# Run as non-root user for security
RUN useradd -m -u 1000 mluser
USER mluser

EXPOSE 8080
CMD ["python3.11", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

2.2 NVIDIA Container Toolkit Setup

GPU containers require the NVIDIA Container Toolkit to be installed on the host machine.

# Install NVIDIA Container Toolkit on Ubuntu
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor \
  -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify GPU container access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

3. Kubernetes ML Deployment

3.1 GPU Deployment + HPA

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: model-serving
  template:
    metadata:
      labels:
        app: model-serving
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8080'
        prometheus.io/path: '/metrics'
    spec:
      nodeSelector:
        accelerator: nvidia-tesla-a10g
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: model-server
          image: myregistry/model-server:v1.2.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: '2'
              memory: '8Gi'
              nvidia.com/gpu: '1'
            limits:
              cpu: '4'
              memory: '16Gi'
              nvidia.com/gpu: '1'
          env:
            - name: MODEL_PATH
              value: '/models/llama-3-8b'
            - name: MAX_BATCH_SIZE
              value: '32'
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-serving-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-serving
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Pods
      pods:
        metric:
          name: model_requests_per_second
        target:
          type: AverageValue
          averageValue: '50'

3.2 Karpenter Node Autoscaling

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    metadata:
      labels:
        accelerator: nvidia-tesla-a10g
    spec:
      nodeClassRef:
        name: gpu-nodeclass
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['g5.xlarge', 'g5.2xlarge', 'g5.4xlarge']
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
  limits:
    nvidia.com/gpu: '20'
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s

4. AI Serving Frameworks Compared

4.1 NVIDIA Triton Inference Server

Triton supports multiple model formats (TensorRT, ONNX, PyTorch, TensorFlow) in a single server. Its Dynamic Batching feature automatically groups incoming requests to maximize GPU utilization.

# config.pbtxt
name: "text_classifier"
platform: "onnxruntime_onnx"
max_batch_size: 64

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1 ]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1 ]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1, 2 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]
  max_queue_delay_microseconds: 5000
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

4.2 BentoML Service Definition

BentoML is a Python-native serving framework that supports both rapid prototyping and production deployment.

import bentoml
from bentoml.io import JSON
from pydantic import BaseModel
import numpy as np
from typing import List

class InferenceRequest(BaseModel):
    texts: List[str]
    top_k: int = 5

class InferenceResponse(BaseModel):
    labels: List[str]
    scores: List[float]

# Create model runner
classifier_runner = bentoml.pytorch.get("text_classifier:latest").to_runner()

svc = bentoml.Service("text_classification_svc", runners=[classifier_runner])

@svc.api(input=JSON(pydantic_model=InferenceRequest),
         output=JSON(pydantic_model=InferenceResponse))
async def classify(request: InferenceRequest) -> InferenceResponse:
    batch_results = await classifier_runner.async_run(request.texts)

    labels = []
    scores = []
    for result in batch_results:
        top_idx = np.argsort(result)[-request.top_k:][::-1]
        labels.extend([f"label_{i}" for i in top_idx])
        scores.extend(result[top_idx].tolist())

    return InferenceResponse(labels=labels, scores=scores)

4.3 Framework Comparison Summary

FrameworkStrengthsWeaknessesBest Scenario
TritonTop performance, multi-formatComplex configurationHigh-throughput GPU serving
BentoMLEasy to use, packagingLower perf vs TritonFast MVP, small teams
Ray ServeDistributed, pipeline supportSteep learning curveComplex ML pipelines
TorchServeNative PyTorch integrationSingle-framework onlyPyTorch-only deployments

5. LLM Serving: vLLM and TGI

5.1 vLLM — High-Performance LLM Serving with PagedAttention

vLLM uses the PagedAttention algorithm to manage KV cache memory like virtual memory in an OS, minimizing GPU memory waste.

from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import uvicorn
import json
import uuid

app = FastAPI(title="vLLM OpenAI-Compatible API")

engine_args = AsyncEngineArgs(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,       # 2-GPU tensor parallelism
    gpu_memory_utilization=0.90,
    max_model_len=8192,
    enable_chunked_prefill=True,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

@app.post("/v1/chat/completions")
async def chat_completions(request: dict):
    messages = request.get("messages", [])
    stream = request.get("stream", False)

    prompt = format_messages(messages)
    sampling_params = SamplingParams(
        temperature=request.get("temperature", 0.7),
        max_tokens=request.get("max_tokens", 512),
        top_p=request.get("top_p", 0.95),
    )

    request_id = str(uuid.uuid4())

    if stream:
        async def generate_stream():
            async for output in engine.generate(prompt, sampling_params, request_id):
                if output.outputs:
                    token = output.outputs[0].text
                    chunk = {
                        "id": request_id,
                        "object": "chat.completion.chunk",
                        "choices": [{"delta": {"content": token}, "index": 0}]
                    }
                    yield f"data: {json.dumps(chunk)}\n\n"
            yield "data: [DONE]\n\n"

        return StreamingResponse(generate_stream(), media_type="text/event-stream")

    final_output = None
    async for output in engine.generate(prompt, sampling_params, request_id):
        final_output = output

    return {
        "choices": [{"message": {"content": final_output.outputs[0].text}}]
    }

def format_messages(messages):
    result = ""
    for msg in messages:
        role = msg.get("role", "user")
        content = msg.get("content", "")
        result += f"<|{role}|>\n{content}\n"
    return result + "<|assistant|>\n"

5.2 TGI (Text Generation Inference) Deployment

Hugging Face TGI can be deployed quickly via Docker.

# Start LLaMA-3 serving with TGI
docker run --gpus all \
  -p 8080:80 \
  -v /data/models:/data \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:2.0.4 \
  --model-id meta-llama/Llama-3-8B-Instruct \
  --num-shard 2 \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --max-batch-prefill-tokens 16384 \
  --dtype bfloat16

# Test streaming response
curl http://localhost:8080/generate_stream \
  -H 'Content-Type: application/json' \
  -d '{"inputs": "What is the capital of France?", "parameters": {"max_new_tokens": 100, "stream": true}}'

6. Performance Optimization Strategies

6.1 Model Warmup

Prevent cold start latency by sending dummy requests when the server starts.

import asyncio
import httpx
import logging

logger = logging.getLogger(__name__)

async def warmup_model(base_url: str, num_warmup_requests: int = 5):
    """Execute model warmup on server startup."""
    dummy_request = {
        "inputs": "warmup",
        "parameters": {"max_new_tokens": 10}
    }

    async with httpx.AsyncClient(timeout=120.0) as client:
        logger.info("Starting model warmup...")

        # Wait for health check
        for _ in range(30):
            try:
                resp = await client.get(f"{base_url}/health")
                if resp.status_code == 200:
                    break
            except Exception:
                await asyncio.sleep(2)

        # Send warmup requests
        tasks = [
            client.post(f"{base_url}/generate", json=dummy_request)
            for _ in range(num_warmup_requests)
        ]
        await asyncio.gather(*tasks, return_exceptions=True)
        logger.info(f"Warmup complete ({num_warmup_requests} requests sent)")

6.2 Dynamic Request Batching

import asyncio
from collections import deque
from dataclasses import dataclass, field
from typing import Any

@dataclass
class BatchRequest:
    request_id: str
    payload: dict
    future: asyncio.Future = field(default_factory=asyncio.Future)

class DynamicBatcher:
    def __init__(self, max_batch_size: int = 32, max_wait_ms: float = 10.0):
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue: deque = deque()
        self._lock = asyncio.Lock()

    async def add_request(self, request_id: str, payload: dict) -> Any:
        req = BatchRequest(request_id=request_id, payload=payload)
        async with self._lock:
            self.queue.append(req)
        return await req.future

    async def process_batches(self, model_fn):
        while True:
            await asyncio.sleep(self.max_wait_ms / 1000)
            async with self._lock:
                if not self.queue:
                    continue
                batch = []
                while self.queue and len(batch) < self.max_batch_size:
                    batch.append(self.queue.popleft())

            if batch:
                try:
                    inputs = [r.payload for r in batch]
                    results = await model_fn(inputs)
                    for req, result in zip(batch, results):
                        req.future.set_result(result)
                except Exception as e:
                    for req in batch:
                        req.future.set_exception(e)

7. Monitoring: Prometheus Metrics

from prometheus_client import Histogram, Counter, Gauge, generate_latest
from fastapi import FastAPI, Request, Response
import time

REQUEST_LATENCY = Histogram(
    "model_request_latency_seconds",
    "Model inference latency in seconds",
    ["model_name", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

REQUEST_COUNT = Counter(
    "model_requests_total",
    "Total number of inference requests",
    ["model_name", "endpoint", "status"]
)

ACTIVE_REQUESTS = Gauge(
    "model_active_requests",
    "Number of currently in-flight requests",
    ["model_name"]
)

TOKEN_THROUGHPUT = Counter(
    "model_tokens_generated_total",
    "Total tokens generated",
    ["model_name"]
)

GPU_MEMORY_USED = Gauge(
    "gpu_memory_used_bytes",
    "GPU memory usage in bytes",
    ["gpu_index"]
)

app = FastAPI()

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    model_name = "llama-3-8b"
    endpoint = request.url.path

    ACTIVE_REQUESTS.labels(model_name=model_name).inc()
    start = time.perf_counter()

    try:
        response = await call_next(request)
        status = str(response.status_code)
        REQUEST_COUNT.labels(
            model_name=model_name, endpoint=endpoint, status=status
        ).inc()
        return response
    except Exception as e:
        REQUEST_COUNT.labels(
            model_name=model_name, endpoint=endpoint, status="500"
        ).inc()
        raise
    finally:
        latency = time.perf_counter() - start
        REQUEST_LATENCY.labels(
            model_name=model_name, endpoint=endpoint
        ).observe(latency)
        ACTIVE_REQUESTS.labels(model_name=model_name).dec()

@app.get("/metrics")
async def metrics():
    return Response(generate_latest(), media_type="text/plain")

Quiz: AI Model Serving Deep Dive

Q1. How does NVIDIA Triton's dynamic batching improve GPU utilization over simple request batching?

Answer: Dynamic batching automatically groups queued requests on the server side, using preferred_batch_size and max_queue_delay settings to ensure the GPU always executes at optimal batch sizes rather than processing one request at a time.

Explanation: With simple request batching, the client must manually construct batches before sending them. Triton's dynamic batching allows a short queuing window (e.g., up to 5ms) on the server and automatically groups arriving requests. This increases GPU Streaming Multiprocessor (SM) utilization and can multiply throughput several times compared to per-request processing. Combining multiple model instances via instance_group further increases batching opportunities.

Q2. How does vLLM's PagedAttention solve memory fragmentation in LLM serving?

Answer: PagedAttention splits the KV cache into fixed-size "pages" and maps them to non-contiguous physical memory using a block table, similar to OS virtual memory. This eliminates most external and internal fragmentation caused by variable sequence lengths.

Explanation: Traditional LLM serving pre-allocates contiguous memory equal to the maximum sequence length for each request, wasting memory when actual output is shorter. vLLM's PagedAttention divides KV cache into blocks (e.g., 16 tokens per block) and uses a logical-to-physical block table. Memory waste drops below 4%, and the same GPU can handle up to 24x more concurrent requests compared to naive implementations.

Q3. What are the architectural differences between BentoML and Ray Serve, and which deployment scenarios suit each?

Answer: BentoML focuses on single-service packaging and container builds for straightforward deployments, while Ray Serve uses a distributed actor model suited for complex ML pipelines and ensemble inference.

Explanation: BentoML bundles model, dependencies, and API into a single Bento artifact, simplifying Docker image builds and cloud deployments. It fits small teams or single-model APIs well. Ray Serve runs on a Ray cluster and excels at chaining multiple models into pipelines, A/B testing, and complex routing logic. It is the better choice for enterprise-scale distributed inference or ensemble workloads requiring fine-grained resource control.

Q4. Why is GPU metric-based HPA scaling in Kubernetes harder than CPU-based scaling?

Answer: GPU metrics are not supported by Kubernetes' default metrics-server, requiring a separate stack (DCGM Exporter + Prometheus Adapter), and GPUs are integer resources that make fine-grained utilization control difficult.

Explanation: CPU and memory are collected by kubelet natively, but GPU utilization (DCGM_FI_DEV_GPU_UTIL) must be collected by NVIDIA DCGM Exporter, stored in Prometheus, and exposed via the Custom Metrics API by Prometheus Adapter before HPA can consume them. GPU memory held by a process is slow to release, risking OOM during scale-down. GPU node provisioning takes 5 to 10 minutes, far longer than CPU nodes, making proactive scale-out critical.

Q5. Why is P99 latency a more important service quality metric than average latency?

Answer: P99 represents the worst-case response time experienced by 99% of users, while the average hides the poor experience of a tail of users. P99 directly reflects real user dissatisfaction that averages obscure.

Explanation: Even if average latency is 50ms, if 1% of requests take 2000ms, thousands of users per minute receive slow responses at scale. Setting a P99 SLO (e.g., under 200ms) allows teams to detect tail latency issues early. In microservice architectures where services call each other serially, individual P99 values compound, causing "tail latency amplification" that makes monitoring P99 at every layer essential.


Conclusion

AI model serving is far more than wrapping a model in an API endpoint. GPU resource management, request batching, streaming responses, Kubernetes autoscaling, and comprehensive monitoring must all work together for reliable production serving. Apply the patterns in this guide incrementally and measure the impact of each improvement with real traffic.