- Published on
Complete Guide to AI Model Deployment & Serving: Triton, vLLM, BentoML, and Kubernetes
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction
Training an AI model in a research environment and serving it reliably in production are entirely different challenges. Alongside model accuracy, low latency, high throughput, and stable scaling are equally critical. This guide walks through the full lifecycle of AI model deployment with a production-first mindset.
1. Serving Architecture Patterns
1.1 Online Serving vs Batch Serving
Online Serving handles user requests in real time with strict latency targets.
- Latency target: P99 under 200ms
- Use cases: recommendation systems, chatbots, real-time image classification
- Infrastructure: REST API / gRPC endpoints, autoscaling replicas
Batch Serving processes large volumes of data in scheduled jobs.
- Latency target: minutes to hours
- Use cases: nightly scoring pipelines, offline recommendation generation
- Infrastructure: Spark jobs, Airflow DAGs, large-scale GPU batch jobs
1.2 Synchronous vs Asynchronous Serving
| Mode | Characteristics | Best Scenario |
|---|---|---|
| Sync | Request waits for response | Latency-sensitive APIs |
| Async | Work queue + result polling | Long inference jobs, LLMs |
| Streaming | Incremental token-by-token response | LLM chat, code generation |
1.3 Streaming Responses with Server-Sent Events
Streaming dramatically improves perceived responsiveness for LLM serving. Time To First Token (TTFT) is the key user-facing metric.
import httpx
import asyncio
async def stream_llm_response(prompt: str):
async with httpx.AsyncClient() as client:
async with client.stream(
"POST",
"http://localhost:8000/v1/completions",
json={
"model": "llama-3-8b",
"prompt": prompt,
"max_tokens": 512,
"stream": True
},
timeout=60.0
) as response:
async for line in response.aiter_lines():
if line.startswith("data: "):
data = line[6:]
if data != "[DONE]":
import json
chunk = json.loads(data)
token = chunk["choices"][0]["text"]
print(token, end="", flush=True)
2. Containerization: Docker for GPU Serving
2.1 Multi-Stage Builds for Optimized Images
Production images should exclude build-time dependencies and include only the minimal runtime footprint.
# ---- Stage 1: Builder ----
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 AS builder
WORKDIR /build
RUN apt-get update && apt-get install -y \
python3.11 \
python3.11-dev \
python3-pip \
git \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# ---- Stage 2: Runtime ----
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
ENV PYTHONUNBUFFERED=1
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
WORKDIR /app
RUN apt-get update && apt-get install -y \
python3.11 \
python3-pip \
libgomp1 \
&& rm -rf /var/lib/apt/lists/*
# Copy only installed packages from the builder stage
COPY /install /usr/local
COPY . .
# Run as non-root user for security
RUN useradd -m -u 1000 mluser
USER mluser
EXPOSE 8080
CMD ["python3.11", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]
2.2 NVIDIA Container Toolkit Setup
GPU containers require the NVIDIA Container Toolkit to be installed on the host machine.
# Install NVIDIA Container Toolkit on Ubuntu
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor \
-o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify GPU container access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
3. Kubernetes ML Deployment
3.1 GPU Deployment + HPA
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-serving
namespace: ml-serving
spec:
replicas: 2
selector:
matchLabels:
app: model-serving
template:
metadata:
labels:
app: model-serving
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8080'
prometheus.io/path: '/metrics'
spec:
nodeSelector:
accelerator: nvidia-tesla-a10g
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: model-server
image: myregistry/model-server:v1.2.0
ports:
- containerPort: 8080
resources:
requests:
cpu: '2'
memory: '8Gi'
nvidia.com/gpu: '1'
limits:
cpu: '4'
memory: '16Gi'
nvidia.com/gpu: '1'
env:
- name: MODEL_PATH
value: '/models/llama-3-8b'
- name: MAX_BATCH_SIZE
value: '32'
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 15
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-serving-hpa
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-serving
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Pods
pods:
metric:
name: model_requests_per_second
target:
type: AverageValue
averageValue: '50'
3.2 Karpenter Node Autoscaling
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: gpu-nodepool
spec:
template:
metadata:
labels:
accelerator: nvidia-tesla-a10g
spec:
nodeClassRef:
name: gpu-nodeclass
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ['on-demand', 'spot']
- key: node.kubernetes.io/instance-type
operator: In
values: ['g5.xlarge', 'g5.2xlarge', 'g5.4xlarge']
- key: kubernetes.io/arch
operator: In
values: ['amd64']
taints:
- key: nvidia.com/gpu
effect: NoSchedule
limits:
nvidia.com/gpu: '20'
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
4. AI Serving Frameworks Compared
4.1 NVIDIA Triton Inference Server
Triton supports multiple model formats (TensorRT, ONNX, PyTorch, TensorFlow) in a single server. Its Dynamic Batching feature automatically groups incoming requests to maximize GPU utilization.
# config.pbtxt
name: "text_classifier"
platform: "onnxruntime_onnx"
max_batch_size: 64
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [ -1 ]
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [ -1 ]
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ -1, 2 ]
}
]
dynamic_batching {
preferred_batch_size: [ 8, 16, 32 ]
max_queue_delay_microseconds: 5000
}
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]
4.2 BentoML Service Definition
BentoML is a Python-native serving framework that supports both rapid prototyping and production deployment.
import bentoml
from bentoml.io import JSON
from pydantic import BaseModel
import numpy as np
from typing import List
class InferenceRequest(BaseModel):
texts: List[str]
top_k: int = 5
class InferenceResponse(BaseModel):
labels: List[str]
scores: List[float]
# Create model runner
classifier_runner = bentoml.pytorch.get("text_classifier:latest").to_runner()
svc = bentoml.Service("text_classification_svc", runners=[classifier_runner])
@svc.api(input=JSON(pydantic_model=InferenceRequest),
output=JSON(pydantic_model=InferenceResponse))
async def classify(request: InferenceRequest) -> InferenceResponse:
batch_results = await classifier_runner.async_run(request.texts)
labels = []
scores = []
for result in batch_results:
top_idx = np.argsort(result)[-request.top_k:][::-1]
labels.extend([f"label_{i}" for i in top_idx])
scores.extend(result[top_idx].tolist())
return InferenceResponse(labels=labels, scores=scores)
4.3 Framework Comparison Summary
| Framework | Strengths | Weaknesses | Best Scenario |
|---|---|---|---|
| Triton | Top performance, multi-format | Complex configuration | High-throughput GPU serving |
| BentoML | Easy to use, packaging | Lower perf vs Triton | Fast MVP, small teams |
| Ray Serve | Distributed, pipeline support | Steep learning curve | Complex ML pipelines |
| TorchServe | Native PyTorch integration | Single-framework only | PyTorch-only deployments |
5. LLM Serving: vLLM and TGI
5.1 vLLM — High-Performance LLM Serving with PagedAttention
vLLM uses the PagedAttention algorithm to manage KV cache memory like virtual memory in an OS, minimizing GPU memory waste.
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import uvicorn
import json
import uuid
app = FastAPI(title="vLLM OpenAI-Compatible API")
engine_args = AsyncEngineArgs(
model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=2, # 2-GPU tensor parallelism
gpu_memory_utilization=0.90,
max_model_len=8192,
enable_chunked_prefill=True,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
@app.post("/v1/chat/completions")
async def chat_completions(request: dict):
messages = request.get("messages", [])
stream = request.get("stream", False)
prompt = format_messages(messages)
sampling_params = SamplingParams(
temperature=request.get("temperature", 0.7),
max_tokens=request.get("max_tokens", 512),
top_p=request.get("top_p", 0.95),
)
request_id = str(uuid.uuid4())
if stream:
async def generate_stream():
async for output in engine.generate(prompt, sampling_params, request_id):
if output.outputs:
token = output.outputs[0].text
chunk = {
"id": request_id,
"object": "chat.completion.chunk",
"choices": [{"delta": {"content": token}, "index": 0}]
}
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate_stream(), media_type="text/event-stream")
final_output = None
async for output in engine.generate(prompt, sampling_params, request_id):
final_output = output
return {
"choices": [{"message": {"content": final_output.outputs[0].text}}]
}
def format_messages(messages):
result = ""
for msg in messages:
role = msg.get("role", "user")
content = msg.get("content", "")
result += f"<|{role}|>\n{content}\n"
return result + "<|assistant|>\n"
5.2 TGI (Text Generation Inference) Deployment
Hugging Face TGI can be deployed quickly via Docker.
# Start LLaMA-3 serving with TGI
docker run --gpus all \
-p 8080:80 \
-v /data/models:/data \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
ghcr.io/huggingface/text-generation-inference:2.0.4 \
--model-id meta-llama/Llama-3-8B-Instruct \
--num-shard 2 \
--max-input-length 4096 \
--max-total-tokens 8192 \
--max-batch-prefill-tokens 16384 \
--dtype bfloat16
# Test streaming response
curl http://localhost:8080/generate_stream \
-H 'Content-Type: application/json' \
-d '{"inputs": "What is the capital of France?", "parameters": {"max_new_tokens": 100, "stream": true}}'
6. Performance Optimization Strategies
6.1 Model Warmup
Prevent cold start latency by sending dummy requests when the server starts.
import asyncio
import httpx
import logging
logger = logging.getLogger(__name__)
async def warmup_model(base_url: str, num_warmup_requests: int = 5):
"""Execute model warmup on server startup."""
dummy_request = {
"inputs": "warmup",
"parameters": {"max_new_tokens": 10}
}
async with httpx.AsyncClient(timeout=120.0) as client:
logger.info("Starting model warmup...")
# Wait for health check
for _ in range(30):
try:
resp = await client.get(f"{base_url}/health")
if resp.status_code == 200:
break
except Exception:
await asyncio.sleep(2)
# Send warmup requests
tasks = [
client.post(f"{base_url}/generate", json=dummy_request)
for _ in range(num_warmup_requests)
]
await asyncio.gather(*tasks, return_exceptions=True)
logger.info(f"Warmup complete ({num_warmup_requests} requests sent)")
6.2 Dynamic Request Batching
import asyncio
from collections import deque
from dataclasses import dataclass, field
from typing import Any
@dataclass
class BatchRequest:
request_id: str
payload: dict
future: asyncio.Future = field(default_factory=asyncio.Future)
class DynamicBatcher:
def __init__(self, max_batch_size: int = 32, max_wait_ms: float = 10.0):
self.max_batch_size = max_batch_size
self.max_wait_ms = max_wait_ms
self.queue: deque = deque()
self._lock = asyncio.Lock()
async def add_request(self, request_id: str, payload: dict) -> Any:
req = BatchRequest(request_id=request_id, payload=payload)
async with self._lock:
self.queue.append(req)
return await req.future
async def process_batches(self, model_fn):
while True:
await asyncio.sleep(self.max_wait_ms / 1000)
async with self._lock:
if not self.queue:
continue
batch = []
while self.queue and len(batch) < self.max_batch_size:
batch.append(self.queue.popleft())
if batch:
try:
inputs = [r.payload for r in batch]
results = await model_fn(inputs)
for req, result in zip(batch, results):
req.future.set_result(result)
except Exception as e:
for req in batch:
req.future.set_exception(e)
7. Monitoring: Prometheus Metrics
from prometheus_client import Histogram, Counter, Gauge, generate_latest
from fastapi import FastAPI, Request, Response
import time
REQUEST_LATENCY = Histogram(
"model_request_latency_seconds",
"Model inference latency in seconds",
["model_name", "endpoint"],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
REQUEST_COUNT = Counter(
"model_requests_total",
"Total number of inference requests",
["model_name", "endpoint", "status"]
)
ACTIVE_REQUESTS = Gauge(
"model_active_requests",
"Number of currently in-flight requests",
["model_name"]
)
TOKEN_THROUGHPUT = Counter(
"model_tokens_generated_total",
"Total tokens generated",
["model_name"]
)
GPU_MEMORY_USED = Gauge(
"gpu_memory_used_bytes",
"GPU memory usage in bytes",
["gpu_index"]
)
app = FastAPI()
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
model_name = "llama-3-8b"
endpoint = request.url.path
ACTIVE_REQUESTS.labels(model_name=model_name).inc()
start = time.perf_counter()
try:
response = await call_next(request)
status = str(response.status_code)
REQUEST_COUNT.labels(
model_name=model_name, endpoint=endpoint, status=status
).inc()
return response
except Exception as e:
REQUEST_COUNT.labels(
model_name=model_name, endpoint=endpoint, status="500"
).inc()
raise
finally:
latency = time.perf_counter() - start
REQUEST_LATENCY.labels(
model_name=model_name, endpoint=endpoint
).observe(latency)
ACTIVE_REQUESTS.labels(model_name=model_name).dec()
@app.get("/metrics")
async def metrics():
return Response(generate_latest(), media_type="text/plain")
Quiz: AI Model Serving Deep Dive
Q1. How does NVIDIA Triton's dynamic batching improve GPU utilization over simple request batching?
Answer: Dynamic batching automatically groups queued requests on the server side, using preferred_batch_size and max_queue_delay settings to ensure the GPU always executes at optimal batch sizes rather than processing one request at a time.
Explanation: With simple request batching, the client must manually construct batches before sending them. Triton's dynamic batching allows a short queuing window (e.g., up to 5ms) on the server and automatically groups arriving requests. This increases GPU Streaming Multiprocessor (SM) utilization and can multiply throughput several times compared to per-request processing. Combining multiple model instances via instance_group further increases batching opportunities.
Q2. How does vLLM's PagedAttention solve memory fragmentation in LLM serving?
Answer: PagedAttention splits the KV cache into fixed-size "pages" and maps them to non-contiguous physical memory using a block table, similar to OS virtual memory. This eliminates most external and internal fragmentation caused by variable sequence lengths.
Explanation: Traditional LLM serving pre-allocates contiguous memory equal to the maximum sequence length for each request, wasting memory when actual output is shorter. vLLM's PagedAttention divides KV cache into blocks (e.g., 16 tokens per block) and uses a logical-to-physical block table. Memory waste drops below 4%, and the same GPU can handle up to 24x more concurrent requests compared to naive implementations.
Q3. What are the architectural differences between BentoML and Ray Serve, and which deployment scenarios suit each?
Answer: BentoML focuses on single-service packaging and container builds for straightforward deployments, while Ray Serve uses a distributed actor model suited for complex ML pipelines and ensemble inference.
Explanation: BentoML bundles model, dependencies, and API into a single Bento artifact, simplifying Docker image builds and cloud deployments. It fits small teams or single-model APIs well. Ray Serve runs on a Ray cluster and excels at chaining multiple models into pipelines, A/B testing, and complex routing logic. It is the better choice for enterprise-scale distributed inference or ensemble workloads requiring fine-grained resource control.
Q4. Why is GPU metric-based HPA scaling in Kubernetes harder than CPU-based scaling?
Answer: GPU metrics are not supported by Kubernetes' default metrics-server, requiring a separate stack (DCGM Exporter + Prometheus Adapter), and GPUs are integer resources that make fine-grained utilization control difficult.
Explanation: CPU and memory are collected by kubelet natively, but GPU utilization (DCGM_FI_DEV_GPU_UTIL) must be collected by NVIDIA DCGM Exporter, stored in Prometheus, and exposed via the Custom Metrics API by Prometheus Adapter before HPA can consume them. GPU memory held by a process is slow to release, risking OOM during scale-down. GPU node provisioning takes 5 to 10 minutes, far longer than CPU nodes, making proactive scale-out critical.
Q5. Why is P99 latency a more important service quality metric than average latency?
Answer: P99 represents the worst-case response time experienced by 99% of users, while the average hides the poor experience of a tail of users. P99 directly reflects real user dissatisfaction that averages obscure.
Explanation: Even if average latency is 50ms, if 1% of requests take 2000ms, thousands of users per minute receive slow responses at scale. Setting a P99 SLO (e.g., under 200ms) allows teams to detect tail latency issues early. In microservice architectures where services call each other serially, individual P99 values compound, causing "tail latency amplification" that makes monitoring P99 at every layer essential.
Conclusion
AI model serving is far more than wrapping a model in an API endpoint. GPU resource management, request batching, streaming responses, Kubernetes autoscaling, and comprehensive monitoring must all work together for reliable production serving. Apply the patterns in this guide incrementally and measure the impact of each improvement with real traffic.