- Authors
- Name
- Introduction
- BentoML vs Building From Scratch
- Installation and Basic Usage
- LLM Serving — OpenLLM Integration
- Multi-Model Pipeline
- Bento Build and Docker
- Kubernetes Deployment
- Adaptive Batching
- Monitoring
- Summary
Introduction
Training an ML model and serving it in production are entirely different problems. BentoML is a framework that bridges this gap, packaging models as API services that can be deployed anywhere. It provides a far more structured approach than building APIs manually with Flask/FastAPI.
BentoML vs Building From Scratch
| Aspect | Flask/FastAPI Manual Build | BentoML |
|---|---|---|
| API Implementation | Manual (routing, serialization) | Decorator-based automation |
| Model Versioning | Must implement manually | Built-in Model Store |
| Batch Processing | Must implement manually | Built-in Adaptive Batching |
| Docker Build | Manual Dockerfile | Auto-generated |
| GPU Support | Manual configuration | Declarative configuration |
Installation and Basic Usage
pip install bentoml
Saving a Model
# save_model.py
import bentoml
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Train the model
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)
# Save to BentoML Model Store
saved_model = bentoml.sklearn.save_model(
"iris_classifier",
model,
signatures={"predict": {"batchable": True}},
labels={"owner": "ml-team", "stage": "production"},
metadata={"accuracy": 0.96, "dataset": "iris"},
)
print(f"Model saved: {saved_model}")
# Model saved: Model(tag="iris_classifier:abc123")
# Check saved models
bentoml models list
# Tag Module Size Creation Time
# iris_classifier:abc123 sklearn 1.2MB 2026-03-03 05:00:00
Defining a Service
# service.py
import bentoml
import numpy as np
from typing import Annotated
@bentoml.service(
resources={"cpu": "2", "memory": "1Gi"},
traffic={"timeout": 30, "concurrency": 32},
)
class IrisClassifier:
model = bentoml.models.get("iris_classifier:latest")
def __init__(self):
self.clf = bentoml.sklearn.load_model(self.model)
self.target_names = ["setosa", "versicolor", "virginica"]
@bentoml.api
def predict(
self,
features: Annotated[np.ndarray, bentoml.validators.Shape((4,))],
) -> dict:
prediction = self.clf.predict([features])[0]
probabilities = self.clf.predict_proba([features])[0]
return {
"class": self.target_names[prediction],
"probability": float(max(probabilities)),
"all_probabilities": {
name: float(prob)
for name, prob in zip(self.target_names, probabilities)
},
}
@bentoml.api
def predict_batch(
self,
features: Annotated[np.ndarray, bentoml.validators.Shape((-1, 4))],
) -> list[dict]:
predictions = self.clf.predict(features)
probabilities = self.clf.predict_proba(features)
return [
{
"class": self.target_names[pred],
"probability": float(max(probs)),
}
for pred, probs in zip(predictions, probabilities)
]
# Local serving
bentoml serve service:IrisClassifier
# Test
curl -X POST http://localhost:3000/predict \
-H "Content-Type: application/json" \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'
# {"class": "setosa", "probability": 0.98, ...}
LLM Serving — OpenLLM Integration
# llm_service.py
import bentoml
from vllm import LLM, SamplingParams
@bentoml.service(
resources={"gpu": 1, "gpu_type": "nvidia-a100"},
traffic={"timeout": 120, "concurrency": 16},
)
class LLMService:
def __init__(self):
self.llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=1,
max_model_len=8192,
gpu_memory_utilization=0.9,
)
@bentoml.api
async def generate(self, prompt: str, max_tokens: int = 512) -> str:
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=max_tokens,
)
outputs = self.llm.generate([prompt], sampling_params)
return outputs[0].outputs[0].text
@bentoml.api
async def chat(self, messages: list[dict]) -> str:
prompt = self._format_chat(messages)
return await self.generate(prompt)
def _format_chat(self, messages):
formatted = ""
for msg in messages:
role = msg["role"]
content = msg["content"]
formatted += f"<|{role}|>\n{content}\n"
formatted += "<|assistant|>\n"
return formatted
Multi-Model Pipeline
# pipeline_service.py
import bentoml
import numpy as np
from PIL import Image
@bentoml.service(resources={"cpu": "4", "memory": "4Gi"})
class ImageClassificationPipeline:
# Compose multiple models
preprocessor = bentoml.depends(ImagePreprocessor)
classifier = bentoml.depends(ImageClassifier)
postprocessor = bentoml.depends(ResultPostprocessor)
@bentoml.api
async def classify(self, image: Image.Image) -> dict:
# 1. Preprocessing
features = await self.preprocessor.process(image)
# 2. Classification
raw_result = await self.classifier.predict(features)
# 3. Postprocessing
result = await self.postprocessor.format(raw_result)
return result
@bentoml.service(resources={"cpu": "1"})
class ImagePreprocessor:
@bentoml.api
async def process(self, image: Image.Image) -> np.ndarray:
img = image.resize((224, 224))
arr = np.array(img) / 255.0
return arr.transpose(2, 0, 1)
@bentoml.service(resources={"gpu": 1})
class ImageClassifier:
model = bentoml.models.get("resnet50:latest")
def __init__(self):
import torch
self.model = bentoml.pytorch.load_model(self.model)
self.model.eval()
self.device = torch.device("cuda")
self.model.to(self.device)
@bentoml.api
async def predict(self, features: np.ndarray) -> np.ndarray:
import torch
tensor = torch.tensor(features).unsqueeze(0).float().to(self.device)
with torch.no_grad():
output = self.model(tensor)
return output.cpu().numpy()
Bento Build and Docker
bentofile.yaml
# bentofile.yaml
service: 'service:IrisClassifier'
labels:
owner: ml-team
project: iris-classifier
include:
- '*.py'
python:
packages:
- scikit-learn==1.5.0
- numpy
docker:
python_version: '3.11'
system_packages:
- libgomp1
env:
BENTOML_PORT: '3000'
# Build Bento
bentoml build
# Check built Bentos
bentoml list
# Tag Size Creation Time
# iris_classifier_service:xyz789 45MB 2026-03-03
# Generate Docker image
bentoml containerize iris_classifier_service:latest
# Run with Docker
docker run -p 3000:3000 iris_classifier_service:latest
Kubernetes Deployment
# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: iris-classifier
namespace: ml-serving
spec:
replicas: 3
selector:
matchLabels:
app: iris-classifier
template:
metadata:
labels:
app: iris-classifier
spec:
containers:
- name: bento
image: registry.example.com/iris_classifier_service:latest
ports:
- containerPort: 3000
resources:
requests:
cpu: '1'
memory: '1Gi'
limits:
cpu: '2'
memory: '2Gi'
readinessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: iris-classifier
namespace: ml-serving
spec:
selector:
app: iris-classifier
ports:
- port: 80
targetPort: 3000
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: iris-classifier
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: iris-classifier
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Adaptive Batching
A core feature of BentoML that automatically groups multiple requests to maximize GPU utilization.
@bentoml.service(
traffic={
"timeout": 30,
},
)
class EmbeddingService:
model = bentoml.models.get("sentence-transformer:latest")
def __init__(self):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer(self.model.path)
@bentoml.api(
batchable=True,
batch_dim=0,
max_batch_size=64,
max_latency_ms=100,
)
async def encode(self, texts: list[str]) -> np.ndarray:
# Individual requests are automatically batched together
embeddings = self.model.encode(texts)
return embeddings
Monitoring
# Adding custom metrics
import bentoml
from prometheus_client import Counter, Histogram
prediction_counter = Counter(
"predictions_total", "Total predictions", ["model", "class"]
)
latency_histogram = Histogram(
"prediction_latency_seconds", "Prediction latency"
)
@bentoml.service
class MonitoredClassifier:
@bentoml.api
def predict(self, features: np.ndarray) -> dict:
with latency_histogram.time():
result = self.clf.predict([features])[0]
prediction_counter.labels(
model="iris_v1", class_name=result
).inc()
return {"class": result}
# Prometheus metrics endpoint
curl http://localhost:3000/metrics
Summary
BentoML significantly reduces the complexity of ML model serving:
- Simple API Implementation: Create REST APIs in just a few lines using decorators
- Model Version Management: Systematic management with the built-in Model Store
- Adaptive Batching: Maximizes GPU utilization
- Docker Automation: Reproducible builds with bentofile.yaml
- Kubernetes Native: Auto-scaling with HPA
Quiz: BentoML Comprehension Check (7 Questions)
Q1. What is BentoML's Model Store?
A repository that stores trained models locally with version management and metadata. Models are saved using functions like bentoml.sklearn.save_model().
Q2. How does Adaptive Batching work?
It automatically collects individual requests and processes them in a single batch once max_batch_size or max_latency_ms is reached, maximizing GPU efficiency.
Q3. What is the role of bentoml.depends()?
In multi-model pipelines, it injects other BentoML services as dependencies, automatically managing inter-service communication.
Q4. What is defined in bentofile.yaml?
The service entrypoint, Python package dependencies, Docker configuration, and files to include are declared.
Q5. What is BentoML's /healthz endpoint for?
It is used for Kubernetes readiness/liveness probes to check whether the service is ready and alive.
Q6. How do you specify GPU resources?
Using the @bentoml.service(resources={"gpu": 1, "gpu_type": "nvidia-a100"}) decorator.
Q7. What advantages does BentoML have over building with Flask/FastAPI directly?
It comes with built-in model version management, Adaptive Batching, automatic Docker builds, and declarative resource management, making it faster to get production-ready.