Skip to content
Published on

Building an ML Model Serving Pipeline with BentoML: From Packaging to Kubernetes Deployment

Authors
  • Name
    Twitter

Introduction

Training an ML model and serving it in production are entirely different problems. BentoML is a framework that bridges this gap, packaging models as API services that can be deployed anywhere. It provides a far more structured approach than building APIs manually with Flask/FastAPI.

BentoML vs Building From Scratch

AspectFlask/FastAPI Manual BuildBentoML
API ImplementationManual (routing, serialization)Decorator-based automation
Model VersioningMust implement manuallyBuilt-in Model Store
Batch ProcessingMust implement manuallyBuilt-in Adaptive Batching
Docker BuildManual DockerfileAuto-generated
GPU SupportManual configurationDeclarative configuration

Installation and Basic Usage

pip install bentoml

Saving a Model

# save_model.py
import bentoml
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Train the model
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)

# Save to BentoML Model Store
saved_model = bentoml.sklearn.save_model(
    "iris_classifier",
    model,
    signatures={"predict": {"batchable": True}},
    labels={"owner": "ml-team", "stage": "production"},
    metadata={"accuracy": 0.96, "dataset": "iris"},
)

print(f"Model saved: {saved_model}")
# Model saved: Model(tag="iris_classifier:abc123")
# Check saved models
bentoml models list
# Tag                          Module    Size    Creation Time
# iris_classifier:abc123       sklearn   1.2MB   2026-03-03 05:00:00

Defining a Service

# service.py
import bentoml
import numpy as np
from typing import Annotated

@bentoml.service(
    resources={"cpu": "2", "memory": "1Gi"},
    traffic={"timeout": 30, "concurrency": 32},
)
class IrisClassifier:
    model = bentoml.models.get("iris_classifier:latest")

    def __init__(self):
        self.clf = bentoml.sklearn.load_model(self.model)
        self.target_names = ["setosa", "versicolor", "virginica"]

    @bentoml.api
    def predict(
        self,
        features: Annotated[np.ndarray, bentoml.validators.Shape((4,))],
    ) -> dict:
        prediction = self.clf.predict([features])[0]
        probabilities = self.clf.predict_proba([features])[0]
        return {
            "class": self.target_names[prediction],
            "probability": float(max(probabilities)),
            "all_probabilities": {
                name: float(prob)
                for name, prob in zip(self.target_names, probabilities)
            },
        }

    @bentoml.api
    def predict_batch(
        self,
        features: Annotated[np.ndarray, bentoml.validators.Shape((-1, 4))],
    ) -> list[dict]:
        predictions = self.clf.predict(features)
        probabilities = self.clf.predict_proba(features)
        return [
            {
                "class": self.target_names[pred],
                "probability": float(max(probs)),
            }
            for pred, probs in zip(predictions, probabilities)
        ]
# Local serving
bentoml serve service:IrisClassifier

# Test
curl -X POST http://localhost:3000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'
# {"class": "setosa", "probability": 0.98, ...}

LLM Serving — OpenLLM Integration

# llm_service.py
import bentoml
from vllm import LLM, SamplingParams

@bentoml.service(
    resources={"gpu": 1, "gpu_type": "nvidia-a100"},
    traffic={"timeout": 120, "concurrency": 16},
)
class LLMService:
    def __init__(self):
        self.llm = LLM(
            model="meta-llama/Llama-3.1-8B-Instruct",
            tensor_parallel_size=1,
            max_model_len=8192,
            gpu_memory_utilization=0.9,
        )

    @bentoml.api
    async def generate(self, prompt: str, max_tokens: int = 512) -> str:
        sampling_params = SamplingParams(
            temperature=0.7,
            top_p=0.9,
            max_tokens=max_tokens,
        )
        outputs = self.llm.generate([prompt], sampling_params)
        return outputs[0].outputs[0].text

    @bentoml.api
    async def chat(self, messages: list[dict]) -> str:
        prompt = self._format_chat(messages)
        return await self.generate(prompt)

    def _format_chat(self, messages):
        formatted = ""
        for msg in messages:
            role = msg["role"]
            content = msg["content"]
            formatted += f"<|{role}|>\n{content}\n"
        formatted += "<|assistant|>\n"
        return formatted

Multi-Model Pipeline

# pipeline_service.py
import bentoml
import numpy as np
from PIL import Image

@bentoml.service(resources={"cpu": "4", "memory": "4Gi"})
class ImageClassificationPipeline:
    # Compose multiple models
    preprocessor = bentoml.depends(ImagePreprocessor)
    classifier = bentoml.depends(ImageClassifier)
    postprocessor = bentoml.depends(ResultPostprocessor)

    @bentoml.api
    async def classify(self, image: Image.Image) -> dict:
        # 1. Preprocessing
        features = await self.preprocessor.process(image)

        # 2. Classification
        raw_result = await self.classifier.predict(features)

        # 3. Postprocessing
        result = await self.postprocessor.format(raw_result)

        return result

@bentoml.service(resources={"cpu": "1"})
class ImagePreprocessor:
    @bentoml.api
    async def process(self, image: Image.Image) -> np.ndarray:
        img = image.resize((224, 224))
        arr = np.array(img) / 255.0
        return arr.transpose(2, 0, 1)

@bentoml.service(resources={"gpu": 1})
class ImageClassifier:
    model = bentoml.models.get("resnet50:latest")

    def __init__(self):
        import torch
        self.model = bentoml.pytorch.load_model(self.model)
        self.model.eval()
        self.device = torch.device("cuda")
        self.model.to(self.device)

    @bentoml.api
    async def predict(self, features: np.ndarray) -> np.ndarray:
        import torch
        tensor = torch.tensor(features).unsqueeze(0).float().to(self.device)
        with torch.no_grad():
            output = self.model(tensor)
        return output.cpu().numpy()

Bento Build and Docker

bentofile.yaml

# bentofile.yaml
service: 'service:IrisClassifier'
labels:
  owner: ml-team
  project: iris-classifier
include:
  - '*.py'
python:
  packages:
    - scikit-learn==1.5.0
    - numpy
docker:
  python_version: '3.11'
  system_packages:
    - libgomp1
  env:
    BENTOML_PORT: '3000'
# Build Bento
bentoml build

# Check built Bentos
bentoml list
# Tag                              Size     Creation Time
# iris_classifier_service:xyz789   45MB     2026-03-03

# Generate Docker image
bentoml containerize iris_classifier_service:latest

# Run with Docker
docker run -p 3000:3000 iris_classifier_service:latest

Kubernetes Deployment

# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: iris-classifier
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: iris-classifier
  template:
    metadata:
      labels:
        app: iris-classifier
    spec:
      containers:
        - name: bento
          image: registry.example.com/iris_classifier_service:latest
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: '1'
              memory: '1Gi'
            limits:
              cpu: '2'
              memory: '2Gi'
          readinessProbe:
            httpGet:
              path: /healthz
              port: 3000
            initialDelaySeconds: 10
          livenessProbe:
            httpGet:
              path: /healthz
              port: 3000
            initialDelaySeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: iris-classifier
  namespace: ml-serving
spec:
  selector:
    app: iris-classifier
  ports:
    - port: 80
      targetPort: 3000
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: iris-classifier
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: iris-classifier
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Adaptive Batching

A core feature of BentoML that automatically groups multiple requests to maximize GPU utilization.

@bentoml.service(
    traffic={
        "timeout": 30,
    },
)
class EmbeddingService:
    model = bentoml.models.get("sentence-transformer:latest")

    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer(self.model.path)

    @bentoml.api(
        batchable=True,
        batch_dim=0,
        max_batch_size=64,
        max_latency_ms=100,
    )
    async def encode(self, texts: list[str]) -> np.ndarray:
        # Individual requests are automatically batched together
        embeddings = self.model.encode(texts)
        return embeddings

Monitoring

# Adding custom metrics
import bentoml
from prometheus_client import Counter, Histogram

prediction_counter = Counter(
    "predictions_total", "Total predictions", ["model", "class"]
)
latency_histogram = Histogram(
    "prediction_latency_seconds", "Prediction latency"
)

@bentoml.service
class MonitoredClassifier:
    @bentoml.api
    def predict(self, features: np.ndarray) -> dict:
        with latency_histogram.time():
            result = self.clf.predict([features])[0]
            prediction_counter.labels(
                model="iris_v1", class_name=result
            ).inc()
            return {"class": result}
# Prometheus metrics endpoint
curl http://localhost:3000/metrics

Summary

BentoML significantly reduces the complexity of ML model serving:

  • Simple API Implementation: Create REST APIs in just a few lines using decorators
  • Model Version Management: Systematic management with the built-in Model Store
  • Adaptive Batching: Maximizes GPU utilization
  • Docker Automation: Reproducible builds with bentofile.yaml
  • Kubernetes Native: Auto-scaling with HPA

Quiz: BentoML Comprehension Check (7 Questions)

Q1. What is BentoML's Model Store?

A repository that stores trained models locally with version management and metadata. Models are saved using functions like bentoml.sklearn.save_model().

Q2. How does Adaptive Batching work?

It automatically collects individual requests and processes them in a single batch once max_batch_size or max_latency_ms is reached, maximizing GPU efficiency.

Q3. What is the role of bentoml.depends()?

In multi-model pipelines, it injects other BentoML services as dependencies, automatically managing inter-service communication.

Q4. What is defined in bentofile.yaml?

The service entrypoint, Python package dependencies, Docker configuration, and files to include are declared.

Q5. What is BentoML's /healthz endpoint for?

It is used for Kubernetes readiness/liveness probes to check whether the service is ready and alive.

Q6. How do you specify GPU resources?

Using the @bentoml.service(resources={"gpu": 1, "gpu_type": "nvidia-a100"}) decorator.

Q7. What advantages does BentoML have over building with Flask/FastAPI directly?

It comes with built-in model version management, Adaptive Batching, automatic Docker builds, and declarative resource management, making it faster to get production-ready.