💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Training an ML model and serving it in production are entirely different problems. **BentoML** is a framework that bridges this gap, packaging models as API services that can be deployed anywhere. It provides a far more structured approach than building APIs manually with Flask/FastAPI.

BentoML vs Building From Scratch

| Aspect | Flask/FastAPI Manual Build | BentoML |

| ------------------ | ------------------------------- | -------------------------- |

| API Implementation | Manual (routing, serialization) | Decorator-based automation |

| Model Versioning | Must implement manually | Built-in Model Store |

| Batch Processing | Must implement manually | Built-in Adaptive Batching |

| Docker Build | Manual Dockerfile | Auto-generated |

| GPU Support | Manual configuration | Declarative configuration |

Installation and Basic Usage

pip install bentoml

Saving a Model

save_model.py

from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import load_iris

Train the model

X, y = load_iris(return_X_y=True)

model = RandomForestClassifier(n_estimators=100)

model.fit(X, y)

Save to BentoML Model Store

saved_model = bentoml.sklearn.save_model(

"iris_classifier",

model,

signatures={"predict": {"batchable": True}},

labels={"owner": "ml-team", "stage": "production"},

metadata={"accuracy": 0.96, "dataset": "iris"},

)

print(f"Model saved: {saved_model}")

Model saved: Model(tag="iris_classifier:abc123")

Check saved models

bentoml models list

Tag Module Size Creation Time

iris_classifier:abc123 sklearn 1.2MB 2026-03-03 05:00:00

Defining a Service

service.py

from typing import Annotated

@bentoml.service(

resources={"cpu": "2", "memory": "1Gi"},

traffic={"timeout": 30, "concurrency": 32},

)

class IrisClassifier:

model = bentoml.models.get("iris_classifier:latest")

def __init__(self):

self.clf = bentoml.sklearn.load_model(self.model)

self.target_names = ["setosa", "versicolor", "virginica"]

@bentoml.api

def predict(

self,

features: Annotated[np.ndarray, bentoml.validators.Shape((4,))],

) -> dict:

prediction = self.clf.predict([features])[0]

probabilities = self.clf.predict_proba([features])[0]

return {

"class": self.target_names[prediction],

"probability": float(max(probabilities)),

"all_probabilities": {

for name, prob in zip(self.target_names, probabilities)

}

@bentoml.api

def predict_batch(

self,

features: Annotated[np.ndarray, bentoml.validators.Shape((-1, 4))],

) -> list[dict]:

predictions = self.clf.predict(features)

probabilities = self.clf.predict_proba(features)

return [

{

"class": self.target_names[pred],

"probability": float(max(probs)),

}

for pred, probs in zip(predictions, probabilities)

]

Local serving

bentoml serve service:IrisClassifier

Test

curl -X POST http://localhost:3000/predict \

-H "Content-Type: application/json" \

-d '{"features": [5.1, 3.5, 1.4, 0.2]}'

{"class": "setosa", "probability": 0.98, ...}

LLM Serving — OpenLLM Integration

llm_service.py

from vllm import LLM, SamplingParams

@bentoml.service(

resources={"gpu": 1, "gpu_type": "nvidia-a100"},

traffic={"timeout": 120, "concurrency": 16},

)

class LLMService:

def __init__(self):

self.llm = LLM(

model="meta-llama/Llama-3.1-8B-Instruct",

tensor_parallel_size=1,

max_model_len=8192,

gpu_memory_utilization=0.9,

)

@bentoml.api

async def generate(self, prompt: str, max_tokens: int = 512) -> str:

sampling_params = SamplingParams(

temperature=0.7,

top_p=0.9,

max_tokens=max_tokens,

)

outputs = self.llm.generate([prompt], sampling_params)

return outputs[0].outputs[0].text

@bentoml.api

async def chat(self, messages: list[dict]) -> str:

prompt = self._format_chat(messages)

return await self.generate(prompt)

def _format_chat(self, messages):

formatted = ""

for msg in messages:

role = msg["role"]

content = msg["content"]

formatted += f"<|{role}|>\n{content}\n"

formatted += "<|assistant|>\n"

return formatted

Multi-Model Pipeline

pipeline_service.py

from PIL import Image

@bentoml.service(resources={"cpu": "4", "memory": "4Gi"})

class ImageClassificationPipeline:

Compose multiple models

preprocessor = bentoml.depends(ImagePreprocessor)

classifier = bentoml.depends(ImageClassifier)

postprocessor = bentoml.depends(ResultPostprocessor)

@bentoml.api

async def classify(self, image: Image.Image) -> dict:

1. Preprocessing

features = await self.preprocessor.process(image)

2. Classification

raw_result = await self.classifier.predict(features)

3. Postprocessing

result = await self.postprocessor.format(raw_result)

return result

@bentoml.service(resources={"cpu": "1"})

class ImagePreprocessor:

@bentoml.api

async def process(self, image: Image.Image) -> np.ndarray:

img = image.resize((224, 224))

arr = np.array(img) / 255.0

return arr.transpose(2, 0, 1)

@bentoml.service(resources={"gpu": 1})

class ImageClassifier:

model = bentoml.models.get("resnet50:latest")

def __init__(self):

self.model = bentoml.pytorch.load_model(self.model)

self.model.eval()

self.device = torch.device("cuda")

self.model.to(self.device)

@bentoml.api

async def predict(self, features: np.ndarray) -> np.ndarray:

tensor = torch.tensor(features).unsqueeze(0).float().to(self.device)

with torch.no_grad():

output = self.model(tensor)

return output.cpu().numpy()

Bento Build and Docker

bentofile.yaml

service: 'service:IrisClassifier'

labels:

owner: ml-team

project: iris-classifier

include:

- '*.py'

python:

packages:

- scikit-learn==1.5.0

- numpy

docker:

python_version: '3.11'

system_packages:

- libgomp1

env:

BENTOML_PORT: '3000'

Build Bento

bentoml build

Check built Bentos

bentoml list

Tag Size Creation Time

iris_classifier_service:xyz789 45MB 2026-03-03

Generate Docker image

bentoml containerize iris_classifier_service:latest

Run with Docker

docker run -p 3000:3000 iris_classifier_service:latest

Kubernetes Deployment

k8s-deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: ml-serving

spec:

replicas: 3

selector:

matchLabels:

app: iris-classifier

template:

metadata:

labels:

app: iris-classifier

spec:

containers:

- name: bento

image: registry.example.com/iris_classifier_service:latest

ports:

- containerPort: 3000

resources:

requests:

cpu: '1'

memory: '1Gi'

limits:

cpu: '2'

memory: '2Gi'

readinessProbe:

httpGet:

path: /healthz

port: 3000

initialDelaySeconds: 10

livenessProbe:

httpGet:

path: /healthz

port: 3000

initialDelaySeconds: 30

apiVersion: v1

kind: Service

metadata:

namespace: ml-serving

spec:

selector:

app: iris-classifier

ports:

- port: 80

targetPort: 3000

type: ClusterIP

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

namespace: ml-serving

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicas: 2

maxReplicas: 10

metrics:

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 70

Adaptive Batching

A core feature of BentoML that automatically groups multiple requests to maximize GPU utilization.

@bentoml.service(

traffic={

"timeout": 30,

)

class EmbeddingService:

model = bentoml.models.get("sentence-transformer:latest")

def __init__(self):

from sentence_transformers import SentenceTransformer

self.model = SentenceTransformer(self.model.path)

@bentoml.api(

batchable=True,

batch_dim=0,

max_batch_size=64,

max_latency_ms=100,

)

async def encode(self, texts: list[str]) -> np.ndarray:

Individual requests are automatically batched together

embeddings = self.model.encode(texts)

return embeddings

Monitoring

Adding custom metrics

from prometheus_client import Counter, Histogram

prediction_counter = Counter(

"predictions_total", "Total predictions", ["model", "class"]

)

latency_histogram = Histogram(

"prediction_latency_seconds", "Prediction latency"

)

@bentoml.service

class MonitoredClassifier:

@bentoml.api

def predict(self, features: np.ndarray) -> dict:

with latency_histogram.time():

result = self.clf.predict([features])[0]

prediction_counter.labels(

model="iris_v1", class_name=result

).inc()

return {"class": result}

Prometheus metrics endpoint

curl http://localhost:3000/metrics

Summary

BentoML significantly reduces the complexity of ML model serving:

- **Simple API Implementation**: Create REST APIs in just a few lines using decorators

- **Model Version Management**: Systematic management with the built-in Model Store

- **Adaptive Batching**: Maximizes GPU utilization

- **Docker Automation**: Reproducible builds with bentofile.yaml

- **Kubernetes Native**: Auto-scaling with HPA

**Q1. What is BentoML's Model Store?**

A repository that stores trained models locally with version management and metadata. Models are saved using functions like `bentoml.sklearn.save_model()`.

**Q2. How does Adaptive Batching work?**

It automatically collects individual requests and processes them in a single batch once max_batch_size or max_latency_ms is reached, maximizing GPU efficiency.

**Q3. What is the role of bentoml.depends()?**

In multi-model pipelines, it injects other BentoML services as dependencies, automatically managing inter-service communication.

**Q4. What is defined in bentofile.yaml?**

The service entrypoint, Python package dependencies, Docker configuration, and files to include are declared.

**Q5. What is BentoML's /healthz endpoint for?**

It is used for Kubernetes readiness/liveness probes to check whether the service is ready and alive.

**Q6. How do you specify GPU resources?**

Using the `@bentoml.service(resources={"gpu": 1, "gpu_type": "nvidia-a100"})` decorator.

**Q7. What advantages does BentoML have over building with Flask/FastAPI directly?**

It comes with built-in model version management, Adaptive Batching, automatic Docker builds, and declarative resource management, making it faster to get production-ready.

Quiz

Q1: What is the main topic covered in "Building an ML Model Serving Pipeline with BentoML: From

Packaging to Kubernetes Deployment"?

A hands-on guide to ML model serving with BentoML. Covers model packaging, API implementation,

multi-model pipelines, Docker builds, and Kubernetes deployment.

Saving a Model Defining a Service

that automatically groups multiple requests to maximize GPU utilization.