Skip to content
Published on

Open Source ML Platforms & MLOps 2026 Deep Dive - Kubeflow, Metaflow, Flyte, ZenML, MLflow, BentoML, ClearML, DVC, Weights & Biases

Authors

Introduction — by May 2026, the MLOps stack has standardized

Up through 2023, the MLOps stack was a "pick whatever works for your company" wilderness. By May 2026, much of that ambiguity has cleared. MLflow and Weights & Biases split experiment tracking. Kubeflow Pipelines and Flyte own Kubernetes-native orchestration. Metaflow and ZenML own Python-first workflows. BentoML 1.4, KServe, and Triton dominate serving. DVC and lakeFS own data versioning.

This article is not a marketing matrix. It is an honest, by-the-layer comparison of "what goes where in production in 2026" — covering MLflow 3.0's changes, the Kubeflow 1.10 lineup, Flyte's commercial trajectory via Union.ai, ZenML Cloud, and BentoML 1.4's LLM serving mode, with real API shapes.

The 2026 MLOps stack — seven layers

Let us start with the big picture. The standard 2026 MLOps stack splits into seven layers.

  1. Experiment tracking — runs, parameters, metrics, artifacts
  2. Pipeline orchestration — DAG execution, distribution, caching
  3. Model registry — model versions, stages, metadata
  4. Model serving — online, batch, streaming inference
  5. Feature store — train/inference consistency for features
  6. Data + experiment versioning — datasets and code in sync
  7. ML monitoring — drift, performance decay, data quality

The era when one or two tools per layer sufficed is over. Today, even within a layer there is a split between LLM-specific and classical-ML tracks. We go layer by layer below.

Experiment tracking — MLflow 3.0 vs Weights & Biases

90% of experiment tracking is two tools.

  • MLflow 3.0: OSS by Databricks, now under the Linux Foundation. 3.0 went GA in Q1 2026. GenAI tracing, evaluation, and prompt registry are now first-class citizens.
  • Weights & Biases: SaaS-first but with an open SDK. Dominates on UX and visualization. W&B Models, W&B Weave (LLM tracing), and W&B Launch ship as one bundle.

The alternatives still matter.

  • Comet ML: a SaaS that integrates ML + LLM experiments + production monitoring.
  • Neptune.ai: repositioned as a metadata store for foundation model training.
  • Aim: Apache 2.0 OSS, self-hosted, with a snappy UI.
  • TensorBoard, PyTorch Lightning Logger: limited as standalone trackers; usually combined with the above.

A typical MLflow 3.0 flow looks like this.

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment("iris-rf-2026")

X, y = load_iris(return_X_y=True)

with mlflow.start_run() as run:
    mlflow.log_param("n_estimators", 200)
    model = RandomForestClassifier(n_estimators=200)
    model.fit(X, y)
    mlflow.log_metric("train_acc", model.score(X, y))
    mlflow.sklearn.log_model(model, artifact_path="model", registered_model_name="iris-rf")
    print(run.info.run_id)

The same flow in W&B is just as terse.

import wandb
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

wandb.init(project="iris-rf-2026", config={"n_estimators": 200})
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=200).fit(X, y)
wandb.log({"train_acc": model.score(X, y)})
wandb.finish()

Both tools have auto-instrumentation for scikit-learn, PyTorch, XGBoost, and others — so you get baseline metrics even without explicit logging. The difference is storage model and governance. MLflow defaults to self-hosted with a BSD-friendly license; W&B is SaaS-first but has by far the smoother collaboration UX.

Pipeline orchestration — K8s-native vs Python-first

ML pipelines differ from generic ETL in that they must handle GPU scheduling, caching, and reproducibility together. As of May 2026, four tools split the market.

  • Kubeflow Pipelines + Kubeflow 1.10: CNCF incubation project. The canonical K8s-native ML platform. Components are containers.
  • Metaflow: built by Netflix, commercialized by Outerbounds. Python decorator-first. Deeply integrated with AWS Batch and Step Functions.
  • Flyte: built by Lyft, commercialized by Union.ai. Under LF AI & Data. K8s-native + type-safe.
  • ZenML: framework-agnostic, positioned as an "abstraction layer". MLflow, W&B, Kubeflow, Airflow are pluggable backends.

Adjacent tools — Prefect, Dagster, Airflow — also see ML use, but they are general data orchestrators rather than ML-specific, so we cover them separately in iter72/iter53.

DSL ergonomics rank roughly as ZenML > Metaflow > Flyte > Kubeflow Pipelines. The benchmark is whether a single Python file is enough.

Kubeflow 1.10 — a full ML platform on Kubernetes

Kubeflow is not a single tool but a full ML platform on top of Kubernetes. As of May 2026, the 1.10 lineup has these core components.

  • Kubeflow Pipelines (KFP): DAG pipeline SDK and UI.
  • Katib: distributed hyperparameter tuning.
  • Training Operator: PyTorchJob, TFJob, MPIJob, PaddleJob CRDs for distributed training.
  • KServe: model serving (formerly KFServing). Split into a sibling project but integrated.
  • Notebook Controller: JupyterHub-style notebook instances.
  • Spark Operator, Volcano: batch scheduling.

A KFP 2.x DSL example.

from kfp import dsl, compiler

@dsl.component(base_image="python:3.12")
def preprocess(input_path: str, output_path: str):
    import pandas as pd
    df = pd.read_csv(input_path)
    df.dropna().to_csv(output_path, index=False)

@dsl.component(base_image="python:3.12", packages_to_install=["scikit-learn"])
def train(data_path: str) -> float:
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    df = pd.read_csv(data_path)
    X, y = df.drop("label", axis=1), df["label"]
    return RandomForestClassifier().fit(X, y).score(X, y)

@dsl.pipeline(name="iris-pipeline")
def pipeline(input_path: str = "/data/iris.csv"):
    pre = preprocess(input_path=input_path, output_path="/tmp/clean.csv")
    train_task = train(data_path=pre.outputs["output_path"])

compiler.Compiler().compile(pipeline, "iris.yaml")

Kubeflow rewards teams who can run Kubernetes, but its onboarding cost is the highest of the four. That is why many teams adopt it via a managed offering (Vertex AI Pipelines, SageMaker MLOps, Azure ML).

Metaflow — workflows that start with one Python decorator

Metaflow is the workflow library Netflix designed with data scientists in mind. Outerbounds offers commercial hosting; the core is Apache 2.0.

The abstractions are minimal.

  • FlowSpec: the workflow class.
  • @step: the step decorator.
  • self.next(...): explicit branching.
  • @batch, @kubernetes, @gpu: per-step execution-environment decorators.
  • @retry, @timeout, @catch: reliability decorators.

A typical Metaflow flow.

from metaflow import FlowSpec, step, batch, retry

class IrisFlow(FlowSpec):
    @step
    def start(self):
        from sklearn.datasets import load_iris
        self.X, self.y = load_iris(return_X_y=True)
        self.next(self.train)

    @batch(cpu=4, memory=16000)
    @retry(times=2)
    @step
    def train(self):
        from sklearn.ensemble import RandomForestClassifier
        self.model = RandomForestClassifier(n_estimators=200).fit(self.X, self.y)
        self.acc = self.model.score(self.X, self.y)
        self.next(self.end)

    @step
    def end(self):
        print(f"acc={self.acc:.3f}")

if __name__ == "__main__":
    IrisFlow()

Metaflow's strength is local-first. It runs in your notebook, and --with batch is all you need to scale to AWS Batch. Automatic artifact storage (S3) plus automatic tracking are built in, which is why many teams skip a separate MLflow.

Flyte — Kubernetes-native, type-safe workflows

Flyte is the K8s-native workflow tool built by Lyft and commercialized by Union.ai. It is a graduated project under LF AI & Data. Its biggest selling points are type safety and caching.

  • Python type annotations double as input/output schemas.
  • Automatic caching: identical inputs hit cache. Significant cost savings.
  • K8s-native: Pod, Deployment, and GPU/TPU scheduling are first-class.
  • Multi-language: Python is primary; Java/Scala SDKs exist.

A Flyte example.

from flytekit import task, workflow, Resources

@task(cache=True, cache_version="1.0", requests=Resources(cpu="2", mem="4Gi"))
def preprocess(input_path: str) -> str:
    import pandas as pd
    df = pd.read_csv(input_path).dropna()
    out = "/tmp/clean.csv"
    df.to_csv(out, index=False)
    return out

@task(requests=Resources(cpu="4", mem="16Gi", gpu="1"))
def train(data_path: str) -> float:
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    df = pd.read_csv(data_path)
    X, y = df.drop("label", axis=1), df["label"]
    return RandomForestClassifier().fit(X, y).score(X, y)

@workflow
def iris_wf(input_path: str = "/data/iris.csv") -> float:
    clean = preprocess(input_path=input_path)
    return train(data_path=clean)

Flyte's edge is being K8s-friendly with caching that actually works. Running the same code on the same data hits cache automatically, saving time and money.

ZenML — a framework-agnostic abstraction layer

ZenML does not compete with the tools above; it sits on top of them as an abstraction layer. ZenML Pipelines can use Kubeflow, Airflow, Tekton, Vertex, SageMaker, or AWS Step Functions as the backend.

Core abstractions.

  • @step, @pipeline: Python decorators for steps and pipelines.
  • Stack: an environment composed of orchestrator + artifact_store + container_registry + experiment_tracker.
  • Component: a swappable implementation per layer — MLflow, W&B, Neptune.ai, Kubeflow, Vertex, etc.
  • Materializer: a serializer for user-defined types.

ZenML code.

from zenml import pipeline, step
from typing import Tuple
import pandas as pd

@step
def load() -> pd.DataFrame:
    import pandas as pd
    return pd.read_csv("/data/iris.csv")

@step
def split(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
    return df.iloc[:120], df.iloc[120:]

@step
def train(train_df: pd.DataFrame) -> float:
    from sklearn.ensemble import RandomForestClassifier
    X, y = train_df.drop("label", axis=1), train_df["label"]
    return RandomForestClassifier().fit(X, y).score(X, y)

@pipeline
def iris_pipeline():
    df = load()
    tr, _ = split(df)
    train(tr)

iris_pipeline()

ZenML's value is deferring vendor lock-in. Going local to SageMaker to Kubeflow requires almost no code change. The cost is the abstraction tax — using a backend's unique features 100% eventually requires calling that backend's SDK directly.

Model registry — MLflow Registry vs BentoML Model Store vs Hugging Face Hub

A trained model has to live somewhere with versions, stages (Staging/Production), and metadata. Three candidates show up most often as of May 2026.

  • MLflow Model Registry: bundled with MLflow and closest to a standard. URI pattern models:/name/Production.
  • BentoML Model Store: paired with BentoML serving. Flow is bentoml.transformers.save_model().
  • Hugging Face Hub: public/private repos for model sharing. The de facto standard for the transformers/diffusion ecosystem.

Enterprises usually combine MLflow Model Registry + their own S3, open-model and LLM teams use a private Hugging Face Hub repo, and BentoML-centric ops teams use the BentoML Model Store as-is.

Loading from MLflow Registry.

import mlflow.pyfunc

model_uri = "models:/iris-rf/Production"
model = mlflow.pyfunc.load_model(model_uri)
predictions = model.predict([[5.1, 3.5, 1.4, 0.2]])

Model serving — BentoML 1.4, KServe, Triton, Ray Serve split the work

No single serving tool covers every case. The split depends on scenario.

  • BentoML 1.4 + Yatai: Python-friendly serving framework. Packages model + business logic as a single "Bento". 1.4 in Q1 2026 made LLM serving mode (vLLM, TGI backends) GA.
  • Seldon Core 2: K8s-native serving. Mesh-based multi-model routing.
  • KServe: formerly KFServing. Often paired with Kubeflow but standalone-capable. Serverless/autoscale + standard inference protocol.
  • TorchServe: built by Meta + AWS. The PyTorch serving standard.
  • TensorFlow Serving: serves TF SavedModels. C++ core, very stable.
  • NVIDIA Triton Inference Server: GPU-optimal serving. Integrates TensorRT, ONNX, PyTorch, TF, vLLM backends.
  • Ray Serve: Python serving atop Ray clusters. Managed by Anyscale.

A BentoML 1.4 LLM serving example.

import bentoml
from transformers import AutoTokenizer

@bentoml.service(resources={"gpu": 1, "memory": "16Gi"})
class LlamaService:
    def __init__(self) -> None:
        from vllm import LLM
        self.llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct")
        self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct")

    @bentoml.api
    def generate(self, prompt: str, max_tokens: int = 256) -> str:
        outputs = self.llm.generate([prompt], sampling_params={"max_tokens": max_tokens})
        return outputs[0].outputs[0].text

For classical (non-LLM) models, TF Serving, TorchServe, and Triton are faster and more stable. LLM inference plugs an engine (vLLM, SGLang, TGI, TensorRT-LLM — see the iter69 inference-engine deep dive) into BentoML or Triton as the standard pattern.

Feature stores — Feast 0.40, Hopsworks, Featureform, Tecton

The feature store layer keeps the features used in training identical to those served at inference.

  • Feast 0.40: Apache 2.0 OSS. K8s or local deployment. Online stores: Redis, DynamoDB, Bigtable.
  • Hopsworks: open core. Both self-hosted and SaaS. Feature store + notebooks + serving as one.
  • Featureform: a virtualization layer that abstracts existing data warehouses into feature stores.
  • Tecton: commercial SaaS. Built by the team behind Feast. Common in enterprise.

A Feast 0.40 definition example.

from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
from datetime import timedelta

user = Entity(name="user", join_keys=["user_id"])

user_stats_source = FileSource(
    path="s3://feast-data/user_stats.parquet",
    timestamp_field="event_ts",
)

user_stats_view = FeatureView(
    name="user_stats",
    entities=[user],
    ttl=timedelta(days=7),
    schema=[
        Field(name="purchase_count_7d", dtype=Int64),
        Field(name="purchase_value_7d", dtype=Float32),
    ],
    source=user_stats_source,
)

It is tempting to put feature stores off forever. But the moment a "train-serve skew" incident happens, they become mandatory.

Data + experiment versioning — DVC, lakeFS, Pachyderm, Quilt

Git handles code; data is another story. Four tools split the data-versioning market.

  • DVC (Data Version Control): data and model versioning on top of Git. .dvc metafiles track large files while real data lives on S3/GCS/Azure.
  • lakeFS: a Git model for the data lake. Branches, merges, and commits work at petabyte scale on S3.
  • Pachyderm: data versioning + data lineage + automatic pipeline triggers.
  • Quilt Data: a data-package catalog.

A DVC flow.

git init
dvc init
dvc remote add -d storage s3://my-bucket/dvc

dvc add data/raw/train.csv
git add data/raw/train.csv.dvc .gitignore
git commit -m "track training data"

dvc push   # data to S3
git push   # metadata to Git

DVC's other strength is DVC Pipelines. Define stages and dependencies in dvc.yaml and it re-runs only the changed stages. With little infrastructure you get a reproducible ML workflow.

ML monitoring — Evidently, Arize, WhyLabs, Fiddler

Once deployed, models degrade as data shifts (data/concept drift). Monitoring catches that.

  • Evidently AI: Apache 2.0 OSS. Reports and dashboards generated from Python. K8s self-hostable.
  • Arize AI: commercial SaaS. Unified ML + LLM observability. Phoenix is the OSS spin-off.
  • WhyLabs: data quality + drift + anomaly detection. Has a free tier.
  • Fiddler AI: explainability-first. Enterprise compliance track.

An Evidently example.

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

ref = pd.read_csv("reference.csv")
cur = pd.read_csv("current.csv")

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref, current_data=cur)
report.save_html("drift.html")

Monitoring is "easy to set up, often turned off in production due to noise". Threshold tuning and dashboard customization end up being the harder problems.

Notebooks & IDEs — Jupyter, Marimo, Hex, Deepnote, Quarto

ML workflows still start in notebooks.

  • Jupyter / JupyterLab 4 / JupyterHub: the de facto standard. Multi-user means JupyterHub. Covered in depth in the iter27 notebooks deep dive.
  • Marimo: reactive notebooks. Cell execution order is determined by the dependency graph. Strong reproducibility.
  • Hex, Deepnote: managed collaborative notebooks.
  • Quarto: turns notebooks + Markdown into reports, books, and sites.

Marimo's distinguishing feature is clear: inter-cell dependencies are extracted from code analysis, eliminating an entire class of order-dependent bugs. Think of it as the notebook a data scientist hands off without needing a README.

Vector DB integration — Pinecone, Weaviate, Milvus, Qdrant

RAG and embedding workflows demand a vector DB. iter62 covers vector DBs in depth; from the MLOps angle, the key is integration with model serving. BentoML, Ray Serve, and KServe all standardize the pattern of packaging vector DB clients alongside models.

Combined with LangChain/LlamaIndex-style LLM frameworks, vector DBs stop being "separate infrastructure" and become "first-class components of the ML service".

End-to-end "MLflow-style" managed platforms

If self-hosting is too much, you go managed.

  • Databricks Lakehouse Platform: by MLflow's authors. Notebooks + Spark + MLflow + Unity Catalog in one.
  • Vertex AI: Google. AutoML, Pipelines, Endpoints, Feature Store unified.
  • SageMaker: AWS. Studio, Pipelines, Model Registry, Endpoints unified.
  • Azure ML: Microsoft. Workspace + Designer + Endpoints + Responsible AI.

Each managed platform pushes its own SDK while still offering standard entry points compatible with MLflow, Kubeflow Pipelines, PyTorch, and TensorFlow. The trade-off is zero infra burden vs vendor lock-in.

LLM-specific MLOps (LLMOps) — LangSmith, Langfuse, Arize Phoenix, Helicone

A subcategory called "LLMOps" formed across 2024-2025. It overlaps with classical MLOps but differs on several axes.

  • Unit of trace: instead of metrics, the primary data is the prompt/response trace.
  • Evaluation: ground truth is fuzzy, so LLM-as-judge, rules, and user feedback combine.
  • Prompt registry: prompt versions are managed separately from model versions.
  • Cost tracking: per-token cost is the center of ML cost.

Key tools.

  • LangSmith: standard within the LangChain world. SaaS + self-host.
  • Langfuse: open-source self-host track. Tracing, evaluation, prompt management.
  • Arize Phoenix: Arize's OSS spin-off. LLM tracing and evaluation.
  • Helicone: OpenAI/Anthropic API gateway + traces.

Covered more deeply in iter77 (LLM observability) and iter83 (fine-tuning).

Korean MLOps — Naver Cloud, Kakao Enterprise, NCSOFT, Hyundai Motor Group

The Korean MLOps ecosystem clusters along these axes.

  • Naver Cloud ML platform: HyperCLOVA X-based fine-tuning + an in-house MLOps toolset.
  • Kakao Enterprise: the ML track on Kakao i Cloud. Mixes Kubeflow with in-house tools.
  • NCSOFT AI Center: game-AI and character-AI pipelines built in-house.
  • Hyundai Motor Group AI Lab: autonomous-driving data + model automation, mixing Flyte and Kubeflow.
  • Korean ML engineer communities: MLOps Korea, Pseudo Lab, Modulabs tracks run regular meetups.

Big companies typically build internal platforms on top of MLflow/Kubeflow; startups often go straight to SageMaker or Vertex AI.

Japanese MLOps — PFN, Mercari, Cybozu, Recruit

Japan's MLOps ecosystem has these standout patterns.

  • PFN (Preferred Networks): authors of Optuna (HPO). Pair it with their in-house workflow Allgo.
  • Cybozu: ML pipelines on kintone data. Argo Workflows plus in-house tools.
  • Recruit AI Lab: ML pipelines in ads and recruiting. Vertex AI + MLflow.
  • Mercari ML platform: runs an in-house ML platform called Merlin. KServe + Argo + Feast.
  • NTT Communications - PFN partnership: large-scale training on telecom data.

In Japan, Argo Workflows (classical) and Kubeflow + KServe see high adoption. Qiita and Zenn have seen a rapid rise in Flyte/Metaflow tutorials in Japanese throughout 2026.

CNCF AI & Data subgroup — the standardization current

In late 2024 the CNCF officially launched the AI & Data subgroup. As of May 2026, these projects sit under it.

  • Kubeflow Pipelines, Kubeflow Training Operator: ML pipelines + distributed training.
  • Flyte: workflow (incubation).
  • KServe: serving (incubation).
  • Volcano: batch scheduler.
  • Spark Operator: Spark on K8s.
  • Argo Workflows: classical but frequently used for ML workflows.

As CNCF becomes the standardization hub, the ML-on-K8s interface is converging. Managed platforms are increasingly adopting the KServe v1/v2 inference protocols.

Composition patterns — how real production stacks are wired

Real companies rarely use one tool end to end. Common composition patterns:

  • Startup minimum: MLflow (tracking + registry) + Metaflow (workflow) + BentoML (serving) + Evidently (monitoring).
  • K8s enterprise: Kubeflow (pipelines) + MLflow (tracking) + KServe (serving) + Feast (features) + DVC (data).
  • AWS-managed first: SageMaker Pipelines + SageMaker Model Registry + SageMaker Endpoints + W&B (SaaS tracking) + Evidently (monitor).
  • GCP-managed first: Vertex AI Pipelines + Vertex Endpoints + W&B + Featureform.
  • LLM-first startup: Hugging Face Hub (models) + Langfuse (tracing) + BentoML 1.4 (serving) + ZenML (workflow abstraction).

The key is "what is first-class and what is secondary". If K8s is first-class, you go Kubeflow + KServe; if Python workflows are, Metaflow + BentoML; if LLMs are, HF Hub + Langfuse + vLLM.

Adoption roadmap — zero to production

If you are introducing MLOps from scratch, the safest order is:

  1. Experiment tracking first. MLflow or W&B. Done within a week.
  2. Model registry in the same tracking tool. A separate tool is not needed yet.
  3. Model serving with at least BentoML or FastAPI + ONNX/TorchScript. Managed (SageMaker Endpoints, Vertex Endpoints) is fine too.
  4. Pipeline orchestration is cron + Makefile when you have 1-2 data scientists. Add Metaflow/ZenML when you have three or more or you need daily training.
  5. Data versioning kicks in once datasets exceed gigabytes and change often: DVC or lakeFS.
  6. Feature store comes after the first train-serve consistency incident. Do not install it on day one.
  7. Monitoring starts once there is at least one production model with real users.

Installing all eight on day one fails 90% of the time. Adopting one layer at a time is the established wisdom.

Closing — May 2026, "MLOps remains a mosaic"

We opened with "the standard has crystallized" — but it is also true that MLOps remains a mosaic. No one company runs all seven layers under a single vendor. OSS combos like MLflow + Metaflow + BentoML + DVC + Evidently are still the most common.

The biggest shift is the split between LLMOps and classical MLOps. Even within one company, the two tracks now run separate tool stacks. LLMOps coalesces around LangSmith/Langfuse/Phoenix; classical solidifies around MLflow/Kubeflow.

Do not over-optimize the tool selection. Any combination that gives you "four-way versioning of code, data, models, and results" solves 90% of the problem. The rest is priorities.

References