The Complete MLflow Guide: From Experiment Tracking to Model Registry and Production Deployment

What is MLflow?
Installation and Server Setup
- Basic Installation
- Deployment with Docker Compose
Experiment Tracking
Model Registry
Model Serving
- Built-in MLflow Serving
- Custom Serving with FastAPI
Experiment Comparison and Analysis
- Comparing in MLflow UI
- Analysis with Python API
Production Checklist

What is MLflow?

MLflow is an open-source platform for managing the ML lifecycle. It consists of four core components:

MLflow Tracking: Records experiment parameters, metrics, and artifacts
MLflow Projects: Packages ML code for reproducibility
MLflow Models: Packages models from various frameworks in a unified format
MLflow Model Registry: Model version management and deployment workflows

Installation and Server Setup

Basic Installation

# pip installation
pip install mlflow

# Additional framework support
pip install mlflow[extras]  # sklearn, tensorflow, pytorch, etc.

# Start server (local)
mlflow server --host 0.0.0.0 --port 5000

# Production server with PostgreSQL + S3 backend
mlflow server \
  --backend-store-uri postgresql://mlflow:password@localhost:5432/mlflow \
  --default-artifact-root s3://mlflow-artifacts/ \
  --host 0.0.0.0 --port 5000

Deployment with Docker Compose

# docker-compose.yml
services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.18.0
    ports:
      - '5000:5000'
    environment:
      - MLFLOW_BACKEND_STORE_URI=postgresql://mlflow:password@postgres:5432/mlflow
      - MLFLOW_DEFAULT_ARTIFACT_ROOT=s3://mlflow-artifacts/
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    command: >
      mlflow server
      --backend-store-uri postgresql://mlflow:password@postgres:5432/mlflow
      --default-artifact-root s3://mlflow-artifacts/
      --host 0.0.0.0 --port 5000
    depends_on:
      - postgres

  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: password
      POSTGRES_DB: mlflow
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

Experiment Tracking

Basic Usage

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score

# Configure tracking server
mlflow.set_tracking_uri("http://localhost:5000")

# Create/set experiment
mlflow.set_experiment("iris-classification")

# Prepare data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Run experiment
with mlflow.start_run(run_name="rf-baseline"):
    # Log parameters
    params = {
        "n_estimators": 100,
        "max_depth": 5,
        "min_samples_split": 2,
        "random_state": 42
    }
    mlflow.log_params(params)

    # Train model
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    # Predictions and metrics
    y_pred = model.predict(X_test)
    metrics = {
        "accuracy": accuracy_score(y_test, y_pred),
        "f1_macro": f1_score(y_test, y_pred, average="macro"),
        "precision_macro": precision_score(y_test, y_pred, average="macro")
    }
    mlflow.log_metrics(metrics)

    # Tags
    mlflow.set_tag("model_type", "random_forest")
    mlflow.set_tag("dataset", "iris")

    # Save model
    mlflow.sklearn.log_model(
        model,
        artifact_path="model",
        registered_model_name="iris-classifier"
    )

    # Custom artifacts (plots, reports, etc.)
    import matplotlib.pyplot as plt
    from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

    cm = confusion_matrix(y_test, y_pred)
    fig, ax = plt.subplots()
    ConfusionMatrixDisplay(cm).plot(ax=ax)
    fig.savefig("confusion_matrix.png")
    mlflow.log_artifact("confusion_matrix.png")

    print(f"Run ID: {mlflow.active_run().info.run_id}")
    print(f"Metrics: {metrics}")

Hyperparameter Tuning Tracking

import optuna
import mlflow

def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 500),
        "max_depth": trial.suggest_int("max_depth", 2, 20),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 10),
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 5),
    }

    with mlflow.start_run(nested=True, run_name=f"trial-{trial.number}"):
        mlflow.log_params(params)

        model = RandomForestClassifier(**params, random_state=42)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        accuracy = accuracy_score(y_test, y_pred)
        mlflow.log_metric("accuracy", accuracy)

        return accuracy

# Run Optuna study
with mlflow.start_run(run_name="hyperparameter-tuning"):
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=50)

    # Log best results
    mlflow.log_params(study.best_params)
    mlflow.log_metric("best_accuracy", study.best_value)
    mlflow.set_tag("best_trial", study.best_trial.number)

PyTorch Model Tracking

import torch
import torch.nn as nn
import mlflow.pytorch

class SimpleNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

with mlflow.start_run(run_name="pytorch-model"):
    model = SimpleNet(4, 32, 3)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss()

    mlflow.log_params({
        "hidden_dim": 32,
        "learning_rate": 0.001,
        "optimizer": "Adam",
        "epochs": 100
    })

    for epoch in range(100):
        # Training logic...
        loss = criterion(model(X_tensor), y_tensor)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Log per-epoch metrics
        mlflow.log_metric("train_loss", loss.item(), step=epoch)

    # Save PyTorch model
    mlflow.pytorch.log_model(model, "model")

Model Registry

Model Registration and Version Management

from mlflow import MlflowClient

client = MlflowClient()

# Register model (auto-registered when using registered_model_name in log_model)
# Or manually register:
result = client.create_registered_model(
    name="iris-classifier",
    description="Iris flower classification model"
)

# Register a specific run's model as a version
model_version = client.create_model_version(
    name="iris-classifier",
    source=f"runs:/{run_id}/model",
    run_id=run_id,
    description="RandomForest baseline v1"
)

print(f"Model Version: {model_version.version}")

Deployment Management with Aliases

# MLflow 2.x uses Aliases (Stage is deprecated)
client = MlflowClient()

# Set production alias
client.set_registered_model_alias(
    name="iris-classifier",
    alias="champion",
    version=3
)

# Set challenger model
client.set_registered_model_alias(
    name="iris-classifier",
    alias="challenger",
    version=5
)

# Load model by alias
champion_model = mlflow.pyfunc.load_model("models:/iris-classifier@champion")
challenger_model = mlflow.pyfunc.load_model("models:/iris-classifier@challenger")

# A/B testing
champion_pred = champion_model.predict(X_test)
challenger_pred = challenger_model.predict(X_test)

print(f"Champion accuracy: {accuracy_score(y_test, champion_pred)}")
print(f"Challenger accuracy: {accuracy_score(y_test, challenger_pred)}")

Using Model Tags

# Add tags to model version
client.set_model_version_tag(
    name="iris-classifier",
    version=3,
    key="validation_status",
    value="approved"
)

client.set_model_version_tag(
    name="iris-classifier",
    version=3,
    key="approved_by",
    value="data-science-lead"
)

# Search models by tag
from mlflow import search_model_versions

approved_versions = search_model_versions(
    "name='iris-classifier' AND tag.validation_status='approved'"
)

Model Serving

Built-in MLflow Serving

# Local REST API serving
mlflow models serve \
  -m "models:/iris-classifier@champion" \
  --port 8080 \
  --no-conda

# Test request
curl -X POST http://localhost:8080/invocations \
  -H "Content-Type: application/json" \
  -d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'

Custom Serving with FastAPI

from fastapi import FastAPI
import mlflow.pyfunc
import numpy as np

app = FastAPI()

# Load model (once at server startup)
model = mlflow.pyfunc.load_model("models:/iris-classifier@champion")

@app.post("/predict")
async def predict(features: list[list[float]]):
    predictions = model.predict(np.array(features))
    return {
        "predictions": predictions.tolist(),
        "model_version": "champion"
    }

@app.get("/health")
async def health():
    return {"status": "healthy", "model": "iris-classifier@champion"}

Experiment Comparison and Analysis

Comparing in MLflow UI

# Search experiments (CLI)
mlflow runs list --experiment-id 1

# Search by metrics
mlflow runs list \
  --experiment-id 1 \
  --filter "metrics.accuracy > 0.95" \
  --order-by "metrics.accuracy DESC"

Analysis with Python API

import mlflow
import pandas as pd

# Query all runs in an experiment
runs = mlflow.search_runs(
    experiment_ids=["1"],
    filter_string="metrics.accuracy > 0.9",
    order_by=["metrics.accuracy DESC"],
    max_results=10
)

# Analyze as DataFrame
print(runs[["run_id", "params.n_estimators", "params.max_depth", "metrics.accuracy"]])

# Find the best run
best_run = runs.iloc[0]
print(f"Best run: {best_run.run_id}, Accuracy: {best_run['metrics.accuracy']}")

Production Checklist

□ Set backend store to PostgreSQL/MySQL
□ Set artifact store to S3/GCS/MinIO
□ Configure authentication/authorization (OIDC, Basic Auth)
□ Set up automatic experiment logging (autolog)
□ Establish Model Registry alias conventions
□ Automate model validation in CI/CD
□ Configure model serving health checks
□ Define experiment cleanup policies (archive old runs)

Review Quiz (6 Questions)

Q1. What are the four core components of MLflow?

Tracking, Projects, Models, Model Registry

Q2. What is the difference between mlflow.log_params and mlflow.log_metrics?

log_params records training hyperparameters (strings), while log_metrics records performance metrics (numbers). Metrics support per-epoch tracking with the step parameter.

Q3. What concept is used for model deployment management in MLflow 2.x?

Aliases (e.g., @champion, @challenger). Stage has been deprecated.

Q4. When is the nested=True parameter used?

It is used when recording multiple child runs inside a parent run, such as during hyperparameter tuning.

Q5. Why use S3 as the artifact store?

It stores large artifacts like model files and plots in scalable object storage, making it easy to share across teams and manage versions.

Q6. What are the pros and cons of mlflow.autolog()?

Pros: Automatically records parameters/metrics/models without code changes. Cons: May record unnecessary information, and custom metrics still need to be logged separately.