MLflow Production Guide: Experiment Tracking, Model Registry, and Scalable MLOps Workflow

Introduction
Experiment Tracking Platform Comparison
Scaling the MLflow Tracking Server
Experiment Tracking Best Practices
- Structuring Experiments for Teams
- Batch Metric Logging for Performance
Model Registry Lifecycle Management
- Understanding Model Aliases (Post-Stages Deprecation)
- Model Promotion Workflow
CI/CD Integration with GitHub Actions
- Automated Training and Validation Pipeline
- Automated Champion Promotion
Multi-Team Experiment Organization
- Access Control and Namespace Strategy
Failure Cases and Operational Warnings
- Common Production Failures
Production Monitoring and Cleanup
- Automated Experiment Cleanup
Summary
References

Introduction

Running ML experiments locally is easy. Running them at scale across multiple teams with reproducibility, auditability, and automated deployment is an entirely different challenge. MLflow has become the de facto open-source standard for experiment tracking and model lifecycle management, but most tutorials stop at mlflow.log_metric() on localhost.

This guide covers the production-grade MLflow workflow: scaling the tracking server with PostgreSQL and S3, structuring experiments for multi-team collaboration, managing the model registry lifecycle with aliases, integrating with CI/CD pipelines via GitHub Actions, and handling the failure modes that only surface at scale.

Experiment Tracking Platform Comparison

Before diving into MLflow, it is important to understand how it compares to other experiment tracking platforms in the ecosystem.

Feature	MLflow	Weights and Biases	Neptune	ClearML
License	Apache 2.0 (OSS)	Proprietary (free tier)	Proprietary (free tier)	Apache 2.0 (OSS)
Self-hosted	Yes (full)	Limited	Limited	Yes (full)
Experiment Tracking	Strong	Excellent	Excellent	Strong
Model Registry	Built-in	Built-in	Metadata-only	Built-in
Hyperparameter Sweeps	Manual / Optuna	Built-in (Sweeps)	Via integrations	Built-in (HPO)
Artifact Storage	S3/GCS/Azure/HDFS	W and B servers	Neptune servers	S3/GCS/Azure
UI Quality	Good	Excellent	Excellent	Good
Framework Integration	All major frameworks	All major frameworks	All major frameworks	All major frameworks
Pricing (Team)	Free (self-hosted)	~$50/user/month	~$79/user/month	Free (self-hosted)
CI/CD Integration	Any (open API)	GitHub/GitLab	GitHub/GitLab	GitHub/GitLab
Data Governance	Full control (self)	Vendor-managed	Vendor-managed	Full control (self)

MLflow wins on self-hosting flexibility and vendor independence. Weights and Biases excels in visualization and collaboration UX. Neptune offers superior metadata querying. ClearML provides the most complete open-source pipeline management. Choose based on your team's primary constraint: budget, governance, or UI polish.

Scaling the MLflow Tracking Server

Architecture Overview

A production MLflow deployment separates three concerns:

Tracking Server - the API and UI process
Backend Store - PostgreSQL for experiment metadata, parameters, metrics, and tags
Artifact Store - S3 (or compatible) for model files, plots, and large binary artifacts

PostgreSQL Backend and S3 Artifact Store

# docker-compose.production.yml
services:
  mlflow-server:
    image: ghcr.io/mlflow/mlflow:v2.20.0
    ports:
      - '5000:5000'
    environment:
      MLFLOW_BACKEND_STORE_URI: 'postgresql://mlflow:${DB_PASSWORD}@postgres:5432/mlflowdb'
      MLFLOW_DEFAULT_ARTIFACT_ROOT: 's3://mlflow-artifacts-prod/'
      AWS_ACCESS_KEY_ID: '${AWS_ACCESS_KEY_ID}'
      AWS_SECRET_ACCESS_KEY: '${AWS_SECRET_ACCESS_KEY}'
      AWS_DEFAULT_REGION: 'ap-northeast-1'
    command: >
      mlflow server
      --backend-store-uri postgresql://mlflow:${DB_PASSWORD}@postgres:5432/mlflowdb
      --default-artifact-root s3://mlflow-artifacts-prod/
      --host 0.0.0.0
      --port 5000
      --workers 4
      --app-name basic-auth
    depends_on:
      postgres:
        condition: service_healthy
    restart: unless-stopped

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: '${DB_PASSWORD}'
      POSTGRES_DB: mlflowdb
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U mlflow -d mlflowdb']
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  nginx:
    image: nginx:1.27-alpine
    ports:
      - '443:443'
      - '80:80'
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - mlflow-server
    restart: unless-stopped

volumes:
  pgdata:

Warning: Never expose the MLflow server directly to the internet without authentication. The --app-name basic-auth flag enables built-in HTTP basic authentication. For production, always place an Nginx reverse proxy with TLS in front of the server.

PostgreSQL Tuning for MLflow

MLflow's tracking workload is write-heavy during training (frequent metric logging) and read-heavy during analysis (UI queries). Tune PostgreSQL accordingly:

# postgresql.conf adjustments for MLflow workloads
# Assuming 8GB RAM dedicated to PostgreSQL

shared_buffers = 2GB
effective_cache_size = 6GB
work_mem = 64MB
maintenance_work_mem = 512MB

# Write-heavy optimizations
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB

# Connection pooling (use PgBouncer for 50+ concurrent training jobs)
max_connections = 200

Operational warning: When running more than 50 concurrent training jobs that log metrics frequently (every step), you will encounter connection pool exhaustion. Deploy PgBouncer in transaction mode between MLflow and PostgreSQL. Without this, training jobs will fail with connection refused errors during peak load.

Experiment Tracking Best Practices

Structuring Experiments for Teams

import mlflow
from mlflow.tracking import MlflowClient

# Configure remote tracking server
mlflow.set_tracking_uri("https://mlflow.internal.company.com")

# Naming convention: team/project/experiment-type
# This enables filtering and access control at scale
EXPERIMENT_NAME = "recommendation-team/product-ranking/hyperparameter-search"

mlflow.set_experiment(EXPERIMENT_NAME)

client = MlflowClient()

def train_model(config: dict):
    """Production-grade experiment tracking with proper error handling."""
    with mlflow.start_run(
        run_name=f"xgb-{config['max_depth']}d-{config['learning_rate']}lr",
        tags={
            "team": "recommendation",
            "project": "product-ranking",
            "environment": "staging",
            "git_commit": config.get("git_sha", "unknown"),
            "data_version": config.get("data_version", "v1"),
        },
    ) as run:
        # Log all hyperparameters
        mlflow.log_params({
            "model_type": "xgboost",
            "max_depth": config["max_depth"],
            "learning_rate": config["learning_rate"],
            "n_estimators": config["n_estimators"],
            "subsample": config["subsample"],
            "colsample_bytree": config["colsample_bytree"],
            "eval_metric": "ndcg",
            "training_data_path": config["data_path"],
            "feature_count": config["feature_count"],
        })

        # Log dataset info as input
        dataset = mlflow.data.from_pandas(
            config["train_df"],
            source=config["data_path"],
            name="product_ranking_train",
        )
        mlflow.log_input(dataset, context="training")

        # Train model
        model = train_xgboost(config)

        # Log metrics at each evaluation point
        for epoch, metrics in enumerate(model.eval_history):
            mlflow.log_metrics({
                "train_ndcg": metrics["train_ndcg"],
                "val_ndcg": metrics["val_ndcg"],
                "train_loss": metrics["train_loss"],
                "val_loss": metrics["val_loss"],
            }, step=epoch)

        # Log final metrics
        final_metrics = evaluate_model(model, config["test_data"])
        mlflow.log_metrics({
            "test_ndcg": final_metrics["ndcg"],
            "test_precision_at_10": final_metrics["precision@10"],
            "test_recall_at_50": final_metrics["recall@50"],
            "test_mrr": final_metrics["mrr"],
            "inference_latency_p99_ms": final_metrics["latency_p99"],
        })

        # Log model with signature
        signature = mlflow.models.infer_signature(
            config["sample_input"],
            model.predict(config["sample_input"]),
        )
        mlflow.xgboost.log_model(
            model,
            artifact_path="model",
            signature=signature,
            registered_model_name="product-ranking-xgb",
        )

        # Log artifacts
        mlflow.log_artifact("feature_importance.png")
        mlflow.log_artifact("confusion_matrix.png")

        return run.info.run_id

Batch Metric Logging for Performance

Warning: Calling mlflow.log_metric() on every training step creates a separate HTTP request per call. For deep learning training with thousands of steps, this saturates the tracking server.

import mlflow

def log_metrics_batched(metrics_buffer: list, batch_size: int = 100):
    """Batch metric logging to reduce HTTP overhead.

    Instead of logging every step individually, accumulate metrics
    and flush in batches. This reduces tracking server load by 50-100x
    for long training runs.
    """
    if len(metrics_buffer) >= batch_size:
        with mlflow.start_run(run_id=current_run_id):
            for step, metrics in metrics_buffer:
                mlflow.log_metrics(metrics, step=step)
        metrics_buffer.clear()


# Usage in training loop
metrics_buffer = []

for step in range(100000):
    loss = train_step()

    metrics_buffer.append((step, {
        "train_loss": loss,
        "learning_rate": scheduler.get_last_lr()[0],
    }))

    # Flush every 100 steps instead of every step
    log_metrics_batched(metrics_buffer, batch_size=100)

# Flush remaining metrics
log_metrics_batched(metrics_buffer, batch_size=1)

Model Registry Lifecycle Management

Understanding Model Aliases (Post-Stages Deprecation)

As of MLflow 2.9+, the legacy stage-based workflow (Staging, Production, Archived) has been deprecated in favor of model aliases. Aliases provide more flexibility for real-world deployment patterns.

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register a new model version (happens automatically with log_model)
# or explicitly:
result = client.create_model_version(
    name="product-ranking-xgb",
    source="s3://mlflow-artifacts-prod/3/abc123/artifacts/model",
    run_id="abc123",
    description="XGBoost v2 with new user features, NDCG@10 improved 3.2%",
)
model_version = result.version

# Set aliases for deployment workflow
# Champion = currently serving production traffic
client.set_registered_model_alias(
    name="product-ranking-xgb",
    alias="champion",
    version=model_version,
)

# Challenger = candidate being validated in shadow mode
client.set_registered_model_alias(
    name="product-ranking-xgb",
    alias="challenger",
    version=model_version + 1,
)

# Load model by alias in serving code
champion_model = mlflow.pyfunc.load_model("models:/product-ranking-xgb@champion")
challenger_model = mlflow.pyfunc.load_model("models:/product-ranking-xgb@challenger")

# Add tags for additional metadata
client.set_model_version_tag(
    name="product-ranking-xgb",
    version=model_version,
    key="validation_status",
    value="passed",
)

client.set_model_version_tag(
    name="product-ranking-xgb",
    version=model_version,
    key="approved_by",
    value="ml-lead@company.com",
)

Model Promotion Workflow

The recommended production workflow uses a three-alias pattern:

candidate - newly trained model, pending validation
challenger - validated model, running in shadow mode alongside champion
champion - serving live production traffic

def promote_model(model_name: str, version: int, target_alias: str):
    """Promote a model version through the deployment lifecycle.

    Workflow: candidate -> challenger -> champion

    Each promotion requires passing validation gates:
    - candidate -> challenger: automated test suite passes
    - challenger -> champion: shadow mode metrics within tolerance
    """
    client = MlflowClient()

    # Get current model version info
    mv = client.get_model_version(name=model_name, version=str(version))

    # Validate promotion is allowed
    if target_alias == "challenger":
        # Must have passed automated validation
        tags = {t.key: t.value for t in mv.tags}
        if tags.get("validation_status") != "passed":
            raise ValueError(
                f"Model version {version} has not passed validation. "
                f"Current status: {tags.get('validation_status', 'unknown')}"
            )

    elif target_alias == "champion":
        # Must currently be challenger
        try:
            current_challenger = client.get_model_version_by_alias(
                name=model_name, alias="challenger"
            )
            if current_challenger.version != str(version):
                raise ValueError(
                    f"Version {version} is not the current challenger. "
                    f"Current challenger is version {current_challenger.version}"
                )
        except mlflow.exceptions.MlflowException:
            raise ValueError("No challenger alias set. Run shadow mode first.")

        # Archive old champion
        try:
            old_champion = client.get_model_version_by_alias(
                name=model_name, alias="champion"
            )
            client.set_model_version_tag(
                name=model_name,
                version=old_champion.version,
                key="archived_at",
                value=datetime.utcnow().isoformat(),
            )
            client.delete_registered_model_alias(
                name=model_name, alias="champion"
            )
        except mlflow.exceptions.MlflowException:
            pass  # No existing champion

    # Set the new alias
    client.set_registered_model_alias(
        name=model_name, alias=target_alias, version=version
    )

    # Tag the promotion event
    client.set_model_version_tag(
        name=model_name,
        version=str(version),
        key=f"promoted_to_{target_alias}_at",
        value=datetime.utcnow().isoformat(),
    )

    print(f"Model {model_name} v{version} promoted to {target_alias}")

Warning: Model alias reassignment is atomic but not transactional across multiple aliases. If you need to swap champion and challenger simultaneously, there will be a brief window where both point to the same version. Design your serving layer to handle this gracefully.

CI/CD Integration with GitHub Actions

Automated Training and Validation Pipeline

# .github/workflows/ml-pipeline.yml
name: ML Training and Model Validation Pipeline

on:
  push:
    paths:
      - 'ml/**'
      - 'features/**'
    branches: [main]
  workflow_dispatch:
    inputs:
      experiment_name:
        description: 'MLflow experiment name'
        required: true
        default: 'recommendation-team/product-ranking/scheduled'

env:
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
  MLFLOW_TRACKING_USERNAME: ${{ secrets.MLFLOW_TRACKING_USERNAME }}
  MLFLOW_TRACKING_PASSWORD: ${{ secrets.MLFLOW_TRACKING_PASSWORD }}
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

jobs:
  train:
    runs-on: [self-hosted, gpu]
    outputs:
      run_id: ${{ steps.train.outputs.run_id }}
      model_version: ${{ steps.train.outputs.model_version }}
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run training
        id: train
        run: |
          python ml/train.py \
            --experiment-name "${{ github.event.inputs.experiment_name || 'recommendation-team/product-ranking/ci' }}" \
            --git-sha "${{ github.sha }}" \
            --data-version "$(date +%Y%m%d)"
          echo "run_id=$(cat /tmp/mlflow_run_id)" >> $GITHUB_OUTPUT
          echo "model_version=$(cat /tmp/mlflow_model_version)" >> $GITHUB_OUTPUT

  validate:
    needs: train
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run model validation
        run: |
          python ml/validate.py \
            --model-uri "models:/product-ranking-xgb/${{ needs.train.outputs.model_version }}" \
            --min-ndcg 0.45 \
            --max-latency-p99-ms 50 \
            --min-data-coverage 0.95

      - name: Set candidate alias
        if: success()
        run: |
          python -c "
          from mlflow.tracking import MlflowClient
          client = MlflowClient()
          client.set_registered_model_alias(
              'product-ranking-xgb', 'candidate',
              ${{ needs.train.outputs.model_version }}
          )
          client.set_model_version_tag(
              'product-ranking-xgb',
              '${{ needs.train.outputs.model_version }}',
              'validation_status', 'passed'
          )
          "

  promote-to-challenger:
    needs: [train, validate]
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install mlflow

      - name: Promote to challenger
        run: |
          python ml/promote.py \
            --model-name "product-ranking-xgb" \
            --version "${{ needs.train.outputs.model_version }}" \
            --target-alias "challenger"

      - name: Deploy to shadow mode
        run: |
          kubectl set image deployment/ranking-shadow \
            model-server=ranking-server:v${{ needs.train.outputs.model_version }} \
            --namespace ml-staging

Automated Champion Promotion

# .github/workflows/promote-champion.yml
name: Promote Challenger to Champion

on:
  workflow_dispatch:
    inputs:
      model_name:
        description: 'Registered model name'
        required: true
      version:
        description: 'Model version to promote'
        required: true

jobs:
  promote:
    runs-on: ubuntu-latest
    environment: production # Requires manual approval
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install mlflow

      - name: Verify shadow mode metrics
        run: |
          python ml/verify_shadow_metrics.py \
            --model-name "${{ github.event.inputs.model_name }}" \
            --version "${{ github.event.inputs.version }}" \
            --min-hours-in-shadow 24 \
            --max-metric-degradation 0.02

      - name: Promote to champion
        run: |
          python ml/promote.py \
            --model-name "${{ github.event.inputs.model_name }}" \
            --version "${{ github.event.inputs.version }}" \
            --target-alias "champion"

      - name: Rolling deploy to production
        run: |
          kubectl set image deployment/ranking-prod \
            model-server=ranking-server:v${{ github.event.inputs.version }} \
            --namespace ml-production
          kubectl rollout status deployment/ranking-prod \
            --namespace ml-production --timeout=600s

Multi-Team Experiment Organization

Access Control and Namespace Strategy

"""
MLflow experiment namespace strategy for multi-team organizations.

Convention:
  {team}/{project}/{experiment-type}

Examples:
  recommendation-team/product-ranking/hyperparameter-search
  recommendation-team/product-ranking/feature-ablation
  search-team/query-understanding/weekly-retrain
  fraud-team/transaction-scoring/model-comparison

Model naming convention:
  {project}-{algorithm}

Examples:
  product-ranking-xgb
  query-understanding-bert
  transaction-scoring-lgbm
"""

import mlflow
from mlflow.tracking import MlflowClient
from dataclasses import dataclass


@dataclass
class ExperimentConfig:
    team: str
    project: str
    experiment_type: str

    @property
    def experiment_name(self) -> str:
        return f"{self.team}/{self.project}/{self.experiment_type}"

    @property
    def model_name_prefix(self) -> str:
        return self.project


def setup_experiment(config: ExperimentConfig) -> str:
    """Create or get experiment with proper tags for discoverability."""
    client = MlflowClient()

    experiment = client.get_experiment_by_name(config.experiment_name)

    if experiment is None:
        experiment_id = client.create_experiment(
            name=config.experiment_name,
            tags={
                "team": config.team,
                "project": config.project,
                "type": config.experiment_type,
                "owner": f"{config.team}-lead@company.com",
            },
        )
    else:
        experiment_id = experiment.experiment_id

    mlflow.set_experiment(experiment_id=experiment_id)
    return experiment_id

Failure Cases and Operational Warnings

Common Production Failures

1. Artifact Store Permissions

The most common production failure is S3 permission errors when training jobs run on different IAM roles than the MLflow server:

# Symptom: Training completes but model is not saved
# Error: botocore.exceptions.ClientError: AccessDenied

# Fix: Ensure the training job IAM role has BOTH:
# - s3:PutObject on the artifact bucket
# - s3:GetObject on the artifact bucket (for model loading)

# Verify permissions:
aws s3 cp test.txt s3://mlflow-artifacts-prod/test.txt
aws s3 ls s3://mlflow-artifacts-prod/

2. PostgreSQL Connection Exhaustion

When running many concurrent training jobs, each job holds a database connection. Without connection pooling, this causes cascading failures:

# Deploy PgBouncer between MLflow and PostgreSQL
# pgbouncer.ini
[databases]
mlflowdb = host=postgres port=5432 dbname=mlflowdb

[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 500
default_pool_size = 30
min_pool_size = 10
reserve_pool_size = 5

3. Large Artifact Upload Timeout

Deep learning models (multi-GB) can timeout during upload. Configure the client timeout:

import os

# Increase timeout for large model uploads (default is 120s)
os.environ["MLFLOW_HTTP_REQUEST_TIMEOUT"] = "600"

# For very large artifacts, use multipart upload
os.environ["MLFLOW_MULTIPART_UPLOAD_CHUNK_SIZE"] = "104857600"  # 100MB chunks

4. Metric Logging Race Conditions

When multiple processes log to the same run (e.g., distributed training), metrics can arrive out of order:

# BAD: Multiple workers logging to same run
# This causes step ordering issues and metric overwrites

# GOOD: Use child runs for distributed training
with mlflow.start_run(run_name="distributed-training") as parent_run:
    for worker_id in range(num_workers):
        with mlflow.start_run(
            run_name=f"worker-{worker_id}",
            nested=True,
        ) as child_run:
            # Each worker logs to its own child run
            train_worker(worker_id, child_run.info.run_id)

    # Aggregate metrics in parent run
    aggregate_and_log_metrics(parent_run.info.run_id)

5. Model Registry Name Collisions

Teams accidentally overwriting each other's registered models:

# Enforce naming convention with a wrapper
def register_model_safe(model_uri: str, name: str, team: str):
    """Register model with team-prefix validation."""
    allowed_prefixes = {
        "recommendation": ["product-ranking", "user-embedding"],
        "search": ["query-understanding", "document-ranking"],
        "fraud": ["transaction-scoring", "account-risk"],
    }

    valid = any(
        name.startswith(prefix)
        for prefix in allowed_prefixes.get(team, [])
    )

    if not valid:
        raise ValueError(
            f"Team '{team}' cannot register model '{name}'. "
            f"Allowed prefixes: {allowed_prefixes.get(team, [])}"
        )

    return mlflow.register_model(model_uri, name)

Production Monitoring and Cleanup

Automated Experiment Cleanup

Old experiments accumulate and slow down the UI. Schedule cleanup:

from mlflow.tracking import MlflowClient
from datetime import datetime, timedelta

def cleanup_old_runs(
    experiment_name: str,
    max_age_days: int = 90,
    keep_top_n: int = 10,
    dry_run: bool = True,
):
    """Clean up old experiment runs while preserving top performers.

    WARNING: This permanently deletes runs and their artifacts.
    Always run with dry_run=True first.
    """
    client = MlflowClient()
    experiment = client.get_experiment_by_name(experiment_name)

    if experiment is None:
        print(f"Experiment '{experiment_name}' not found")
        return

    cutoff = datetime.now() - timedelta(days=max_age_days)
    cutoff_ms = int(cutoff.timestamp() * 1000)

    # Get all runs sorted by primary metric
    runs = client.search_runs(
        experiment_ids=[experiment.experiment_id],
        order_by=["metrics.val_ndcg DESC"],
    )

    # Protect top N runs regardless of age
    protected_run_ids = {r.info.run_id for r in runs[:keep_top_n]}

    deleted_count = 0
    for run in runs:
        if run.info.run_id in protected_run_ids:
            continue
        if run.info.end_time and run.info.end_time < cutoff_ms:
            if dry_run:
                print(f"Would delete run {run.info.run_id} "
                      f"(ended: {datetime.fromtimestamp(run.info.end_time/1000)})")
            else:
                client.delete_run(run.info.run_id)
            deleted_count += 1

    print(f"{'Would delete' if dry_run else 'Deleted'} {deleted_count} runs")

Summary

MLflow at production scale requires more than just calling mlflow.log_metric(). The key principles are:

Separate compute from storage: Use PostgreSQL for metadata and S3 for artifacts. Deploy PgBouncer for connection pooling.
Structure experiments by team and project: Use a clear naming convention that scales with organizational growth.
Use aliases, not stages: The champion/challenger/candidate pattern with model aliases provides flexible deployment workflows.
Integrate with CI/CD: Automate validation gates and deployment through GitHub Actions with environment-based approval flows.
Plan for failure: Connection exhaustion, permission errors, and race conditions in distributed training are the most common production issues.
Clean up proactively: Old runs accumulate and degrade UI performance. Schedule automated cleanup with protection for top-performing models.

The shift from stages to aliases, the adoption of model signatures, and the integration of dataset tracking (via mlflow.log_input()) represent MLflow's maturation into a production-grade MLOps platform. Combined with proper infrastructure scaling and CI/CD integration, MLflow provides a solid foundation for managing ML experiments and models at enterprise scale.