Skip to content

필사 모드: MLflow Production Guide: Experiment Tracking, Model Registry, and Scalable MLOps Workflow

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Running ML experiments locally is easy. Running them at scale across multiple teams with reproducibility, auditability, and automated deployment is an entirely different challenge. MLflow has become the de facto open-source standard for experiment tracking and model lifecycle management, but most tutorials stop at `mlflow.log_metric()` on localhost.

This guide covers the production-grade MLflow workflow: scaling the tracking server with PostgreSQL and S3, structuring experiments for multi-team collaboration, managing the model registry lifecycle with aliases, integrating with CI/CD pipelines via GitHub Actions, and handling the failure modes that only surface at scale.

Experiment Tracking Platform Comparison

Before diving into MLflow, it is important to understand how it compares to other experiment tracking platforms in the ecosystem.

| Feature | MLflow | Weights and Biases | Neptune | ClearML |

| ------------------------- | -------------------- | ----------------------- | ----------------------- | -------------------- |

| **License** | Apache 2.0 (OSS) | Proprietary (free tier) | Proprietary (free tier) | Apache 2.0 (OSS) |

| **Self-hosted** | Yes (full) | Limited | Limited | Yes (full) |

| **Experiment Tracking** | Strong | Excellent | Excellent | Strong |

| **Model Registry** | Built-in | Built-in | Metadata-only | Built-in |

| **Hyperparameter Sweeps** | Manual / Optuna | Built-in (Sweeps) | Via integrations | Built-in (HPO) |

| **Artifact Storage** | S3/GCS/Azure/HDFS | W and B servers | Neptune servers | S3/GCS/Azure |

| **UI Quality** | Good | Excellent | Excellent | Good |

| **Framework Integration** | All major frameworks | All major frameworks | All major frameworks | All major frameworks |

| **Pricing (Team)** | Free (self-hosted) | ~\$50/user/month | ~\$79/user/month | Free (self-hosted) |

| **CI/CD Integration** | Any (open API) | GitHub/GitLab | GitHub/GitLab | GitHub/GitLab |

| **Data Governance** | Full control (self) | Vendor-managed | Vendor-managed | Full control (self) |

MLflow wins on self-hosting flexibility and vendor independence. Weights and Biases excels in visualization and collaboration UX. Neptune offers superior metadata querying. ClearML provides the most complete open-source pipeline management. Choose based on your team's primary constraint: budget, governance, or UI polish.

Scaling the MLflow Tracking Server

Architecture Overview

A production MLflow deployment separates three concerns:

1. **Tracking Server** - the API and UI process

2. **Backend Store** - PostgreSQL for experiment metadata, parameters, metrics, and tags

3. **Artifact Store** - S3 (or compatible) for model files, plots, and large binary artifacts

PostgreSQL Backend and S3 Artifact Store

docker-compose.production.yml

services:

mlflow-server:

image: ghcr.io/mlflow/mlflow:v2.20.0

ports:

- '5000:5000'

environment:

MLFLOW_BACKEND_STORE_URI: 'postgresql://mlflow:${DB_PASSWORD}@postgres:5432/mlflowdb'

MLFLOW_DEFAULT_ARTIFACT_ROOT: 's3://mlflow-artifacts-prod/'

AWS_ACCESS_KEY_ID: '${AWS_ACCESS_KEY_ID}'

AWS_SECRET_ACCESS_KEY: '${AWS_SECRET_ACCESS_KEY}'

AWS_DEFAULT_REGION: 'ap-northeast-1'

command: >

mlflow server

--backend-store-uri postgresql://mlflow:${DB_PASSWORD}@postgres:5432/mlflowdb

--default-artifact-root s3://mlflow-artifacts-prod/

--host 0.0.0.0

--port 5000

--workers 4

--app-name basic-auth

depends_on:

postgres:

condition: service_healthy

restart: unless-stopped

postgres:

image: postgres:16-alpine

environment:

POSTGRES_USER: mlflow

POSTGRES_PASSWORD: '${DB_PASSWORD}'

POSTGRES_DB: mlflowdb

volumes:

- pgdata:/var/lib/postgresql/data

healthcheck:

test: ['CMD-SHELL', 'pg_isready -U mlflow -d mlflowdb']

interval: 10s

timeout: 5s

retries: 5

restart: unless-stopped

nginx:

image: nginx:1.27-alpine

ports:

- '443:443'

- '80:80'

volumes:

- ./nginx.conf:/etc/nginx/nginx.conf:ro

- ./certs:/etc/nginx/certs:ro

depends_on:

- mlflow-server

restart: unless-stopped

volumes:

pgdata:

**Warning**: Never expose the MLflow server directly to the internet without authentication. The `--app-name basic-auth` flag enables built-in HTTP basic authentication. For production, always place an Nginx reverse proxy with TLS in front of the server.

PostgreSQL Tuning for MLflow

MLflow's tracking workload is write-heavy during training (frequent metric logging) and read-heavy during analysis (UI queries). Tune PostgreSQL accordingly:

postgresql.conf adjustments for MLflow workloads

Assuming 8GB RAM dedicated to PostgreSQL

shared_buffers = 2GB

effective_cache_size = 6GB

work_mem = 64MB

maintenance_work_mem = 512MB

Write-heavy optimizations

wal_buffers = 64MB

checkpoint_completion_target = 0.9

max_wal_size = 4GB

Connection pooling (use PgBouncer for 50+ concurrent training jobs)

max_connections = 200

**Operational warning**: When running more than 50 concurrent training jobs that log metrics frequently (every step), you will encounter connection pool exhaustion. Deploy PgBouncer in transaction mode between MLflow and PostgreSQL. Without this, training jobs will fail with `connection refused` errors during peak load.

Experiment Tracking Best Practices

Structuring Experiments for Teams

from mlflow.tracking import MlflowClient

Configure remote tracking server

mlflow.set_tracking_uri("https://mlflow.internal.company.com")

Naming convention: team/project/experiment-type

This enables filtering and access control at scale

EXPERIMENT_NAME = "recommendation-team/product-ranking/hyperparameter-search"

mlflow.set_experiment(EXPERIMENT_NAME)

client = MlflowClient()

def train_model(config: dict):

"""Production-grade experiment tracking with proper error handling."""

with mlflow.start_run(

run_name=f"xgb-{config['max_depth']}d-{config['learning_rate']}lr",

tags={

"team": "recommendation",

"project": "product-ranking",

"environment": "staging",

"git_commit": config.get("git_sha", "unknown"),

"data_version": config.get("data_version", "v1"),

},

) as run:

Log all hyperparameters

mlflow.log_params({

"model_type": "xgboost",

"max_depth": config["max_depth"],

"learning_rate": config["learning_rate"],

"n_estimators": config["n_estimators"],

"subsample": config["subsample"],

"colsample_bytree": config["colsample_bytree"],

"eval_metric": "ndcg",

"training_data_path": config["data_path"],

"feature_count": config["feature_count"],

})

Log dataset info as input

dataset = mlflow.data.from_pandas(

config["train_df"],

source=config["data_path"],

name="product_ranking_train",

)

mlflow.log_input(dataset, context="training")

Train model

model = train_xgboost(config)

Log metrics at each evaluation point

for epoch, metrics in enumerate(model.eval_history):

mlflow.log_metrics({

"train_ndcg": metrics["train_ndcg"],

"val_ndcg": metrics["val_ndcg"],

"train_loss": metrics["train_loss"],

"val_loss": metrics["val_loss"],

}, step=epoch)

Log final metrics

final_metrics = evaluate_model(model, config["test_data"])

mlflow.log_metrics({

"test_ndcg": final_metrics["ndcg"],

"test_precision_at_10": final_metrics["precision@10"],

"test_recall_at_50": final_metrics["recall@50"],

"test_mrr": final_metrics["mrr"],

"inference_latency_p99_ms": final_metrics["latency_p99"],

})

Log model with signature

signature = mlflow.models.infer_signature(

config["sample_input"],

model.predict(config["sample_input"]),

)

mlflow.xgboost.log_model(

model,

artifact_path="model",

signature=signature,

registered_model_name="product-ranking-xgb",

)

Log artifacts

mlflow.log_artifact("feature_importance.png")

mlflow.log_artifact("confusion_matrix.png")

return run.info.run_id

Batch Metric Logging for Performance

**Warning**: Calling `mlflow.log_metric()` on every training step creates a separate HTTP request per call. For deep learning training with thousands of steps, this saturates the tracking server.

def log_metrics_batched(metrics_buffer: list, batch_size: int = 100):

"""Batch metric logging to reduce HTTP overhead.

Instead of logging every step individually, accumulate metrics

and flush in batches. This reduces tracking server load by 50-100x

for long training runs.

"""

if len(metrics_buffer) >= batch_size:

with mlflow.start_run(run_id=current_run_id):

for step, metrics in metrics_buffer:

mlflow.log_metrics(metrics, step=step)

metrics_buffer.clear()

Usage in training loop

metrics_buffer = []

for step in range(100000):

loss = train_step()

metrics_buffer.append((step, {

"train_loss": loss,

"learning_rate": scheduler.get_last_lr()[0],

}))

Flush every 100 steps instead of every step

log_metrics_batched(metrics_buffer, batch_size=100)

Flush remaining metrics

log_metrics_batched(metrics_buffer, batch_size=1)

Model Registry Lifecycle Management

Understanding Model Aliases (Post-Stages Deprecation)

As of MLflow 2.9+, the legacy stage-based workflow (Staging, Production, Archived) has been deprecated in favor of model aliases. Aliases provide more flexibility for real-world deployment patterns.

from mlflow.tracking import MlflowClient

client = MlflowClient()

Register a new model version (happens automatically with log_model)

or explicitly:

result = client.create_model_version(

name="product-ranking-xgb",

source="s3://mlflow-artifacts-prod/3/abc123/artifacts/model",

run_id="abc123",

description="XGBoost v2 with new user features, NDCG@10 improved 3.2%",

)

model_version = result.version

Set aliases for deployment workflow

Champion = currently serving production traffic

client.set_registered_model_alias(

name="product-ranking-xgb",

alias="champion",

version=model_version,

)

Challenger = candidate being validated in shadow mode

client.set_registered_model_alias(

name="product-ranking-xgb",

alias="challenger",

version=model_version + 1,

)

Load model by alias in serving code

champion_model = mlflow.pyfunc.load_model("models:/product-ranking-xgb@champion")

challenger_model = mlflow.pyfunc.load_model("models:/product-ranking-xgb@challenger")

Add tags for additional metadata

client.set_model_version_tag(

name="product-ranking-xgb",

version=model_version,

key="validation_status",

value="passed",

)

client.set_model_version_tag(

name="product-ranking-xgb",

version=model_version,

key="approved_by",

value="ml-lead@company.com",

)

Model Promotion Workflow

The recommended production workflow uses a three-alias pattern:

1. **candidate** - newly trained model, pending validation

2. **challenger** - validated model, running in shadow mode alongside champion

3. **champion** - serving live production traffic

def promote_model(model_name: str, version: int, target_alias: str):

"""Promote a model version through the deployment lifecycle.

Workflow: candidate -> challenger -> champion

Each promotion requires passing validation gates:

- candidate -> challenger: automated test suite passes

- challenger -> champion: shadow mode metrics within tolerance

"""

client = MlflowClient()

Get current model version info

mv = client.get_model_version(name=model_name, version=str(version))

Validate promotion is allowed

if target_alias == "challenger":

Must have passed automated validation

tags = {t.key: t.value for t in mv.tags}

if tags.get("validation_status") != "passed":

raise ValueError(

f"Model version {version} has not passed validation. "

f"Current status: {tags.get('validation_status', 'unknown')}"

)

elif target_alias == "champion":

Must currently be challenger

try:

current_challenger = client.get_model_version_by_alias(

name=model_name, alias="challenger"

)

if current_challenger.version != str(version):

raise ValueError(

f"Version {version} is not the current challenger. "

f"Current challenger is version {current_challenger.version}"

)

except mlflow.exceptions.MlflowException:

raise ValueError("No challenger alias set. Run shadow mode first.")

Archive old champion

try:

old_champion = client.get_model_version_by_alias(

name=model_name, alias="champion"

)

client.set_model_version_tag(

name=model_name,

version=old_champion.version,

key="archived_at",

value=datetime.utcnow().isoformat(),

)

client.delete_registered_model_alias(

name=model_name, alias="champion"

)

except mlflow.exceptions.MlflowException:

pass # No existing champion

Set the new alias

client.set_registered_model_alias(

name=model_name, alias=target_alias, version=version

)

Tag the promotion event

client.set_model_version_tag(

name=model_name,

version=str(version),

key=f"promoted_to_{target_alias}_at",

value=datetime.utcnow().isoformat(),

)

print(f"Model {model_name} v{version} promoted to {target_alias}")

**Warning**: Model alias reassignment is atomic but not transactional across multiple aliases. If you need to swap champion and challenger simultaneously, there will be a brief window where both point to the same version. Design your serving layer to handle this gracefully.

CI/CD Integration with GitHub Actions

Automated Training and Validation Pipeline

.github/workflows/ml-pipeline.yml

name: ML Training and Model Validation Pipeline

on:

push:

paths:

- 'ml/**'

- 'features/**'

branches: [main]

workflow_dispatch:

inputs:

experiment_name:

description: 'MLflow experiment name'

required: true

default: 'recommendation-team/product-ranking/scheduled'

env:

MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}

MLFLOW_TRACKING_USERNAME: ${{ secrets.MLFLOW_TRACKING_USERNAME }}

MLFLOW_TRACKING_PASSWORD: ${{ secrets.MLFLOW_TRACKING_PASSWORD }}

AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}

AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

jobs:

train:

runs-on: [self-hosted, gpu]

outputs:

run_id: ${{ steps.train.outputs.run_id }}

model_version: ${{ steps.train.outputs.model_version }}

steps:

- uses: actions/checkout@v4

- name: Set up Python

uses: actions/setup-python@v5

with:

python-version: '3.11'

cache: 'pip'

- name: Install dependencies

run: pip install -r requirements.txt

- name: Run training

id: train

run: |

python ml/train.py \

--experiment-name "${{ github.event.inputs.experiment_name || 'recommendation-team/product-ranking/ci' }}" \

--git-sha "${{ github.sha }}" \

--data-version "$(date +%Y%m%d)"

echo "run_id=$(cat /tmp/mlflow_run_id)" >> $GITHUB_OUTPUT

echo "model_version=$(cat /tmp/mlflow_model_version)" >> $GITHUB_OUTPUT

validate:

needs: train

runs-on: [self-hosted, gpu]

steps:

- uses: actions/checkout@v4

- name: Install dependencies

run: pip install -r requirements.txt

- name: Run model validation

run: |

python ml/validate.py \

--model-uri "models:/product-ranking-xgb/${{ needs.train.outputs.model_version }}" \

--min-ndcg 0.45 \

--max-latency-p99-ms 50 \

--min-data-coverage 0.95

- name: Set candidate alias

if: success()

run: |

python -c "

from mlflow.tracking import MlflowClient

client = MlflowClient()

client.set_registered_model_alias(

'product-ranking-xgb', 'candidate',

${{ needs.train.outputs.model_version }}

)

client.set_model_version_tag(

'product-ranking-xgb',

'${{ needs.train.outputs.model_version }}',

'validation_status', 'passed'

)

"

promote-to-challenger:

needs: [train, validate]

runs-on: ubuntu-latest

environment: staging

steps:

- uses: actions/checkout@v4

- name: Install dependencies

run: pip install mlflow

- name: Promote to challenger

run: |

python ml/promote.py \

--model-name "product-ranking-xgb" \

--version "${{ needs.train.outputs.model_version }}" \

--target-alias "challenger"

- name: Deploy to shadow mode

run: |

kubectl set image deployment/ranking-shadow \

model-server=ranking-server:v${{ needs.train.outputs.model_version }} \

--namespace ml-staging

Automated Champion Promotion

.github/workflows/promote-champion.yml

name: Promote Challenger to Champion

on:

workflow_dispatch:

inputs:

model_name:

description: 'Registered model name'

required: true

version:

description: 'Model version to promote'

required: true

jobs:

promote:

runs-on: ubuntu-latest

environment: production # Requires manual approval

steps:

- uses: actions/checkout@v4

- name: Install dependencies

run: pip install mlflow

- name: Verify shadow mode metrics

run: |

python ml/verify_shadow_metrics.py \

--model-name "${{ github.event.inputs.model_name }}" \

--version "${{ github.event.inputs.version }}" \

--min-hours-in-shadow 24 \

--max-metric-degradation 0.02

- name: Promote to champion

run: |

python ml/promote.py \

--model-name "${{ github.event.inputs.model_name }}" \

--version "${{ github.event.inputs.version }}" \

--target-alias "champion"

- name: Rolling deploy to production

run: |

kubectl set image deployment/ranking-prod \

model-server=ranking-server:v${{ github.event.inputs.version }} \

--namespace ml-production

kubectl rollout status deployment/ranking-prod \

--namespace ml-production --timeout=600s

Multi-Team Experiment Organization

Access Control and Namespace Strategy

"""

MLflow experiment namespace strategy for multi-team organizations.

Convention:

{team}/{project}/{experiment-type}

Examples:

recommendation-team/product-ranking/hyperparameter-search

recommendation-team/product-ranking/feature-ablation

search-team/query-understanding/weekly-retrain

fraud-team/transaction-scoring/model-comparison

Model naming convention:

{project}-{algorithm}

Examples:

product-ranking-xgb

query-understanding-bert

transaction-scoring-lgbm

"""

from mlflow.tracking import MlflowClient

from dataclasses import dataclass

@dataclass

class ExperimentConfig:

team: str

project: str

experiment_type: str

@property

def experiment_name(self) -> str:

return f"{self.team}/{self.project}/{self.experiment_type}"

@property

def model_name_prefix(self) -> str:

return self.project

def setup_experiment(config: ExperimentConfig) -> str:

"""Create or get experiment with proper tags for discoverability."""

client = MlflowClient()

experiment = client.get_experiment_by_name(config.experiment_name)

if experiment is None:

experiment_id = client.create_experiment(

name=config.experiment_name,

tags={

"team": config.team,

"project": config.project,

"type": config.experiment_type,

"owner": f"{config.team}-lead@company.com",

},

)

else:

experiment_id = experiment.experiment_id

mlflow.set_experiment(experiment_id=experiment_id)

return experiment_id

Failure Cases and Operational Warnings

Common Production Failures

**1. Artifact Store Permissions**

The most common production failure is S3 permission errors when training jobs run on different IAM roles than the MLflow server:

Symptom: Training completes but model is not saved

Error: botocore.exceptions.ClientError: AccessDenied

Fix: Ensure the training job IAM role has BOTH:

- s3:PutObject on the artifact bucket

- s3:GetObject on the artifact bucket (for model loading)

Verify permissions:

aws s3 cp test.txt s3://mlflow-artifacts-prod/test.txt

aws s3 ls s3://mlflow-artifacts-prod/

**2. PostgreSQL Connection Exhaustion**

When running many concurrent training jobs, each job holds a database connection. Without connection pooling, this causes cascading failures:

Deploy PgBouncer between MLflow and PostgreSQL

pgbouncer.ini

[databases]

mlflowdb = host=postgres port=5432 dbname=mlflowdb

[pgbouncer]

listen_port = 6432

listen_addr = 0.0.0.0

auth_type = md5

auth_file = /etc/pgbouncer/userlist.txt

pool_mode = transaction

max_client_conn = 500

default_pool_size = 30

min_pool_size = 10

reserve_pool_size = 5

**3. Large Artifact Upload Timeout**

Deep learning models (multi-GB) can timeout during upload. Configure the client timeout:

Increase timeout for large model uploads (default is 120s)

os.environ["MLFLOW_HTTP_REQUEST_TIMEOUT"] = "600"

For very large artifacts, use multipart upload

os.environ["MLFLOW_MULTIPART_UPLOAD_CHUNK_SIZE"] = "104857600" # 100MB chunks

**4. Metric Logging Race Conditions**

When multiple processes log to the same run (e.g., distributed training), metrics can arrive out of order:

BAD: Multiple workers logging to same run

This causes step ordering issues and metric overwrites

GOOD: Use child runs for distributed training

with mlflow.start_run(run_name="distributed-training") as parent_run:

for worker_id in range(num_workers):

with mlflow.start_run(

run_name=f"worker-{worker_id}",

nested=True,

) as child_run:

Each worker logs to its own child run

train_worker(worker_id, child_run.info.run_id)

Aggregate metrics in parent run

aggregate_and_log_metrics(parent_run.info.run_id)

**5. Model Registry Name Collisions**

Teams accidentally overwriting each other's registered models:

Enforce naming convention with a wrapper

def register_model_safe(model_uri: str, name: str, team: str):

"""Register model with team-prefix validation."""

allowed_prefixes = {

"recommendation": ["product-ranking", "user-embedding"],

"search": ["query-understanding", "document-ranking"],

"fraud": ["transaction-scoring", "account-risk"],

}

valid = any(

name.startswith(prefix)

for prefix in allowed_prefixes.get(team, [])

)

if not valid:

raise ValueError(

f"Team '{team}' cannot register model '{name}'. "

f"Allowed prefixes: {allowed_prefixes.get(team, [])}"

)

return mlflow.register_model(model_uri, name)

Production Monitoring and Cleanup

Automated Experiment Cleanup

Old experiments accumulate and slow down the UI. Schedule cleanup:

from mlflow.tracking import MlflowClient

from datetime import datetime, timedelta

def cleanup_old_runs(

experiment_name: str,

max_age_days: int = 90,

keep_top_n: int = 10,

dry_run: bool = True,

):

"""Clean up old experiment runs while preserving top performers.

WARNING: This permanently deletes runs and their artifacts.

Always run with dry_run=True first.

"""

client = MlflowClient()

experiment = client.get_experiment_by_name(experiment_name)

if experiment is None:

print(f"Experiment '{experiment_name}' not found")

return

cutoff = datetime.now() - timedelta(days=max_age_days)

cutoff_ms = int(cutoff.timestamp() * 1000)

Get all runs sorted by primary metric

runs = client.search_runs(

experiment_ids=[experiment.experiment_id],

order_by=["metrics.val_ndcg DESC"],

)

Protect top N runs regardless of age

protected_run_ids = {r.info.run_id for r in runs[:keep_top_n]}

deleted_count = 0

for run in runs:

if run.info.run_id in protected_run_ids:

continue

if run.info.end_time and run.info.end_time < cutoff_ms:

if dry_run:

print(f"Would delete run {run.info.run_id} "

f"(ended: {datetime.fromtimestamp(run.info.end_time/1000)})")

else:

client.delete_run(run.info.run_id)

deleted_count += 1

print(f"{'Would delete' if dry_run else 'Deleted'} {deleted_count} runs")

Summary

MLflow at production scale requires more than just calling `mlflow.log_metric()`. The key principles are:

1. **Separate compute from storage**: Use PostgreSQL for metadata and S3 for artifacts. Deploy PgBouncer for connection pooling.

2. **Structure experiments by team and project**: Use a clear naming convention that scales with organizational growth.

3. **Use aliases, not stages**: The champion/challenger/candidate pattern with model aliases provides flexible deployment workflows.

4. **Integrate with CI/CD**: Automate validation gates and deployment through GitHub Actions with environment-based approval flows.

5. **Plan for failure**: Connection exhaustion, permission errors, and race conditions in distributed training are the most common production issues.

6. **Clean up proactively**: Old runs accumulate and degrade UI performance. Schedule automated cleanup with protection for top-performing models.

The shift from stages to aliases, the adoption of model signatures, and the integration of dataset tracking (via `mlflow.log_input()`) represent MLflow's maturation into a production-grade MLOps platform. Combined with proper infrastructure scaling and CI/CD integration, MLflow provides a solid foundation for managing ML experiments and models at enterprise scale.

References

- [MLflow Official Documentation](https://mlflow.org/docs/latest/)

- [MLflow GitHub Repository](https://github.com/mlflow/mlflow)

- [MLflow Model Registry Workflows](https://mlflow.org/docs/latest/ml/model-registry/workflow/)

- [Databricks - MLflow Self-Hosting Overview](https://mlflow.org/docs/latest/self-hosting/)

- [MLflow Backend Stores Documentation](https://mlflow.org/docs/latest/self-hosting/architecture/backend-store/)

- [MLflow Artifact Stores Documentation](https://mlflow.org/docs/latest/self-hosting/architecture/artifact-store/)

- [RFC: Deprecating Model Registry Stages (GitHub Issue 10336)](https://github.com/mlflow/mlflow/issues/10336)

- [MLOps Best Practices: Building Production ML Pipelines](https://www.dataa.dev/2025/03/17/mlops-best-practices-production-ml-pipelines-2025/)

Quiz

Q1: What is the main topic covered in "MLflow Production Guide: Experiment Tracking, Model

Registry, and Scalable MLOps Workflow"?

A comprehensive guide to MLflow covering experiment tracking at scale, model registry lifecycle

management, CI/CD integration, PostgreSQL and S3 backend configuration, multi-team collaboration,

and production deployment best practices.

Before diving into MLflow, it is important to understand how it compares to other experiment

tracking platforms in the ecosystem. MLflow wins on self-hosting flexibility and vendor

independence. Weights and Biases excels in visualization and collaboration UX.

Architecture Overview A production MLflow deployment separates three concerns: Tracking Server -

the API and UI process Backend Store - PostgreSQL for experiment metadata, parameters, metrics,

and tags Artifact Store - S3 (or compatible) for model files, plots, and large binary a...

Structuring Experiments for Teams Batch Metric Logging for Performance Warning: Calling

mlflow.log_metric() on every training step creates a separate HTTP request per call. For deep

learning training with thousands of steps, this saturates the tracking server.

Understanding Model Aliases (Post-Stages Deprecation) As of MLflow 2.9+, the legacy stage-based

workflow (Staging, Production, Archived) has been deprecated in favor of model aliases. Aliases

provide more flexibility for real-world deployment patterns.

현재 단락 (1/556)

Running ML experiments locally is easy. Running them at scale across multiple teams with reproducibi...

작성 글자: 0원문 글자: 23,161작성 단락: 0/556