💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Why an MLflow 2.x Operations Guide Is Needed

MLflow is downloaded over 14 million times per month and has become the de facto standard for open-source experiment tracking tools. Installing it and calling `mlflow.autolog()` is easy. The problem comes next. As the team grows and experiments exceed thousands, operational issues emerge: lack of experiment naming conventions, artifact storage capacity explosions, and model promotion process confusion.

This article covers practical patterns for designing and operating experiment tracking and model registry at production level, encompassing MLflow 2.x (2.9-2.18) and early 3.x versions. It is written with team-level operations in mind, not local experimentation.

Architecture Design: Separating Tracking Server and Artifact Store

Core Components

The MLflow production architecture should be separated into three layers.

1. **Tracking Server**: Stores experiment metadata (parameters, metrics, tags). Uses PostgreSQL or MySQL backend.

2. **Artifact Store**: Stores model binaries, datasets, and visualization files. Uses S3/GCS/Azure Blob.

3. **Model Registry**: Model version management, aliases, stage transitions. Uses the same DB as the Tracking Server.

Production tracking server startup command

mlflow server \

--backend-store-uri postgresql://mlflow_user:${DB_PASSWORD}@db.internal:5432/mlflow_prod \

--default-artifact-root s3://company-mlflow-artifacts/prod/ \

--artifacts-destination s3://company-mlflow-artifacts/prod/ \

--host 0.0.0.0 \

--port 5000 \

--workers 4 \

--gunicorn-opts "--timeout 120 --keep-alive 5"

Artifact Store Configuration Considerations

When using S3 as the artifact store, setting `MLFLOW_S3_ENDPOINT_URL` on both client and server sides causes path conflicts. The principle is to specify the path on the server side with `--default-artifact-root` and not set this environment variable on the client.

Client-side configuration (correct approach)

Set only the tracking server URI. Artifact path is managed by the server.

os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow.internal:5000"

IAM Role is recommended for S3 authentication (EC2/EKS environment)

Specify credentials only for local development

os.environ["AWS_PROFILE"] = "mlflow-dev"

mlflow.set_experiment("/team-search/ranking-model-v3")

When using GCS, specify in `gs://bucket/path` format, and in production, use Workload Identity instead of Service Account Keys. Artifact upload/download timeout is controlled by the `MLFLOW_ARTIFACT_UPLOAD_DOWNLOAD_TIMEOUT` environment variable, with a default of 60 seconds for GCS. When handling large model checkpoints, this value should be raised to 300 seconds or more.

Experiment Tracking Design Patterns

Experiment Naming Strategy

Use the `/{team}/{project}/{experiment-type}` pattern for experiment names. Flat naming becomes unmanageable once experiments exceed 100.

Good examples: hierarchical naming

mlflow.set_experiment("/search-team/query-ranking/bert-finetune")

mlflow.set_experiment("/fraud-team/transaction-classifier/xgboost-baseline")

mlflow.set_experiment("/recommendation/item2vec/hyperopt-sweep")

Bad examples: flat and ambiguous naming

mlflow.set_experiment("experiment_1")

mlflow.set_experiment("test_model")

mlflow.set_experiment("johns_experiment")

Tag System Design

Tags are the core of experiment search and governance. At minimum, the following tags must be recorded.

from datetime import datetime

with mlflow.start_run(run_name="bert-ranking-v3.2.1") as run:

Required tags

mlflow.set_tag("team", "search")

mlflow.set_tag("owner", "jane.doe@company.com")

mlflow.set_tag("git.commit", "a1b2c3d4")

mlflow.set_tag("data.version", "v2026.03.05")

mlflow.set_tag("environment", "gpu-cluster-a100")

mlflow.set_tag("purpose", "hyperparameter-sweep")

Log parameters

mlflow.log_params({

"learning_rate": 2e-5,

"batch_size": 32,

"max_epochs": 10,

"model_architecture": "bert-base-uncased",

"optimizer": "AdamW",

"warmup_steps": 500,

})

Log metrics by step (epoch or global step)

for epoch in range(10):

train_loss = train_one_epoch(model, train_loader)

val_ndcg = evaluate(model, val_loader)

mlflow.log_metrics({

"train_loss": train_loss,

"val_ndcg@10": val_ndcg,

}, step=epoch)

Log final model

mlflow.pytorch.log_model(

model,

artifact_path="model",

registered_model_name="search-ranking-bert",

)

The Pitfalls of autolog

`mlflow.autolog()` is useful for quick prototyping, but in production experiments, the following issues arise.

- **Excessive unnecessary artifact logging**: For sklearn, feature importance plots, confusion matrices, etc. are saved for every run. Running hyperparameter searches thousands of times causes artifact storage volume to grow rapidly.

- **Missing custom metrics**: Domain-specific metrics (NDCG, MRR, business KPIs) are not logged by autolog.

- **Inconsistency across frameworks**: PyTorch, TensorFlow, and XGBoost each log with different metric names and structures.

In production, it is recommended to enable only minimal logging with `mlflow.autolog(log_models=False, log_datasets=False)` and explicitly log key metrics and models.

Model Registry Operations Strategy

Model Naming Rules

Name models with a **product focus**. Do not include version numbers or algorithm names in the model name.

Good examples (product/function focused)

fraud-detector

search-ranker

recommendation-item2vec

churn-predictor

Bad examples (algorithm/version focused)

xgboost-fraud-v3

bert-search-ranking-2026

lgbm_model_final_final

Version numbers are automatically managed by the registry. Algorithm changes are tracked through tags or descriptions.

Alias-Based Deployment Workflow

In MLflow 2.x, using the **Alias** system is recommended over the traditional Stage (Staging/Production). Aliases are more flexible and can manage multiple production environments.

from mlflow import MlflowClient

client = MlflowClient()

Set alias after registering a new model version

model_name = "search-ranker"

version = client.create_model_version(

name=model_name,

source="runs:/abc123/model",

run_id="abc123",

description="BERT-base finetuned on 2026 Q1 query logs"

)

Check current champion

try:

current_champion = client.get_model_version_by_alias(model_name, "champion")

print(f"Current champion: v{current_champion.version}")

except mlflow.exceptions.MlflowException:

print("No champion alias set yet")

Canary deployment: assign challenger alias to new version

client.set_registered_model_alias(model_name, "challenger", version.version)

Promote to champion after canary validation passes

client.set_registered_model_alias(model_name, "champion", version.version)

Move previous champion to archived

client.set_registered_model_alias(model_name, "previous-champion", current_champion.version)

In serving code, loading the model by alias enables zero-downtime model replacement by simply changing the alias in the registry.

Serving code: alias-based model loading

model = mlflow.pyfunc.load_model("models:/search-ranker@champion")

predictions = model.predict(input_data)

Model Version Metadata Management

The following information should be tagged for each model version. Without this information, no one will know "what data was this model trained on" six months later.

client.set_model_version_tag(model_name, version.version, "training_data", "s3://data/query-logs/2026-q1/")

client.set_model_version_tag(model_name, version.version, "training_commit", "a1b2c3d4e5f6")

client.set_model_version_tag(model_name, version.version, "validation_ndcg", "0.847")

client.set_model_version_tag(model_name, version.version, "approved_by", "jane.doe")

client.set_model_version_tag(model_name, version.version, "approval_date", "2026-03-05")

Experiment Tracking Tool Comparison

Before choosing MLflow, evaluate tools that match your team's requirements.

| -------------------------- | ------------------- | -------------------- | ----------------------- | ------------------- |

**Selection Criteria Summary**:

- **Cost sensitive + Self-hosting required**: MLflow or ClearML

- **Best-in-class visualization + Team collaboration**: Weights & Biases

- **Large enterprise + Governance**: Neptune.ai

- **Using Databricks ecosystem**: MLflow (native integration)

CI/CD Pipeline Integration

GitHub Actions and MLflow Integration

Automating model training and registration ensures reproducibility and reduces human errors.

.github/workflows/train-and-register.yml

name: Train and Register Model

on:

push:

paths:

- 'models/search-ranker/**'

branches: [main]

workflow_dispatch:

inputs:

experiment_name:

description: 'MLflow experiment name'

required: true

default: '/search-team/query-ranking/scheduled-retrain'

env:

MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}

AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}

AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

jobs:

train:

runs-on: [self-hosted, gpu]

steps:

- uses: actions/checkout@v4

- name: Setup Python

uses: actions/setup-python@v5

with:

python-version: '3.11'

- name: Install dependencies

run: |

pip install -r requirements.txt

pip install mlflow[extras]

- name: Train model

run: |

python models/search-ranker/train.py \

--experiment-name "${{ github.event.inputs.experiment_name || '/search-team/query-ranking/ci-train' }}" \

--run-name "ci-${{ github.sha }}" \

--register-model search-ranker

- name: Validate model

run: |

python models/search-ranker/validate.py \

--model-uri "models:/search-ranker@challenger" \

--threshold-ndcg 0.82

- name: Promote to champion

if: success()

run: |

python scripts/promote_model.py \

--model-name search-ranker \

--from-alias challenger \

--to-alias champion

Model Validation Script Example

Performance validation must be performed before model promotion in the CI pipeline.

scripts/validate_model.py

from mlflow import MlflowClient

def validate_model(model_name: str, alias: str, threshold: float) -> bool:

"""Validate that model performance meets the threshold."""

client = MlflowClient()

Look up model version by alias

model_version = client.get_model_version_by_alias(model_name, alias)

run = client.get_run(model_version.run_id)

Check validation metrics

val_ndcg = run.data.metrics.get("val_ndcg@10")

if val_ndcg is None:

print(f"ERROR: val_ndcg@10 metric not found in run {model_version.run_id}")

return False

Compare with current champion

try:

champion = client.get_model_version_by_alias(model_name, "champion")

champion_run = client.get_run(champion.run_id)

champion_ndcg = champion_run.data.metrics.get("val_ndcg@10", 0)

print(f"Champion v{champion.version} NDCG: {champion_ndcg:.4f}")

print(f"Challenger v{model_version.version} NDCG: {val_ndcg:.4f}")

Check for performance degradation compared to champion

if val_ndcg < champion_ndcg * 0.98: # Fail if more than 2% decline

print("FAIL: Challenger performs worse than champion by more than 2%")

return False

except Exception:

print("No existing champion found. Proceeding with threshold check only.")

Absolute threshold check

if val_ndcg < threshold:

print(f"FAIL: NDCG {val_ndcg:.4f} below threshold {threshold:.4f}")

return False

print(f"PASS: NDCG {val_ndcg:.4f} meets threshold {threshold:.4f}")

return True

if __name__ == "__main__":

parser = argparse.ArgumentParser()

parser.add_argument("--model-name", required=True)

parser.add_argument("--alias", default="challenger")

parser.add_argument("--threshold", type=float, default=0.80)

args = parser.parse_args()

if not validate_model(args.model_name, args.alias, args.threshold):

sys.exit(1)

Multi-Tenancy Configuration

When there are multiple teams and experiment data needs to be isolated, here is how to configure MLflow's multi-tenancy.

Enabling Authentication

MLflow 2.x comes with built-in authentication features.

Start server with authentication enabled

mlflow server \

--backend-store-uri postgresql://mlflow:${DB_PASSWORD}@db:5432/mlflow \

--default-artifact-root s3://mlflow-artifacts/ \

--app-name basic-auth \

--host 0.0.0.0 \

--port 5000

Team Isolation Strategy

When complete data isolation is required, implementing logical isolation through experiment naming and permissions is more cost-effective from an operational standpoint than running separate MLflow instances per team.

Logical isolation with team-specific experiment prefixes

TEAM_PREFIX = {

"search": "/search-team",

"fraud": "/fraud-team",

"recommendation": "/rec-team",

}

def get_experiment_name(team: str, project: str, experiment: str) -> str:

"""Generate experiment name with team prefix."""

prefix = TEAM_PREFIX.get(team)

if prefix is None:

raise ValueError(f"Unknown team: {team}. Allowed: {list(TEAM_PREFIX.keys())}")

return f"{prefix}/{project}/{experiment}"

Usage example

experiment = get_experiment_name("search", "query-ranking", "bert-v4-sweep")

mlflow.set_experiment(experiment) # "/search-team/query-ranking/bert-v4-sweep"

For larger organizations, MLflow 3.x's Multi-Workspace feature enables experiment/model/prompt isolation at the workspace level on a single tracking server.

Artifact Management and Cost Optimization

Artifact Cleanup Automation

As experiments accumulate, artifact storage costs increase rapidly. This is especially problematic when hyperparameter searches generate hundreds to thousands of model checkpoints.

from mlflow import MlflowClient

from datetime import datetime, timedelta

def cleanup_old_runs(experiment_name: str, days_old: int = 90, dry_run: bool = True):

"""Clean up artifacts from failed/cancelled runs past the specified period."""

client = MlflowClient()

experiment = client.get_experiment_by_name(experiment_name)

if experiment is None:

print(f"Experiment '{experiment_name}' not found")

return

cutoff_ts = int((datetime.now() - timedelta(days=days_old)).timestamp() * 1000)

runs = client.search_runs(

experiment_ids=[experiment.experiment_id],

filter_string=f"attributes.end_time < {cutoff_ts} AND attributes.status != 'RUNNING'",

order_by=["attributes.end_time ASC"],

max_results=500,

)

deleted_count = 0

for run in runs:

Check preservation status via tags

if run.data.tags.get("keep", "false").lower() == "true":

continue

Skip runs with registered models

if run.data.tags.get("mlflow.registeredModelName"):

continue

if dry_run:

print(f"[DRY RUN] Would delete run {run.info.run_id} "

f"(ended: {datetime.fromtimestamp(run.info.end_time / 1000)})")

else:

client.delete_run(run.info.run_id)

deleted_count += 1

print(f"{'Would delete' if dry_run else 'Deleted'} {deleted_count} runs "

f"out of {len(runs)} found")

Usage: verify with dry_run first, then actually delete

cleanup_old_runs("/search-team/query-ranking/bert-finetune", days_old=60, dry_run=True)

S3 Lifecycle Policy

In addition to MLflow artifact cleanup, you can further reduce costs by setting lifecycle policies at the S3 bucket level.

{

"Rules": [

{

"ID": "MoveOldArtifactsToIA",

"Status": "Enabled",

"Filter": {

"Prefix": "prod/"

"Transitions": [

{

"Days": 90,

"StorageClass": "STANDARD_IA"

{

"Days": 365,

"StorageClass": "GLACIER"

}

]

{

"ID": "DeleteTempArtifacts",

"Status": "Enabled",

"Filter": {

"Prefix": "tmp/"

"Expiration": {

"Days": 7

}

]

}

Troubleshooting: Common Operational Failures

1. DB Connection Pool Exhaustion

**Symptoms**: `OperationalError: too many connections` occurs when many experiments run simultaneously.

**Cause**: MLflow server's default SQLAlchemy connection pool size (5) is insufficient.

**Solution**:

Adjust connection pool parameters at server startup

mlflow server \

--backend-store-uri "postgresql://mlflow:pass@db:5432/mlflow?pool_size=20&max_overflow=40" \

--default-artifact-root s3://artifacts/ \

--workers 8

2. Artifact Upload Timeout

**Symptoms**: `ConnectionError` or timeout when logging large models (several GB).

**Solution**:

Extend upload timeout

export MLFLOW_ARTIFACT_UPLOAD_DOWNLOAD_TIMEOUT=600

Adjust multipart upload chunk size (S3)

export MLFLOW_S3_UPLOAD_EXTRA_ARGS='{"ServerSideEncryption": "aws:kms"}'

3. Run Status Permanently Stuck as RUNNING

**Symptoms**: The training process died but the run continues showing "Running" status in the MLflow UI.

**Solution**:

from mlflow import MlflowClient

client = MlflowClient()

Force terminate stuck runs

stuck_runs = client.search_runs(

experiment_ids=["1"],

filter_string="attributes.status = 'RUNNING'",

)

for run in stuck_runs:

end_time = run.info.end_time

If no end_time and start time is more than 24 hours ago

if end_time is None or end_time == 0:

start_time = run.info.start_time

if (datetime.now().timestamp() * 1000 - start_time) > 86400000: # 24h

client.set_terminated(run.info.run_id, status="FAILED")

print(f"Force-terminated stuck run: {run.info.run_id}")

4. Model Registry Alias Conflict

**Symptoms**: Two CI pipelines try to set the same alias simultaneously.

**Solution**: Check the current alias state before setting an alias, and use a distributed lock. Redis-based locking is the simplest approach.

def safe_promote_model(model_name: str, version: str, alias: str, redis_url: str):

"""Safe model promotion using distributed lock."""

r = redis.from_url(redis_url)

lock_key = f"mlflow:promote:{model_name}:{alias}"

Acquire distributed lock with 30-second TTL

lock = r.lock(lock_key, timeout=30)

if lock.acquire(blocking=True, blocking_timeout=10):

try:

client = MlflowClient()

client.set_registered_model_alias(model_name, alias, version)

print(f"Successfully promoted {model_name} v{version} to @{alias}")

finally:

lock.release()

else:

raise RuntimeError(f"Failed to acquire lock for {model_name}@{alias}")

5. PostgreSQL Disk Full

**Symptoms**: Metric logging fails with a `DiskFull` error.

**Solution**: MLflow stores metrics as individual rows, so heavy step-level logging can cause the DB to grow rapidly. Regularly delete old runs and execute `VACUUM FULL`. Also, adjust metric logging frequency appropriately (log every 100 steps instead of every step).

Operations Checklist

When operating production MLflow, the following items should be checked periodically.

Initial Setup Checklist

- [ ] PostgreSQL/MySQL backend store configuration complete

- [ ] S3/GCS artifact store configuration and IAM permissions set

- [ ] Tracking server high availability (HA) configured (load balancer + multiple workers)

- [ ] Authentication enabled (`--app-name basic-auth`)

- [ ] TLS termination configured (Nginx/ALB frontend)

- [ ] Experiment naming conventions documented and shared with the team

- [ ] Model registry naming conventions agreed upon

Weekly Operations Checklist

- [ ] Artifact store capacity monitoring (threshold alerts set)

- [ ] DB disk usage checked

- [ ] Stuck (RUNNING status) runs cleaned up

- [ ] Failed run artifact cleanup script executed

- [ ] Tracking server response time verified (maintain P95 under 500ms)

Monthly Operations Checklist

- [ ] S3/GCS cost analysis and lifecycle policy review

- [ ] DB performance analysis (slow query check, index optimization)

- [ ] Unused models in model registry cleaned up

- [ ] MLflow version upgrade review

- [ ] Backup/recovery procedure tested

Rollback and Disaster Recovery Procedures

Model Rollback

Procedure for immediately rolling back to the previous version when a production model has issues.

from mlflow import MlflowClient

def rollback_model(model_name: str):

"""Roll back the champion model to previous-champion."""

client = MlflowClient()

try:

previous = client.get_model_version_by_alias(model_name, "previous-champion")

except Exception:

print("ERROR: No previous-champion alias found. Manual intervention required.")

return False

current = client.get_model_version_by_alias(model_name, "champion")

Execute rollback

client.set_registered_model_alias(model_name, "champion", previous.version)

client.set_registered_model_alias(model_name, "rolled-back", current.version)

Tag rollback reason

client.set_model_version_tag(

model_name, current.version, "rollback_reason", "performance_degradation"

)

client.set_model_version_tag(

model_name, current.version, "rolled_back_at", datetime.now().isoformat()

)

print(f"Rolled back {model_name}: v{current.version} -> v{previous.version}")

return True

DB Recovery

When the PostgreSQL backend fails, recover in the following order.

1. Restore from the latest DB snapshot

2. Restart the MLflow server and verify artifact store consistency

3. Clean up orphan artifact references with the `mlflow gc` command

4. Verify that the registry's champion alias points to the correct model version

Artifact garbage collection

mlflow gc \

--backend-store-uri postgresql://mlflow:pass@db:5432/mlflow \

--older-than 30d

Migration Points from MLflow 2.x to 3.x

MLflow 3.0 (released mid-2025) focused on GenAI and AI agent support. Key points for existing 2.x users:

- **Model Registry Extension**: In 3.x, code versions, prompt configurations, evaluation runs, and deployment metadata are linked to models. Backward compatible with existing 2.x registries.

- **Tracing Feature Added**: The `mlflow-tracing` SDK allows adding instrumentation to code/models/agents with minimal dependencies in production environments.

- **search_logged_models() API**: Enables SQL-like syntax for searching across experiments based on performance metrics, parameters, and model attributes.

- **LLM Cost Tracking**: Added functionality to automatically extract model information from LLM spans and calculate costs.

- **UI Improvements**: A sidebar for GenAI app and agent developers has been added, while continuing to support existing model training workflows.

When upgrading from 2.x to 3.x, you must run the DB migration script (`mlflow db upgrade`) and ensure a DB backup is available before upgrading.

Summary

Experiment tracking and model registry in MLflow 2.x are easy to install, but operating at production level requires systematically establishing architecture design, naming conventions, artifact management, CI/CD integration, multi-tenancy, monitoring, and rollback procedures. Artifact storage cost management and DB performance optimization in particular become significant technical debt if not incorporated into the design from the beginning.

Key principles summarized:

1. **Name experiments hierarchically** and leave rich metadata through tags.

2. **Name models product-centric** -- leave versions and algorithms to the registry and tags.

3. **Implement zero-downtime model replacement** with alias-based deployment.

4. **Automate training-validation-promotion** in your CI/CD pipeline.

5. **Automate artifact cleanup** -- otherwise the S3 bill will become frightening every month.

References

- [MLflow Tracking Official Documentation](https://mlflow.org/docs/latest/ml/tracking/)

- [MLflow Model Registry Official Documentation](https://mlflow.org/docs/latest/ml/model-registry/)

- [MLflow Artifact Stores Configuration Guide](https://mlflow.org/docs/latest/self-hosting/architecture/artifact-store/)

- [MLflow Model Registry Workflows](https://mlflow.org/docs/latest/ml/model-registry/workflow/)

- [MLflow Self-Hosting Troubleshooting](https://mlflow.org/docs/latest/self-hosting/troubleshooting/)

- [MLflow Releases](https://mlflow.org/releases)

- [MLflow GitHub Repository](https://github.com/mlflow/mlflow)

- [MLflow Model Serving Official Documentation](https://mlflow.org/docs/latest/ml/deployment/)

Quiz

Q1: What is the main topic covered in "MLflow 2.x Experiment Tracking and Model Registry

Operations Guide"?

A practical guide from MLflow 2.x experiment tracking design to model registry operations,

artifact management, CI/CD integration, multi-tenancy, and production deployment.

MLflow is downloaded over 14 million times per month and has become the de facto standard for

open-source experiment tracking tools. Installing it and calling mlflow.autolog() is easy. The

problem comes next.

Q3: Describe the Architecture Design: Separating Tracking Server and Artifact Store.

Core Components The MLflow production architecture should be separated into three layers. Tracking

Server: Stores experiment metadata (parameters, metrics, tags). Uses PostgreSQL or MySQL backend.

Artifact Store: Stores model binaries, datasets, and visualization files.

Experiment Naming Strategy Use the /team/project/experiment-type pattern for experiment names.

Flat naming becomes unmanageable once experiments exceed 100. Tag System Design Tags are the core

of experiment search and governance.

Model Naming Rules Name models with a product focus. Do not include version numbers or algorithm

names in the model name. Version numbers are automatically managed by the registry. Algorithm

changes are tracked through tags or descriptions.