Split View: MLflow 2.x 실험 추적과 모델 레지스트리 운영 가이드

MLflow 2.x 실험 추적과 모델 레지스트리 운영 가이드

왜 MLflow 2.x 운영 가이드가 필요한가
아키텍처 설계: 추적 서버와 아티팩트 저장소 분리
- 핵심 구성 요소
- 아티팩트 저장소 설정 시 주의사항
실험 추적 설계 패턴
모델 레지스트리 운영 전략
실험 추적 도구 비교
CI/CD 파이프라인 통합
- GitHub Actions와 MLflow 연동
- 모델 검증 스크립트 예시
멀티테넌시 구성
- 인증 활성화
- 팀별 격리 전략
아티팩트 관리와 비용 최적화
- 아티팩트 정리 자동화
- S3 라이프사이클 정책
트러블슈팅: 자주 발생하는 운영 장애
운영 체크리스트
롤백 및 장애 복구 절차
- 모델 롤백
- DB 복구
MLflow 2.x에서 3.x로의 마이그레이션 포인트
정리
References

왜 MLflow 2.x 운영 가이드가 필요한가

MLflow는 월간 1,400만 회 이상 다운로드되며 오픈소스 실험 추적 도구의 사실상 표준이 되었다. 설치하고 mlflow.autolog()를 호출하는 것까지는 쉽다. 문제는 그 다음이다. 팀이 커지고 실험이 수천 개를 넘어가면, 실험 명명 규칙 부재, 아티팩트 저장소 용량 폭증, 모델 승격 프로세스 혼란 같은 운영 이슈가 터진다.

이 글은 MLflow 2.x(2.9~2.18)와 3.x 초기 버전까지 포함하여, 프로덕션 수준에서 실험 추적과 모델 레지스트리를 설계하고 운영하는 실전 패턴을 다룬다. 로컬 실습이 아닌, 팀 단위 운영을 전제로 작성했다.

아키텍처 설계: 추적 서버와 아티팩트 저장소 분리

핵심 구성 요소

MLflow 프로덕션 아키텍처는 세 계층으로 분리해야 한다.

Tracking Server: 실험 메타데이터(파라미터, 메트릭, 태그) 저장. PostgreSQL 또는 MySQL 백엔드 사용.
Artifact Store: 모델 바이너리, 데이터셋, 시각화 파일 저장. S3/GCS/Azure Blob 사용.
Model Registry: 모델 버전 관리, 별칭(alias), 스테이지 전환. Tracking Server와 동일 DB 사용.

# 프로덕션 추적 서버 기동 명령
mlflow server \
  --backend-store-uri postgresql://mlflow_user:${DB_PASSWORD}@db.internal:5432/mlflow_prod \
  --default-artifact-root s3://company-mlflow-artifacts/prod/ \
  --artifacts-destination s3://company-mlflow-artifacts/prod/ \
  --host 0.0.0.0 \
  --port 5000 \
  --workers 4 \
  --gunicorn-opts "--timeout 120 --keep-alive 5"

아티팩트 저장소 설정 시 주의사항

S3를 아티팩트 저장소로 사용할 때, 클라이언트와 서버 양쪽에서 MLFLOW_S3_ENDPOINT_URL을 설정하면 경로 충돌이 발생한다. 서버 측에서는 --default-artifact-root로 경로를 지정하고, 클라이언트는 해당 환경 변수를 설정하지 않는 것이 원칙이다.

# 클라이언트 측 설정 (올바른 방식)
import mlflow
import os

# 추적 서버 URI만 설정. 아티팩트 경로는 서버가 관리.
os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow.internal:5000"

# S3 인증은 IAM Role 사용 권장 (EC2/EKS 환경)
# 로컬 개발 시에만 credential 명시
# os.environ["AWS_PROFILE"] = "mlflow-dev"

mlflow.set_experiment("/team-search/ranking-model-v3")

GCS를 사용할 경우 gs://bucket/path 형식으로 지정하며, 프로덕션에서는 Service Account Key 대신 Workload Identity를 사용해야 한다. 아티팩트 업로드/다운로드 타임아웃은 MLFLOW_ARTIFACT_UPLOAD_DOWNLOAD_TIMEOUT 환경 변수로 제어하며, 기본값은 GCS 60초이다. 대용량 모델 체크포인트를 다루는 경우 이 값을 300초 이상으로 올려야 한다.

실험 추적 설계 패턴

실험 네이밍 전략

실험 이름은 /team/project/experiment-type 패턴을 사용한다. 플랫한 네이밍은 실험이 100개를 넘어가면 관리 불가능해진다.

import mlflow

# 좋은 예: 계층적 네이밍
mlflow.set_experiment("/search-team/query-ranking/bert-finetune")
mlflow.set_experiment("/fraud-team/transaction-classifier/xgboost-baseline")
mlflow.set_experiment("/recommendation/item2vec/hyperopt-sweep")

# 나쁜 예: 플랫하고 모호한 네이밍
# mlflow.set_experiment("experiment_1")
# mlflow.set_experiment("test_model")
# mlflow.set_experiment("johns_experiment")

태그 체계 설계

태그는 실험 검색과 거버넌스의 핵심이다. 최소한 아래 태그는 반드시 기록해야 한다.

import mlflow
from datetime import datetime

with mlflow.start_run(run_name="bert-ranking-v3.2.1") as run:
    # 필수 태그
    mlflow.set_tag("team", "search")
    mlflow.set_tag("owner", "jane.doe@company.com")
    mlflow.set_tag("git.commit", "a1b2c3d4")
    mlflow.set_tag("data.version", "v2026.03.05")
    mlflow.set_tag("environment", "gpu-cluster-a100")
    mlflow.set_tag("purpose", "hyperparameter-sweep")

    # 파라미터 기록
    mlflow.log_params({
        "learning_rate": 2e-5,
        "batch_size": 32,
        "max_epochs": 10,
        "model_architecture": "bert-base-uncased",
        "optimizer": "AdamW",
        "warmup_steps": 500,
    })

    # 메트릭은 스텝 단위로 기록 (에포크 또는 글로벌 스텝)
    for epoch in range(10):
        train_loss = train_one_epoch(model, train_loader)
        val_ndcg = evaluate(model, val_loader)

        mlflow.log_metrics({
            "train_loss": train_loss,
            "val_ndcg@10": val_ndcg,
        }, step=epoch)

    # 최종 모델 로깅
    mlflow.pytorch.log_model(
        model,
        artifact_path="model",
        registered_model_name="search-ranking-bert",
    )

autolog의 함정

mlflow.autolog()는 빠른 프로토타이핑에는 유용하지만, 프로덕션 실험에서는 다음 문제가 발생한다.

불필요한 아티팩트 과다 기록: sklearn의 경우 feature importance plot, confusion matrix 등이 매 run마다 저장된다. 수천 번 하이퍼파라미터 서치를 돌리면 아티팩트 저장소 용량이 급격히 증가한다.
커스텀 메트릭 누락: 도메인 특화 메트릭(NDCG, MRR, 비즈니스 KPI)은 autolog이 기록하지 않는다.
프레임워크 간 일관성 부재: PyTorch, TensorFlow, XGBoost가 각기 다른 메트릭 이름과 구조로 기록된다.

프로덕션에서는 mlflow.autolog(log_models=False, log_datasets=False)로 최소 기록만 활성화하고, 핵심 메트릭과 모델은 명시적으로 기록하는 것을 권장한다.

모델 레지스트리 운영 전략

모델 네이밍 규칙

모델 이름은 제품 중심으로 짓는다. 버전 번호나 알고리즘 이름은 모델 이름에 포함하지 않는다.

# 좋은 예 (제품/기능 중심)
fraud-detector
search-ranker
recommendation-item2vec
churn-predictor

# 나쁜 예 (알고리즘/버전 중심)
xgboost-fraud-v3
bert-search-ranking-2026
lgbm_model_final_final

버전 번호는 레지스트리가 자동 관리한다. 알고리즘 변경은 태그나 설명(description)으로 추적한다.

Alias 기반 배포 워크플로우

MLflow 2.x에서는 기존의 Stage(Staging/Production) 대신 Alias 시스템을 사용하는 것이 권장된다. Alias는 더 유연하며, 복수의 프로덕션 환경을 관리할 수 있다.

from mlflow import MlflowClient

client = MlflowClient()

# 새 모델 버전 등록 후 alias 설정
model_name = "search-ranker"
version = client.create_model_version(
    name=model_name,
    source="runs:/abc123/model",
    run_id="abc123",
    description="BERT-base finetuned on 2026 Q1 query logs"
)

# 현재 champion 확인
try:
    current_champion = client.get_model_version_by_alias(model_name, "champion")
    print(f"Current champion: v{current_champion.version}")
except mlflow.exceptions.MlflowException:
    print("No champion alias set yet")

# Canary 배포: 새 버전에 challenger alias 부여
client.set_registered_model_alias(model_name, "challenger", version.version)

# Canary 검증 통과 후 champion 승격
client.set_registered_model_alias(model_name, "champion", version.version)

# 이전 champion을 archived로 이동
client.set_registered_model_alias(model_name, "previous-champion", current_champion.version)

서빙 코드에서는 alias로 모델을 로드하면, 레지스트리에서 alias만 변경하는 것으로 무중단 모델 교체가 가능하다.

import mlflow

# 서빙 코드: alias 기반 모델 로드
model = mlflow.pyfunc.load_model("models:/search-ranker@champion")
predictions = model.predict(input_data)

모델 버전 메타데이터 관리

모델 버전마다 다음 정보를 태그로 남겨야 한다. 이 정보가 없으면 6개월 후 "이 모델이 어떤 데이터로 훈련된 건지" 아무도 모른다.

client.set_model_version_tag(model_name, version.version, "training_data", "s3://data/query-logs/2026-q1/")
client.set_model_version_tag(model_name, version.version, "training_commit", "a1b2c3d4e5f6")
client.set_model_version_tag(model_name, version.version, "validation_ndcg", "0.847")
client.set_model_version_tag(model_name, version.version, "approved_by", "jane.doe")
client.set_model_version_tag(model_name, version.version, "approval_date", "2026-03-05")

실험 추적 도구 비교

MLflow 선택 전에 팀의 요구사항에 맞는 도구를 평가해야 한다.

항목	MLflow	Weights & Biases	Neptune.ai	ClearML
라이선스	Apache 2.0 (오픈소스)	프리미엄 SaaS	프리미엄 SaaS	SSPL (오픈소스 제한)
셀프호스팅	완전 지원	제한적	제한적	완전 지원
실험 추적	우수	최우수 (시각화)	최우수 (대규모)	우수
모델 레지스트리	내장	내장	외부 연동	내장
GenAI 지원	3.x에서 강화	LLM 평가 내장	제한적	제한적
대규모 로깅	보통 (DB 의존)	우수	최우수 (1000x 처리량)	우수
UI/UX	기능적	직관적, 최우수	기능적	우수
비용 (50인 팀)	인프라 비용만	$2,500~10,000/월	$2,500~10,000/월	인프라 비용만
Databricks 통합	네이티브	플러그인	플러그인	제한적
커뮤니티	20K+ GitHub Stars	활발	활발	활발

선택 기준 요약:

비용 민감 + 셀프호스팅 필수: MLflow 또는 ClearML
최고 수준의 시각화 + 팀 협업: Weights & Biases
대규모 엔터프라이즈 + 거버넌스: Neptune.ai
Databricks 생태계 사용 중: MLflow (네이티브 통합)

CI/CD 파이프라인 통합

GitHub Actions와 MLflow 연동

모델 학습과 등록을 자동화하면 재현성이 보장되고 휴먼 에러가 줄어든다.

# .github/workflows/train-and-register.yml
name: Train and Register Model

on:
  push:
    paths:
      - 'models/search-ranker/**'
    branches: [main]
  workflow_dispatch:
    inputs:
      experiment_name:
        description: 'MLflow experiment name'
        required: true
        default: '/search-team/query-ranking/scheduled-retrain'

env:
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

jobs:
  train:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install mlflow[extras]

      - name: Train model
        run: |
          python models/search-ranker/train.py \
            --experiment-name "${{ github.event.inputs.experiment_name || '/search-team/query-ranking/ci-train' }}" \
            --run-name "ci-${{ github.sha }}" \
            --register-model search-ranker

      - name: Validate model
        run: |
          python models/search-ranker/validate.py \
            --model-uri "models:/search-ranker@challenger" \
            --threshold-ndcg 0.82

      - name: Promote to champion
        if: success()
        run: |
          python scripts/promote_model.py \
            --model-name search-ranker \
            --from-alias challenger \
            --to-alias champion

모델 검증 스크립트 예시

CI 파이프라인에서 모델 승격 전에 반드시 성능 검증을 수행해야 한다.

# scripts/validate_model.py
import mlflow
import sys
from mlflow import MlflowClient

def validate_model(model_name: str, alias: str, threshold: float) -> bool:
    """모델 성능이 임계값을 충족하는지 검증한다."""
    client = MlflowClient()

    # alias로 모델 버전 조회
    model_version = client.get_model_version_by_alias(model_name, alias)
    run = client.get_run(model_version.run_id)

    # 검증 메트릭 확인
    val_ndcg = run.data.metrics.get("val_ndcg@10")
    if val_ndcg is None:
        print(f"ERROR: val_ndcg@10 metric not found in run {model_version.run_id}")
        return False

    # 현재 champion과 비교
    try:
        champion = client.get_model_version_by_alias(model_name, "champion")
        champion_run = client.get_run(champion.run_id)
        champion_ndcg = champion_run.data.metrics.get("val_ndcg@10", 0)
        print(f"Champion v{champion.version} NDCG: {champion_ndcg:.4f}")
        print(f"Challenger v{model_version.version} NDCG: {val_ndcg:.4f}")

        # 챔피언 대비 성능 저하 체크
        if val_ndcg < champion_ndcg * 0.98:  # 2% 이상 하락 시 실패
            print("FAIL: Challenger performs worse than champion by more than 2%")
            return False
    except Exception:
        print("No existing champion found. Proceeding with threshold check only.")

    # 절대 임계값 체크
    if val_ndcg < threshold:
        print(f"FAIL: NDCG {val_ndcg:.4f} below threshold {threshold:.4f}")
        return False

    print(f"PASS: NDCG {val_ndcg:.4f} meets threshold {threshold:.4f}")
    return True

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-name", required=True)
    parser.add_argument("--alias", default="challenger")
    parser.add_argument("--threshold", type=float, default=0.80)
    args = parser.parse_args()

    if not validate_model(args.model_name, args.alias, args.threshold):
        sys.exit(1)

멀티테넌시 구성

팀이 여러 개이고 실험 데이터를 격리해야 할 때, MLflow의 멀티테넌시를 구성하는 방법이다.

인증 활성화

MLflow 2.x부터 기본 인증 기능이 내장되었다.

# 인증 활성화된 서버 기동
mlflow server \
  --backend-store-uri postgresql://mlflow:${DB_PASSWORD}@db:5432/mlflow \
  --default-artifact-root s3://mlflow-artifacts/ \
  --app-name basic-auth \
  --host 0.0.0.0 \
  --port 5000

팀별 격리 전략

완전한 데이터 격리가 필요한 경우, 팀별로 별도의 MLflow 인스턴스를 운영하는 것보다 실험 네이밍과 권한으로 논리적 격리를 구현하는 것이 운영 비용 측면에서 유리하다.

# 팀별 실험 접두어로 논리적 격리
TEAM_PREFIX = {
    "search": "/search-team",
    "fraud": "/fraud-team",
    "recommendation": "/rec-team",
}

def get_experiment_name(team: str, project: str, experiment: str) -> str:
    """팀 접두어가 포함된 실험 이름을 생성한다."""
    prefix = TEAM_PREFIX.get(team)
    if prefix is None:
        raise ValueError(f"Unknown team: {team}. Allowed: {list(TEAM_PREFIX.keys())}")
    return f"{prefix}/{project}/{experiment}"

# 사용 예
experiment = get_experiment_name("search", "query-ranking", "bert-v4-sweep")
mlflow.set_experiment(experiment)  # "/search-team/query-ranking/bert-v4-sweep"

규모가 큰 조직에서는 MLflow 3.x의 Multi-Workspace 기능을 활용하면, 단일 추적 서버에서 워크스페이스 단위의 실험/모델/프롬프트 격리가 가능하다.

아티팩트 관리와 비용 최적화

아티팩트 정리 자동화

실험이 누적되면 아티팩트 저장소 비용이 빠르게 증가한다. 특히 하이퍼파라미터 서치에서 수백~수천 개의 모델 체크포인트가 생성되는 경우가 문제다.

from mlflow import MlflowClient
from datetime import datetime, timedelta

def cleanup_old_runs(experiment_name: str, days_old: int = 90, dry_run: bool = True):
    """지정 기간이 지난 실패/취소된 run의 아티팩트를 정리한다."""
    client = MlflowClient()
    experiment = client.get_experiment_by_name(experiment_name)

    if experiment is None:
        print(f"Experiment '{experiment_name}' not found")
        return

    cutoff_ts = int((datetime.now() - timedelta(days=days_old)).timestamp() * 1000)

    runs = client.search_runs(
        experiment_ids=[experiment.experiment_id],
        filter_string=f"attributes.end_time < {cutoff_ts} AND attributes.status != 'RUNNING'",
        order_by=["attributes.end_time ASC"],
        max_results=500,
    )

    deleted_count = 0
    for run in runs:
        # 태그로 보존 여부 확인
        if run.data.tags.get("keep", "false").lower() == "true":
            continue

        # 레지스트리에 등록된 모델이 있는 run은 건너뜀
        if run.data.tags.get("mlflow.registeredModelName"):
            continue

        if dry_run:
            print(f"[DRY RUN] Would delete run {run.info.run_id} "
                  f"(ended: {datetime.fromtimestamp(run.info.end_time / 1000)})")
        else:
            client.delete_run(run.info.run_id)
            deleted_count += 1

    print(f"{'Would delete' if dry_run else 'Deleted'} {deleted_count} runs "
          f"out of {len(runs)} found")

# 사용: 먼저 dry_run으로 확인 후 실제 삭제
cleanup_old_runs("/search-team/query-ranking/bert-finetune", days_old=60, dry_run=True)

S3 라이프사이클 정책

MLflow 아티팩트 정리와 별도로, S3 버킷 레벨에서 라이프사이클 정책을 설정하여 비용을 추가 절감할 수 있다.

{
  "Rules": [
    {
      "ID": "MoveOldArtifactsToIA",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "prod/"
      },
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 365,
          "StorageClass": "GLACIER"
        }
      ]
    },
    {
      "ID": "DeleteTempArtifacts",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "tmp/"
      },
      "Expiration": {
        "Days": 7
      }
    }
  ]
}

트러블슈팅: 자주 발생하는 운영 장애

1. DB 연결 풀 고갈

증상: 동시에 많은 실험이 돌아갈 때 OperationalError: too many connections 발생.

원인: MLflow 서버의 기본 SQLAlchemy 연결 풀 크기(5)가 부족.

해결:

# 서버 기동 시 연결 풀 파라미터 조정
mlflow server \
  --backend-store-uri "postgresql://mlflow:pass@db:5432/mlflow?pool_size=20&max_overflow=40" \
  --default-artifact-root s3://artifacts/ \
  --workers 8

2. 아티팩트 업로드 타임아웃

증상: 대용량 모델(수 GB) 로깅 시 ConnectionError 또는 타임아웃.

해결:

# 업로드 타임아웃 연장
export MLFLOW_ARTIFACT_UPLOAD_DOWNLOAD_TIMEOUT=600

# multipart upload 청크 크기 조정 (S3)
export MLFLOW_S3_UPLOAD_EXTRA_ARGS='{"ServerSideEncryption": "aws:kms"}'

3. run 상태가 RUNNING으로 영구 고착

증상: 학습 프로세스가 죽었는데 MLflow UI에서 해당 run이 계속 "Running" 상태.

해결:

from mlflow import MlflowClient

client = MlflowClient()

# 고착된 run 강제 종료
stuck_runs = client.search_runs(
    experiment_ids=["1"],
    filter_string="attributes.status = 'RUNNING'",
)

for run in stuck_runs:
    end_time = run.info.end_time
    # end_time이 없고 시작 시간이 24시간 이상 전인 경우
    if end_time is None or end_time == 0:
        start_time = run.info.start_time
        if (datetime.now().timestamp() * 1000 - start_time) > 86400000:  # 24h
            client.set_terminated(run.info.run_id, status="FAILED")
            print(f"Force-terminated stuck run: {run.info.run_id}")

4. 모델 레지스트리 alias 충돌

증상: 두 CI 파이프라인이 동시에 같은 alias를 설정하려고 함.

해결: alias 설정 전에 현재 alias 상태를 확인하고, 분산 잠금(distributed lock)을 사용한다. Redis 기반 잠금이 가장 간단하다.

import redis
import time

def safe_promote_model(model_name: str, version: str, alias: str, redis_url: str):
    """분산 잠금을 사용한 안전한 모델 승격."""
    r = redis.from_url(redis_url)
    lock_key = f"mlflow:promote:{model_name}:{alias}"

    # 30초 TTL의 분산 잠금 획득
    lock = r.lock(lock_key, timeout=30)
    if lock.acquire(blocking=True, blocking_timeout=10):
        try:
            client = MlflowClient()
            client.set_registered_model_alias(model_name, alias, version)
            print(f"Successfully promoted {model_name} v{version} to @{alias}")
        finally:
            lock.release()
    else:
        raise RuntimeError(f"Failed to acquire lock for {model_name}@{alias}")

5. PostgreSQL 디스크 풀

증상: DiskFull 에러와 함께 메트릭 기록 실패.

해결: MLflow는 메트릭을 개별 행으로 저장하므로, step 단위 로깅이 많으면 DB가 빠르게 커진다. 정기적으로 오래된 run을 삭제하고, VACUUM FULL을 실행해야 한다. 또한, 메트릭 로깅 주기를 적절히 조절한다 (매 step이 아닌, 100 step 단위로 기록).

운영 체크리스트

프로덕션 MLflow를 운영할 때 아래 항목을 주기적으로 점검해야 한다.

초기 설정 체크리스트

PostgreSQL/MySQL 백엔드 스토어 구성 완료
S3/GCS 아티팩트 저장소 구성 및 IAM 권한 설정
추적 서버 고가용성(HA) 구성 (로드밸런서 + 복수 워커)
인증 활성화 (--app-name basic-auth)
TLS 종료(termination) 설정 (Nginx/ALB 앞단)
실험 네이밍 규칙 문서화 및 팀 공유
모델 레지스트리 네이밍 규칙 합의

주간 운영 체크리스트

아티팩트 저장소 용량 모니터링 (임계값 알림 설정)
DB 디스크 사용량 확인
고착된(RUNNING 상태) run 정리
실패한 run의 아티팩트 정리 스크립트 실행
추적 서버 응답 시간 확인 (P95 기준 500ms 이하 유지)

월간 운영 체크리스트

S3/GCS 비용 분석 및 라이프사이클 정책 검토
DB 성능 분석 (슬로우 쿼리 확인, 인덱스 최적화)
모델 레지스트리 미사용 모델 정리
MLflow 버전 업그레이드 검토
백업/복구 절차 테스트

롤백 및 장애 복구 절차

모델 롤백

프로덕션 모델에 문제가 발생했을 때 즉시 이전 버전으로 롤백하는 절차이다.

from mlflow import MlflowClient

def rollback_model(model_name: str):
    """champion 모델을 previous-champion으로 롤백한다."""
    client = MlflowClient()

    try:
        previous = client.get_model_version_by_alias(model_name, "previous-champion")
    except Exception:
        print("ERROR: No previous-champion alias found. Manual intervention required.")
        return False

    current = client.get_model_version_by_alias(model_name, "champion")

    # 롤백 실행
    client.set_registered_model_alias(model_name, "champion", previous.version)
    client.set_registered_model_alias(model_name, "rolled-back", current.version)

    # 롤백 사유 태깅
    client.set_model_version_tag(
        model_name, current.version, "rollback_reason", "performance_degradation"
    )
    client.set_model_version_tag(
        model_name, current.version, "rolled_back_at", datetime.now().isoformat()
    )

    print(f"Rolled back {model_name}: v{current.version} -> v{previous.version}")
    return True

DB 복구

PostgreSQL 백엔드가 장애 발생 시, 아래 순서로 복구한다.

최신 DB 스냅샷에서 복원
MLflow 서버 재기동 후 아티팩트 저장소 정합성 확인
mlflow gc 명령으로 고아(orphan) 아티팩트 참조 정리
레지스트리의 champion alias가 올바른 모델 버전을 가리키는지 확인

# 아티팩트 가비지 컬렉션
mlflow gc \
  --backend-store-uri postgresql://mlflow:pass@db:5432/mlflow \
  --older-than 30d

MLflow 2.x에서 3.x로의 마이그레이션 포인트

MLflow 3.0(2025년 중반 출시)은 GenAI와 AI 에이전트 지원에 초점을 맞추었다. 기존 2.x 사용자가 주의할 점은 다음과 같다.

Model Registry 확장: 3.x에서는 코드 버전, 프롬프트 구성, 평가 run, 배포 메타데이터까지 모델에 연결된다. 기존 2.x 레지스트리와 하위 호환된다.
Tracing 기능 추가: mlflow-tracing SDK로 프로덕션 환경에서 최소 의존성으로 코드/모델/에이전트에 계측(instrumentation)을 추가할 수 있다.
search_logged_models() API: SQL 유사 문법으로 실험 전체에서 성능 메트릭, 파라미터, 모델 속성 기반 검색이 가능해졌다.
LLM 비용 추적: LLM 스팬에서 모델 정보를 자동 추출하고 비용을 계산하는 기능이 추가되었다.
UI 개선: GenAI 앱과 에이전트 개발자를 위한 사이드바가 추가되었고, 기존 모델 학습 워크플로우도 함께 지원한다.

2.x에서 3.x로 업그레이드 시, DB 마이그레이션 스크립트(mlflow db upgrade)를 반드시 실행하고, 업그레이드 전 DB 백업을 확보해야 한다.

정리

MLflow 2.x의 실험 추적과 모델 레지스트리는 설치 자체는 쉽지만, 프로덕션 수준으로 운영하려면 아키텍처 설계, 네이밍 규칙, 아티팩트 관리, CI/CD 통합, 멀티테넌시, 모니터링, 롤백 절차까지 체계적으로 갖춰야 한다. 특히 아티팩트 저장소 비용 관리와 DB 성능 최적화는 운영 초기부터 설계에 반영하지 않으면 나중에 큰 기술 부채가 된다.

핵심 원칙을 요약하면 다음과 같다.

실험 이름은 계층적으로 짓고, 태그로 메타데이터를 풍부하게 남겨라.
모델 이름은 제품 중심으로, 버전과 알고리즘은 레지스트리와 태그에 맡겨라.
Alias 기반 배포로 무중단 모델 교체를 구현하라.
CI/CD 파이프라인에 학습-검증-승격을 자동화하라.
아티팩트 정리를 자동화하지 않으면 S3 비용 청구서가 매달 두려워진다.

References

MLflow 2.x Experiment Tracking and Model Registry Operations Guide

Why an MLflow 2.x Operations Guide Is Needed
Architecture Design: Separating Tracking Server and Artifact Store
- Core Components
- Artifact Store Configuration Considerations
Experiment Tracking Design Patterns
Model Registry Operations Strategy
Experiment Tracking Tool Comparison
CI/CD Pipeline Integration
- GitHub Actions and MLflow Integration
- Model Validation Script Example
Multi-Tenancy Configuration
- Enabling Authentication
- Team Isolation Strategy
Artifact Management and Cost Optimization
- Artifact Cleanup Automation
- S3 Lifecycle Policy
Troubleshooting: Common Operational Failures
Operations Checklist
Rollback and Disaster Recovery Procedures
- Model Rollback
- DB Recovery
Migration Points from MLflow 2.x to 3.x
Summary
References
Quiz

Why an MLflow 2.x Operations Guide Is Needed

MLflow is downloaded over 14 million times per month and has become the de facto standard for open-source experiment tracking tools. Installing it and calling mlflow.autolog() is easy. The problem comes next. As the team grows and experiments exceed thousands, operational issues emerge: lack of experiment naming conventions, artifact storage capacity explosions, and model promotion process confusion.

This article covers practical patterns for designing and operating experiment tracking and model registry at production level, encompassing MLflow 2.x (2.9-2.18) and early 3.x versions. It is written with team-level operations in mind, not local experimentation.

Architecture Design: Separating Tracking Server and Artifact Store

Core Components

The MLflow production architecture should be separated into three layers.

Tracking Server: Stores experiment metadata (parameters, metrics, tags). Uses PostgreSQL or MySQL backend.
Artifact Store: Stores model binaries, datasets, and visualization files. Uses S3/GCS/Azure Blob.
Model Registry: Model version management, aliases, stage transitions. Uses the same DB as the Tracking Server.

# Production tracking server startup command
mlflow server \
  --backend-store-uri postgresql://mlflow_user:${DB_PASSWORD}@db.internal:5432/mlflow_prod \
  --default-artifact-root s3://company-mlflow-artifacts/prod/ \
  --artifacts-destination s3://company-mlflow-artifacts/prod/ \
  --host 0.0.0.0 \
  --port 5000 \
  --workers 4 \
  --gunicorn-opts "--timeout 120 --keep-alive 5"

Artifact Store Configuration Considerations

When using S3 as the artifact store, setting MLFLOW_S3_ENDPOINT_URL on both client and server sides causes path conflicts. The principle is to specify the path on the server side with --default-artifact-root and not set this environment variable on the client.

# Client-side configuration (correct approach)
import mlflow
import os

# Set only the tracking server URI. Artifact path is managed by the server.
os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow.internal:5000"

# IAM Role is recommended for S3 authentication (EC2/EKS environment)
# Specify credentials only for local development
# os.environ["AWS_PROFILE"] = "mlflow-dev"

mlflow.set_experiment("/team-search/ranking-model-v3")

When using GCS, specify in gs://bucket/path format, and in production, use Workload Identity instead of Service Account Keys. Artifact upload/download timeout is controlled by the MLFLOW_ARTIFACT_UPLOAD_DOWNLOAD_TIMEOUT environment variable, with a default of 60 seconds for GCS. When handling large model checkpoints, this value should be raised to 300 seconds or more.

Experiment Tracking Design Patterns

Experiment Naming Strategy

Use the /{team}/{project}/{experiment-type} pattern for experiment names. Flat naming becomes unmanageable once experiments exceed 100.

import mlflow

# Good examples: hierarchical naming
mlflow.set_experiment("/search-team/query-ranking/bert-finetune")
mlflow.set_experiment("/fraud-team/transaction-classifier/xgboost-baseline")
mlflow.set_experiment("/recommendation/item2vec/hyperopt-sweep")

# Bad examples: flat and ambiguous naming
# mlflow.set_experiment("experiment_1")
# mlflow.set_experiment("test_model")
# mlflow.set_experiment("johns_experiment")

Tag System Design

Tags are the core of experiment search and governance. At minimum, the following tags must be recorded.

import mlflow
from datetime import datetime

with mlflow.start_run(run_name="bert-ranking-v3.2.1") as run:
    # Required tags
    mlflow.set_tag("team", "search")
    mlflow.set_tag("owner", "jane.doe@company.com")
    mlflow.set_tag("git.commit", "a1b2c3d4")
    mlflow.set_tag("data.version", "v2026.03.05")
    mlflow.set_tag("environment", "gpu-cluster-a100")
    mlflow.set_tag("purpose", "hyperparameter-sweep")

    # Log parameters
    mlflow.log_params({
        "learning_rate": 2e-5,
        "batch_size": 32,
        "max_epochs": 10,
        "model_architecture": "bert-base-uncased",
        "optimizer": "AdamW",
        "warmup_steps": 500,
    })

    # Log metrics by step (epoch or global step)
    for epoch in range(10):
        train_loss = train_one_epoch(model, train_loader)
        val_ndcg = evaluate(model, val_loader)

        mlflow.log_metrics({
            "train_loss": train_loss,
            "val_ndcg@10": val_ndcg,
        }, step=epoch)

    # Log final model
    mlflow.pytorch.log_model(
        model,
        artifact_path="model",
        registered_model_name="search-ranking-bert",
    )

The Pitfalls of autolog

mlflow.autolog() is useful for quick prototyping, but in production experiments, the following issues arise.

Excessive unnecessary artifact logging: For sklearn, feature importance plots, confusion matrices, etc. are saved for every run. Running hyperparameter searches thousands of times causes artifact storage volume to grow rapidly.
Missing custom metrics: Domain-specific metrics (NDCG, MRR, business KPIs) are not logged by autolog.
Inconsistency across frameworks: PyTorch, TensorFlow, and XGBoost each log with different metric names and structures.

In production, it is recommended to enable only minimal logging with mlflow.autolog(log_models=False, log_datasets=False) and explicitly log key metrics and models.

Model Registry Operations Strategy

Model Naming Rules

Name models with a product focus. Do not include version numbers or algorithm names in the model name.

# Good examples (product/function focused)
fraud-detector
search-ranker
recommendation-item2vec
churn-predictor

# Bad examples (algorithm/version focused)
xgboost-fraud-v3
bert-search-ranking-2026
lgbm_model_final_final

Version numbers are automatically managed by the registry. Algorithm changes are tracked through tags or descriptions.

Alias-Based Deployment Workflow

In MLflow 2.x, using the Alias system is recommended over the traditional Stage (Staging/Production). Aliases are more flexible and can manage multiple production environments.

from mlflow import MlflowClient

client = MlflowClient()

# Set alias after registering a new model version
model_name = "search-ranker"
version = client.create_model_version(
    name=model_name,
    source="runs:/abc123/model",
    run_id="abc123",
    description="BERT-base finetuned on 2026 Q1 query logs"
)

# Check current champion
try:
    current_champion = client.get_model_version_by_alias(model_name, "champion")
    print(f"Current champion: v{current_champion.version}")
except mlflow.exceptions.MlflowException:
    print("No champion alias set yet")

# Canary deployment: assign challenger alias to new version
client.set_registered_model_alias(model_name, "challenger", version.version)

# Promote to champion after canary validation passes
client.set_registered_model_alias(model_name, "champion", version.version)

# Move previous champion to archived
client.set_registered_model_alias(model_name, "previous-champion", current_champion.version)

In serving code, loading the model by alias enables zero-downtime model replacement by simply changing the alias in the registry.

import mlflow

# Serving code: alias-based model loading
model = mlflow.pyfunc.load_model("models:/search-ranker@champion")
predictions = model.predict(input_data)

Model Version Metadata Management

The following information should be tagged for each model version. Without this information, no one will know "what data was this model trained on" six months later.

client.set_model_version_tag(model_name, version.version, "training_data", "s3://data/query-logs/2026-q1/")
client.set_model_version_tag(model_name, version.version, "training_commit", "a1b2c3d4e5f6")
client.set_model_version_tag(model_name, version.version, "validation_ndcg", "0.847")
client.set_model_version_tag(model_name, version.version, "approved_by", "jane.doe")
client.set_model_version_tag(model_name, version.version, "approval_date", "2026-03-05")

Experiment Tracking Tool Comparison

Before choosing MLflow, evaluate tools that match your team's requirements.

Item	MLflow	Weights & Biases	Neptune.ai	ClearML
License	Apache 2.0 (OSS)	Premium SaaS	Premium SaaS	SSPL (limited OSS)
Self-hosting	Full support	Limited	Limited	Full support
Experiment Tracking	Excellent	Best (visualization)	Best (at scale)	Excellent
Model Registry	Built-in	Built-in	External integration	Built-in
GenAI Support	Enhanced in 3.x	LLM eval built-in	Limited	Limited
Large-scale Logging	Fair (DB dependent)	Excellent	Best (1000x throughput)	Excellent
UI/UX	Functional	Intuitive, best	Functional	Excellent
Cost (50-person team)	Infrastructure only	$2,500-10,000/mo	$2,500-10,000/mo	Infrastructure only
Databricks Integration	Native	Plugin	Plugin	Limited
Community	20K+ GitHub Stars	Active	Active	Active

Selection Criteria Summary:

Cost sensitive + Self-hosting required: MLflow or ClearML
Best-in-class visualization + Team collaboration: Weights & Biases
Large enterprise + Governance: Neptune.ai
Using Databricks ecosystem: MLflow (native integration)

CI/CD Pipeline Integration

GitHub Actions and MLflow Integration

Automating model training and registration ensures reproducibility and reduces human errors.

# .github/workflows/train-and-register.yml
name: Train and Register Model

on:
  push:
    paths:
      - 'models/search-ranker/**'
    branches: [main]
  workflow_dispatch:
    inputs:
      experiment_name:
        description: 'MLflow experiment name'
        required: true
        default: '/search-team/query-ranking/scheduled-retrain'

env:
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

jobs:
  train:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install mlflow[extras]

      - name: Train model
        run: |
          python models/search-ranker/train.py \
            --experiment-name "${{ github.event.inputs.experiment_name || '/search-team/query-ranking/ci-train' }}" \
            --run-name "ci-${{ github.sha }}" \
            --register-model search-ranker

      - name: Validate model
        run: |
          python models/search-ranker/validate.py \
            --model-uri "models:/search-ranker@challenger" \
            --threshold-ndcg 0.82

      - name: Promote to champion
        if: success()
        run: |
          python scripts/promote_model.py \
            --model-name search-ranker \
            --from-alias challenger \
            --to-alias champion

Model Validation Script Example

Performance validation must be performed before model promotion in the CI pipeline.

# scripts/validate_model.py
import mlflow
import sys
from mlflow import MlflowClient

def validate_model(model_name: str, alias: str, threshold: float) -> bool:
    """Validate that model performance meets the threshold."""
    client = MlflowClient()

    # Look up model version by alias
    model_version = client.get_model_version_by_alias(model_name, alias)
    run = client.get_run(model_version.run_id)

    # Check validation metrics
    val_ndcg = run.data.metrics.get("val_ndcg@10")
    if val_ndcg is None:
        print(f"ERROR: val_ndcg@10 metric not found in run {model_version.run_id}")
        return False

    # Compare with current champion
    try:
        champion = client.get_model_version_by_alias(model_name, "champion")
        champion_run = client.get_run(champion.run_id)
        champion_ndcg = champion_run.data.metrics.get("val_ndcg@10", 0)
        print(f"Champion v{champion.version} NDCG: {champion_ndcg:.4f}")
        print(f"Challenger v{model_version.version} NDCG: {val_ndcg:.4f}")

        # Check for performance degradation compared to champion
        if val_ndcg < champion_ndcg * 0.98:  # Fail if more than 2% decline
            print("FAIL: Challenger performs worse than champion by more than 2%")
            return False
    except Exception:
        print("No existing champion found. Proceeding with threshold check only.")

    # Absolute threshold check
    if val_ndcg < threshold:
        print(f"FAIL: NDCG {val_ndcg:.4f} below threshold {threshold:.4f}")
        return False

    print(f"PASS: NDCG {val_ndcg:.4f} meets threshold {threshold:.4f}")
    return True

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-name", required=True)
    parser.add_argument("--alias", default="challenger")
    parser.add_argument("--threshold", type=float, default=0.80)
    args = parser.parse_args()

    if not validate_model(args.model_name, args.alias, args.threshold):
        sys.exit(1)

Multi-Tenancy Configuration

When there are multiple teams and experiment data needs to be isolated, here is how to configure MLflow's multi-tenancy.

Enabling Authentication

MLflow 2.x comes with built-in authentication features.

# Start server with authentication enabled
mlflow server \
  --backend-store-uri postgresql://mlflow:${DB_PASSWORD}@db:5432/mlflow \
  --default-artifact-root s3://mlflow-artifacts/ \
  --app-name basic-auth \
  --host 0.0.0.0 \
  --port 5000

Team Isolation Strategy

When complete data isolation is required, implementing logical isolation through experiment naming and permissions is more cost-effective from an operational standpoint than running separate MLflow instances per team.

# Logical isolation with team-specific experiment prefixes
TEAM_PREFIX = {
    "search": "/search-team",
    "fraud": "/fraud-team",
    "recommendation": "/rec-team",
}

def get_experiment_name(team: str, project: str, experiment: str) -> str:
    """Generate experiment name with team prefix."""
    prefix = TEAM_PREFIX.get(team)
    if prefix is None:
        raise ValueError(f"Unknown team: {team}. Allowed: {list(TEAM_PREFIX.keys())}")
    return f"{prefix}/{project}/{experiment}"

# Usage example
experiment = get_experiment_name("search", "query-ranking", "bert-v4-sweep")
mlflow.set_experiment(experiment)  # "/search-team/query-ranking/bert-v4-sweep"

For larger organizations, MLflow 3.x's Multi-Workspace feature enables experiment/model/prompt isolation at the workspace level on a single tracking server.

Artifact Management and Cost Optimization

Artifact Cleanup Automation

As experiments accumulate, artifact storage costs increase rapidly. This is especially problematic when hyperparameter searches generate hundreds to thousands of model checkpoints.

from mlflow import MlflowClient
from datetime import datetime, timedelta

def cleanup_old_runs(experiment_name: str, days_old: int = 90, dry_run: bool = True):
    """Clean up artifacts from failed/cancelled runs past the specified period."""
    client = MlflowClient()
    experiment = client.get_experiment_by_name(experiment_name)

    if experiment is None:
        print(f"Experiment '{experiment_name}' not found")
        return

    cutoff_ts = int((datetime.now() - timedelta(days=days_old)).timestamp() * 1000)

    runs = client.search_runs(
        experiment_ids=[experiment.experiment_id],
        filter_string=f"attributes.end_time < {cutoff_ts} AND attributes.status != 'RUNNING'",
        order_by=["attributes.end_time ASC"],
        max_results=500,
    )

    deleted_count = 0
    for run in runs:
        # Check preservation status via tags
        if run.data.tags.get("keep", "false").lower() == "true":
            continue

        # Skip runs with registered models
        if run.data.tags.get("mlflow.registeredModelName"):
            continue

        if dry_run:
            print(f"[DRY RUN] Would delete run {run.info.run_id} "
                  f"(ended: {datetime.fromtimestamp(run.info.end_time / 1000)})")
        else:
            client.delete_run(run.info.run_id)
            deleted_count += 1

    print(f"{'Would delete' if dry_run else 'Deleted'} {deleted_count} runs "
          f"out of {len(runs)} found")

# Usage: verify with dry_run first, then actually delete
cleanup_old_runs("/search-team/query-ranking/bert-finetune", days_old=60, dry_run=True)

S3 Lifecycle Policy

In addition to MLflow artifact cleanup, you can further reduce costs by setting lifecycle policies at the S3 bucket level.

{
  "Rules": [
    {
      "ID": "MoveOldArtifactsToIA",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "prod/"
      },
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 365,
          "StorageClass": "GLACIER"
        }
      ]
    },
    {
      "ID": "DeleteTempArtifacts",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "tmp/"
      },
      "Expiration": {
        "Days": 7
      }
    }
  ]
}

Troubleshooting: Common Operational Failures

1. DB Connection Pool Exhaustion

Symptoms: OperationalError: too many connections occurs when many experiments run simultaneously.

Cause: MLflow server's default SQLAlchemy connection pool size (5) is insufficient.

Solution:

# Adjust connection pool parameters at server startup
mlflow server \
  --backend-store-uri "postgresql://mlflow:pass@db:5432/mlflow?pool_size=20&max_overflow=40" \
  --default-artifact-root s3://artifacts/ \
  --workers 8

2. Artifact Upload Timeout

Symptoms: ConnectionError or timeout when logging large models (several GB).

Solution:

# Extend upload timeout
export MLFLOW_ARTIFACT_UPLOAD_DOWNLOAD_TIMEOUT=600

# Adjust multipart upload chunk size (S3)
export MLFLOW_S3_UPLOAD_EXTRA_ARGS='{"ServerSideEncryption": "aws:kms"}'

3. Run Status Permanently Stuck as RUNNING

Symptoms: The training process died but the run continues showing "Running" status in the MLflow UI.

Solution:

from mlflow import MlflowClient

client = MlflowClient()

# Force terminate stuck runs
stuck_runs = client.search_runs(
    experiment_ids=["1"],
    filter_string="attributes.status = 'RUNNING'",
)

for run in stuck_runs:
    end_time = run.info.end_time
    # If no end_time and start time is more than 24 hours ago
    if end_time is None or end_time == 0:
        start_time = run.info.start_time
        if (datetime.now().timestamp() * 1000 - start_time) > 86400000:  # 24h
            client.set_terminated(run.info.run_id, status="FAILED")
            print(f"Force-terminated stuck run: {run.info.run_id}")

4. Model Registry Alias Conflict

Symptoms: Two CI pipelines try to set the same alias simultaneously.

Solution: Check the current alias state before setting an alias, and use a distributed lock. Redis-based locking is the simplest approach.

import redis
import time

def safe_promote_model(model_name: str, version: str, alias: str, redis_url: str):
    """Safe model promotion using distributed lock."""
    r = redis.from_url(redis_url)
    lock_key = f"mlflow:promote:{model_name}:{alias}"

    # Acquire distributed lock with 30-second TTL
    lock = r.lock(lock_key, timeout=30)
    if lock.acquire(blocking=True, blocking_timeout=10):
        try:
            client = MlflowClient()
            client.set_registered_model_alias(model_name, alias, version)
            print(f"Successfully promoted {model_name} v{version} to @{alias}")
        finally:
            lock.release()
    else:
        raise RuntimeError(f"Failed to acquire lock for {model_name}@{alias}")

5. PostgreSQL Disk Full

Symptoms: Metric logging fails with a DiskFull error.

Solution: MLflow stores metrics as individual rows, so heavy step-level logging can cause the DB to grow rapidly. Regularly delete old runs and execute VACUUM FULL. Also, adjust metric logging frequency appropriately (log every 100 steps instead of every step).

Operations Checklist

When operating production MLflow, the following items should be checked periodically.

Initial Setup Checklist

PostgreSQL/MySQL backend store configuration complete
S3/GCS artifact store configuration and IAM permissions set
Tracking server high availability (HA) configured (load balancer + multiple workers)
Authentication enabled (--app-name basic-auth)
TLS termination configured (Nginx/ALB frontend)
Experiment naming conventions documented and shared with the team
Model registry naming conventions agreed upon

Weekly Operations Checklist

Artifact store capacity monitoring (threshold alerts set)
DB disk usage checked
Stuck (RUNNING status) runs cleaned up
Failed run artifact cleanup script executed
Tracking server response time verified (maintain P95 under 500ms)

Monthly Operations Checklist

S3/GCS cost analysis and lifecycle policy review
DB performance analysis (slow query check, index optimization)
Unused models in model registry cleaned up
MLflow version upgrade review
Backup/recovery procedure tested

Rollback and Disaster Recovery Procedures

Model Rollback

Procedure for immediately rolling back to the previous version when a production model has issues.

from mlflow import MlflowClient

def rollback_model(model_name: str):
    """Roll back the champion model to previous-champion."""
    client = MlflowClient()

    try:
        previous = client.get_model_version_by_alias(model_name, "previous-champion")
    except Exception:
        print("ERROR: No previous-champion alias found. Manual intervention required.")
        return False

    current = client.get_model_version_by_alias(model_name, "champion")

    # Execute rollback
    client.set_registered_model_alias(model_name, "champion", previous.version)
    client.set_registered_model_alias(model_name, "rolled-back", current.version)

    # Tag rollback reason
    client.set_model_version_tag(
        model_name, current.version, "rollback_reason", "performance_degradation"
    )
    client.set_model_version_tag(
        model_name, current.version, "rolled_back_at", datetime.now().isoformat()
    )

    print(f"Rolled back {model_name}: v{current.version} -> v{previous.version}")
    return True

DB Recovery

When the PostgreSQL backend fails, recover in the following order.

Restore from the latest DB snapshot
Restart the MLflow server and verify artifact store consistency
Clean up orphan artifact references with the mlflow gc command
Verify that the registry's champion alias points to the correct model version

# Artifact garbage collection
mlflow gc \
  --backend-store-uri postgresql://mlflow:pass@db:5432/mlflow \
  --older-than 30d

Migration Points from MLflow 2.x to 3.x

MLflow 3.0 (released mid-2025) focused on GenAI and AI agent support. Key points for existing 2.x users:

Model Registry Extension: In 3.x, code versions, prompt configurations, evaluation runs, and deployment metadata are linked to models. Backward compatible with existing 2.x registries.
Tracing Feature Added: The mlflow-tracing SDK allows adding instrumentation to code/models/agents with minimal dependencies in production environments.
search_logged_models() API: Enables SQL-like syntax for searching across experiments based on performance metrics, parameters, and model attributes.
LLM Cost Tracking: Added functionality to automatically extract model information from LLM spans and calculate costs.
UI Improvements: A sidebar for GenAI app and agent developers has been added, while continuing to support existing model training workflows.

When upgrading from 2.x to 3.x, you must run the DB migration script (mlflow db upgrade) and ensure a DB backup is available before upgrading.

Summary

Experiment tracking and model registry in MLflow 2.x are easy to install, but operating at production level requires systematically establishing architecture design, naming conventions, artifact management, CI/CD integration, multi-tenancy, monitoring, and rollback procedures. Artifact storage cost management and DB performance optimization in particular become significant technical debt if not incorporated into the design from the beginning.

Key principles summarized:

Name experiments hierarchically and leave rich metadata through tags.
Name models product-centric -- leave versions and algorithms to the registry and tags.
Implement zero-downtime model replacement with alias-based deployment.
Automate training-validation-promotion in your CI/CD pipeline.
Automate artifact cleanup -- otherwise the S3 bill will become frightening every month.

References

Quiz

Q1: What is the main topic covered in "MLflow 2.x Experiment Tracking and Model Registry Operations Guide"?

A practical guide from MLflow 2.x experiment tracking design to model registry operations, artifact management, CI/CD integration, multi-tenancy, and production deployment.

Q2: Why an MLflow 2.x Operations Guide Is Needed?

MLflow is downloaded over 14 million times per month and has become the de facto standard for open-source experiment tracking tools. Installing it and calling mlflow.autolog() is easy. The problem comes next.

Q3: Describe the Architecture Design: Separating Tracking Server and Artifact Store.

Core Components The MLflow production architecture should be separated into three layers. Tracking Server: Stores experiment metadata (parameters, metrics, tags). Uses PostgreSQL or MySQL backend. Artifact Store: Stores model binaries, datasets, and visualization files.

Q4: Describe the Experiment Tracking Design Patterns.

Experiment Naming Strategy Use the /team/project/experiment-type pattern for experiment names. Flat naming becomes unmanageable once experiments exceed 100. Tag System Design Tags are the core of experiment search and governance.

Q5: How does Model Registry Operations Strategy work?

Model Naming Rules Name models with a product focus. Do not include version numbers or algorithm names in the model name. Version numbers are automatically managed by the registry. Algorithm changes are tracked through tags or descriptions.