ML Model Monitoring and Drift Detection: Evidently AI + MLflow Production Operations Guide

1. Introduction: Production Models Silently Degrade
2. Types of Drift: What Changes
3. Evidently AI Architecture and Core Features
- Core Components
- Key Drift Detection Algorithms
4. Evidently AI Practical Usage
- Installation and Basic Setup
- Automated Data Quality Validation with Test Suite
5. MLflow Model Registry and Monitoring Integration
- Logging Drift Metrics to MLflow
- Alias-Based Model Registry Management
6. Building an Automatic Retraining Pipeline
7. Monitoring Tool Comparison: Evidently vs NannyML vs WhyLabs vs Alibi Detect
8. Grafana/Prometheus Dashboard Configuration
9. Operational Considerations
10. Failure Cases and Recovery Procedures
11. Production Monitoring Checklist
12. References

1. Introduction: Production Models Silently Degrade

An ML model's accuracy peaks at the moment of deployment. After that, prediction quality gradually declines as the real world changes. The problem is that this degradation progresses without explicit errors. No HTTP 500s are thrown, no CRITICAL logs appear, and the service responds normally. It's just that recommendations become increasingly irrelevant, fraud detection misses new patterns, and demand forecasts start diverging from reality.

According to Google's research, over 60% of failures in production ML systems originate from data-related issues, not model code. The model itself doesn't break -- rather, the gap between the world the model learned and the real world keeps growing.

This article covers how to combine the open-source monitoring tool Evidently AI with the experiment/model management platform MLflow to continuously monitor the health of ML models in production, detect drift, and trigger automatic retraining pipelines.

2. Types of Drift: What Changes

Drift refers to the discrepancy between the data distribution the model was trained on and the data distribution at the time of serving. Drift is broadly classified into three categories based on where and how it occurs.

Data Drift (Covariate Shift)

This is the phenomenon where the distribution of input features changes. The model's input space P(X) shifts over time. For example, in an e-commerce recommendation model, the age distribution of users changes, or the proportion of purchase categories shifts with seasons. The relationship P(Y|X) between the target variable Y and features X remains the same, but the statistical characteristics of the inputs themselves change.

Concept Drift

This is the phenomenon where the relationship between features and the target itself changes. P(Y|X) changes. This is a more serious problem than data drift because the correct answer itself changes for the same input. Representative examples include demand forecasting models becoming completely invalidated during the COVID-19 pandemic, and financial fraud detection where fraudsters' methods evolve, making existing patterns no longer valid.

Prediction Drift

This is the phenomenon where the distribution of model output P(Y_pred) changes. It can appear as a result of input drift or occur independently due to internal model issues. It includes cases where the prediction ratio for a specific class in a classification model suddenly skews, or the mean or variance of predicted values in a regression model changes significantly.

Drift Type	What Changes	Detection Difficulty	Representative Detection Methods	Retraining Urgency
Data Drift	P(X) input distribution	Medium	PSI, KS test, Wasserstein	Medium
Concept Drift	P(Y\|X) relationship	High	Performance metric monitoring, ADWIN	High
Prediction Drift	P(Y_pred) output	Low	Output distribution statistics, Chi-squared	Situational
Label Drift	P(Y) target distribution	Medium	Label distribution comparison	High

3. Evidently AI Architecture and Core Features

Evidently AI is an open-source library for ML model monitoring and data quality validation. It operates in a Python-native environment and has over 20 built-in statistical drift detection methods.

Core Components

Report: One-time data analysis report. Can be output as HTML, JSON, or Python dictionary format. Suitable for exploratory analysis and debugging.
Test Suite: Automated validation against predefined conditions. Integrated into CI/CD pipelines as data quality gates.
Metric: Individual measurement items. Dozens of metrics such as DataDriftTable, DatasetSummaryMetric, and ColumnCorrelationsMetric are provided out of the box.
Collector/Workspace: Evidently server mode. Stores monitoring results as time series and queries them on dashboards.

Key Drift Detection Algorithms

Evidently automatically selects the optimal detection algorithm based on feature type (numerical/categorical) and dataset size.

Algorithm	Target Type	Principle	Pros	Limitations
Kolmogorov-Smirnov (KS)	Numerical, small	Maximum difference in cumulative distribution	No distribution assumptions	Oversensitive with large data
Population Stability Index (PSI)	Numerical/Categorical	Weighted sum of log ratios of two distributions	Industry standard, easy to interpret	Sensitive to bin settings
Wasserstein Distance	Numerical	Minimum transport cost between two distributions	Reflects distribution shape differences	High computational cost
Jensen-Shannon Divergence	Numerical/Categorical	Symmetric version of KL Divergence	Always finite, symmetric	Insensitive to tail changes
Chi-squared Test	Categorical	Difference between observed/expected frequencies	Intuitive for categorical	Unstable with low-frequency categories
Z-test (Proportion test)	Categorical, large	Standardization of proportion differences	Efficient for large data	Assumes normal approximation

4. Evidently AI Practical Usage

Installation and Basic Setup

# Evidently AI installation (including MLflow integration)
# pip install evidently mlflow scikit-learn pandas

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
from evidently.metrics import (
    DatasetDriftMetric,
    DataDriftTable,
    ColumnDriftMetric,
)

# Prepare reference / current data
data = load_iris(as_frame=True)
df = data.frame
df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "target"]

reference_data = df.sample(frac=0.5, random_state=42)
current_data = df.drop(reference_data.index)

# Create simulated data with data drift
current_drifted = current_data.copy()
current_drifted["sepal_length"] = current_drifted["sepal_length"] + np.random.normal(2.0, 0.5, len(current_drifted))
current_drifted["petal_width"] = current_drifted["petal_width"] * 1.8

# Generate drift report
drift_report = Report(metrics=[
    DatasetDriftMetric(),
    DataDriftTable(),
])

drift_report.run(
    reference_data=reference_data,
    current_data=current_drifted,
)

# Extract results as dictionary (for programmatic use)
result = drift_report.as_dict()
dataset_drift = result["metrics"][0]["result"]["dataset_drift"]
drift_share = result["metrics"][0]["result"]["share_of_drifted_columns"]

print(f"Dataset drift detected: {dataset_drift}")
print(f"Drifted column ratio: {drift_share:.2%}")

# Save as HTML report
drift_report.save_html("drift_report.html")

Automated Data Quality Validation with Test Suite

from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset, DataQualityTestPreset
from evidently.tests import (
    TestColumnDrift,
    TestShareOfDriftedColumns,
    TestNumberOfMissingValues,
    TestShareOfOutRangeValues,
    TestMeanInNSigmas,
)

# Configure data drift + quality test suite
monitoring_suite = TestSuite(tests=[
    # Drift test: fail if 30% or more of columns drift
    TestShareOfDriftedColumns(lt=0.3),

    # Individual key feature drift validation
    TestColumnDrift(column_name="sepal_length"),
    TestColumnDrift(column_name="petal_width"),

    # Data quality tests
    TestNumberOfMissingValues(eq=0),

    # Value range validation: sepal_length within +/- 3 sigma of reference data
    TestMeanInNSigmas(column_name="sepal_length", n=3),
])

monitoring_suite.run(
    reference_data=reference_data,
    current_data=current_drifted,
)

# Programmatically check test results
suite_result = monitoring_suite.as_dict()
all_passed = all(
    test["status"] == "SUCCESS"
    for test in suite_result["tests"]
)

print(f"All tests passed: {all_passed}")
for test in suite_result["tests"]:
    status_icon = "PASS" if test["status"] == "SUCCESS" else "FAIL"
    print(f"  [{status_icon}] {test['name']}: {test['status']}")

# Use as exit code in CI/CD pipelines
if not all_passed:
    print("ALERT: Data drift or quality anomaly detected. Retraining pipeline trigger required.")
    # sys.exit(1)  # Fail the build in CI

5. MLflow Model Registry and Monitoring Integration

MLflow provides experiment tracking, model packaging, and model registry functionality. By recording Evidently's drift detection results in MLflow, you can track performance history and drift status per model version on a single platform.

Logging Drift Metrics to MLflow

import mlflow
from evidently.report import Report
from evidently.metrics import (
    DatasetDriftMetric,
    DataDriftTable,
    ColumnDriftMetric,
)
import json
from datetime import datetime

# MLflow tracking server configuration
mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment("model-monitoring/fraud-detection-v2")

def log_drift_to_mlflow(
    reference_data,
    current_data,
    model_name: str,
    model_version: str,
    batch_id: str,
):
    """Log drift analysis results to MLflow"""

    # Generate Evidently drift report
    drift_report = Report(metrics=[
        DatasetDriftMetric(),
        DataDriftTable(),
    ])
    drift_report.run(
        reference_data=reference_data,
        current_data=current_data,
    )

    result = drift_report.as_dict()
    drift_result = result["metrics"][0]["result"]

    # Record as MLflow Run
    with mlflow.start_run(run_name=f"drift-check-{batch_id}") as run:
        # Basic drift metrics
        mlflow.log_metric("dataset_drift_detected", int(drift_result["dataset_drift"]))
        mlflow.log_metric("drifted_columns_share", drift_result["share_of_drifted_columns"])
        mlflow.log_metric("number_of_drifted_columns", drift_result["number_of_drifted_columns"])
        mlflow.log_metric("total_columns", drift_result["number_of_columns"])

        # Log individual column drift scores
        column_drift = result["metrics"][1]["result"]["drift_by_columns"]
        for col_name, col_info in column_drift.items():
            safe_col_name = col_name.replace(" ", "_").replace("/", "_")
            mlflow.log_metric(
                f"drift_score_{safe_col_name}",
                col_info.get("drift_score", 0.0),
            )
            mlflow.log_metric(
                f"drift_detected_{safe_col_name}",
                int(col_info.get("column_drift", False)),
            )

        # Record metadata as tags
        mlflow.set_tags({
            "monitoring.type": "drift_detection",
            "monitoring.model_name": model_name,
            "monitoring.model_version": model_version,
            "monitoring.batch_id": batch_id,
            "monitoring.timestamp": datetime.utcnow().isoformat(),
            "monitoring.reference_size": str(len(reference_data)),
            "monitoring.current_size": str(len(current_data)),
        })

        # Save HTML report as artifact
        report_path = f"/tmp/drift_report_{batch_id}.html"
        drift_report.save_html(report_path)
        mlflow.log_artifact(report_path, artifact_path="drift_reports")

        # Save JSON results as artifact
        json_path = f"/tmp/drift_result_{batch_id}.json"
        with open(json_path, "w") as f:
            json.dump(result, f, indent=2, default=str)
        mlflow.log_artifact(json_path, artifact_path="drift_reports")

        print(f"Drift results logged to MLflow. Run ID: {run.info.run_id}")
        return drift_result["dataset_drift"], drift_result["share_of_drifted_columns"]


# Usage example
is_drifted, drift_share = log_drift_to_mlflow(
    reference_data=reference_data,
    current_data=current_drifted,
    model_name="fraud-detector",
    model_version="3",
    batch_id="2026-03-06-batch-001",
)

Alias-Based Model Registry Management

Starting with MLflow 2.x, alias-based model management is recommended over the traditional Stage system (Staging/Production/Archived). You can apply a strategy that automatically switches model aliases based on drift detection results.

from mlflow import MlflowClient

client = MlflowClient(tracking_uri="http://mlflow.internal:5000")

MODEL_NAME = "fraud-detector"

def handle_drift_detection(
    is_drifted: bool,
    drift_share: float,
    model_name: str = MODEL_NAME,
    drift_threshold_warn: float = 0.2,
    drift_threshold_critical: float = 0.5,
):
    """Perform model registry actions based on drift detection results"""

    # Check current production model version
    try:
        prod_version = client.get_model_version_by_alias(model_name, "production")
        current_version = prod_version.version
        print(f"Current production model version: {current_version}")
    except Exception as e:
        print(f"Failed to retrieve production model alias: {e}")
        return

    if not is_drifted:
        print("No drift detected. Maintaining current model.")
        client.set_model_version_tag(
            model_name, current_version,
            key="last_drift_check",
            value="passed",
        )
        return

    if drift_share >= drift_threshold_critical:
        # Critical drift: immediately switch to fallback model + trigger retraining
        print(f"CRITICAL: Drift ratio {drift_share:.1%} - Switching to fallback model and triggering retraining")
        client.set_model_version_tag(
            model_name, current_version,
            key="drift_status", value="critical",
        )
        # Switch to fallback model if available
        try:
            fallback = client.get_model_version_by_alias(model_name, "fallback")
            client.set_registered_model_alias(model_name, "production", fallback.version)
            print(f"Switched to fallback model version {fallback.version}")
        except Exception:
            print("WARNING: No fallback model available. Maintaining current model while emergency retraining is needed.")

        # Trigger retraining (external system call)
        trigger_retraining(model_name, reason="critical_drift")

    elif drift_share >= drift_threshold_warn:
        # Warning level drift: record tag + notification
        print(f"WARNING: Drift ratio {drift_share:.1%} - Enhanced monitoring and scheduling retraining")
        client.set_model_version_tag(
            model_name, current_version,
            key="drift_status", value="warning",
        )
        # Add to scheduled retraining queue
        schedule_retraining(model_name, priority="normal")


def trigger_retraining(model_name: str, reason: str):
    """Trigger emergency retraining (call Airflow DAG, Kubeflow Pipeline, etc.)"""
    print(f"Retraining triggered: model={model_name}, reason={reason}")
    # requests.post("http://airflow.internal/api/v1/dags/retrain/dagRuns", ...)


def schedule_retraining(model_name: str, priority: str):
    """Register in scheduled retraining queue"""
    print(f"Retraining scheduled: model={model_name}, priority={priority}")


# Execute
handle_drift_detection(
    is_drifted=True,
    drift_share=0.55,
    model_name=MODEL_NAME,
)

6. Building an Automatic Retraining Pipeline

The automated pipeline from drift detection to retraining consists of the following stages.

Overall Pipeline Flow

Scheduler: Trigger drift check after batch inference or at regular intervals (daily/weekly)
Drift Analyzer: Analyze current data against reference data with Evidently
Decision Engine: Determine whether retraining is needed based on drift thresholds
Retraining Orchestrator: Execute training jobs in Airflow/Kubeflow
Champion/Challenger Evaluation: Compare and evaluate the new model against the existing model
Deployment Gate: Auto-deploy if performance criteria are met, rollback on failure

Airflow DAG Integration Pattern

# Airflow DAG example: drift check + conditional retraining
# dag_drift_monitor.py

from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.empty import EmptyOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
import pandas as pd

default_args = {
    "owner": "ml-platform",
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
    "execution_timeout": timedelta(minutes=30),
}

dag = DAG(
    dag_id="ml_drift_monitor_fraud_detection",
    default_args=default_args,
    description="Daily drift monitoring and conditional retraining",
    schedule_interval="0 6 * * *",  # Daily at 6 AM
    start_date=days_ago(1),
    catchup=False,
    tags=["ml-monitoring", "drift-detection"],
)


def fetch_data(**context):
    """Load reference data and last 24 hours of serving data"""
    from sqlalchemy import create_engine
    engine = create_engine("postgresql://reader:password@db.internal/features")

    reference = pd.read_sql(
        "SELECT * FROM fraud_features_reference", engine
    )
    current = pd.read_sql(
        """SELECT * FROM fraud_features_serving
           WHERE created_at >= NOW() - INTERVAL '24 hours'""",
        engine,
    )

    # Pass paths via XCom (store large data in S3)
    ref_path = "/tmp/reference_data.parquet"
    cur_path = "/tmp/current_data.parquet"
    reference.to_parquet(ref_path)
    current.to_parquet(cur_path)

    context["ti"].xcom_push(key="reference_path", value=ref_path)
    context["ti"].xcom_push(key="current_path", value=cur_path)
    context["ti"].xcom_push(key="current_size", value=len(current))


def run_drift_check(**context):
    """Run Evidently drift analysis and log to MLflow"""
    from evidently.report import Report
    from evidently.metrics import DatasetDriftMetric, DataDriftTable
    import mlflow

    ti = context["ti"]
    ref_path = ti.xcom_pull(key="reference_path")
    cur_path = ti.xcom_pull(key="current_path")

    reference = pd.read_parquet(ref_path)
    current = pd.read_parquet(cur_path)

    # Validate minimum sample count
    if len(current) < 100:
        print(f"Insufficient current data samples: {len(current)}. Skipping drift check.")
        ti.xcom_push(key="drift_action", value="skip")
        return "skip_retraining"

    report = Report(metrics=[DatasetDriftMetric(), DataDriftTable()])
    report.run(reference_data=reference, current_data=current)
    result = report.as_dict()

    drift_detected = result["metrics"][0]["result"]["dataset_drift"]
    drift_share = result["metrics"][0]["result"]["share_of_drifted_columns"]

    # Log to MLflow
    mlflow.set_tracking_uri("http://mlflow.internal:5000")
    mlflow.set_experiment("monitoring/fraud-detection")
    with mlflow.start_run(run_name=f"drift-{context['ds']}"):
        mlflow.log_metric("drift_detected", int(drift_detected))
        mlflow.log_metric("drift_share", drift_share)

    ti.xcom_push(key="drift_detected", value=drift_detected)
    ti.xcom_push(key="drift_share", value=drift_share)


def decide_action(**context):
    """Decide whether to retrain based on drift level"""
    ti = context["ti"]
    drift_detected = ti.xcom_pull(key="drift_detected")
    drift_share = ti.xcom_pull(key="drift_share")

    if drift_share is None or drift_share < 0.2:
        return "skip_retraining"
    elif drift_share >= 0.5:
        return "trigger_emergency_retrain"
    else:
        return "trigger_scheduled_retrain"


fetch_task = PythonOperator(
    task_id="fetch_data", python_callable=fetch_data, dag=dag,
)
drift_task = PythonOperator(
    task_id="run_drift_check", python_callable=run_drift_check, dag=dag,
)
branch_task = BranchPythonOperator(
    task_id="decide_action", python_callable=decide_action, dag=dag,
)
skip_task = EmptyOperator(task_id="skip_retraining", dag=dag)
scheduled_retrain = EmptyOperator(task_id="trigger_scheduled_retrain", dag=dag)
emergency_retrain = EmptyOperator(task_id="trigger_emergency_retrain", dag=dag)

fetch_task >> drift_task >> branch_task >> [skip_task, scheduled_retrain, emergency_retrain]

Retraining Trigger Threshold Guidelines

Drift Level	drift_share Range	Recommended Action	Response Time
Normal	0% ~ 15%	Maintain monitoring	-
Caution	15% ~ 30%	Send alert, begin root cause analysis	Within 48 hrs
Warning	30% ~ 50%	Register in scheduled retraining queue	Within 24 hrs
Critical	50% or above	Immediate retraining + fallback model switch	Immediately

Note: Thresholds should be adjusted based on domain and model characteristics. Domains with high missed detection costs like financial fraud detection should use lower thresholds (10-20%), while domains with wider tolerance like recommendation systems should apply higher thresholds (30-50%).

7. Monitoring Tool Comparison: Evidently vs NannyML vs WhyLabs vs Alibi Detect

There are several options for production ML monitoring tools. Let's compare the strengths and weaknesses of each.

Criteria	Evidently AI	NannyML	WhyLabs	Alibi Detect
License	Apache 2.0 (OSS)	BSD-3 (OSS)	SaaS + Free tier	BSD-3 (OSS)
Core Strength	General data/model monitoring	Label-free performance estimation (CBPE)	Real-time streaming profiling	Advanced drift detection algorithms
Number of Drift Methods	20+	10+	15+	15+
Label-free Performance Estimation	Limited	Core feature (CBPE, DLE)	Not supported	Not supported
Real-time Monitoring	Collector mode	Not supported (batch)	Native support	Not supported (batch)
Visualization	Built-in HTML/dashboard	Built-in HTML	Web dashboard (SaaS)	Basic visualization
CI/CD Integration	Test Suite (native)	Limited	API-based	Manual configuration required
Prometheus Integration	Officially supported	Custom required	Built-in	Custom required
MLflow Integration	Easy (Python native)	Manual configuration	API integration	Manual configuration
Learning Curve	Low	Medium	Low (SaaS)	High
Production Use Cases	General purpose	Label-delayed environments	Large-scale real-time	Research/advanced detection

Selection Guide:

Environments where labels cannot be obtained immediately (e.g., financial fraud detection where label confirmation takes months): NannyML's CBPE (Confidence-Based Performance Estimation) is the only option.
Open-source first + rapid adoption: Evidently AI provides the widest feature range with the lowest adoption barrier.
Large-scale real-time streaming: WhyLabs' data profiling is optimized for processing tens of thousands of records per second.
Research environments needing advanced statistical detection: Alibi Detect's deep kernel MMD and Learned Kernel drift detection are well-suited.

8. Grafana/Prometheus Dashboard Configuration

Let's look at how to expose Evidently's monitoring results as Prometheus metrics and visualize them as time series on Grafana dashboards.

Prometheus Metrics Export

# prometheus_drift_exporter.py
from prometheus_client import start_http_server, Gauge, Counter, Histogram
from evidently.report import Report
from evidently.metrics import DatasetDriftMetric, DataDriftTable
import pandas as pd
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Prometheus metric definitions
DRIFT_DETECTED = Gauge(
    "ml_model_drift_detected",
    "Dataset drift detection status (0/1)",
    ["model_name", "model_version"],
)
DRIFT_SHARE = Gauge(
    "ml_model_drift_column_share",
    "Share of drifted columns",
    ["model_name", "model_version"],
)
COLUMN_DRIFT_SCORE = Gauge(
    "ml_model_column_drift_score",
    "Individual column drift score",
    ["model_name", "model_version", "column_name"],
)
DRIFT_CHECK_TOTAL = Counter(
    "ml_model_drift_checks_total",
    "Total drift check executions",
    ["model_name"],
)
DRIFT_CHECK_DURATION = Histogram(
    "ml_model_drift_check_duration_seconds",
    "Drift check execution duration",
    ["model_name"],
    buckets=[0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0],
)

MODEL_NAME = "fraud-detector"
MODEL_VERSION = "3"


def run_periodic_drift_check(
    reference_path: str,
    current_query_fn,
    interval_seconds: int = 300,
):
    """Periodic drift check and Prometheus metric update"""
    reference = pd.read_parquet(reference_path)

    while True:
        try:
            start_time = time.time()

            # Load recent data
            current = current_query_fn()
            if current is None or len(current) < 50:
                logger.warning(f"Insufficient current data: {len(current) if current is not None else 0} records")
                time.sleep(interval_seconds)
                continue

            # Filter to feature columns only (exclude target and metadata columns)
            feature_cols = [c for c in reference.columns if c not in ["target", "id", "timestamp"]]
            ref_features = reference[feature_cols]
            cur_features = current[feature_cols]

            # Drift analysis
            report = Report(metrics=[DatasetDriftMetric(), DataDriftTable()])
            report.run(reference_data=ref_features, current_data=cur_features)
            result = report.as_dict()

            drift_result = result["metrics"][0]["result"]
            column_results = result["metrics"][1]["result"]["drift_by_columns"]

            # Update Prometheus metrics
            DRIFT_DETECTED.labels(MODEL_NAME, MODEL_VERSION).set(
                int(drift_result["dataset_drift"])
            )
            DRIFT_SHARE.labels(MODEL_NAME, MODEL_VERSION).set(
                drift_result["share_of_drifted_columns"]
            )

            for col_name, col_info in column_results.items():
                COLUMN_DRIFT_SCORE.labels(MODEL_NAME, MODEL_VERSION, col_name).set(
                    col_info.get("drift_score", 0.0)
                )

            DRIFT_CHECK_TOTAL.labels(MODEL_NAME).inc()

            duration = time.time() - start_time
            DRIFT_CHECK_DURATION.labels(MODEL_NAME).observe(duration)

            logger.info(
                f"Drift check complete: drift={drift_result['dataset_drift']}, "
                f"share={drift_result['share_of_drifted_columns']:.2%}, "
                f"duration={duration:.1f}s"
            )

        except Exception as e:
            logger.error(f"Drift check failed: {e}", exc_info=True)

        time.sleep(interval_seconds)


if __name__ == "__main__":
    # Start Prometheus metrics HTTP server (port 8000)
    start_http_server(8000)
    logger.info("Prometheus metrics exporter started (port 8000)")

    # Start periodic drift check (5-minute intervals)
    run_periodic_drift_check(
        reference_path="/data/reference/fraud_features_v3.parquet",
        current_query_fn=lambda: pd.read_parquet("/data/serving/latest_batch.parquet"),
        interval_seconds=300,
    )

Grafana Dashboard Components

Configure the following panels in Grafana to comprehensively monitor ML model health status.

Panel	Metric	Visualization Type	Alert Rule
Drift Status	`ml_model_drift_detected`	Stat (latest)	Critical alert when value is 1
Drifted Column Ratio Trend	`ml_model_drift_column_share`	Time Series	Warning when exceeding 30%
Per-Column Drift Score	`ml_model_column_drift_score`	Heatmap	Highlight columns exceeding threshold
Check Duration	`ml_model_drift_check_duration_seconds`	Histogram	Warning when exceeding 60s
Check Execution Count	`rate(ml_model_drift_checks_total[1h])`	Time Series	Alert when 0 (check stalled)

Alertmanager Alert Rules Example

# prometheus-alerts.yaml
groups:
  - name: ml_model_drift_alerts
    rules:
      - alert: MLModelDriftDetected
        expr: ml_model_drift_detected == 1
        for: 5m
        labels:
          severity: warning
          team: ml-platform
        annotations:
          summary: 'ML model data drift detected'
          description: 'Data drift detected in model {{ $labels.model_name }} v{{ $labels.model_version }}. Drifted column ratio: {{ $value }}'

      - alert: MLModelCriticalDrift
        expr: ml_model_drift_column_share > 0.5
        for: 0m
        labels:
          severity: critical
          team: ml-platform
        annotations:
          summary: 'ML model critical drift - Immediate action required'
          description: 'Model {{ $labels.model_name }} drift column ratio is {{ $value | humanizePercentage }}. Immediate retraining or fallback switch is required.'

      - alert: MLDriftCheckStalled
        expr: rate(ml_model_drift_checks_total[1h]) == 0
        for: 30m
        labels:
          severity: warning
          team: ml-platform
        annotations:
          summary: 'ML drift check stalled'
          description: 'Drift check for model {{ $labels.model_name }} has not run for over 30 minutes. Monitoring pipeline inspection is needed.'

9. Operational Considerations

False Positive Drift Management

The most common pitfall of statistical drift detection is false positives. Drift may be falsely detected in the following situations even when there is no actual problem.

Sample Size Effect: When the current data sample size is very large, KS tests and Chi-squared tests detect statistically significant but practically meaningless differences as drift. Complement with effect-size-based metrics such as PSI or Wasserstein distance to verify practical significance.

Seasonality: Purchase patterns during Black Friday in e-commerce are distinctly different from normal periods. If detected as drift, unnecessary alerts flood in at the same time every year. Set reference data to historical data from the same period, or apply seasonal adjustment logic.

Inter-feature Correlations: Drift detection on individual features alone cannot capture multivariate distribution changes. There are cases where features A and B each have similar distributions, but the correlation between A and B has changed. Evidently's DatasetDriftMetric provides dataset-level judgment, but if explicit multivariate detection is needed, consider Alibi Detect's MMD (Maximum Mean Discrepancy) method.

Reference Data Management Strategies

Reference data is the baseline for drift detection. Incorrect reference data invalidates all detection results.

Strategy	Description	Suitable For	Caution
Fixed Training Data	Fix data used for model training as reference	Stable domains with little change	Reference itself becomes outdated over time
Sliding Window	Update reference with data from recent N days/weeks	Environments where gradual change is normal	Risk of missing gradual drift
Update at Retraining	Update reference each time model is retrained	Pipelines with regular retraining	Dependent on retraining cycle
Dual Baseline	Compare against both training data and recent stable period data	Environments requiring high accuracy	Increased management complexity

Key Point: Reference data should be version-controlled and tracked with a 1:1 mapping to model versions. Storing reference data snapshots as MLflow artifacts is recommended.

Feature Store Integration

If feature computation logic differs between offline training and online serving (Training-Serving Skew), fake drift caused by implementation inconsistency rather than actual drift will occur. Using a feature store like Feast to ensure feature consistency between training and serving is the fundamental solution.

10. Failure Cases and Recovery Procedures

Case 1: Silent Model Degradation

Situation: An e-commerce recommendation model gradually degraded over 3 months. CTR dropped from 12% to 7%, but drift monitoring was configured only at the individual feature level and failed to detect it.

Root Cause: Multivariate change in user behavior patterns. Individual feature distributions (views, dwell time, category ratio) didn't change significantly, but the correlations between features changed. Specifically, the "dwell time - purchase conversion" relationship weakened due to changes in short-form content consumption patterns.

Recovery Procedure:

Added multivariate drift detection (feature correlation matrix comparison)
Added a concept drift monitoring layer that directly monitors business KPIs (CTR, conversion rate)
Retrained model with last 2 weeks of data and deployed via A/B testing
Shortened retraining cycle from monthly to weekly

Lesson: Data drift alone is insufficient to capture concept drift. Business metric monitoring must always be conducted in parallel.

Case 2: Fake Drift from Data Pipeline Failure

Situation: Late Friday night, a flood of Critical drift alerts. Over 80% column drift detected across 3 models simultaneously.

Root Cause: An upstream data pipeline ETL job failed, causing some columns in the serving feature table to be filled with default values (0 or null). A data quality issue was falsely detected as drift.

Recovery Procedure:

Placed TestNumberOfMissingValues and TestShareOfOutRangeValues in the Evidently TestSuite before the drift check stage
Skip drift check on data quality failure and send a separate data pipeline alert
Added data completeness validation gate to upstream ETL
Included "recent data quality check results" information in drift alerts

Lesson: A data quality validation step must always be placed before the drift detection pipeline. Distinguishing between data quality issues and actual distribution changes is the key to operations.

Case 3: Reference Data Contamination

Situation: After model retraining, reference data was updated with the new dataset. Afterwards, no drift was detected at all, rendering the monitoring system useless.

Root Cause: The data used for retraining already contained drift, and this contaminated data became the new reference. As a result, the drift was "normalized" and the baseline was reset.

Recovery Procedure:

Automated drift comparison between new and previous reference data during reference update
Added a gate to block reference updates when drift ratio exceeds a certain level
Version-controlled reference data change history as MLflow artifacts
Periodically compared against a golden dataset (manually verified high-quality data)

Lesson: Reference data is the baseline of the monitoring system, so any changes must go through a validation process.

11. Production Monitoring Checklist

Pre-deployment Checklist

Is reference data version-controlled alongside model versions
Are Evidently Report/TestSuite integrated into the deployment pipeline
Are drift thresholds adjusted for domain characteristics
Is data quality validation placed before the drift detection stage
Is a fallback model registered in the registry
Are Grafana dashboards and alert rules configured

Operational Checklist

Are drift checks running on a normal schedule (monitoring the monitor)
Is the false positive rate at a manageable level (recommended under 5 per month)
Is reference data being updated at appropriate intervals
Are retraining triggers working properly, and is champion/challenger evaluation being performed
Are business KPIs and model performance metrics being tracked together
Is the mean time to respond (MTTR) after receiving alerts within SLA

Concept Drift Response Checklist

Is a label acquisition pipeline built (including delayed labels)
Are model performance metrics (Accuracy, F1, AUC) time-series trends being monitored
Are proxy metrics defined for periods without labels
Is A/B testing infrastructure ready

12. References

Evidently AI - Data Drift Official Guide - A comprehensive guide covering data drift concepts, detection methodologies, and real-world cases.
Evidently AI GitHub Repository - Open-source code, example notebooks, and community discussions.
MLflow Model Registry Official Documentation - Model registry API, alias system, and deployment workflow guide.
Evidently AI Official Documentation - Complete API reference and tutorials for Report, TestSuite, and Metric.
Advanced ML Model Monitoring: Drift Detection, Explainability, and Automated Retraining - Advanced patterns for drift detection and automatic retraining pipelines.
Google - ML Technical Debt (Hidden Technical Debt in Machine Learning Systems) - Foundational paper on technical debt and monitoring needs in ML systems.
NannyML - Estimating Model Performance without Ground Truth - CBPE methodology for estimating model performance without labels.