Skip to content
Published on

ML Model Monitoring and Drift Detection: Evidently AI + MLflow Production Operations Guide

Authors
  • Name
    Twitter
ML Model Monitoring

1. Introduction: Production Models Silently Degrade

An ML model's accuracy peaks at the moment of deployment. After that, prediction quality gradually declines as the real world changes. The problem is that this degradation progresses without explicit errors. No HTTP 500s are thrown, no CRITICAL logs appear, and the service responds normally. It's just that recommendations become increasingly irrelevant, fraud detection misses new patterns, and demand forecasts start diverging from reality.

According to Google's research, over 60% of failures in production ML systems originate from data-related issues, not model code. The model itself doesn't break -- rather, the gap between the world the model learned and the real world keeps growing.

This article covers how to combine the open-source monitoring tool Evidently AI with the experiment/model management platform MLflow to continuously monitor the health of ML models in production, detect drift, and trigger automatic retraining pipelines.

2. Types of Drift: What Changes

Drift refers to the discrepancy between the data distribution the model was trained on and the data distribution at the time of serving. Drift is broadly classified into three categories based on where and how it occurs.

Data Drift (Covariate Shift)

This is the phenomenon where the distribution of input features changes. The model's input space P(X) shifts over time. For example, in an e-commerce recommendation model, the age distribution of users changes, or the proportion of purchase categories shifts with seasons. The relationship P(Y|X) between the target variable Y and features X remains the same, but the statistical characteristics of the inputs themselves change.

Concept Drift

This is the phenomenon where the relationship between features and the target itself changes. P(Y|X) changes. This is a more serious problem than data drift because the correct answer itself changes for the same input. Representative examples include demand forecasting models becoming completely invalidated during the COVID-19 pandemic, and financial fraud detection where fraudsters' methods evolve, making existing patterns no longer valid.

Prediction Drift

This is the phenomenon where the distribution of model output P(Y_pred) changes. It can appear as a result of input drift or occur independently due to internal model issues. It includes cases where the prediction ratio for a specific class in a classification model suddenly skews, or the mean or variance of predicted values in a regression model changes significantly.

Drift TypeWhat ChangesDetection DifficultyRepresentative Detection MethodsRetraining Urgency
Data DriftP(X) input distributionMediumPSI, KS test, WassersteinMedium
Concept DriftP(Y|X) relationshipHighPerformance metric monitoring, ADWINHigh
Prediction DriftP(Y_pred) outputLowOutput distribution statistics, Chi-squaredSituational
Label DriftP(Y) target distributionMediumLabel distribution comparisonHigh

3. Evidently AI Architecture and Core Features

Evidently AI is an open-source library for ML model monitoring and data quality validation. It operates in a Python-native environment and has over 20 built-in statistical drift detection methods.

Core Components

  • Report: One-time data analysis report. Can be output as HTML, JSON, or Python dictionary format. Suitable for exploratory analysis and debugging.
  • Test Suite: Automated validation against predefined conditions. Integrated into CI/CD pipelines as data quality gates.
  • Metric: Individual measurement items. Dozens of metrics such as DataDriftTable, DatasetSummaryMetric, and ColumnCorrelationsMetric are provided out of the box.
  • Collector/Workspace: Evidently server mode. Stores monitoring results as time series and queries them on dashboards.

Key Drift Detection Algorithms

Evidently automatically selects the optimal detection algorithm based on feature type (numerical/categorical) and dataset size.

AlgorithmTarget TypePrincipleProsLimitations
Kolmogorov-Smirnov (KS)Numerical, smallMaximum difference in cumulative distributionNo distribution assumptionsOversensitive with large data
Population Stability Index (PSI)Numerical/CategoricalWeighted sum of log ratios of two distributionsIndustry standard, easy to interpretSensitive to bin settings
Wasserstein DistanceNumericalMinimum transport cost between two distributionsReflects distribution shape differencesHigh computational cost
Jensen-Shannon DivergenceNumerical/CategoricalSymmetric version of KL DivergenceAlways finite, symmetricInsensitive to tail changes
Chi-squared TestCategoricalDifference between observed/expected frequenciesIntuitive for categoricalUnstable with low-frequency categories
Z-test (Proportion test)Categorical, largeStandardization of proportion differencesEfficient for large dataAssumes normal approximation

4. Evidently AI Practical Usage

Installation and Basic Setup

# Evidently AI installation (including MLflow integration)
# pip install evidently mlflow scikit-learn pandas

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
from evidently.metrics import (
    DatasetDriftMetric,
    DataDriftTable,
    ColumnDriftMetric,
)

# Prepare reference / current data
data = load_iris(as_frame=True)
df = data.frame
df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "target"]

reference_data = df.sample(frac=0.5, random_state=42)
current_data = df.drop(reference_data.index)

# Create simulated data with data drift
current_drifted = current_data.copy()
current_drifted["sepal_length"] = current_drifted["sepal_length"] + np.random.normal(2.0, 0.5, len(current_drifted))
current_drifted["petal_width"] = current_drifted["petal_width"] * 1.8

# Generate drift report
drift_report = Report(metrics=[
    DatasetDriftMetric(),
    DataDriftTable(),
])

drift_report.run(
    reference_data=reference_data,
    current_data=current_drifted,
)

# Extract results as dictionary (for programmatic use)
result = drift_report.as_dict()
dataset_drift = result["metrics"][0]["result"]["dataset_drift"]
drift_share = result["metrics"][0]["result"]["share_of_drifted_columns"]

print(f"Dataset drift detected: {dataset_drift}")
print(f"Drifted column ratio: {drift_share:.2%}")

# Save as HTML report
drift_report.save_html("drift_report.html")

Automated Data Quality Validation with Test Suite

from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset, DataQualityTestPreset
from evidently.tests import (
    TestColumnDrift,
    TestShareOfDriftedColumns,
    TestNumberOfMissingValues,
    TestShareOfOutRangeValues,
    TestMeanInNSigmas,
)

# Configure data drift + quality test suite
monitoring_suite = TestSuite(tests=[
    # Drift test: fail if 30% or more of columns drift
    TestShareOfDriftedColumns(lt=0.3),

    # Individual key feature drift validation
    TestColumnDrift(column_name="sepal_length"),
    TestColumnDrift(column_name="petal_width"),

    # Data quality tests
    TestNumberOfMissingValues(eq=0),

    # Value range validation: sepal_length within +/- 3 sigma of reference data
    TestMeanInNSigmas(column_name="sepal_length", n=3),
])

monitoring_suite.run(
    reference_data=reference_data,
    current_data=current_drifted,
)

# Programmatically check test results
suite_result = monitoring_suite.as_dict()
all_passed = all(
    test["status"] == "SUCCESS"
    for test in suite_result["tests"]
)

print(f"All tests passed: {all_passed}")
for test in suite_result["tests"]:
    status_icon = "PASS" if test["status"] == "SUCCESS" else "FAIL"
    print(f"  [{status_icon}] {test['name']}: {test['status']}")

# Use as exit code in CI/CD pipelines
if not all_passed:
    print("ALERT: Data drift or quality anomaly detected. Retraining pipeline trigger required.")
    # sys.exit(1)  # Fail the build in CI

5. MLflow Model Registry and Monitoring Integration

MLflow provides experiment tracking, model packaging, and model registry functionality. By recording Evidently's drift detection results in MLflow, you can track performance history and drift status per model version on a single platform.

Logging Drift Metrics to MLflow

import mlflow
from evidently.report import Report
from evidently.metrics import (
    DatasetDriftMetric,
    DataDriftTable,
    ColumnDriftMetric,
)
import json
from datetime import datetime

# MLflow tracking server configuration
mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment("model-monitoring/fraud-detection-v2")

def log_drift_to_mlflow(
    reference_data,
    current_data,
    model_name: str,
    model_version: str,
    batch_id: str,
):
    """Log drift analysis results to MLflow"""

    # Generate Evidently drift report
    drift_report = Report(metrics=[
        DatasetDriftMetric(),
        DataDriftTable(),
    ])
    drift_report.run(
        reference_data=reference_data,
        current_data=current_data,
    )

    result = drift_report.as_dict()
    drift_result = result["metrics"][0]["result"]

    # Record as MLflow Run
    with mlflow.start_run(run_name=f"drift-check-{batch_id}") as run:
        # Basic drift metrics
        mlflow.log_metric("dataset_drift_detected", int(drift_result["dataset_drift"]))
        mlflow.log_metric("drifted_columns_share", drift_result["share_of_drifted_columns"])
        mlflow.log_metric("number_of_drifted_columns", drift_result["number_of_drifted_columns"])
        mlflow.log_metric("total_columns", drift_result["number_of_columns"])

        # Log individual column drift scores
        column_drift = result["metrics"][1]["result"]["drift_by_columns"]
        for col_name, col_info in column_drift.items():
            safe_col_name = col_name.replace(" ", "_").replace("/", "_")
            mlflow.log_metric(
                f"drift_score_{safe_col_name}",
                col_info.get("drift_score", 0.0),
            )
            mlflow.log_metric(
                f"drift_detected_{safe_col_name}",
                int(col_info.get("column_drift", False)),
            )

        # Record metadata as tags
        mlflow.set_tags({
            "monitoring.type": "drift_detection",
            "monitoring.model_name": model_name,
            "monitoring.model_version": model_version,
            "monitoring.batch_id": batch_id,
            "monitoring.timestamp": datetime.utcnow().isoformat(),
            "monitoring.reference_size": str(len(reference_data)),
            "monitoring.current_size": str(len(current_data)),
        })

        # Save HTML report as artifact
        report_path = f"/tmp/drift_report_{batch_id}.html"
        drift_report.save_html(report_path)
        mlflow.log_artifact(report_path, artifact_path="drift_reports")

        # Save JSON results as artifact
        json_path = f"/tmp/drift_result_{batch_id}.json"
        with open(json_path, "w") as f:
            json.dump(result, f, indent=2, default=str)
        mlflow.log_artifact(json_path, artifact_path="drift_reports")

        print(f"Drift results logged to MLflow. Run ID: {run.info.run_id}")
        return drift_result["dataset_drift"], drift_result["share_of_drifted_columns"]


# Usage example
is_drifted, drift_share = log_drift_to_mlflow(
    reference_data=reference_data,
    current_data=current_drifted,
    model_name="fraud-detector",
    model_version="3",
    batch_id="2026-03-06-batch-001",
)

Alias-Based Model Registry Management

Starting with MLflow 2.x, alias-based model management is recommended over the traditional Stage system (Staging/Production/Archived). You can apply a strategy that automatically switches model aliases based on drift detection results.

from mlflow import MlflowClient

client = MlflowClient(tracking_uri="http://mlflow.internal:5000")

MODEL_NAME = "fraud-detector"

def handle_drift_detection(
    is_drifted: bool,
    drift_share: float,
    model_name: str = MODEL_NAME,
    drift_threshold_warn: float = 0.2,
    drift_threshold_critical: float = 0.5,
):
    """Perform model registry actions based on drift detection results"""

    # Check current production model version
    try:
        prod_version = client.get_model_version_by_alias(model_name, "production")
        current_version = prod_version.version
        print(f"Current production model version: {current_version}")
    except Exception as e:
        print(f"Failed to retrieve production model alias: {e}")
        return

    if not is_drifted:
        print("No drift detected. Maintaining current model.")
        client.set_model_version_tag(
            model_name, current_version,
            key="last_drift_check",
            value="passed",
        )
        return

    if drift_share >= drift_threshold_critical:
        # Critical drift: immediately switch to fallback model + trigger retraining
        print(f"CRITICAL: Drift ratio {drift_share:.1%} - Switching to fallback model and triggering retraining")
        client.set_model_version_tag(
            model_name, current_version,
            key="drift_status", value="critical",
        )
        # Switch to fallback model if available
        try:
            fallback = client.get_model_version_by_alias(model_name, "fallback")
            client.set_registered_model_alias(model_name, "production", fallback.version)
            print(f"Switched to fallback model version {fallback.version}")
        except Exception:
            print("WARNING: No fallback model available. Maintaining current model while emergency retraining is needed.")

        # Trigger retraining (external system call)
        trigger_retraining(model_name, reason="critical_drift")

    elif drift_share >= drift_threshold_warn:
        # Warning level drift: record tag + notification
        print(f"WARNING: Drift ratio {drift_share:.1%} - Enhanced monitoring and scheduling retraining")
        client.set_model_version_tag(
            model_name, current_version,
            key="drift_status", value="warning",
        )
        # Add to scheduled retraining queue
        schedule_retraining(model_name, priority="normal")


def trigger_retraining(model_name: str, reason: str):
    """Trigger emergency retraining (call Airflow DAG, Kubeflow Pipeline, etc.)"""
    print(f"Retraining triggered: model={model_name}, reason={reason}")
    # requests.post("http://airflow.internal/api/v1/dags/retrain/dagRuns", ...)


def schedule_retraining(model_name: str, priority: str):
    """Register in scheduled retraining queue"""
    print(f"Retraining scheduled: model={model_name}, priority={priority}")


# Execute
handle_drift_detection(
    is_drifted=True,
    drift_share=0.55,
    model_name=MODEL_NAME,
)

6. Building an Automatic Retraining Pipeline

The automated pipeline from drift detection to retraining consists of the following stages.

Overall Pipeline Flow

  1. Scheduler: Trigger drift check after batch inference or at regular intervals (daily/weekly)
  2. Drift Analyzer: Analyze current data against reference data with Evidently
  3. Decision Engine: Determine whether retraining is needed based on drift thresholds
  4. Retraining Orchestrator: Execute training jobs in Airflow/Kubeflow
  5. Champion/Challenger Evaluation: Compare and evaluate the new model against the existing model
  6. Deployment Gate: Auto-deploy if performance criteria are met, rollback on failure

Airflow DAG Integration Pattern

# Airflow DAG example: drift check + conditional retraining
# dag_drift_monitor.py

from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.empty import EmptyOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
import pandas as pd

default_args = {
    "owner": "ml-platform",
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
    "execution_timeout": timedelta(minutes=30),
}

dag = DAG(
    dag_id="ml_drift_monitor_fraud_detection",
    default_args=default_args,
    description="Daily drift monitoring and conditional retraining",
    schedule_interval="0 6 * * *",  # Daily at 6 AM
    start_date=days_ago(1),
    catchup=False,
    tags=["ml-monitoring", "drift-detection"],
)


def fetch_data(**context):
    """Load reference data and last 24 hours of serving data"""
    from sqlalchemy import create_engine
    engine = create_engine("postgresql://reader:password@db.internal/features")

    reference = pd.read_sql(
        "SELECT * FROM fraud_features_reference", engine
    )
    current = pd.read_sql(
        """SELECT * FROM fraud_features_serving
           WHERE created_at >= NOW() - INTERVAL '24 hours'""",
        engine,
    )

    # Pass paths via XCom (store large data in S3)
    ref_path = "/tmp/reference_data.parquet"
    cur_path = "/tmp/current_data.parquet"
    reference.to_parquet(ref_path)
    current.to_parquet(cur_path)

    context["ti"].xcom_push(key="reference_path", value=ref_path)
    context["ti"].xcom_push(key="current_path", value=cur_path)
    context["ti"].xcom_push(key="current_size", value=len(current))


def run_drift_check(**context):
    """Run Evidently drift analysis and log to MLflow"""
    from evidently.report import Report
    from evidently.metrics import DatasetDriftMetric, DataDriftTable
    import mlflow

    ti = context["ti"]
    ref_path = ti.xcom_pull(key="reference_path")
    cur_path = ti.xcom_pull(key="current_path")

    reference = pd.read_parquet(ref_path)
    current = pd.read_parquet(cur_path)

    # Validate minimum sample count
    if len(current) < 100:
        print(f"Insufficient current data samples: {len(current)}. Skipping drift check.")
        ti.xcom_push(key="drift_action", value="skip")
        return "skip_retraining"

    report = Report(metrics=[DatasetDriftMetric(), DataDriftTable()])
    report.run(reference_data=reference, current_data=current)
    result = report.as_dict()

    drift_detected = result["metrics"][0]["result"]["dataset_drift"]
    drift_share = result["metrics"][0]["result"]["share_of_drifted_columns"]

    # Log to MLflow
    mlflow.set_tracking_uri("http://mlflow.internal:5000")
    mlflow.set_experiment("monitoring/fraud-detection")
    with mlflow.start_run(run_name=f"drift-{context['ds']}"):
        mlflow.log_metric("drift_detected", int(drift_detected))
        mlflow.log_metric("drift_share", drift_share)

    ti.xcom_push(key="drift_detected", value=drift_detected)
    ti.xcom_push(key="drift_share", value=drift_share)


def decide_action(**context):
    """Decide whether to retrain based on drift level"""
    ti = context["ti"]
    drift_detected = ti.xcom_pull(key="drift_detected")
    drift_share = ti.xcom_pull(key="drift_share")

    if drift_share is None or drift_share < 0.2:
        return "skip_retraining"
    elif drift_share >= 0.5:
        return "trigger_emergency_retrain"
    else:
        return "trigger_scheduled_retrain"


fetch_task = PythonOperator(
    task_id="fetch_data", python_callable=fetch_data, dag=dag,
)
drift_task = PythonOperator(
    task_id="run_drift_check", python_callable=run_drift_check, dag=dag,
)
branch_task = BranchPythonOperator(
    task_id="decide_action", python_callable=decide_action, dag=dag,
)
skip_task = EmptyOperator(task_id="skip_retraining", dag=dag)
scheduled_retrain = EmptyOperator(task_id="trigger_scheduled_retrain", dag=dag)
emergency_retrain = EmptyOperator(task_id="trigger_emergency_retrain", dag=dag)

fetch_task >> drift_task >> branch_task >> [skip_task, scheduled_retrain, emergency_retrain]

Retraining Trigger Threshold Guidelines

Drift Leveldrift_share RangeRecommended ActionResponse Time
Normal0% ~ 15%Maintain monitoring-
Caution15% ~ 30%Send alert, begin root cause analysisWithin 48 hrs
Warning30% ~ 50%Register in scheduled retraining queueWithin 24 hrs
Critical50% or aboveImmediate retraining + fallback model switchImmediately

Note: Thresholds should be adjusted based on domain and model characteristics. Domains with high missed detection costs like financial fraud detection should use lower thresholds (10-20%), while domains with wider tolerance like recommendation systems should apply higher thresholds (30-50%).

7. Monitoring Tool Comparison: Evidently vs NannyML vs WhyLabs vs Alibi Detect

There are several options for production ML monitoring tools. Let's compare the strengths and weaknesses of each.

CriteriaEvidently AINannyMLWhyLabsAlibi Detect
LicenseApache 2.0 (OSS)BSD-3 (OSS)SaaS + Free tierBSD-3 (OSS)
Core StrengthGeneral data/model monitoringLabel-free performance estimation (CBPE)Real-time streaming profilingAdvanced drift detection algorithms
Number of Drift Methods20+10+15+15+
Label-free Performance EstimationLimitedCore feature (CBPE, DLE)Not supportedNot supported
Real-time MonitoringCollector modeNot supported (batch)Native supportNot supported (batch)
VisualizationBuilt-in HTML/dashboardBuilt-in HTMLWeb dashboard (SaaS)Basic visualization
CI/CD IntegrationTest Suite (native)LimitedAPI-basedManual configuration required
Prometheus IntegrationOfficially supportedCustom requiredBuilt-inCustom required
MLflow IntegrationEasy (Python native)Manual configurationAPI integrationManual configuration
Learning CurveLowMediumLow (SaaS)High
Production Use CasesGeneral purposeLabel-delayed environmentsLarge-scale real-timeResearch/advanced detection

Selection Guide:

  • Environments where labels cannot be obtained immediately (e.g., financial fraud detection where label confirmation takes months): NannyML's CBPE (Confidence-Based Performance Estimation) is the only option.
  • Open-source first + rapid adoption: Evidently AI provides the widest feature range with the lowest adoption barrier.
  • Large-scale real-time streaming: WhyLabs' data profiling is optimized for processing tens of thousands of records per second.
  • Research environments needing advanced statistical detection: Alibi Detect's deep kernel MMD and Learned Kernel drift detection are well-suited.

8. Grafana/Prometheus Dashboard Configuration

Let's look at how to expose Evidently's monitoring results as Prometheus metrics and visualize them as time series on Grafana dashboards.

Prometheus Metrics Export

# prometheus_drift_exporter.py
from prometheus_client import start_http_server, Gauge, Counter, Histogram
from evidently.report import Report
from evidently.metrics import DatasetDriftMetric, DataDriftTable
import pandas as pd
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Prometheus metric definitions
DRIFT_DETECTED = Gauge(
    "ml_model_drift_detected",
    "Dataset drift detection status (0/1)",
    ["model_name", "model_version"],
)
DRIFT_SHARE = Gauge(
    "ml_model_drift_column_share",
    "Share of drifted columns",
    ["model_name", "model_version"],
)
COLUMN_DRIFT_SCORE = Gauge(
    "ml_model_column_drift_score",
    "Individual column drift score",
    ["model_name", "model_version", "column_name"],
)
DRIFT_CHECK_TOTAL = Counter(
    "ml_model_drift_checks_total",
    "Total drift check executions",
    ["model_name"],
)
DRIFT_CHECK_DURATION = Histogram(
    "ml_model_drift_check_duration_seconds",
    "Drift check execution duration",
    ["model_name"],
    buckets=[0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0],
)

MODEL_NAME = "fraud-detector"
MODEL_VERSION = "3"


def run_periodic_drift_check(
    reference_path: str,
    current_query_fn,
    interval_seconds: int = 300,
):
    """Periodic drift check and Prometheus metric update"""
    reference = pd.read_parquet(reference_path)

    while True:
        try:
            start_time = time.time()

            # Load recent data
            current = current_query_fn()
            if current is None or len(current) < 50:
                logger.warning(f"Insufficient current data: {len(current) if current is not None else 0} records")
                time.sleep(interval_seconds)
                continue

            # Filter to feature columns only (exclude target and metadata columns)
            feature_cols = [c for c in reference.columns if c not in ["target", "id", "timestamp"]]
            ref_features = reference[feature_cols]
            cur_features = current[feature_cols]

            # Drift analysis
            report = Report(metrics=[DatasetDriftMetric(), DataDriftTable()])
            report.run(reference_data=ref_features, current_data=cur_features)
            result = report.as_dict()

            drift_result = result["metrics"][0]["result"]
            column_results = result["metrics"][1]["result"]["drift_by_columns"]

            # Update Prometheus metrics
            DRIFT_DETECTED.labels(MODEL_NAME, MODEL_VERSION).set(
                int(drift_result["dataset_drift"])
            )
            DRIFT_SHARE.labels(MODEL_NAME, MODEL_VERSION).set(
                drift_result["share_of_drifted_columns"]
            )

            for col_name, col_info in column_results.items():
                COLUMN_DRIFT_SCORE.labels(MODEL_NAME, MODEL_VERSION, col_name).set(
                    col_info.get("drift_score", 0.0)
                )

            DRIFT_CHECK_TOTAL.labels(MODEL_NAME).inc()

            duration = time.time() - start_time
            DRIFT_CHECK_DURATION.labels(MODEL_NAME).observe(duration)

            logger.info(
                f"Drift check complete: drift={drift_result['dataset_drift']}, "
                f"share={drift_result['share_of_drifted_columns']:.2%}, "
                f"duration={duration:.1f}s"
            )

        except Exception as e:
            logger.error(f"Drift check failed: {e}", exc_info=True)

        time.sleep(interval_seconds)


if __name__ == "__main__":
    # Start Prometheus metrics HTTP server (port 8000)
    start_http_server(8000)
    logger.info("Prometheus metrics exporter started (port 8000)")

    # Start periodic drift check (5-minute intervals)
    run_periodic_drift_check(
        reference_path="/data/reference/fraud_features_v3.parquet",
        current_query_fn=lambda: pd.read_parquet("/data/serving/latest_batch.parquet"),
        interval_seconds=300,
    )

Grafana Dashboard Components

Configure the following panels in Grafana to comprehensively monitor ML model health status.

PanelMetricVisualization TypeAlert Rule
Drift Statusml_model_drift_detectedStat (latest)Critical alert when value is 1
Drifted Column Ratio Trendml_model_drift_column_shareTime SeriesWarning when exceeding 30%
Per-Column Drift Scoreml_model_column_drift_scoreHeatmapHighlight columns exceeding threshold
Check Durationml_model_drift_check_duration_secondsHistogramWarning when exceeding 60s
Check Execution Countrate(ml_model_drift_checks_total[1h])Time SeriesAlert when 0 (check stalled)

Alertmanager Alert Rules Example

# prometheus-alerts.yaml
groups:
  - name: ml_model_drift_alerts
    rules:
      - alert: MLModelDriftDetected
        expr: ml_model_drift_detected == 1
        for: 5m
        labels:
          severity: warning
          team: ml-platform
        annotations:
          summary: 'ML model data drift detected'
          description: 'Data drift detected in model {{ $labels.model_name }} v{{ $labels.model_version }}. Drifted column ratio: {{ $value }}'

      - alert: MLModelCriticalDrift
        expr: ml_model_drift_column_share > 0.5
        for: 0m
        labels:
          severity: critical
          team: ml-platform
        annotations:
          summary: 'ML model critical drift - Immediate action required'
          description: 'Model {{ $labels.model_name }} drift column ratio is {{ $value | humanizePercentage }}. Immediate retraining or fallback switch is required.'

      - alert: MLDriftCheckStalled
        expr: rate(ml_model_drift_checks_total[1h]) == 0
        for: 30m
        labels:
          severity: warning
          team: ml-platform
        annotations:
          summary: 'ML drift check stalled'
          description: 'Drift check for model {{ $labels.model_name }} has not run for over 30 minutes. Monitoring pipeline inspection is needed.'

9. Operational Considerations

False Positive Drift Management

The most common pitfall of statistical drift detection is false positives. Drift may be falsely detected in the following situations even when there is no actual problem.

Sample Size Effect: When the current data sample size is very large, KS tests and Chi-squared tests detect statistically significant but practically meaningless differences as drift. Complement with effect-size-based metrics such as PSI or Wasserstein distance to verify practical significance.

Seasonality: Purchase patterns during Black Friday in e-commerce are distinctly different from normal periods. If detected as drift, unnecessary alerts flood in at the same time every year. Set reference data to historical data from the same period, or apply seasonal adjustment logic.

Inter-feature Correlations: Drift detection on individual features alone cannot capture multivariate distribution changes. There are cases where features A and B each have similar distributions, but the correlation between A and B has changed. Evidently's DatasetDriftMetric provides dataset-level judgment, but if explicit multivariate detection is needed, consider Alibi Detect's MMD (Maximum Mean Discrepancy) method.

Reference Data Management Strategies

Reference data is the baseline for drift detection. Incorrect reference data invalidates all detection results.

StrategyDescriptionSuitable ForCaution
Fixed Training DataFix data used for model training as referenceStable domains with little changeReference itself becomes outdated over time
Sliding WindowUpdate reference with data from recent N days/weeksEnvironments where gradual change is normalRisk of missing gradual drift
Update at RetrainingUpdate reference each time model is retrainedPipelines with regular retrainingDependent on retraining cycle
Dual BaselineCompare against both training data and recent stable period dataEnvironments requiring high accuracyIncreased management complexity

Key Point: Reference data should be version-controlled and tracked with a 1:1 mapping to model versions. Storing reference data snapshots as MLflow artifacts is recommended.

Feature Store Integration

If feature computation logic differs between offline training and online serving (Training-Serving Skew), fake drift caused by implementation inconsistency rather than actual drift will occur. Using a feature store like Feast to ensure feature consistency between training and serving is the fundamental solution.

10. Failure Cases and Recovery Procedures

Case 1: Silent Model Degradation

Situation: An e-commerce recommendation model gradually degraded over 3 months. CTR dropped from 12% to 7%, but drift monitoring was configured only at the individual feature level and failed to detect it.

Root Cause: Multivariate change in user behavior patterns. Individual feature distributions (views, dwell time, category ratio) didn't change significantly, but the correlations between features changed. Specifically, the "dwell time - purchase conversion" relationship weakened due to changes in short-form content consumption patterns.

Recovery Procedure:

  1. Added multivariate drift detection (feature correlation matrix comparison)
  2. Added a concept drift monitoring layer that directly monitors business KPIs (CTR, conversion rate)
  3. Retrained model with last 2 weeks of data and deployed via A/B testing
  4. Shortened retraining cycle from monthly to weekly

Lesson: Data drift alone is insufficient to capture concept drift. Business metric monitoring must always be conducted in parallel.

Case 2: Fake Drift from Data Pipeline Failure

Situation: Late Friday night, a flood of Critical drift alerts. Over 80% column drift detected across 3 models simultaneously.

Root Cause: An upstream data pipeline ETL job failed, causing some columns in the serving feature table to be filled with default values (0 or null). A data quality issue was falsely detected as drift.

Recovery Procedure:

  1. Placed TestNumberOfMissingValues and TestShareOfOutRangeValues in the Evidently TestSuite before the drift check stage
  2. Skip drift check on data quality failure and send a separate data pipeline alert
  3. Added data completeness validation gate to upstream ETL
  4. Included "recent data quality check results" information in drift alerts

Lesson: A data quality validation step must always be placed before the drift detection pipeline. Distinguishing between data quality issues and actual distribution changes is the key to operations.

Case 3: Reference Data Contamination

Situation: After model retraining, reference data was updated with the new dataset. Afterwards, no drift was detected at all, rendering the monitoring system useless.

Root Cause: The data used for retraining already contained drift, and this contaminated data became the new reference. As a result, the drift was "normalized" and the baseline was reset.

Recovery Procedure:

  1. Automated drift comparison between new and previous reference data during reference update
  2. Added a gate to block reference updates when drift ratio exceeds a certain level
  3. Version-controlled reference data change history as MLflow artifacts
  4. Periodically compared against a golden dataset (manually verified high-quality data)

Lesson: Reference data is the baseline of the monitoring system, so any changes must go through a validation process.

11. Production Monitoring Checklist

Pre-deployment Checklist

  • Is reference data version-controlled alongside model versions
  • Are Evidently Report/TestSuite integrated into the deployment pipeline
  • Are drift thresholds adjusted for domain characteristics
  • Is data quality validation placed before the drift detection stage
  • Is a fallback model registered in the registry
  • Are Grafana dashboards and alert rules configured

Operational Checklist

  • Are drift checks running on a normal schedule (monitoring the monitor)
  • Is the false positive rate at a manageable level (recommended under 5 per month)
  • Is reference data being updated at appropriate intervals
  • Are retraining triggers working properly, and is champion/challenger evaluation being performed
  • Are business KPIs and model performance metrics being tracked together
  • Is the mean time to respond (MTTR) after receiving alerts within SLA

Concept Drift Response Checklist

  • Is a label acquisition pipeline built (including delayed labels)
  • Are model performance metrics (Accuracy, F1, AUC) time-series trends being monitored
  • Are proxy metrics defined for periods without labels
  • Is A/B testing infrastructure ready

12. References

  1. Evidently AI - Data Drift Official Guide - A comprehensive guide covering data drift concepts, detection methodologies, and real-world cases.
  2. Evidently AI GitHub Repository - Open-source code, example notebooks, and community discussions.
  3. MLflow Model Registry Official Documentation - Model registry API, alias system, and deployment workflow guide.
  4. Evidently AI Official Documentation - Complete API reference and tutorials for Report, TestSuite, and Metric.
  5. Advanced ML Model Monitoring: Drift Detection, Explainability, and Automated Retraining - Advanced patterns for drift detection and automatic retraining pipelines.
  6. Google - ML Technical Debt (Hidden Technical Debt in Machine Learning Systems) - Foundational paper on technical debt and monitoring needs in ML systems.
  7. NannyML - Estimating Model Performance without Ground Truth - CBPE methodology for estimating model performance without labels.