- Authors
- Name
- 1. Introduction: Production Models Silently Degrade
- 2. Types of Drift: What Changes
- 3. Evidently AI Architecture and Core Features
- 4. Evidently AI Practical Usage
- 5. MLflow Model Registry and Monitoring Integration
- 6. Building an Automatic Retraining Pipeline
- 7. Monitoring Tool Comparison: Evidently vs NannyML vs WhyLabs vs Alibi Detect
- 8. Grafana/Prometheus Dashboard Configuration
- 9. Operational Considerations
- 10. Failure Cases and Recovery Procedures
- 11. Production Monitoring Checklist
- 12. References

1. Introduction: Production Models Silently Degrade
An ML model's accuracy peaks at the moment of deployment. After that, prediction quality gradually declines as the real world changes. The problem is that this degradation progresses without explicit errors. No HTTP 500s are thrown, no CRITICAL logs appear, and the service responds normally. It's just that recommendations become increasingly irrelevant, fraud detection misses new patterns, and demand forecasts start diverging from reality.
According to Google's research, over 60% of failures in production ML systems originate from data-related issues, not model code. The model itself doesn't break -- rather, the gap between the world the model learned and the real world keeps growing.
This article covers how to combine the open-source monitoring tool Evidently AI with the experiment/model management platform MLflow to continuously monitor the health of ML models in production, detect drift, and trigger automatic retraining pipelines.
2. Types of Drift: What Changes
Drift refers to the discrepancy between the data distribution the model was trained on and the data distribution at the time of serving. Drift is broadly classified into three categories based on where and how it occurs.
Data Drift (Covariate Shift)
This is the phenomenon where the distribution of input features changes. The model's input space P(X) shifts over time. For example, in an e-commerce recommendation model, the age distribution of users changes, or the proportion of purchase categories shifts with seasons. The relationship P(Y|X) between the target variable Y and features X remains the same, but the statistical characteristics of the inputs themselves change.
Concept Drift
This is the phenomenon where the relationship between features and the target itself changes. P(Y|X) changes. This is a more serious problem than data drift because the correct answer itself changes for the same input. Representative examples include demand forecasting models becoming completely invalidated during the COVID-19 pandemic, and financial fraud detection where fraudsters' methods evolve, making existing patterns no longer valid.
Prediction Drift
This is the phenomenon where the distribution of model output P(Y_pred) changes. It can appear as a result of input drift or occur independently due to internal model issues. It includes cases where the prediction ratio for a specific class in a classification model suddenly skews, or the mean or variance of predicted values in a regression model changes significantly.
| Drift Type | What Changes | Detection Difficulty | Representative Detection Methods | Retraining Urgency |
|---|---|---|---|---|
| Data Drift | P(X) input distribution | Medium | PSI, KS test, Wasserstein | Medium |
| Concept Drift | P(Y|X) relationship | High | Performance metric monitoring, ADWIN | High |
| Prediction Drift | P(Y_pred) output | Low | Output distribution statistics, Chi-squared | Situational |
| Label Drift | P(Y) target distribution | Medium | Label distribution comparison | High |
3. Evidently AI Architecture and Core Features
Evidently AI is an open-source library for ML model monitoring and data quality validation. It operates in a Python-native environment and has over 20 built-in statistical drift detection methods.
Core Components
- Report: One-time data analysis report. Can be output as HTML, JSON, or Python dictionary format. Suitable for exploratory analysis and debugging.
- Test Suite: Automated validation against predefined conditions. Integrated into CI/CD pipelines as data quality gates.
- Metric: Individual measurement items. Dozens of metrics such as DataDriftTable, DatasetSummaryMetric, and ColumnCorrelationsMetric are provided out of the box.
- Collector/Workspace: Evidently server mode. Stores monitoring results as time series and queries them on dashboards.
Key Drift Detection Algorithms
Evidently automatically selects the optimal detection algorithm based on feature type (numerical/categorical) and dataset size.
| Algorithm | Target Type | Principle | Pros | Limitations |
|---|---|---|---|---|
| Kolmogorov-Smirnov (KS) | Numerical, small | Maximum difference in cumulative distribution | No distribution assumptions | Oversensitive with large data |
| Population Stability Index (PSI) | Numerical/Categorical | Weighted sum of log ratios of two distributions | Industry standard, easy to interpret | Sensitive to bin settings |
| Wasserstein Distance | Numerical | Minimum transport cost between two distributions | Reflects distribution shape differences | High computational cost |
| Jensen-Shannon Divergence | Numerical/Categorical | Symmetric version of KL Divergence | Always finite, symmetric | Insensitive to tail changes |
| Chi-squared Test | Categorical | Difference between observed/expected frequencies | Intuitive for categorical | Unstable with low-frequency categories |
| Z-test (Proportion test) | Categorical, large | Standardization of proportion differences | Efficient for large data | Assumes normal approximation |
4. Evidently AI Practical Usage
Installation and Basic Setup
# Evidently AI installation (including MLflow integration)
# pip install evidently mlflow scikit-learn pandas
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
from evidently.metrics import (
DatasetDriftMetric,
DataDriftTable,
ColumnDriftMetric,
)
# Prepare reference / current data
data = load_iris(as_frame=True)
df = data.frame
df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "target"]
reference_data = df.sample(frac=0.5, random_state=42)
current_data = df.drop(reference_data.index)
# Create simulated data with data drift
current_drifted = current_data.copy()
current_drifted["sepal_length"] = current_drifted["sepal_length"] + np.random.normal(2.0, 0.5, len(current_drifted))
current_drifted["petal_width"] = current_drifted["petal_width"] * 1.8
# Generate drift report
drift_report = Report(metrics=[
DatasetDriftMetric(),
DataDriftTable(),
])
drift_report.run(
reference_data=reference_data,
current_data=current_drifted,
)
# Extract results as dictionary (for programmatic use)
result = drift_report.as_dict()
dataset_drift = result["metrics"][0]["result"]["dataset_drift"]
drift_share = result["metrics"][0]["result"]["share_of_drifted_columns"]
print(f"Dataset drift detected: {dataset_drift}")
print(f"Drifted column ratio: {drift_share:.2%}")
# Save as HTML report
drift_report.save_html("drift_report.html")
Automated Data Quality Validation with Test Suite
from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset, DataQualityTestPreset
from evidently.tests import (
TestColumnDrift,
TestShareOfDriftedColumns,
TestNumberOfMissingValues,
TestShareOfOutRangeValues,
TestMeanInNSigmas,
)
# Configure data drift + quality test suite
monitoring_suite = TestSuite(tests=[
# Drift test: fail if 30% or more of columns drift
TestShareOfDriftedColumns(lt=0.3),
# Individual key feature drift validation
TestColumnDrift(column_name="sepal_length"),
TestColumnDrift(column_name="petal_width"),
# Data quality tests
TestNumberOfMissingValues(eq=0),
# Value range validation: sepal_length within +/- 3 sigma of reference data
TestMeanInNSigmas(column_name="sepal_length", n=3),
])
monitoring_suite.run(
reference_data=reference_data,
current_data=current_drifted,
)
# Programmatically check test results
suite_result = monitoring_suite.as_dict()
all_passed = all(
test["status"] == "SUCCESS"
for test in suite_result["tests"]
)
print(f"All tests passed: {all_passed}")
for test in suite_result["tests"]:
status_icon = "PASS" if test["status"] == "SUCCESS" else "FAIL"
print(f" [{status_icon}] {test['name']}: {test['status']}")
# Use as exit code in CI/CD pipelines
if not all_passed:
print("ALERT: Data drift or quality anomaly detected. Retraining pipeline trigger required.")
# sys.exit(1) # Fail the build in CI
5. MLflow Model Registry and Monitoring Integration
MLflow provides experiment tracking, model packaging, and model registry functionality. By recording Evidently's drift detection results in MLflow, you can track performance history and drift status per model version on a single platform.
Logging Drift Metrics to MLflow
import mlflow
from evidently.report import Report
from evidently.metrics import (
DatasetDriftMetric,
DataDriftTable,
ColumnDriftMetric,
)
import json
from datetime import datetime
# MLflow tracking server configuration
mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment("model-monitoring/fraud-detection-v2")
def log_drift_to_mlflow(
reference_data,
current_data,
model_name: str,
model_version: str,
batch_id: str,
):
"""Log drift analysis results to MLflow"""
# Generate Evidently drift report
drift_report = Report(metrics=[
DatasetDriftMetric(),
DataDriftTable(),
])
drift_report.run(
reference_data=reference_data,
current_data=current_data,
)
result = drift_report.as_dict()
drift_result = result["metrics"][0]["result"]
# Record as MLflow Run
with mlflow.start_run(run_name=f"drift-check-{batch_id}") as run:
# Basic drift metrics
mlflow.log_metric("dataset_drift_detected", int(drift_result["dataset_drift"]))
mlflow.log_metric("drifted_columns_share", drift_result["share_of_drifted_columns"])
mlflow.log_metric("number_of_drifted_columns", drift_result["number_of_drifted_columns"])
mlflow.log_metric("total_columns", drift_result["number_of_columns"])
# Log individual column drift scores
column_drift = result["metrics"][1]["result"]["drift_by_columns"]
for col_name, col_info in column_drift.items():
safe_col_name = col_name.replace(" ", "_").replace("/", "_")
mlflow.log_metric(
f"drift_score_{safe_col_name}",
col_info.get("drift_score", 0.0),
)
mlflow.log_metric(
f"drift_detected_{safe_col_name}",
int(col_info.get("column_drift", False)),
)
# Record metadata as tags
mlflow.set_tags({
"monitoring.type": "drift_detection",
"monitoring.model_name": model_name,
"monitoring.model_version": model_version,
"monitoring.batch_id": batch_id,
"monitoring.timestamp": datetime.utcnow().isoformat(),
"monitoring.reference_size": str(len(reference_data)),
"monitoring.current_size": str(len(current_data)),
})
# Save HTML report as artifact
report_path = f"/tmp/drift_report_{batch_id}.html"
drift_report.save_html(report_path)
mlflow.log_artifact(report_path, artifact_path="drift_reports")
# Save JSON results as artifact
json_path = f"/tmp/drift_result_{batch_id}.json"
with open(json_path, "w") as f:
json.dump(result, f, indent=2, default=str)
mlflow.log_artifact(json_path, artifact_path="drift_reports")
print(f"Drift results logged to MLflow. Run ID: {run.info.run_id}")
return drift_result["dataset_drift"], drift_result["share_of_drifted_columns"]
# Usage example
is_drifted, drift_share = log_drift_to_mlflow(
reference_data=reference_data,
current_data=current_drifted,
model_name="fraud-detector",
model_version="3",
batch_id="2026-03-06-batch-001",
)
Alias-Based Model Registry Management
Starting with MLflow 2.x, alias-based model management is recommended over the traditional Stage system (Staging/Production/Archived). You can apply a strategy that automatically switches model aliases based on drift detection results.
from mlflow import MlflowClient
client = MlflowClient(tracking_uri="http://mlflow.internal:5000")
MODEL_NAME = "fraud-detector"
def handle_drift_detection(
is_drifted: bool,
drift_share: float,
model_name: str = MODEL_NAME,
drift_threshold_warn: float = 0.2,
drift_threshold_critical: float = 0.5,
):
"""Perform model registry actions based on drift detection results"""
# Check current production model version
try:
prod_version = client.get_model_version_by_alias(model_name, "production")
current_version = prod_version.version
print(f"Current production model version: {current_version}")
except Exception as e:
print(f"Failed to retrieve production model alias: {e}")
return
if not is_drifted:
print("No drift detected. Maintaining current model.")
client.set_model_version_tag(
model_name, current_version,
key="last_drift_check",
value="passed",
)
return
if drift_share >= drift_threshold_critical:
# Critical drift: immediately switch to fallback model + trigger retraining
print(f"CRITICAL: Drift ratio {drift_share:.1%} - Switching to fallback model and triggering retraining")
client.set_model_version_tag(
model_name, current_version,
key="drift_status", value="critical",
)
# Switch to fallback model if available
try:
fallback = client.get_model_version_by_alias(model_name, "fallback")
client.set_registered_model_alias(model_name, "production", fallback.version)
print(f"Switched to fallback model version {fallback.version}")
except Exception:
print("WARNING: No fallback model available. Maintaining current model while emergency retraining is needed.")
# Trigger retraining (external system call)
trigger_retraining(model_name, reason="critical_drift")
elif drift_share >= drift_threshold_warn:
# Warning level drift: record tag + notification
print(f"WARNING: Drift ratio {drift_share:.1%} - Enhanced monitoring and scheduling retraining")
client.set_model_version_tag(
model_name, current_version,
key="drift_status", value="warning",
)
# Add to scheduled retraining queue
schedule_retraining(model_name, priority="normal")
def trigger_retraining(model_name: str, reason: str):
"""Trigger emergency retraining (call Airflow DAG, Kubeflow Pipeline, etc.)"""
print(f"Retraining triggered: model={model_name}, reason={reason}")
# requests.post("http://airflow.internal/api/v1/dags/retrain/dagRuns", ...)
def schedule_retraining(model_name: str, priority: str):
"""Register in scheduled retraining queue"""
print(f"Retraining scheduled: model={model_name}, priority={priority}")
# Execute
handle_drift_detection(
is_drifted=True,
drift_share=0.55,
model_name=MODEL_NAME,
)
6. Building an Automatic Retraining Pipeline
The automated pipeline from drift detection to retraining consists of the following stages.
Overall Pipeline Flow
- Scheduler: Trigger drift check after batch inference or at regular intervals (daily/weekly)
- Drift Analyzer: Analyze current data against reference data with Evidently
- Decision Engine: Determine whether retraining is needed based on drift thresholds
- Retraining Orchestrator: Execute training jobs in Airflow/Kubeflow
- Champion/Challenger Evaluation: Compare and evaluate the new model against the existing model
- Deployment Gate: Auto-deploy if performance criteria are met, rollback on failure
Airflow DAG Integration Pattern
# Airflow DAG example: drift check + conditional retraining
# dag_drift_monitor.py
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.empty import EmptyOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
import pandas as pd
default_args = {
"owner": "ml-platform",
"retries": 2,
"retry_delay": timedelta(minutes=5),
"execution_timeout": timedelta(minutes=30),
}
dag = DAG(
dag_id="ml_drift_monitor_fraud_detection",
default_args=default_args,
description="Daily drift monitoring and conditional retraining",
schedule_interval="0 6 * * *", # Daily at 6 AM
start_date=days_ago(1),
catchup=False,
tags=["ml-monitoring", "drift-detection"],
)
def fetch_data(**context):
"""Load reference data and last 24 hours of serving data"""
from sqlalchemy import create_engine
engine = create_engine("postgresql://reader:password@db.internal/features")
reference = pd.read_sql(
"SELECT * FROM fraud_features_reference", engine
)
current = pd.read_sql(
"""SELECT * FROM fraud_features_serving
WHERE created_at >= NOW() - INTERVAL '24 hours'""",
engine,
)
# Pass paths via XCom (store large data in S3)
ref_path = "/tmp/reference_data.parquet"
cur_path = "/tmp/current_data.parquet"
reference.to_parquet(ref_path)
current.to_parquet(cur_path)
context["ti"].xcom_push(key="reference_path", value=ref_path)
context["ti"].xcom_push(key="current_path", value=cur_path)
context["ti"].xcom_push(key="current_size", value=len(current))
def run_drift_check(**context):
"""Run Evidently drift analysis and log to MLflow"""
from evidently.report import Report
from evidently.metrics import DatasetDriftMetric, DataDriftTable
import mlflow
ti = context["ti"]
ref_path = ti.xcom_pull(key="reference_path")
cur_path = ti.xcom_pull(key="current_path")
reference = pd.read_parquet(ref_path)
current = pd.read_parquet(cur_path)
# Validate minimum sample count
if len(current) < 100:
print(f"Insufficient current data samples: {len(current)}. Skipping drift check.")
ti.xcom_push(key="drift_action", value="skip")
return "skip_retraining"
report = Report(metrics=[DatasetDriftMetric(), DataDriftTable()])
report.run(reference_data=reference, current_data=current)
result = report.as_dict()
drift_detected = result["metrics"][0]["result"]["dataset_drift"]
drift_share = result["metrics"][0]["result"]["share_of_drifted_columns"]
# Log to MLflow
mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment("monitoring/fraud-detection")
with mlflow.start_run(run_name=f"drift-{context['ds']}"):
mlflow.log_metric("drift_detected", int(drift_detected))
mlflow.log_metric("drift_share", drift_share)
ti.xcom_push(key="drift_detected", value=drift_detected)
ti.xcom_push(key="drift_share", value=drift_share)
def decide_action(**context):
"""Decide whether to retrain based on drift level"""
ti = context["ti"]
drift_detected = ti.xcom_pull(key="drift_detected")
drift_share = ti.xcom_pull(key="drift_share")
if drift_share is None or drift_share < 0.2:
return "skip_retraining"
elif drift_share >= 0.5:
return "trigger_emergency_retrain"
else:
return "trigger_scheduled_retrain"
fetch_task = PythonOperator(
task_id="fetch_data", python_callable=fetch_data, dag=dag,
)
drift_task = PythonOperator(
task_id="run_drift_check", python_callable=run_drift_check, dag=dag,
)
branch_task = BranchPythonOperator(
task_id="decide_action", python_callable=decide_action, dag=dag,
)
skip_task = EmptyOperator(task_id="skip_retraining", dag=dag)
scheduled_retrain = EmptyOperator(task_id="trigger_scheduled_retrain", dag=dag)
emergency_retrain = EmptyOperator(task_id="trigger_emergency_retrain", dag=dag)
fetch_task >> drift_task >> branch_task >> [skip_task, scheduled_retrain, emergency_retrain]
Retraining Trigger Threshold Guidelines
| Drift Level | drift_share Range | Recommended Action | Response Time |
|---|---|---|---|
| Normal | 0% ~ 15% | Maintain monitoring | - |
| Caution | 15% ~ 30% | Send alert, begin root cause analysis | Within 48 hrs |
| Warning | 30% ~ 50% | Register in scheduled retraining queue | Within 24 hrs |
| Critical | 50% or above | Immediate retraining + fallback model switch | Immediately |
Note: Thresholds should be adjusted based on domain and model characteristics. Domains with high missed detection costs like financial fraud detection should use lower thresholds (10-20%), while domains with wider tolerance like recommendation systems should apply higher thresholds (30-50%).
7. Monitoring Tool Comparison: Evidently vs NannyML vs WhyLabs vs Alibi Detect
There are several options for production ML monitoring tools. Let's compare the strengths and weaknesses of each.
| Criteria | Evidently AI | NannyML | WhyLabs | Alibi Detect |
|---|---|---|---|---|
| License | Apache 2.0 (OSS) | BSD-3 (OSS) | SaaS + Free tier | BSD-3 (OSS) |
| Core Strength | General data/model monitoring | Label-free performance estimation (CBPE) | Real-time streaming profiling | Advanced drift detection algorithms |
| Number of Drift Methods | 20+ | 10+ | 15+ | 15+ |
| Label-free Performance Estimation | Limited | Core feature (CBPE, DLE) | Not supported | Not supported |
| Real-time Monitoring | Collector mode | Not supported (batch) | Native support | Not supported (batch) |
| Visualization | Built-in HTML/dashboard | Built-in HTML | Web dashboard (SaaS) | Basic visualization |
| CI/CD Integration | Test Suite (native) | Limited | API-based | Manual configuration required |
| Prometheus Integration | Officially supported | Custom required | Built-in | Custom required |
| MLflow Integration | Easy (Python native) | Manual configuration | API integration | Manual configuration |
| Learning Curve | Low | Medium | Low (SaaS) | High |
| Production Use Cases | General purpose | Label-delayed environments | Large-scale real-time | Research/advanced detection |
Selection Guide:
- Environments where labels cannot be obtained immediately (e.g., financial fraud detection where label confirmation takes months): NannyML's CBPE (Confidence-Based Performance Estimation) is the only option.
- Open-source first + rapid adoption: Evidently AI provides the widest feature range with the lowest adoption barrier.
- Large-scale real-time streaming: WhyLabs' data profiling is optimized for processing tens of thousands of records per second.
- Research environments needing advanced statistical detection: Alibi Detect's deep kernel MMD and Learned Kernel drift detection are well-suited.
8. Grafana/Prometheus Dashboard Configuration
Let's look at how to expose Evidently's monitoring results as Prometheus metrics and visualize them as time series on Grafana dashboards.
Prometheus Metrics Export
# prometheus_drift_exporter.py
from prometheus_client import start_http_server, Gauge, Counter, Histogram
from evidently.report import Report
from evidently.metrics import DatasetDriftMetric, DataDriftTable
import pandas as pd
import time
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Prometheus metric definitions
DRIFT_DETECTED = Gauge(
"ml_model_drift_detected",
"Dataset drift detection status (0/1)",
["model_name", "model_version"],
)
DRIFT_SHARE = Gauge(
"ml_model_drift_column_share",
"Share of drifted columns",
["model_name", "model_version"],
)
COLUMN_DRIFT_SCORE = Gauge(
"ml_model_column_drift_score",
"Individual column drift score",
["model_name", "model_version", "column_name"],
)
DRIFT_CHECK_TOTAL = Counter(
"ml_model_drift_checks_total",
"Total drift check executions",
["model_name"],
)
DRIFT_CHECK_DURATION = Histogram(
"ml_model_drift_check_duration_seconds",
"Drift check execution duration",
["model_name"],
buckets=[0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0],
)
MODEL_NAME = "fraud-detector"
MODEL_VERSION = "3"
def run_periodic_drift_check(
reference_path: str,
current_query_fn,
interval_seconds: int = 300,
):
"""Periodic drift check and Prometheus metric update"""
reference = pd.read_parquet(reference_path)
while True:
try:
start_time = time.time()
# Load recent data
current = current_query_fn()
if current is None or len(current) < 50:
logger.warning(f"Insufficient current data: {len(current) if current is not None else 0} records")
time.sleep(interval_seconds)
continue
# Filter to feature columns only (exclude target and metadata columns)
feature_cols = [c for c in reference.columns if c not in ["target", "id", "timestamp"]]
ref_features = reference[feature_cols]
cur_features = current[feature_cols]
# Drift analysis
report = Report(metrics=[DatasetDriftMetric(), DataDriftTable()])
report.run(reference_data=ref_features, current_data=cur_features)
result = report.as_dict()
drift_result = result["metrics"][0]["result"]
column_results = result["metrics"][1]["result"]["drift_by_columns"]
# Update Prometheus metrics
DRIFT_DETECTED.labels(MODEL_NAME, MODEL_VERSION).set(
int(drift_result["dataset_drift"])
)
DRIFT_SHARE.labels(MODEL_NAME, MODEL_VERSION).set(
drift_result["share_of_drifted_columns"]
)
for col_name, col_info in column_results.items():
COLUMN_DRIFT_SCORE.labels(MODEL_NAME, MODEL_VERSION, col_name).set(
col_info.get("drift_score", 0.0)
)
DRIFT_CHECK_TOTAL.labels(MODEL_NAME).inc()
duration = time.time() - start_time
DRIFT_CHECK_DURATION.labels(MODEL_NAME).observe(duration)
logger.info(
f"Drift check complete: drift={drift_result['dataset_drift']}, "
f"share={drift_result['share_of_drifted_columns']:.2%}, "
f"duration={duration:.1f}s"
)
except Exception as e:
logger.error(f"Drift check failed: {e}", exc_info=True)
time.sleep(interval_seconds)
if __name__ == "__main__":
# Start Prometheus metrics HTTP server (port 8000)
start_http_server(8000)
logger.info("Prometheus metrics exporter started (port 8000)")
# Start periodic drift check (5-minute intervals)
run_periodic_drift_check(
reference_path="/data/reference/fraud_features_v3.parquet",
current_query_fn=lambda: pd.read_parquet("/data/serving/latest_batch.parquet"),
interval_seconds=300,
)
Grafana Dashboard Components
Configure the following panels in Grafana to comprehensively monitor ML model health status.
| Panel | Metric | Visualization Type | Alert Rule |
|---|---|---|---|
| Drift Status | ml_model_drift_detected | Stat (latest) | Critical alert when value is 1 |
| Drifted Column Ratio Trend | ml_model_drift_column_share | Time Series | Warning when exceeding 30% |
| Per-Column Drift Score | ml_model_column_drift_score | Heatmap | Highlight columns exceeding threshold |
| Check Duration | ml_model_drift_check_duration_seconds | Histogram | Warning when exceeding 60s |
| Check Execution Count | rate(ml_model_drift_checks_total[1h]) | Time Series | Alert when 0 (check stalled) |
Alertmanager Alert Rules Example
# prometheus-alerts.yaml
groups:
- name: ml_model_drift_alerts
rules:
- alert: MLModelDriftDetected
expr: ml_model_drift_detected == 1
for: 5m
labels:
severity: warning
team: ml-platform
annotations:
summary: 'ML model data drift detected'
description: 'Data drift detected in model {{ $labels.model_name }} v{{ $labels.model_version }}. Drifted column ratio: {{ $value }}'
- alert: MLModelCriticalDrift
expr: ml_model_drift_column_share > 0.5
for: 0m
labels:
severity: critical
team: ml-platform
annotations:
summary: 'ML model critical drift - Immediate action required'
description: 'Model {{ $labels.model_name }} drift column ratio is {{ $value | humanizePercentage }}. Immediate retraining or fallback switch is required.'
- alert: MLDriftCheckStalled
expr: rate(ml_model_drift_checks_total[1h]) == 0
for: 30m
labels:
severity: warning
team: ml-platform
annotations:
summary: 'ML drift check stalled'
description: 'Drift check for model {{ $labels.model_name }} has not run for over 30 minutes. Monitoring pipeline inspection is needed.'
9. Operational Considerations
False Positive Drift Management
The most common pitfall of statistical drift detection is false positives. Drift may be falsely detected in the following situations even when there is no actual problem.
Sample Size Effect: When the current data sample size is very large, KS tests and Chi-squared tests detect statistically significant but practically meaningless differences as drift. Complement with effect-size-based metrics such as PSI or Wasserstein distance to verify practical significance.
Seasonality: Purchase patterns during Black Friday in e-commerce are distinctly different from normal periods. If detected as drift, unnecessary alerts flood in at the same time every year. Set reference data to historical data from the same period, or apply seasonal adjustment logic.
Inter-feature Correlations: Drift detection on individual features alone cannot capture multivariate distribution changes. There are cases where features A and B each have similar distributions, but the correlation between A and B has changed. Evidently's DatasetDriftMetric provides dataset-level judgment, but if explicit multivariate detection is needed, consider Alibi Detect's MMD (Maximum Mean Discrepancy) method.
Reference Data Management Strategies
Reference data is the baseline for drift detection. Incorrect reference data invalidates all detection results.
| Strategy | Description | Suitable For | Caution |
|---|---|---|---|
| Fixed Training Data | Fix data used for model training as reference | Stable domains with little change | Reference itself becomes outdated over time |
| Sliding Window | Update reference with data from recent N days/weeks | Environments where gradual change is normal | Risk of missing gradual drift |
| Update at Retraining | Update reference each time model is retrained | Pipelines with regular retraining | Dependent on retraining cycle |
| Dual Baseline | Compare against both training data and recent stable period data | Environments requiring high accuracy | Increased management complexity |
Key Point: Reference data should be version-controlled and tracked with a 1:1 mapping to model versions. Storing reference data snapshots as MLflow artifacts is recommended.
Feature Store Integration
If feature computation logic differs between offline training and online serving (Training-Serving Skew), fake drift caused by implementation inconsistency rather than actual drift will occur. Using a feature store like Feast to ensure feature consistency between training and serving is the fundamental solution.
10. Failure Cases and Recovery Procedures
Case 1: Silent Model Degradation
Situation: An e-commerce recommendation model gradually degraded over 3 months. CTR dropped from 12% to 7%, but drift monitoring was configured only at the individual feature level and failed to detect it.
Root Cause: Multivariate change in user behavior patterns. Individual feature distributions (views, dwell time, category ratio) didn't change significantly, but the correlations between features changed. Specifically, the "dwell time - purchase conversion" relationship weakened due to changes in short-form content consumption patterns.
Recovery Procedure:
- Added multivariate drift detection (feature correlation matrix comparison)
- Added a concept drift monitoring layer that directly monitors business KPIs (CTR, conversion rate)
- Retrained model with last 2 weeks of data and deployed via A/B testing
- Shortened retraining cycle from monthly to weekly
Lesson: Data drift alone is insufficient to capture concept drift. Business metric monitoring must always be conducted in parallel.
Case 2: Fake Drift from Data Pipeline Failure
Situation: Late Friday night, a flood of Critical drift alerts. Over 80% column drift detected across 3 models simultaneously.
Root Cause: An upstream data pipeline ETL job failed, causing some columns in the serving feature table to be filled with default values (0 or null). A data quality issue was falsely detected as drift.
Recovery Procedure:
- Placed
TestNumberOfMissingValuesandTestShareOfOutRangeValuesin the Evidently TestSuite before the drift check stage - Skip drift check on data quality failure and send a separate data pipeline alert
- Added data completeness validation gate to upstream ETL
- Included "recent data quality check results" information in drift alerts
Lesson: A data quality validation step must always be placed before the drift detection pipeline. Distinguishing between data quality issues and actual distribution changes is the key to operations.
Case 3: Reference Data Contamination
Situation: After model retraining, reference data was updated with the new dataset. Afterwards, no drift was detected at all, rendering the monitoring system useless.
Root Cause: The data used for retraining already contained drift, and this contaminated data became the new reference. As a result, the drift was "normalized" and the baseline was reset.
Recovery Procedure:
- Automated drift comparison between new and previous reference data during reference update
- Added a gate to block reference updates when drift ratio exceeds a certain level
- Version-controlled reference data change history as MLflow artifacts
- Periodically compared against a golden dataset (manually verified high-quality data)
Lesson: Reference data is the baseline of the monitoring system, so any changes must go through a validation process.
11. Production Monitoring Checklist
Pre-deployment Checklist
- Is reference data version-controlled alongside model versions
- Are Evidently Report/TestSuite integrated into the deployment pipeline
- Are drift thresholds adjusted for domain characteristics
- Is data quality validation placed before the drift detection stage
- Is a fallback model registered in the registry
- Are Grafana dashboards and alert rules configured
Operational Checklist
- Are drift checks running on a normal schedule (monitoring the monitor)
- Is the false positive rate at a manageable level (recommended under 5 per month)
- Is reference data being updated at appropriate intervals
- Are retraining triggers working properly, and is champion/challenger evaluation being performed
- Are business KPIs and model performance metrics being tracked together
- Is the mean time to respond (MTTR) after receiving alerts within SLA
Concept Drift Response Checklist
- Is a label acquisition pipeline built (including delayed labels)
- Are model performance metrics (Accuracy, F1, AUC) time-series trends being monitored
- Are proxy metrics defined for periods without labels
- Is A/B testing infrastructure ready
12. References
- Evidently AI - Data Drift Official Guide - A comprehensive guide covering data drift concepts, detection methodologies, and real-world cases.
- Evidently AI GitHub Repository - Open-source code, example notebooks, and community discussions.
- MLflow Model Registry Official Documentation - Model registry API, alias system, and deployment workflow guide.
- Evidently AI Official Documentation - Complete API reference and tutorials for Report, TestSuite, and Metric.
- Advanced ML Model Monitoring: Drift Detection, Explainability, and Automated Retraining - Advanced patterns for drift detection and automatic retraining pipelines.
- Google - ML Technical Debt (Hidden Technical Debt in Machine Learning Systems) - Foundational paper on technical debt and monitoring needs in ML systems.
- NannyML - Estimating Model Performance without Ground Truth - CBPE methodology for estimating model performance without labels.