Split View: MLOps 완전 가이드 — 모델 서빙·Feature Store·Drift·A/B 테스트·GPU 경제학 (Season 2 Ep 7, 2025)

MLOps 완전 가이드 — 모델 서빙·Feature Store·Drift·A/B 테스트·GPU 경제학 (Season 2 Ep 7, 2025)

들어가며 — MLOps가 "DevOps + ML"이 아닌 이유

DevOps는 코드를 배포한다. MLOps는 코드 + 데이터 + 모델 세 가지를 동시에 배포·모니터링·버전 관리한다.

MLOps가 유독 어려운 4가지 이유:

재현성: 같은 코드 + 같은 데이터 → 다른 모델 (랜덤성·하드웨어 차이)
Drift: 데이터 분포가 바뀌면 모델이 실시간으로 썩음
지연시간: 학습은 배치 / 서빙은 실시간 → 아키텍처 분리
비용: GPU 1개 월 $2,000~$ 30,000. 잘못 설계하면 스타트업 전사 예산 증발

2024~2025년, LLM 시대가 되며 MLOps는 LLMOps로 확장. 이 글은 두 영역 모두 다룬다.

1부 — MLOps 성숙도 5단계 (Google 2021 프레임워크)

Level	특징
Lv.0	수동: notebook → 수동 배포. 소규모 실험용
Lv.1	자동 학습 파이프라인 + 수동 배포
Lv.2	자동 학습 + 자동 배포 + 모니터링
Lv.3	자동 재학습 (Drift 감지 시 트리거)
Lv.4	완전 자동 + 비즈니스 지표까지 연결

현실 대부분 기업은 Lv.1~2. Lv.3+는 Netflix·Uber·Airbnb 수준.

2부 — Model Serving: 추론 시스템

2.1 일반 ML Serving

도구	강점	용도
TorchServe	PyTorch 네이티브	PyTorch 표준
TensorFlow Serving	오래되고 성숙	TF 모델
Triton Inference Server (NVIDIA)	멀티 프레임워크, 동적 배치	프로덕션 표준
BentoML	Python-friendly	빠른 프로토타이핑
KServe	Kubernetes 네이티브	쿠버네티스 환경

2.2 LLM Serving (2024~2025 표준)

도구	특징
vLLM	PagedAttention, 처리량 압도적, 오픈소스 표준
TGI (HuggingFace)	Rust 작성, 안정적
TensorRT-LLM	NVIDIA 최적화, 최고 성능
SGLang	복잡한 워크플로우 최적화
llama.cpp	CPU·Mac·엣지

2025년 오픈소스 LLM 프로덕션의 기본값: vLLM.

2.3 vLLM의 혁신: PagedAttention

전통 Attention KV Cache: 연속 메모리 할당 → 단편화 심각, GPU 메모리의 60~80% 낭비.

PagedAttention: OS 가상메모리처럼 블록 단위 관리 → 메모리 낭비 4% 미만, 동시 요청 처리량 2~4배.

2.4 Serving 패턴 4가지

Online (Real-time): ms 단위 응답. API 서버.
Batch: 대량 예측 (야간 작업). 효율 ↑.
Streaming: 이벤트 기반 (Kafka → 모델).
Edge: 디바이스에서 직접 (모바일·IoT).

2.5 Serving 성능 지표

Latency (P50, P95, P99): 응답 시간
Throughput (QPS): 초당 요청 수
TTFT (Time to First Token, LLM): 첫 토큰까지
TPS (Tokens Per Second, LLM): 생성 속도
GPU Utilization: 목표 60~80% (너무 낮으면 낭비, 너무 높으면 지연 폭증)

3부 — Feature Store: ML의 특징 계산층

3.1 왜 Feature Store인가

학습 시점과 서빙 시점의 Feature 계산이 달라서 망하는 것을 방지.

예: "최근 7일 구매 금액" — 학습 데이터와 실시간 요청에서 시간 기준·집계 로직이 0.001%만 달라도 모델 성능 급락.

3.2 Feature Store 3가지 역할

Offline Store (Training용): Parquet, BigQuery, Snowflake — 학습용 대량 조회
Online Store (Serving용): Redis, DynamoDB — 저지연 조회
Feature Definition: "이 Feature는 무엇이고 어떻게 계산하나" 중앙 관리

3.3 2025 Feature Store 옵션

도구	특징
Feast	오픈소스 표준, 경량
Tecton	상용, 엔터프라이즈
Hopsworks	End-to-end 플랫폼
Databricks Feature Store	Delta Lake 통합
자체 구축	Redis + S3 + 메타데이터 DB

3.4 Feast 예제

from feast import Entity, FeatureView, Field
from feast.types import Float32, Int64

user = Entity(name="user_id", value_type=Int64)

user_activity = FeatureView(
    name="user_activity_7d",
    entities=[user],
    features=[
        Field(name="purchase_amount_7d", dtype=Float32),
        Field(name="click_count_7d", dtype=Int64),
    ],
    source=bigquery_source,
    online=True,
    ttl=timedelta(days=7),
)

# 학습 시
features = store.get_historical_features(entity_df, feature_refs).to_df()

# 서빙 시
features = store.get_online_features(
    features=feature_refs, entity_rows=[{"user_id": 1}]
).to_dict()

4부 — Training Infra: 대규모 학습

4.1 단일 GPU → 분산 학습 순서

Single GPU (~7B 파라미터)
Data Parallel (DP): 여러 GPU가 같은 모델 복제 + 다른 데이터
Distributed Data Parallel (DDP): DP 개선, All-Reduce로 gradient 동기화
Model Parallel: 모델이 너무 커서 쪼갬
Tensor Parallel: 한 레이어 안에서도 쪼갬 (Megatron-LM)
Pipeline Parallel: 레이어 단위로 GPU 분배
3D Parallel: DP + TP + PP 조합 (GPT-4급)

4.2 2025 분산 학습 도구

도구	특징
PyTorch DDP	표준
DeepSpeed (MS)	ZeRO 최적화, LLM 필수
FSDP (Meta)	PyTorch 네이티브, DeepSpeed 대안
Megatron-LM (NVIDIA)	초대형 모델
Ray Train	통합 인터페이스
Determined AI	실험 관리 통합

4.3 ZeRO (Zero Redundancy Optimizer)

Optimizer State를 GPU 간 분할 → 메모리 사용량 대폭 감소:

ZeRO-1: Optimizer State 분할
ZeRO-2: + Gradient 분할
ZeRO-3: + Model Parameter 분할 (FSDP와 유사)

4.4 Fine-tuning 경량 기법

기법	절감
LoRA	학습 파라미터 ~1%만
QLoRA	LoRA + 4bit 양자화 → 단일 GPU로 70B 파인튜닝
DoRA	LoRA 개선 (크기·방향 분리)
Galore	풀 파라미터 학습 유사 성능 + 메모리 절약

5부 — Experiment Tracking: 실험 기록

5.1 왜 필요한가

"3개월 전 그 모델이 왜 좋았더라?" → 답할 수 없음 = 재현 불가 = 무의미.

5.2 2025 도구 비교

도구	장점	단점
MLflow	오픈소스, 자가 호스팅 가능	UI 평범
Weights & Biases	UI 최고, 협업 좋음	SaaS, 비용
Neptune.ai	메타데이터 강함	중소규모
Comet	W&B 대안	중소규모
ClearML	오픈소스, 파이프라인까지	학습 곡선

5.3 MLflow 기본

import mlflow
import mlflow.pytorch

mlflow.set_experiment("image_classifier")

with mlflow.start_run():
    mlflow.log_params({"lr": 0.001, "batch_size": 64})

    for epoch in range(epochs):
        train_loss = train(model, loader)
        val_acc = validate(model, val_loader)
        mlflow.log_metrics({"train_loss": train_loss, "val_acc": val_acc}, step=epoch)

    mlflow.pytorch.log_model(model, "model")
    mlflow.log_artifact("confusion_matrix.png")

5.4 추적해야 할 것 10가지

Hyperparameter: LR, batch size, optimizer
Data Version: DVC 또는 Delta Lake 해시
Code Version: git commit hash
Metric: train/val loss, acc, AUC, etc.
Model Artifact: weights, architecture
Environment: Python 버전, GPU 종류, requirements
Training Time: 총 시간, 에폭당 시간
Resource: GPU 메모리, CPU 사용
Random Seed: 재현성
Dataset Stats: 클래스 분포, 샘플 수

6부 — Drift 감지: 모델이 썩는 법

6.1 3가지 Drift

Data Drift (Covariate Shift): 입력 분포 변화
- 예: 코로나 전/후 쇼핑 패턴
Concept Drift: 입력→출력 관계 변화
- 예: 스팸 정의 자체가 바뀜
Label Drift: 정답 분포 변화
- 예: 사기 비율이 1%에서 5%로 급증

6.2 감지 방법

통계적:

KS Test (단일 특징)
PSI (Population Stability Index)
Wasserstein Distance
Chi-Square (범주형)

ML 기반:

Domain Classifier (학습/프로덕션 데이터 분류기)
Autoencoder Reconstruction Error

성능 기반:

Delayed Label 도착 후 실제 성능 추적
Proxy Metric (CTR, 전환율)

6.3 2025 Drift 도구

Evidently AI: 오픈소스, 대시보드
Arize AI: 상용, LLM·ML 통합
WhyLabs: 데이터 품질 + Drift
Fiddler: 엔터프라이즈

6.4 LLM 특유 문제

Prompt Drift: 프롬프트 분포 변화 (사용자 트렌드)
Response Drift: 응답 품질 저하 (모델 업데이트 영향)
Cost Drift: 평균 토큰 수 증가 → 비용 폭증

7부 — Model A/B 테스트와 배포 전략

7.1 배포 전략 5가지

전략	설명	위험
Blue-Green	구/신 환경 전체 교체	중
Canary	N%만 신모델, 점진 확대	낮음
A/B	사용자를 정확히 절반으로 나눔	낮음 (통계 필요)
Shadow	신모델이 실제 트래픽 처리, 응답은 구모델로	가장 안전
Multi-armed Bandit	자동 트래픽 재분배	지능적

7.2 A/B 테스트 설계

Hypothesis: "신모델이 CTR을 5% 이상 올릴 것"
Sample Size 계산: Statistical Power 80%, 유의수준 5%
Randomization: User ID 해시 기반
Duration: 주간 패턴 포함 (최소 1주)
Guardrail Metric: 주 지표 + 안전망 지표 (latency, error rate)
분석: 유의미한 효과인가 + 서브그룹 영향

7.3 Shadow Deployment

User Request
  ↓
[Router]
  ├─ Prod Model → User Response (반환)
  └─ Shadow Model → Log (사용자에겐 안 보임)

장점: 0 위험, 실제 트래픽으로 검증. 단점: 2배 비용, 로직 변경 시 부작용 감지 어려움.

7.4 2024~2025 실험 플랫폼

Eppo: 통계 엄격
GrowthBook: 오픈소스
Statsig: Facebook 출신 창업
자체 구축: 대기업 선호

8부 — GPU 경제학 2025

8.1 GPU 옵션 비교

옵션	가격 (H100 기준)	유연성	적합
On-demand Cloud	~$3/시간	최고	소규모·불규칙
Spot/Preemptible	$1~~1.5/시간 (60~~70% 할인)	낮음	배치 학습
Reserved (1~3년)	~~$1.5~~2/시간	중	예측 가능 워크로드
전용 (Dedicated)	수백만 원/월	높음	장기 프로덕션
자체 구매	H100 ~ $30K, DGX ~$ 400K	최고	대규모·장기

8.2 2024~2025 트렌드

H100 → B200 (Blackwell): 2.5배 성능, 가격 비슷
AMD MI300X: H100 대안, 메모리 192GB
Groq LPU: 추론 특화, 토큰/초 최고
AWS Trainium/Inferentia: 자체 칩, 가성비 ↑
Google TPU v5: 학습·추론 특화

8.3 비용 최적화 전술 10가지

Spot 인스턴스 + 체크포인트: 학습은 재개 가능. 70% 할인
Mixed Precision (FP16/BF16): 속도·메모리 2배
Gradient Accumulation: 작은 GPU로 큰 배치 효과
Gradient Checkpointing: 메모리 절반, 속도 20% 손실
Quantization (INT8/INT4): 추론 메모리 2~4배 절감
LoRA/QLoRA: 파인튜닝 비용 99% 절감
Model Distillation: 작은 모델로 성능 복제
Batching + Dynamic Batching: 서빙 처리량 ↑
Request Caching: 같은 프롬프트 결과 재사용
Right-sizing: 과도한 GPU 피하기 (A10이면 되는데 A100 쓰지 말기)

8.4 Cloud vs 자체 구매 손익분기점

단순 규칙: 24/7 사용하는 GPU가 3대 이상 + 1년 이상 예상 → 자체 구매 고려.

현실: 인프라 팀 인건비, 전력·냉각, 교체 주기(3~4년)까지 계산해야.

9부 — 데이터 파이프라인

9.1 Batch vs Streaming

Batch	Streaming
Airflow, Prefect, Dagster	Kafka, Flink, Spark Streaming
시간당/일당 배치	실시간
비용 낮음	비용 높음
지연 허용	ms~s 요구

9.2 2025 오케스트레이션

Airflow 2.x: 표준, 성숙
Prefect: Pythonic, UX 우수
Dagster: 타입 안전, 데이터 인식
Temporal: 워크플로우 특화, 재시작 가능

9.3 ML 데이터 품질 체크

결측치: 비율 임계치
이상치: Z-score 또는 Isolation Forest
스키마 드리프트: 컬럼 추가/삭제/타입 변경
범위 체크: age > 0, price > 0
분포: 히스토그램 비교 vs 기준선
유일성: ID 중복 없음
관계: FK 무결성

9.4 Great Expectations · Soda

import great_expectations as ge

df = ge.from_pandas(df)
df.expect_column_values_to_not_be_null("user_id")
df.expect_column_values_to_be_between("age", 0, 150)
df.expect_column_value_lengths_to_be_between("email", 5, 100)

10부 — 관측성과 디버깅

10.1 3축 + ML 고유

일반 앱:

Metric (Prometheus)
Log (Loki, Elasticsearch)
Trace (OpenTelemetry)

ML 추가:

Prediction Log: 입력 + 출력 + 모델 버전
Feature Log: Feature Store 조회 기록
Drift Metric: 분포 통계
Explanation: SHAP, LIME

10.2 LLM 관측성 도구

LangSmith: LangChain 팀
Langfuse: 오픈소스
Helicone: 프록시 기반
Phoenix (Arize): 오픈소스 강력

11부 — MLOps 마스터 로드맵 6개월

Month 1: 기초 + Serving

FastAPI + PyTorch 모델 서빙
Docker + K8s 기본

Month 2: Training Infra

PyTorch DDP
Ray Train 또는 DeepSpeed 체험
MLflow로 실험 추적

Month 3: Feature Store + Data

Feast 설치·운영
Airflow 또는 Dagster 파이프라인
Great Expectations 데이터 품질

Month 4: LLM 특화

vLLM 운영
프롬프트 관리 (Langfuse)
LLM-as-a-Judge eval

Month 5: Drift + A/B

Evidently AI로 drift 감지
GrowthBook으로 A/B 테스트
Shadow Deployment 실전

Month 6: 최적화·스케일

GPU 비용 모니터링
LoRA 파인튜닝
Model Distillation 실험

12부 — MLOps 체크리스트 12

MLOps 성숙도 5단계에서 우리 팀 위치를 안다
vLLM이 기존 서빙보다 빠른 이유를 안다 (PagedAttention)
Feature Store 3가지 역할을 말할 수 있다
DDP vs FSDP vs ZeRO 차이를 안다
LoRA vs QLoRA의 메모리 절감 원리를 안다
MLflow의 5가지 로그 대상을 안다
3가지 Drift (Data/Concept/Label) 차이를 설명할 수 있다
Shadow Deployment의 장단점을 안다
Canary vs A/B 테스트 선택 기준을 안다
GPU Spot 인스턴스로 학습할 때 주의점을 안다
Batch vs Dynamic Batching 차이를 안다
LLM 관측성 3요소 (prompt·completion·metadata)를 안다

13부 — MLOps 안티패턴 10

Notebook만으로 배포: 재현성 0. 파이프라인으로 전환
Feature 계산 학습·서빙 분리: 반드시 Feature Store 또는 공유 라이브러리
Eval 셋 없이 배포: 성능 저하 감지 불가
Drift 모니터링 없음: 6개월 후 조용히 망함
단일 A/B 지표: Guardrail 필수
GPU "적당히" 선택: A10·A100·H100 명확한 기준 필요
Spot 인스턴스에 체크포인트 없이: 종료되면 학습 소멸
Shadow Deploy 없이 Big Bang: 위험 과소평가
관측성 나중에: 처음부터 심어야
MLOps를 ML 팀만의 일로: DevOps·Data 팀 협업 필수

마치며 — MLOps는 "보이지 않는 70%"다

논문의 아름다움은 모델 구조에 있지만, 프로덕션의 아름다움은 30개의 보이지 않는 시스템이 맞물려 돌아가는 것에 있다.

2025년 AI/ML 엔지니어의 구분점:

"모델 돌릴 줄 안다" = 입구 수준
"파이프라인·서빙 아키텍처를 그릴 수 있다" = 시니어
"Drift·Cost·Eval까지 설계한다" = 스태프+

논문은 공개되지만 운영 노하우는 공개되지 않는다. 그래서 이 영역이 연봉 차이를 벌린다.

다음 글 예고 — "데이터 엔지니어링 완전 가이드: Lakehouse·Streaming·dbt·Orchestration·Data Mesh"

Season 2 Ep 8은 ML 아래의 기반, 데이터 엔지니어링. 다음 글은:

Lakehouse 아키텍처 (Iceberg·Delta·Hudi)
Batch vs Streaming (Flink, Kafka Streams, Spark Structured Streaming)
dbt + Elementary로 데이터 모델링
Airflow vs Prefect vs Dagster vs Temporal
Data Mesh의 진짜 의미
Data Contract과 Schema Registry

"데이터가 일하는 방식"이 달라진 2025년, 다음 글에서 이어진다.

MLOps Complete Guide — Model Serving, Feature Store, Drift, A/B Testing, GPU Economics (Season 2 Ep 7, 2025)

Intro — Why MLOps isn't just "DevOps + ML"

DevOps deploys code. MLOps deploys, monitors, and versions code + data + models simultaneously.

Four reasons MLOps is uniquely hard:

Reproducibility: same code + same data yields different models (randomness, hardware differences)
Drift: when data distributions shift, models rot in real time
Latency: training is batch, serving is real-time — architecture must split
Cost: one GPU costs $2,000 to$ 30,000/month. Bad design evaporates a startup's entire budget

In 2024 to 2025, with the LLM era, MLOps expanded into LLMOps. This post covers both.

Part 1 — Google's 5 Levels of MLOps Maturity (2021)

Level	Characteristics
Lv.0	Manual: notebook to manual deploy. Small-scale experiments
Lv.1	Automated training pipeline + manual deploy
Lv.2	Auto training + auto deploy + monitoring
Lv.3	Auto retraining (triggered on drift detection)
Lv.4	Fully automated + linked to business metrics

Most enterprises live at Lv.1 to Lv.2. Lv.3+ is Netflix, Uber, Airbnb territory.

Part 2 — Model Serving: the Inference System

2.1 General ML Serving

Tool	Strength	Use
TorchServe	PyTorch native	PyTorch standard
TensorFlow Serving	Mature, long-standing	TF models
Triton Inference Server (NVIDIA)	Multi-framework, dynamic batching	Production standard
BentoML	Python-friendly	Fast prototyping
KServe	Kubernetes native	K8s environments

2.2 LLM Serving (2024 to 2025 standard)

Tool	Characteristics
vLLM	PagedAttention, dominant throughput, open-source standard
TGI (HuggingFace)	Written in Rust, stable
TensorRT-LLM	NVIDIA optimized, top performance
SGLang	Optimized for complex workflows
llama.cpp	CPU, Mac, edge

Default for open-source LLM production in 2025: vLLM.

2.3 vLLM's Innovation: PagedAttention

Classic attention KV cache uses contiguous memory allocation — heavy fragmentation wastes 60 to 80% of GPU memory.

PagedAttention manages block-wise like OS virtual memory — under 4% waste, 2 to 4x throughput on concurrent requests.

2.4 Four Serving Patterns

Online (real-time): millisecond response. API server.
Batch: bulk prediction (nightly jobs). Efficient.
Streaming: event-driven (Kafka to model).
Edge: on-device (mobile, IoT).

2.5 Serving Performance Metrics

Latency (P50, P95, P99): response time
Throughput (QPS): requests per second
TTFT (Time to First Token, LLM): time to first token
TPS (Tokens Per Second, LLM): generation speed
GPU Utilization: target 60 to 80% (too low wastes; too high explodes latency)

Part 3 — Feature Store: the ML Feature Compute Layer

3.1 Why a Feature Store

Prevents the disaster of feature computation diverging between training and serving.

Example: "purchase amount over last 7 days" — if the time boundary or aggregation logic differs by even 0.001% between training data and live requests, model performance craters.

3.2 Three Roles of a Feature Store

Offline Store (training): Parquet, BigQuery, Snowflake — bulk lookup for training
Online Store (serving): Redis, DynamoDB — low-latency lookup
Feature Definition: central registry of "what this feature is and how it's computed"

3.3 2025 Feature Store Options

Tool	Characteristics
Feast	Open-source standard, lightweight
Tecton	Commercial, enterprise
Hopsworks	End-to-end platform
Databricks Feature Store	Delta Lake integrated
Self-built	Redis + S3 + metadata DB

3.4 Feast Example

from feast import Entity, FeatureView, Field
from feast.types import Float32, Int64

user = Entity(name="user_id", value_type=Int64)

user_activity = FeatureView(
    name="user_activity_7d",
    entities=[user],
    features=[
        Field(name="purchase_amount_7d", dtype=Float32),
        Field(name="click_count_7d", dtype=Int64),
    ],
    source=bigquery_source,
    online=True,
    ttl=timedelta(days=7),
)

# Training
features = store.get_historical_features(entity_df, feature_refs).to_df()

# Serving
features = store.get_online_features(
    features=feature_refs, entity_rows=[{"user_id": 1}]
).to_dict()

Part 4 — Training Infra: Large-Scale Training

4.1 Single GPU to Distributed Training Progression

Single GPU (~7B parameters)
Data Parallel (DP): multiple GPUs replicate model + different data
Distributed Data Parallel (DDP): DP improved, All-Reduce for gradient sync
Model Parallel: model too big, split it
Tensor Parallel: split within a single layer (Megatron-LM)
Pipeline Parallel: distribute by layer across GPUs
3D Parallel: DP + TP + PP combined (GPT-4 class)

4.2 2025 Distributed Training Tools

Tool	Characteristics
PyTorch DDP	Standard
DeepSpeed (MS)	ZeRO optimization, essential for LLMs
FSDP (Meta)	PyTorch native, DeepSpeed alternative
Megatron-LM (NVIDIA)	Ultra-large models
Ray Train	Unified interface
Determined AI	Integrated experiment tracking

4.3 ZeRO (Zero Redundancy Optimizer)

Shards optimizer state across GPUs, dramatically reducing memory:

ZeRO-1: optimizer state sharding
ZeRO-2: + gradient sharding
ZeRO-3: + model parameter sharding (similar to FSDP)

4.4 Lightweight Fine-tuning Techniques

Technique	Savings
LoRA	Only ~1% of parameters trained
QLoRA	LoRA + 4-bit quantization — fine-tune 70B on a single GPU
DoRA	LoRA improved (magnitude/direction split)
Galore	Near full-parameter quality + memory savings

Part 5 — Experiment Tracking: Recording Experiments

5.1 Why It Matters

"Why was that model 3 months ago so good?" — no answer = no reproducibility = worthless.

5.2 2025 Tool Comparison

Tool	Pros	Cons
MLflow	Open-source, self-hostable	Plain UI
Weights & Biases	Best UI, great for collab	SaaS, cost
Neptune.ai	Strong metadata	SMB-sized
Comet	W&B alternative	SMB-sized
ClearML	Open-source, includes pipelines	Learning curve

5.3 MLflow Basics

import mlflow
import mlflow.pytorch

mlflow.set_experiment("image_classifier")

with mlflow.start_run():
    mlflow.log_params({"lr": 0.001, "batch_size": 64})

    for epoch in range(epochs):
        train_loss = train(model, loader)
        val_acc = validate(model, val_loader)
        mlflow.log_metrics({"train_loss": train_loss, "val_acc": val_acc}, step=epoch)

    mlflow.pytorch.log_model(model, "model")
    mlflow.log_artifact("confusion_matrix.png")

5.4 Ten Things to Track

Hyperparameters: LR, batch size, optimizer
Data version: DVC or Delta Lake hash
Code version: git commit hash
Metrics: train/val loss, acc, AUC, etc.
Model artifact: weights, architecture
Environment: Python version, GPU type, requirements
Training time: total and per-epoch
Resources: GPU memory, CPU usage
Random seed: reproducibility
Dataset stats: class distribution, sample count

Part 6 — Drift Detection: How Models Rot

6.1 Three Kinds of Drift

Data Drift (Covariate Shift): input distribution changes
- Example: pre/post-COVID shopping patterns
Concept Drift: input to output relationship changes
- Example: the definition of spam itself shifts
Label Drift: label distribution changes
- Example: fraud rate jumps from 1% to 5%

6.2 Detection Methods

Statistical:

KS Test (single feature)
PSI (Population Stability Index)
Wasserstein Distance
Chi-Square (categorical)

ML-based:

Domain Classifier (training vs. production classifier)
Autoencoder reconstruction error

Performance-based:

Actual performance after delayed labels arrive
Proxy metrics (CTR, conversion rate)

6.3 2025 Drift Tools

Evidently AI: open-source, dashboards
Arize AI: commercial, LLM + ML unified
WhyLabs: data quality + drift
Fiddler: enterprise

6.4 LLM-Specific Problems

Prompt Drift: prompt distribution changes (user trends)
Response Drift: response quality degradation (model update effects)
Cost Drift: average token count creeps up — costs explode

Part 7 — Model A/B Testing and Deployment Strategies

7.1 Five Deployment Strategies

Strategy	Description	Risk
Blue-Green	Full swap of old/new environments	Medium
Canary	N% on new model, gradual expansion	Low
A/B	Split users precisely in half	Low (stats required)
Shadow	New model processes real traffic, response comes from old	Safest
Multi-armed Bandit	Automatic traffic reallocation	Intelligent

7.2 A/B Test Design

Hypothesis: "new model lifts CTR by at least 5%"
Sample size calculation: 80% statistical power, 5% significance
Randomization: user ID hash-based
Duration: include weekly patterns (minimum 1 week)
Guardrail metrics: primary metric + safety-net metrics (latency, error rate)
Analysis: significance + subgroup impact

7.3 Shadow Deployment

User Request
  |
[Router]
  |-- Prod Model --> User Response (returned)
  +-- Shadow Model --> Log (invisible to user)

Pros: zero risk, validated with real traffic. Cons: 2x cost, side effects hard to detect on logic changes.

7.4 2024 to 2025 Experiment Platforms

Eppo: statistically rigorous
GrowthBook: open-source
Statsig: Facebook alumni
Self-built: favored by large enterprises

Part 8 — GPU Economics 2025

8.1 GPU Option Comparison

Option	Price (H100 baseline)	Flexibility	Good For
On-demand Cloud	~$3/hr	Highest	Small, irregular
Spot/Preemptible	$1 to$ 1.5/hr (60 to 70% off)	Low	Batch training
Reserved (1 to 3 years)	~ $1.5 to$ 2/hr	Medium	Predictable workloads
Dedicated	thousands/month	High	Long-term production
Owned	H100 ~ $30K, DGX ~$ 400K	Highest	Large-scale, long-term

8.2 2024 to 2025 Trends

H100 to B200 (Blackwell): 2.5x performance at similar price
AMD MI300X: H100 alternative, 192GB memory
Groq LPU: inference specialized, highest tokens/sec
AWS Trainium/Inferentia: in-house chips, better price/perf
Google TPU v5: training and inference

8.3 Ten Cost-Optimization Tactics

Spot instances + checkpointing: training is resumable. 70% off.
Mixed precision (FP16/BF16): 2x speed and memory
Gradient accumulation: large batch effect on small GPUs
Gradient checkpointing: half memory, 20% slower
Quantization (INT8/INT4): 2 to 4x inference memory reduction
LoRA/QLoRA: 99% savings on fine-tuning
Model distillation: replicate performance in a smaller model
Batching + dynamic batching: serving throughput
Request caching: reuse results for repeated prompts
Right-sizing: avoid overprovisioning (don't use A100 when A10 suffices)

8.4 Cloud vs. Owning: Break-Even Point

Simple rule: 24/7 GPU usage, 3+ units, 1+ year expected — consider buying.

Reality: factor in infra team headcount, power/cooling, refresh cycles (3 to 4 years).

Part 9 — Data Pipelines

9.1 Batch vs. Streaming

Batch	Streaming
Airflow, Prefect, Dagster	Kafka, Flink, Spark Streaming
Hourly/daily batches	Real-time
Lower cost	Higher cost
Delay tolerated	ms to s required

9.2 2025 Orchestration

Airflow 2.x: standard, mature
Prefect: Pythonic, great UX
Dagster: type-safe, data-aware
Temporal: workflow specialized, restartable

9.3 ML Data Quality Checks

Nulls: threshold on null ratio
Outliers: Z-score or Isolation Forest
Schema drift: column add/remove/type change
Range checks: age > 0, price > 0
Distribution: histogram vs. baseline
Uniqueness: no ID duplicates
Relationships: FK integrity

9.4 Great Expectations / Soda

import great_expectations as ge

df = ge.from_pandas(df)
df.expect_column_values_to_not_be_null("user_id")
df.expect_column_values_to_be_between("age", 0, 150)
df.expect_column_value_lengths_to_be_between("email", 5, 100)

Part 10 — Observability and Debugging

10.1 Three Pillars + ML-Specific

Generic apps:

Metrics (Prometheus)
Logs (Loki, Elasticsearch)
Traces (OpenTelemetry)

ML additions:

Prediction log: input + output + model version
Feature log: Feature Store lookup records
Drift metrics: distribution statistics
Explanations: SHAP, LIME

10.2 LLM Observability Tools

LangSmith: LangChain team
Langfuse: open-source
Helicone: proxy-based
Phoenix (Arize): open-source, strong

Part 11 — MLOps Mastery Roadmap (6 Months)

Month 1: Fundamentals + Serving

FastAPI + PyTorch model serving
Docker + K8s basics

Month 2: Training Infra

PyTorch DDP
Ray Train or DeepSpeed hands-on
MLflow for experiment tracking

Month 3: Feature Store + Data

Feast install and operations
Airflow or Dagster pipelines
Great Expectations data quality

Month 4: LLM Specialization

vLLM operations
Prompt management (Langfuse)
LLM-as-a-Judge eval

Month 5: Drift + A/B

Evidently AI for drift detection
GrowthBook for A/B testing
Shadow Deployment in practice

Month 6: Optimization + Scale

GPU cost monitoring
LoRA fine-tuning
Model distillation experiments

Part 12 — MLOps Checklist of 12

Know your team's position on the 5-level MLOps maturity scale
Understand why vLLM beats legacy serving (PagedAttention)
Articulate Feature Store's three roles
Know the differences between DDP, FSDP, and ZeRO
Know the memory-saving principle of LoRA vs. QLoRA
Know MLflow's 5 logging targets
Explain the 3 kinds of drift (Data/Concept/Label)
Know Shadow Deployment's trade-offs
Know when to pick Canary vs. A/B testing
Know GPU spot instance pitfalls during training
Know the difference between batch and dynamic batching
Know the 3 elements of LLM observability (prompt, completion, metadata)

Part 13 — 10 MLOps Anti-Patterns

Deploying from notebooks only: zero reproducibility. Move to pipelines.
Splitting feature computation across training and serving: use a Feature Store or shared library.
Deploying without an eval set: no way to detect regressions.
No drift monitoring: quiet failure in 6 months.
Single A/B metric: guardrails are mandatory.
Picking GPUs "by feel": A10/A100/H100 need clear criteria.
Spot instances without checkpointing: termination erases training.
Big-bang deploy without Shadow: risk underestimated.
Observability later: must be baked in from day one.
Treating MLOps as ML team's job only: DevOps/Data team collaboration required.

Closing — MLOps is the "Invisible 70%"

The beauty of papers lies in model architecture, but the beauty of production lies in 30 invisible systems meshing together.

The 2025 AI/ML engineer divide:

"Can run a model" = entry level
"Can draw a pipeline/serving architecture" = senior
"Designs drift, cost, eval end-to-end" = staff+

Papers are public; operational know-how is not. That's why this area drives salary gaps.

Next Post — "Data Engineering Complete Guide: Lakehouse, Streaming, dbt, Orchestration, Data Mesh"

Season 2 Ep 8 is about the foundation beneath ML: data engineering. Next up:

Lakehouse architecture (Iceberg, Delta, Hudi)
Batch vs. Streaming (Flink, Kafka Streams, Spark Structured Streaming)
Data modeling with dbt + Elementary
Airflow vs. Prefect vs. Dagster vs. Temporal
The real meaning of Data Mesh
Data Contracts and Schema Registry

"The way data works" has changed in 2025 — continues in the next post.