Skip to content

Split View: MLOps 완전 가이드 — 모델 서빙·Feature Store·Drift·A/B 테스트·GPU 경제학 (Season 2 Ep 7, 2025)

|

MLOps 완전 가이드 — 모델 서빙·Feature Store·Drift·A/B 테스트·GPU 경제학 (Season 2 Ep 7, 2025)

들어가며 — MLOps가 "DevOps + ML"이 아닌 이유

DevOps는 코드를 배포한다. MLOps는 코드 + 데이터 + 모델 세 가지를 동시에 배포·모니터링·버전 관리한다.

MLOps가 유독 어려운 4가지 이유:

  1. 재현성: 같은 코드 + 같은 데이터 → 다른 모델 (랜덤성·하드웨어 차이)
  2. Drift: 데이터 분포가 바뀌면 모델이 실시간으로 썩음
  3. 지연시간: 학습은 배치 / 서빙은 실시간 → 아키텍처 분리
  4. 비용: GPU 1개 월 2,000 2,000~30,000. 잘못 설계하면 스타트업 전사 예산 증발

2024~2025년, LLM 시대가 되며 MLOps는 LLMOps로 확장. 이 글은 두 영역 모두 다룬다.


1부 — MLOps 성숙도 5단계 (Google 2021 프레임워크)

Level특징
Lv.0수동: notebook → 수동 배포. 소규모 실험용
Lv.1자동 학습 파이프라인 + 수동 배포
Lv.2자동 학습 + 자동 배포 + 모니터링
Lv.3자동 재학습 (Drift 감지 시 트리거)
Lv.4완전 자동 + 비즈니스 지표까지 연결

현실 대부분 기업은 Lv.1~2. Lv.3+는 Netflix·Uber·Airbnb 수준.


2부 — Model Serving: 추론 시스템

2.1 일반 ML Serving

도구강점용도
TorchServePyTorch 네이티브PyTorch 표준
TensorFlow Serving오래되고 성숙TF 모델
Triton Inference Server (NVIDIA)멀티 프레임워크, 동적 배치프로덕션 표준
BentoMLPython-friendly빠른 프로토타이핑
KServeKubernetes 네이티브쿠버네티스 환경

2.2 LLM Serving (2024~2025 표준)

도구특징
vLLMPagedAttention, 처리량 압도적, 오픈소스 표준
TGI (HuggingFace)Rust 작성, 안정적
TensorRT-LLMNVIDIA 최적화, 최고 성능
SGLang복잡한 워크플로우 최적화
llama.cppCPU·Mac·엣지

2025년 오픈소스 LLM 프로덕션의 기본값: vLLM.

2.3 vLLM의 혁신: PagedAttention

전통 Attention KV Cache: 연속 메모리 할당 → 단편화 심각, GPU 메모리의 60~80% 낭비.

PagedAttention: OS 가상메모리처럼 블록 단위 관리 → 메모리 낭비 4% 미만, 동시 요청 처리량 2~4배.

2.4 Serving 패턴 4가지

  1. Online (Real-time): ms 단위 응답. API 서버.
  2. Batch: 대량 예측 (야간 작업). 효율 ↑.
  3. Streaming: 이벤트 기반 (Kafka → 모델).
  4. Edge: 디바이스에서 직접 (모바일·IoT).

2.5 Serving 성능 지표

  • Latency (P50, P95, P99): 응답 시간
  • Throughput (QPS): 초당 요청 수
  • TTFT (Time to First Token, LLM): 첫 토큰까지
  • TPS (Tokens Per Second, LLM): 생성 속도
  • GPU Utilization: 목표 60~80% (너무 낮으면 낭비, 너무 높으면 지연 폭증)

3부 — Feature Store: ML의 특징 계산층

3.1 왜 Feature Store인가

학습 시점과 서빙 시점의 Feature 계산이 달라서 망하는 것을 방지.

예: "최근 7일 구매 금액" — 학습 데이터와 실시간 요청에서 시간 기준·집계 로직이 0.001%만 달라도 모델 성능 급락.

3.2 Feature Store 3가지 역할

  1. Offline Store (Training용): Parquet, BigQuery, Snowflake — 학습용 대량 조회
  2. Online Store (Serving용): Redis, DynamoDB — 저지연 조회
  3. Feature Definition: "이 Feature는 무엇이고 어떻게 계산하나" 중앙 관리

3.3 2025 Feature Store 옵션

도구특징
Feast오픈소스 표준, 경량
Tecton상용, 엔터프라이즈
HopsworksEnd-to-end 플랫폼
Databricks Feature StoreDelta Lake 통합
자체 구축Redis + S3 + 메타데이터 DB

3.4 Feast 예제

from feast import Entity, FeatureView, Field
from feast.types import Float32, Int64

user = Entity(name="user_id", value_type=Int64)

user_activity = FeatureView(
    name="user_activity_7d",
    entities=[user],
    features=[
        Field(name="purchase_amount_7d", dtype=Float32),
        Field(name="click_count_7d", dtype=Int64),
    ],
    source=bigquery_source,
    online=True,
    ttl=timedelta(days=7),
)

# 학습 시
features = store.get_historical_features(entity_df, feature_refs).to_df()

# 서빙 시
features = store.get_online_features(
    features=feature_refs, entity_rows=[{"user_id": 1}]
).to_dict()

4부 — Training Infra: 대규모 학습

4.1 단일 GPU → 분산 학습 순서

  1. Single GPU (~7B 파라미터)
  2. Data Parallel (DP): 여러 GPU가 같은 모델 복제 + 다른 데이터
  3. Distributed Data Parallel (DDP): DP 개선, All-Reduce로 gradient 동기화
  4. Model Parallel: 모델이 너무 커서 쪼갬
  5. Tensor Parallel: 한 레이어 안에서도 쪼갬 (Megatron-LM)
  6. Pipeline Parallel: 레이어 단위로 GPU 분배
  7. 3D Parallel: DP + TP + PP 조합 (GPT-4급)

4.2 2025 분산 학습 도구

도구특징
PyTorch DDP표준
DeepSpeed (MS)ZeRO 최적화, LLM 필수
FSDP (Meta)PyTorch 네이티브, DeepSpeed 대안
Megatron-LM (NVIDIA)초대형 모델
Ray Train통합 인터페이스
Determined AI실험 관리 통합

4.3 ZeRO (Zero Redundancy Optimizer)

Optimizer State를 GPU 간 분할 → 메모리 사용량 대폭 감소:

  • ZeRO-1: Optimizer State 분할
  • ZeRO-2: + Gradient 분할
  • ZeRO-3: + Model Parameter 분할 (FSDP와 유사)

4.4 Fine-tuning 경량 기법

기법절감
LoRA학습 파라미터 ~1%만
QLoRALoRA + 4bit 양자화 → 단일 GPU로 70B 파인튜닝
DoRALoRA 개선 (크기·방향 분리)
Galore풀 파라미터 학습 유사 성능 + 메모리 절약

5부 — Experiment Tracking: 실험 기록

5.1 왜 필요한가

"3개월 전 그 모델이 왜 좋았더라?" → 답할 수 없음 = 재현 불가 = 무의미.

5.2 2025 도구 비교

도구장점단점
MLflow오픈소스, 자가 호스팅 가능UI 평범
Weights & BiasesUI 최고, 협업 좋음SaaS, 비용
Neptune.ai메타데이터 강함중소규모
CometW&B 대안중소규모
ClearML오픈소스, 파이프라인까지학습 곡선

5.3 MLflow 기본

import mlflow
import mlflow.pytorch

mlflow.set_experiment("image_classifier")

with mlflow.start_run():
    mlflow.log_params({"lr": 0.001, "batch_size": 64})

    for epoch in range(epochs):
        train_loss = train(model, loader)
        val_acc = validate(model, val_loader)
        mlflow.log_metrics({"train_loss": train_loss, "val_acc": val_acc}, step=epoch)

    mlflow.pytorch.log_model(model, "model")
    mlflow.log_artifact("confusion_matrix.png")

5.4 추적해야 할 것 10가지

  1. Hyperparameter: LR, batch size, optimizer
  2. Data Version: DVC 또는 Delta Lake 해시
  3. Code Version: git commit hash
  4. Metric: train/val loss, acc, AUC, etc.
  5. Model Artifact: weights, architecture
  6. Environment: Python 버전, GPU 종류, requirements
  7. Training Time: 총 시간, 에폭당 시간
  8. Resource: GPU 메모리, CPU 사용
  9. Random Seed: 재현성
  10. Dataset Stats: 클래스 분포, 샘플 수

6부 — Drift 감지: 모델이 썩는 법

6.1 3가지 Drift

  1. Data Drift (Covariate Shift): 입력 분포 변화
    • 예: 코로나 전/후 쇼핑 패턴
  2. Concept Drift: 입력→출력 관계 변화
    • 예: 스팸 정의 자체가 바뀜
  3. Label Drift: 정답 분포 변화
    • 예: 사기 비율이 1%에서 5%로 급증

6.2 감지 방법

통계적:

  • KS Test (단일 특징)
  • PSI (Population Stability Index)
  • Wasserstein Distance
  • Chi-Square (범주형)

ML 기반:

  • Domain Classifier (학습/프로덕션 데이터 분류기)
  • Autoencoder Reconstruction Error

성능 기반:

  • Delayed Label 도착 후 실제 성능 추적
  • Proxy Metric (CTR, 전환율)

6.3 2025 Drift 도구

  • Evidently AI: 오픈소스, 대시보드
  • Arize AI: 상용, LLM·ML 통합
  • WhyLabs: 데이터 품질 + Drift
  • Fiddler: 엔터프라이즈

6.4 LLM 특유 문제

  • Prompt Drift: 프롬프트 분포 변화 (사용자 트렌드)
  • Response Drift: 응답 품질 저하 (모델 업데이트 영향)
  • Cost Drift: 평균 토큰 수 증가 → 비용 폭증

7부 — Model A/B 테스트와 배포 전략

7.1 배포 전략 5가지

전략설명위험
Blue-Green구/신 환경 전체 교체
CanaryN%만 신모델, 점진 확대낮음
A/B사용자를 정확히 절반으로 나눔낮음 (통계 필요)
Shadow신모델이 실제 트래픽 처리, 응답은 구모델로가장 안전
Multi-armed Bandit자동 트래픽 재분배지능적

7.2 A/B 테스트 설계

  1. Hypothesis: "신모델이 CTR을 5% 이상 올릴 것"
  2. Sample Size 계산: Statistical Power 80%, 유의수준 5%
  3. Randomization: User ID 해시 기반
  4. Duration: 주간 패턴 포함 (최소 1주)
  5. Guardrail Metric: 주 지표 + 안전망 지표 (latency, error rate)
  6. 분석: 유의미한 효과인가 + 서브그룹 영향

7.3 Shadow Deployment

User Request
[Router]
  ├─ Prod ModelUser Response (반환)
  └─ Shadow ModelLog (사용자에겐 안 보임)

장점: 0 위험, 실제 트래픽으로 검증. 단점: 2배 비용, 로직 변경 시 부작용 감지 어려움.

7.4 2024~2025 실험 플랫폼

  • Eppo: 통계 엄격
  • GrowthBook: 오픈소스
  • Statsig: Facebook 출신 창업
  • 자체 구축: 대기업 선호

8부 — GPU 경제학 2025

8.1 GPU 옵션 비교

옵션가격 (H100 기준)유연성적합
On-demand Cloud~$3/시간최고소규모·불규칙
Spot/Preemptible$11.5/시간 (6070% 할인)낮음배치 학습
Reserved (1~3년)$1.52/시간예측 가능 워크로드
전용 (Dedicated)수백만 원/월높음장기 프로덕션
자체 구매H100 ~30K,DGX 30K, DGX ~400K최고대규모·장기

8.2 2024~2025 트렌드

  • H100 → B200 (Blackwell): 2.5배 성능, 가격 비슷
  • AMD MI300X: H100 대안, 메모리 192GB
  • Groq LPU: 추론 특화, 토큰/초 최고
  • AWS Trainium/Inferentia: 자체 칩, 가성비 ↑
  • Google TPU v5: 학습·추론 특화

8.3 비용 최적화 전술 10가지

  1. Spot 인스턴스 + 체크포인트: 학습은 재개 가능. 70% 할인
  2. Mixed Precision (FP16/BF16): 속도·메모리 2배
  3. Gradient Accumulation: 작은 GPU로 큰 배치 효과
  4. Gradient Checkpointing: 메모리 절반, 속도 20% 손실
  5. Quantization (INT8/INT4): 추론 메모리 2~4배 절감
  6. LoRA/QLoRA: 파인튜닝 비용 99% 절감
  7. Model Distillation: 작은 모델로 성능 복제
  8. Batching + Dynamic Batching: 서빙 처리량 ↑
  9. Request Caching: 같은 프롬프트 결과 재사용
  10. Right-sizing: 과도한 GPU 피하기 (A10이면 되는데 A100 쓰지 말기)

8.4 Cloud vs 자체 구매 손익분기점

단순 규칙: 24/7 사용하는 GPU가 3대 이상 + 1년 이상 예상 → 자체 구매 고려.

현실: 인프라 팀 인건비, 전력·냉각, 교체 주기(3~4년)까지 계산해야.


9부 — 데이터 파이프라인

9.1 Batch vs Streaming

BatchStreaming
Airflow, Prefect, DagsterKafka, Flink, Spark Streaming
시간당/일당 배치실시간
비용 낮음비용 높음
지연 허용ms~s 요구

9.2 2025 오케스트레이션

  • Airflow 2.x: 표준, 성숙
  • Prefect: Pythonic, UX 우수
  • Dagster: 타입 안전, 데이터 인식
  • Temporal: 워크플로우 특화, 재시작 가능

9.3 ML 데이터 품질 체크

  • 결측치: 비율 임계치
  • 이상치: Z-score 또는 Isolation Forest
  • 스키마 드리프트: 컬럼 추가/삭제/타입 변경
  • 범위 체크: age > 0, price > 0
  • 분포: 히스토그램 비교 vs 기준선
  • 유일성: ID 중복 없음
  • 관계: FK 무결성

9.4 Great Expectations · Soda

import great_expectations as ge

df = ge.from_pandas(df)
df.expect_column_values_to_not_be_null("user_id")
df.expect_column_values_to_be_between("age", 0, 150)
df.expect_column_value_lengths_to_be_between("email", 5, 100)

10부 — 관측성과 디버깅

10.1 3축 + ML 고유

일반 앱:

  • Metric (Prometheus)
  • Log (Loki, Elasticsearch)
  • Trace (OpenTelemetry)

ML 추가:

  • Prediction Log: 입력 + 출력 + 모델 버전
  • Feature Log: Feature Store 조회 기록
  • Drift Metric: 분포 통계
  • Explanation: SHAP, LIME

10.2 LLM 관측성 도구

  • LangSmith: LangChain 팀
  • Langfuse: 오픈소스
  • Helicone: 프록시 기반
  • Phoenix (Arize): 오픈소스 강력

11부 — MLOps 마스터 로드맵 6개월

Month 1: 기초 + Serving

  • FastAPI + PyTorch 모델 서빙
  • Docker + K8s 기본

Month 2: Training Infra

  • PyTorch DDP
  • Ray Train 또는 DeepSpeed 체험
  • MLflow로 실험 추적

Month 3: Feature Store + Data

  • Feast 설치·운영
  • Airflow 또는 Dagster 파이프라인
  • Great Expectations 데이터 품질

Month 4: LLM 특화

  • vLLM 운영
  • 프롬프트 관리 (Langfuse)
  • LLM-as-a-Judge eval

Month 5: Drift + A/B

  • Evidently AI로 drift 감지
  • GrowthBook으로 A/B 테스트
  • Shadow Deployment 실전

Month 6: 최적화·스케일

  • GPU 비용 모니터링
  • LoRA 파인튜닝
  • Model Distillation 실험

12부 — MLOps 체크리스트 12

  1. MLOps 성숙도 5단계에서 우리 팀 위치를 안다
  2. vLLM이 기존 서빙보다 빠른 이유를 안다 (PagedAttention)
  3. Feature Store 3가지 역할을 말할 수 있다
  4. DDP vs FSDP vs ZeRO 차이를 안다
  5. LoRA vs QLoRA의 메모리 절감 원리를 안다
  6. MLflow의 5가지 로그 대상을 안다
  7. 3가지 Drift (Data/Concept/Label) 차이를 설명할 수 있다
  8. Shadow Deployment의 장단점을 안다
  9. Canary vs A/B 테스트 선택 기준을 안다
  10. GPU Spot 인스턴스로 학습할 때 주의점을 안다
  11. Batch vs Dynamic Batching 차이를 안다
  12. LLM 관측성 3요소 (prompt·completion·metadata)를 안다

13부 — MLOps 안티패턴 10

  1. Notebook만으로 배포: 재현성 0. 파이프라인으로 전환
  2. Feature 계산 학습·서빙 분리: 반드시 Feature Store 또는 공유 라이브러리
  3. Eval 셋 없이 배포: 성능 저하 감지 불가
  4. Drift 모니터링 없음: 6개월 후 조용히 망함
  5. 단일 A/B 지표: Guardrail 필수
  6. GPU "적당히" 선택: A10·A100·H100 명확한 기준 필요
  7. Spot 인스턴스에 체크포인트 없이: 종료되면 학습 소멸
  8. Shadow Deploy 없이 Big Bang: 위험 과소평가
  9. 관측성 나중에: 처음부터 심어야
  10. MLOps를 ML 팀만의 일로: DevOps·Data 팀 협업 필수

마치며 — MLOps는 "보이지 않는 70%"다

논문의 아름다움은 모델 구조에 있지만, 프로덕션의 아름다움은 30개의 보이지 않는 시스템이 맞물려 돌아가는 것에 있다.

2025년 AI/ML 엔지니어의 구분점:

  • "모델 돌릴 줄 안다" = 입구 수준
  • "파이프라인·서빙 아키텍처를 그릴 수 있다" = 시니어
  • "Drift·Cost·Eval까지 설계한다" = 스태프+

논문은 공개되지만 운영 노하우는 공개되지 않는다. 그래서 이 영역이 연봉 차이를 벌린다.


다음 글 예고 — "데이터 엔지니어링 완전 가이드: Lakehouse·Streaming·dbt·Orchestration·Data Mesh"

Season 2 Ep 8은 ML 아래의 기반, 데이터 엔지니어링. 다음 글은:

  • Lakehouse 아키텍처 (Iceberg·Delta·Hudi)
  • Batch vs Streaming (Flink, Kafka Streams, Spark Structured Streaming)
  • dbt + Elementary로 데이터 모델링
  • Airflow vs Prefect vs Dagster vs Temporal
  • Data Mesh의 진짜 의미
  • Data Contract과 Schema Registry

"데이터가 일하는 방식"이 달라진 2025년, 다음 글에서 이어진다.

MLOps Complete Guide — Model Serving, Feature Store, Drift, A/B Testing, GPU Economics (Season 2 Ep 7, 2025)

Intro — Why MLOps isn't just "DevOps + ML"

DevOps deploys code. MLOps deploys, monitors, and versions code + data + models simultaneously.

Four reasons MLOps is uniquely hard:

  1. Reproducibility: same code + same data yields different models (randomness, hardware differences)
  2. Drift: when data distributions shift, models rot in real time
  3. Latency: training is batch, serving is real-time — architecture must split
  4. Cost: one GPU costs 2,000to2,000 to 30,000/month. Bad design evaporates a startup's entire budget

In 2024 to 2025, with the LLM era, MLOps expanded into LLMOps. This post covers both.


Part 1 — Google's 5 Levels of MLOps Maturity (2021)

LevelCharacteristics
Lv.0Manual: notebook to manual deploy. Small-scale experiments
Lv.1Automated training pipeline + manual deploy
Lv.2Auto training + auto deploy + monitoring
Lv.3Auto retraining (triggered on drift detection)
Lv.4Fully automated + linked to business metrics

Most enterprises live at Lv.1 to Lv.2. Lv.3+ is Netflix, Uber, Airbnb territory.


Part 2 — Model Serving: the Inference System

2.1 General ML Serving

ToolStrengthUse
TorchServePyTorch nativePyTorch standard
TensorFlow ServingMature, long-standingTF models
Triton Inference Server (NVIDIA)Multi-framework, dynamic batchingProduction standard
BentoMLPython-friendlyFast prototyping
KServeKubernetes nativeK8s environments

2.2 LLM Serving (2024 to 2025 standard)

ToolCharacteristics
vLLMPagedAttention, dominant throughput, open-source standard
TGI (HuggingFace)Written in Rust, stable
TensorRT-LLMNVIDIA optimized, top performance
SGLangOptimized for complex workflows
llama.cppCPU, Mac, edge

Default for open-source LLM production in 2025: vLLM.

2.3 vLLM's Innovation: PagedAttention

Classic attention KV cache uses contiguous memory allocation — heavy fragmentation wastes 60 to 80% of GPU memory.

PagedAttention manages block-wise like OS virtual memory — under 4% waste, 2 to 4x throughput on concurrent requests.

2.4 Four Serving Patterns

  1. Online (real-time): millisecond response. API server.
  2. Batch: bulk prediction (nightly jobs). Efficient.
  3. Streaming: event-driven (Kafka to model).
  4. Edge: on-device (mobile, IoT).

2.5 Serving Performance Metrics

  • Latency (P50, P95, P99): response time
  • Throughput (QPS): requests per second
  • TTFT (Time to First Token, LLM): time to first token
  • TPS (Tokens Per Second, LLM): generation speed
  • GPU Utilization: target 60 to 80% (too low wastes; too high explodes latency)

Part 3 — Feature Store: the ML Feature Compute Layer

3.1 Why a Feature Store

Prevents the disaster of feature computation diverging between training and serving.

Example: "purchase amount over last 7 days" — if the time boundary or aggregation logic differs by even 0.001% between training data and live requests, model performance craters.

3.2 Three Roles of a Feature Store

  1. Offline Store (training): Parquet, BigQuery, Snowflake — bulk lookup for training
  2. Online Store (serving): Redis, DynamoDB — low-latency lookup
  3. Feature Definition: central registry of "what this feature is and how it's computed"

3.3 2025 Feature Store Options

ToolCharacteristics
FeastOpen-source standard, lightweight
TectonCommercial, enterprise
HopsworksEnd-to-end platform
Databricks Feature StoreDelta Lake integrated
Self-builtRedis + S3 + metadata DB

3.4 Feast Example

from feast import Entity, FeatureView, Field
from feast.types import Float32, Int64

user = Entity(name="user_id", value_type=Int64)

user_activity = FeatureView(
    name="user_activity_7d",
    entities=[user],
    features=[
        Field(name="purchase_amount_7d", dtype=Float32),
        Field(name="click_count_7d", dtype=Int64),
    ],
    source=bigquery_source,
    online=True,
    ttl=timedelta(days=7),
)

# Training
features = store.get_historical_features(entity_df, feature_refs).to_df()

# Serving
features = store.get_online_features(
    features=feature_refs, entity_rows=[{"user_id": 1}]
).to_dict()

Part 4 — Training Infra: Large-Scale Training

4.1 Single GPU to Distributed Training Progression

  1. Single GPU (~7B parameters)
  2. Data Parallel (DP): multiple GPUs replicate model + different data
  3. Distributed Data Parallel (DDP): DP improved, All-Reduce for gradient sync
  4. Model Parallel: model too big, split it
  5. Tensor Parallel: split within a single layer (Megatron-LM)
  6. Pipeline Parallel: distribute by layer across GPUs
  7. 3D Parallel: DP + TP + PP combined (GPT-4 class)

4.2 2025 Distributed Training Tools

ToolCharacteristics
PyTorch DDPStandard
DeepSpeed (MS)ZeRO optimization, essential for LLMs
FSDP (Meta)PyTorch native, DeepSpeed alternative
Megatron-LM (NVIDIA)Ultra-large models
Ray TrainUnified interface
Determined AIIntegrated experiment tracking

4.3 ZeRO (Zero Redundancy Optimizer)

Shards optimizer state across GPUs, dramatically reducing memory:

  • ZeRO-1: optimizer state sharding
  • ZeRO-2: + gradient sharding
  • ZeRO-3: + model parameter sharding (similar to FSDP)

4.4 Lightweight Fine-tuning Techniques

TechniqueSavings
LoRAOnly ~1% of parameters trained
QLoRALoRA + 4-bit quantization — fine-tune 70B on a single GPU
DoRALoRA improved (magnitude/direction split)
GaloreNear full-parameter quality + memory savings

Part 5 — Experiment Tracking: Recording Experiments

5.1 Why It Matters

"Why was that model 3 months ago so good?" — no answer = no reproducibility = worthless.

5.2 2025 Tool Comparison

ToolProsCons
MLflowOpen-source, self-hostablePlain UI
Weights & BiasesBest UI, great for collabSaaS, cost
Neptune.aiStrong metadataSMB-sized
CometW&B alternativeSMB-sized
ClearMLOpen-source, includes pipelinesLearning curve

5.3 MLflow Basics

import mlflow
import mlflow.pytorch

mlflow.set_experiment("image_classifier")

with mlflow.start_run():
    mlflow.log_params({"lr": 0.001, "batch_size": 64})

    for epoch in range(epochs):
        train_loss = train(model, loader)
        val_acc = validate(model, val_loader)
        mlflow.log_metrics({"train_loss": train_loss, "val_acc": val_acc}, step=epoch)

    mlflow.pytorch.log_model(model, "model")
    mlflow.log_artifact("confusion_matrix.png")

5.4 Ten Things to Track

  1. Hyperparameters: LR, batch size, optimizer
  2. Data version: DVC or Delta Lake hash
  3. Code version: git commit hash
  4. Metrics: train/val loss, acc, AUC, etc.
  5. Model artifact: weights, architecture
  6. Environment: Python version, GPU type, requirements
  7. Training time: total and per-epoch
  8. Resources: GPU memory, CPU usage
  9. Random seed: reproducibility
  10. Dataset stats: class distribution, sample count

Part 6 — Drift Detection: How Models Rot

6.1 Three Kinds of Drift

  1. Data Drift (Covariate Shift): input distribution changes
    • Example: pre/post-COVID shopping patterns
  2. Concept Drift: input to output relationship changes
    • Example: the definition of spam itself shifts
  3. Label Drift: label distribution changes
    • Example: fraud rate jumps from 1% to 5%

6.2 Detection Methods

Statistical:

  • KS Test (single feature)
  • PSI (Population Stability Index)
  • Wasserstein Distance
  • Chi-Square (categorical)

ML-based:

  • Domain Classifier (training vs. production classifier)
  • Autoencoder reconstruction error

Performance-based:

  • Actual performance after delayed labels arrive
  • Proxy metrics (CTR, conversion rate)

6.3 2025 Drift Tools

  • Evidently AI: open-source, dashboards
  • Arize AI: commercial, LLM + ML unified
  • WhyLabs: data quality + drift
  • Fiddler: enterprise

6.4 LLM-Specific Problems

  • Prompt Drift: prompt distribution changes (user trends)
  • Response Drift: response quality degradation (model update effects)
  • Cost Drift: average token count creeps up — costs explode

Part 7 — Model A/B Testing and Deployment Strategies

7.1 Five Deployment Strategies

StrategyDescriptionRisk
Blue-GreenFull swap of old/new environmentsMedium
CanaryN% on new model, gradual expansionLow
A/BSplit users precisely in halfLow (stats required)
ShadowNew model processes real traffic, response comes from oldSafest
Multi-armed BanditAutomatic traffic reallocationIntelligent

7.2 A/B Test Design

  1. Hypothesis: "new model lifts CTR by at least 5%"
  2. Sample size calculation: 80% statistical power, 5% significance
  3. Randomization: user ID hash-based
  4. Duration: include weekly patterns (minimum 1 week)
  5. Guardrail metrics: primary metric + safety-net metrics (latency, error rate)
  6. Analysis: significance + subgroup impact

7.3 Shadow Deployment

User Request
  |
[Router]
  |-- Prod Model --> User Response (returned)
  +-- Shadow Model --> Log (invisible to user)

Pros: zero risk, validated with real traffic. Cons: 2x cost, side effects hard to detect on logic changes.

7.4 2024 to 2025 Experiment Platforms

  • Eppo: statistically rigorous
  • GrowthBook: open-source
  • Statsig: Facebook alumni
  • Self-built: favored by large enterprises

Part 8 — GPU Economics 2025

8.1 GPU Option Comparison

OptionPrice (H100 baseline)FlexibilityGood For
On-demand Cloud~$3/hrHighestSmall, irregular
Spot/Preemptible1to1 to 1.5/hr (60 to 70% off)LowBatch training
Reserved (1 to 3 years)~1.5to1.5 to 2/hrMediumPredictable workloads
Dedicatedthousands/monthHighLong-term production
OwnedH100 ~30K,DGX 30K, DGX ~400KHighestLarge-scale, long-term
  • H100 to B200 (Blackwell): 2.5x performance at similar price
  • AMD MI300X: H100 alternative, 192GB memory
  • Groq LPU: inference specialized, highest tokens/sec
  • AWS Trainium/Inferentia: in-house chips, better price/perf
  • Google TPU v5: training and inference

8.3 Ten Cost-Optimization Tactics

  1. Spot instances + checkpointing: training is resumable. 70% off.
  2. Mixed precision (FP16/BF16): 2x speed and memory
  3. Gradient accumulation: large batch effect on small GPUs
  4. Gradient checkpointing: half memory, 20% slower
  5. Quantization (INT8/INT4): 2 to 4x inference memory reduction
  6. LoRA/QLoRA: 99% savings on fine-tuning
  7. Model distillation: replicate performance in a smaller model
  8. Batching + dynamic batching: serving throughput
  9. Request caching: reuse results for repeated prompts
  10. Right-sizing: avoid overprovisioning (don't use A100 when A10 suffices)

8.4 Cloud vs. Owning: Break-Even Point

Simple rule: 24/7 GPU usage, 3+ units, 1+ year expected — consider buying.

Reality: factor in infra team headcount, power/cooling, refresh cycles (3 to 4 years).


Part 9 — Data Pipelines

9.1 Batch vs. Streaming

BatchStreaming
Airflow, Prefect, DagsterKafka, Flink, Spark Streaming
Hourly/daily batchesReal-time
Lower costHigher cost
Delay toleratedms to s required

9.2 2025 Orchestration

  • Airflow 2.x: standard, mature
  • Prefect: Pythonic, great UX
  • Dagster: type-safe, data-aware
  • Temporal: workflow specialized, restartable

9.3 ML Data Quality Checks

  • Nulls: threshold on null ratio
  • Outliers: Z-score or Isolation Forest
  • Schema drift: column add/remove/type change
  • Range checks: age > 0, price > 0
  • Distribution: histogram vs. baseline
  • Uniqueness: no ID duplicates
  • Relationships: FK integrity

9.4 Great Expectations / Soda

import great_expectations as ge

df = ge.from_pandas(df)
df.expect_column_values_to_not_be_null("user_id")
df.expect_column_values_to_be_between("age", 0, 150)
df.expect_column_value_lengths_to_be_between("email", 5, 100)

Part 10 — Observability and Debugging

10.1 Three Pillars + ML-Specific

Generic apps:

  • Metrics (Prometheus)
  • Logs (Loki, Elasticsearch)
  • Traces (OpenTelemetry)

ML additions:

  • Prediction log: input + output + model version
  • Feature log: Feature Store lookup records
  • Drift metrics: distribution statistics
  • Explanations: SHAP, LIME

10.2 LLM Observability Tools

  • LangSmith: LangChain team
  • Langfuse: open-source
  • Helicone: proxy-based
  • Phoenix (Arize): open-source, strong

Part 11 — MLOps Mastery Roadmap (6 Months)

Month 1: Fundamentals + Serving

  • FastAPI + PyTorch model serving
  • Docker + K8s basics

Month 2: Training Infra

  • PyTorch DDP
  • Ray Train or DeepSpeed hands-on
  • MLflow for experiment tracking

Month 3: Feature Store + Data

  • Feast install and operations
  • Airflow or Dagster pipelines
  • Great Expectations data quality

Month 4: LLM Specialization

  • vLLM operations
  • Prompt management (Langfuse)
  • LLM-as-a-Judge eval

Month 5: Drift + A/B

  • Evidently AI for drift detection
  • GrowthBook for A/B testing
  • Shadow Deployment in practice

Month 6: Optimization + Scale

  • GPU cost monitoring
  • LoRA fine-tuning
  • Model distillation experiments

Part 12 — MLOps Checklist of 12

  1. Know your team's position on the 5-level MLOps maturity scale
  2. Understand why vLLM beats legacy serving (PagedAttention)
  3. Articulate Feature Store's three roles
  4. Know the differences between DDP, FSDP, and ZeRO
  5. Know the memory-saving principle of LoRA vs. QLoRA
  6. Know MLflow's 5 logging targets
  7. Explain the 3 kinds of drift (Data/Concept/Label)
  8. Know Shadow Deployment's trade-offs
  9. Know when to pick Canary vs. A/B testing
  10. Know GPU spot instance pitfalls during training
  11. Know the difference between batch and dynamic batching
  12. Know the 3 elements of LLM observability (prompt, completion, metadata)

Part 13 — 10 MLOps Anti-Patterns

  1. Deploying from notebooks only: zero reproducibility. Move to pipelines.
  2. Splitting feature computation across training and serving: use a Feature Store or shared library.
  3. Deploying without an eval set: no way to detect regressions.
  4. No drift monitoring: quiet failure in 6 months.
  5. Single A/B metric: guardrails are mandatory.
  6. Picking GPUs "by feel": A10/A100/H100 need clear criteria.
  7. Spot instances without checkpointing: termination erases training.
  8. Big-bang deploy without Shadow: risk underestimated.
  9. Observability later: must be baked in from day one.
  10. Treating MLOps as ML team's job only: DevOps/Data team collaboration required.

Closing — MLOps is the "Invisible 70%"

The beauty of papers lies in model architecture, but the beauty of production lies in 30 invisible systems meshing together.

The 2025 AI/ML engineer divide:

  • "Can run a model" = entry level
  • "Can draw a pipeline/serving architecture" = senior
  • "Designs drift, cost, eval end-to-end" = staff+

Papers are public; operational know-how is not. That's why this area drives salary gaps.


Next Post — "Data Engineering Complete Guide: Lakehouse, Streaming, dbt, Orchestration, Data Mesh"

Season 2 Ep 8 is about the foundation beneath ML: data engineering. Next up:

  • Lakehouse architecture (Iceberg, Delta, Hudi)
  • Batch vs. Streaming (Flink, Kafka Streams, Spark Structured Streaming)
  • Data modeling with dbt + Elementary
  • Airflow vs. Prefect vs. Dagster vs. Temporal
  • The real meaning of Data Mesh
  • Data Contracts and Schema Registry

"The way data works" has changed in 2025 — continues in the next post.