MLOps Complete Guide — Model Serving, Feature Store, Drift, A/B Testing, GPU Economics (Season 2 Ep 7, 2025)

Intro — Why MLOps isn't just "DevOps + ML"

DevOps deploys code. MLOps deploys, monitors, and versions code + data + models simultaneously.

Four reasons MLOps is uniquely hard:

Reproducibility: same code + same data yields different models (randomness, hardware differences)
Drift: when data distributions shift, models rot in real time
Latency: training is batch, serving is real-time — architecture must split
Cost: one GPU costs $2,000 to$ 30,000/month. Bad design evaporates a startup's entire budget

In 2024 to 2025, with the LLM era, MLOps expanded into LLMOps. This post covers both.

Part 1 — Google's 5 Levels of MLOps Maturity (2021)

Level	Characteristics
Lv.0	Manual: notebook to manual deploy. Small-scale experiments
Lv.1	Automated training pipeline + manual deploy
Lv.2	Auto training + auto deploy + monitoring
Lv.3	Auto retraining (triggered on drift detection)
Lv.4	Fully automated + linked to business metrics

Most enterprises live at Lv.1 to Lv.2. Lv.3+ is Netflix, Uber, Airbnb territory.

Part 2 — Model Serving: the Inference System

2.1 General ML Serving

Tool	Strength	Use
TorchServe	PyTorch native	PyTorch standard
TensorFlow Serving	Mature, long-standing	TF models
Triton Inference Server (NVIDIA)	Multi-framework, dynamic batching	Production standard
BentoML	Python-friendly	Fast prototyping
KServe	Kubernetes native	K8s environments

2.2 LLM Serving (2024 to 2025 standard)

Tool	Characteristics
vLLM	PagedAttention, dominant throughput, open-source standard
TGI (HuggingFace)	Written in Rust, stable
TensorRT-LLM	NVIDIA optimized, top performance
SGLang	Optimized for complex workflows
llama.cpp	CPU, Mac, edge

Default for open-source LLM production in 2025: vLLM.

2.3 vLLM's Innovation: PagedAttention

Classic attention KV cache uses contiguous memory allocation — heavy fragmentation wastes 60 to 80% of GPU memory.

PagedAttention manages block-wise like OS virtual memory — under 4% waste, 2 to 4x throughput on concurrent requests.

2.4 Four Serving Patterns

Online (real-time): millisecond response. API server.
Batch: bulk prediction (nightly jobs). Efficient.
Streaming: event-driven (Kafka to model).
Edge: on-device (mobile, IoT).

2.5 Serving Performance Metrics

Latency (P50, P95, P99): response time
Throughput (QPS): requests per second
TTFT (Time to First Token, LLM): time to first token
TPS (Tokens Per Second, LLM): generation speed
GPU Utilization: target 60 to 80% (too low wastes; too high explodes latency)

Part 3 — Feature Store: the ML Feature Compute Layer

3.1 Why a Feature Store

Prevents the disaster of feature computation diverging between training and serving.

Example: "purchase amount over last 7 days" — if the time boundary or aggregation logic differs by even 0.001% between training data and live requests, model performance craters.

3.2 Three Roles of a Feature Store

Offline Store (training): Parquet, BigQuery, Snowflake — bulk lookup for training
Online Store (serving): Redis, DynamoDB — low-latency lookup
Feature Definition: central registry of "what this feature is and how it's computed"

3.3 2025 Feature Store Options

Tool	Characteristics
Feast	Open-source standard, lightweight
Tecton	Commercial, enterprise
Hopsworks	End-to-end platform
Databricks Feature Store	Delta Lake integrated
Self-built	Redis + S3 + metadata DB

3.4 Feast Example

from feast import Entity, FeatureView, Field
from feast.types import Float32, Int64

user = Entity(name="user_id", value_type=Int64)

user_activity = FeatureView(
    name="user_activity_7d",
    entities=[user],
    features=[
        Field(name="purchase_amount_7d", dtype=Float32),
        Field(name="click_count_7d", dtype=Int64),
    ],
    source=bigquery_source,
    online=True,
    ttl=timedelta(days=7),
)

# Training
features = store.get_historical_features(entity_df, feature_refs).to_df()

# Serving
features = store.get_online_features(
    features=feature_refs, entity_rows=[{"user_id": 1}]
).to_dict()

Part 4 — Training Infra: Large-Scale Training

4.1 Single GPU to Distributed Training Progression

Single GPU (~7B parameters)
Data Parallel (DP): multiple GPUs replicate model + different data
Distributed Data Parallel (DDP): DP improved, All-Reduce for gradient sync
Model Parallel: model too big, split it
Tensor Parallel: split within a single layer (Megatron-LM)
Pipeline Parallel: distribute by layer across GPUs
3D Parallel: DP + TP + PP combined (GPT-4 class)

4.2 2025 Distributed Training Tools

Tool	Characteristics
PyTorch DDP	Standard
DeepSpeed (MS)	ZeRO optimization, essential for LLMs
FSDP (Meta)	PyTorch native, DeepSpeed alternative
Megatron-LM (NVIDIA)	Ultra-large models
Ray Train	Unified interface
Determined AI	Integrated experiment tracking

4.3 ZeRO (Zero Redundancy Optimizer)

Shards optimizer state across GPUs, dramatically reducing memory:

ZeRO-1: optimizer state sharding
ZeRO-2: + gradient sharding
ZeRO-3: + model parameter sharding (similar to FSDP)

4.4 Lightweight Fine-tuning Techniques

Technique	Savings
LoRA	Only ~1% of parameters trained
QLoRA	LoRA + 4-bit quantization — fine-tune 70B on a single GPU
DoRA	LoRA improved (magnitude/direction split)
Galore	Near full-parameter quality + memory savings

Part 5 — Experiment Tracking: Recording Experiments

5.1 Why It Matters

"Why was that model 3 months ago so good?" — no answer = no reproducibility = worthless.

5.2 2025 Tool Comparison

Tool	Pros	Cons
MLflow	Open-source, self-hostable	Plain UI
Weights & Biases	Best UI, great for collab	SaaS, cost
Neptune.ai	Strong metadata	SMB-sized
Comet	W&B alternative	SMB-sized
ClearML	Open-source, includes pipelines	Learning curve

5.3 MLflow Basics

import mlflow
import mlflow.pytorch

mlflow.set_experiment("image_classifier")

with mlflow.start_run():
    mlflow.log_params({"lr": 0.001, "batch_size": 64})

    for epoch in range(epochs):
        train_loss = train(model, loader)
        val_acc = validate(model, val_loader)
        mlflow.log_metrics({"train_loss": train_loss, "val_acc": val_acc}, step=epoch)

    mlflow.pytorch.log_model(model, "model")
    mlflow.log_artifact("confusion_matrix.png")

5.4 Ten Things to Track

Hyperparameters: LR, batch size, optimizer
Data version: DVC or Delta Lake hash
Code version: git commit hash
Metrics: train/val loss, acc, AUC, etc.
Model artifact: weights, architecture
Environment: Python version, GPU type, requirements
Training time: total and per-epoch
Resources: GPU memory, CPU usage
Random seed: reproducibility
Dataset stats: class distribution, sample count

Part 6 — Drift Detection: How Models Rot

6.1 Three Kinds of Drift

Data Drift (Covariate Shift): input distribution changes
- Example: pre/post-COVID shopping patterns
Concept Drift: input to output relationship changes
- Example: the definition of spam itself shifts
Label Drift: label distribution changes
- Example: fraud rate jumps from 1% to 5%

6.2 Detection Methods

Statistical:

KS Test (single feature)
PSI (Population Stability Index)
Wasserstein Distance
Chi-Square (categorical)

ML-based:

Domain Classifier (training vs. production classifier)
Autoencoder reconstruction error

Performance-based:

Actual performance after delayed labels arrive
Proxy metrics (CTR, conversion rate)

6.3 2025 Drift Tools

Evidently AI: open-source, dashboards
Arize AI: commercial, LLM + ML unified
WhyLabs: data quality + drift
Fiddler: enterprise

6.4 LLM-Specific Problems

Prompt Drift: prompt distribution changes (user trends)
Response Drift: response quality degradation (model update effects)
Cost Drift: average token count creeps up — costs explode

Part 7 — Model A/B Testing and Deployment Strategies

7.1 Five Deployment Strategies

Strategy	Description	Risk
Blue-Green	Full swap of old/new environments	Medium
Canary	N% on new model, gradual expansion	Low
A/B	Split users precisely in half	Low (stats required)
Shadow	New model processes real traffic, response comes from old	Safest
Multi-armed Bandit	Automatic traffic reallocation	Intelligent

7.2 A/B Test Design

Hypothesis: "new model lifts CTR by at least 5%"
Sample size calculation: 80% statistical power, 5% significance
Randomization: user ID hash-based
Duration: include weekly patterns (minimum 1 week)
Guardrail metrics: primary metric + safety-net metrics (latency, error rate)
Analysis: significance + subgroup impact

7.3 Shadow Deployment

User Request
  |
[Router]
  |-- Prod Model --> User Response (returned)
  +-- Shadow Model --> Log (invisible to user)

Pros: zero risk, validated with real traffic. Cons: 2x cost, side effects hard to detect on logic changes.

7.4 2024 to 2025 Experiment Platforms

Eppo: statistically rigorous
GrowthBook: open-source
Statsig: Facebook alumni
Self-built: favored by large enterprises

Part 8 — GPU Economics 2025

8.1 GPU Option Comparison

Option	Price (H100 baseline)	Flexibility	Good For
On-demand Cloud	~$3/hr	Highest	Small, irregular
Spot/Preemptible	$1 to$ 1.5/hr (60 to 70% off)	Low	Batch training
Reserved (1 to 3 years)	~ $1.5 to$ 2/hr	Medium	Predictable workloads
Dedicated	thousands/month	High	Long-term production
Owned	H100 ~ $30K, DGX ~$ 400K	Highest	Large-scale, long-term

8.2 2024 to 2025 Trends

H100 to B200 (Blackwell): 2.5x performance at similar price
AMD MI300X: H100 alternative, 192GB memory
Groq LPU: inference specialized, highest tokens/sec
AWS Trainium/Inferentia: in-house chips, better price/perf
Google TPU v5: training and inference

8.3 Ten Cost-Optimization Tactics

Spot instances + checkpointing: training is resumable. 70% off.
Mixed precision (FP16/BF16): 2x speed and memory
Gradient accumulation: large batch effect on small GPUs
Gradient checkpointing: half memory, 20% slower
Quantization (INT8/INT4): 2 to 4x inference memory reduction
LoRA/QLoRA: 99% savings on fine-tuning
Model distillation: replicate performance in a smaller model
Batching + dynamic batching: serving throughput
Request caching: reuse results for repeated prompts
Right-sizing: avoid overprovisioning (don't use A100 when A10 suffices)

8.4 Cloud vs. Owning: Break-Even Point

Simple rule: 24/7 GPU usage, 3+ units, 1+ year expected — consider buying.

Reality: factor in infra team headcount, power/cooling, refresh cycles (3 to 4 years).

Part 9 — Data Pipelines

9.1 Batch vs. Streaming

Batch	Streaming
Airflow, Prefect, Dagster	Kafka, Flink, Spark Streaming
Hourly/daily batches	Real-time
Lower cost	Higher cost
Delay tolerated	ms to s required

9.2 2025 Orchestration

Airflow 2.x: standard, mature
Prefect: Pythonic, great UX
Dagster: type-safe, data-aware
Temporal: workflow specialized, restartable

9.3 ML Data Quality Checks

Nulls: threshold on null ratio
Outliers: Z-score or Isolation Forest
Schema drift: column add/remove/type change
Range checks: age > 0, price > 0
Distribution: histogram vs. baseline
Uniqueness: no ID duplicates
Relationships: FK integrity

9.4 Great Expectations / Soda

import great_expectations as ge

df = ge.from_pandas(df)
df.expect_column_values_to_not_be_null("user_id")
df.expect_column_values_to_be_between("age", 0, 150)
df.expect_column_value_lengths_to_be_between("email", 5, 100)

Part 10 — Observability and Debugging

10.1 Three Pillars + ML-Specific

Generic apps:

Metrics (Prometheus)
Logs (Loki, Elasticsearch)
Traces (OpenTelemetry)

ML additions:

Prediction log: input + output + model version
Feature log: Feature Store lookup records
Drift metrics: distribution statistics
Explanations: SHAP, LIME

10.2 LLM Observability Tools

LangSmith: LangChain team
Langfuse: open-source
Helicone: proxy-based
Phoenix (Arize): open-source, strong

Part 11 — MLOps Mastery Roadmap (6 Months)

Month 1: Fundamentals + Serving

FastAPI + PyTorch model serving
Docker + K8s basics

Month 2: Training Infra

PyTorch DDP
Ray Train or DeepSpeed hands-on
MLflow for experiment tracking

Month 3: Feature Store + Data

Feast install and operations
Airflow or Dagster pipelines
Great Expectations data quality

Month 4: LLM Specialization

vLLM operations
Prompt management (Langfuse)
LLM-as-a-Judge eval

Month 5: Drift + A/B

Evidently AI for drift detection
GrowthBook for A/B testing
Shadow Deployment in practice

Month 6: Optimization + Scale

GPU cost monitoring
LoRA fine-tuning
Model distillation experiments

Part 12 — MLOps Checklist of 12

Know your team's position on the 5-level MLOps maturity scale
Understand why vLLM beats legacy serving (PagedAttention)
Articulate Feature Store's three roles
Know the differences between DDP, FSDP, and ZeRO
Know the memory-saving principle of LoRA vs. QLoRA
Know MLflow's 5 logging targets
Explain the 3 kinds of drift (Data/Concept/Label)
Know Shadow Deployment's trade-offs
Know when to pick Canary vs. A/B testing
Know GPU spot instance pitfalls during training
Know the difference between batch and dynamic batching
Know the 3 elements of LLM observability (prompt, completion, metadata)

Part 13 — 10 MLOps Anti-Patterns

Deploying from notebooks only: zero reproducibility. Move to pipelines.
Splitting feature computation across training and serving: use a Feature Store or shared library.
Deploying without an eval set: no way to detect regressions.
No drift monitoring: quiet failure in 6 months.
Single A/B metric: guardrails are mandatory.
Picking GPUs "by feel": A10/A100/H100 need clear criteria.
Spot instances without checkpointing: termination erases training.
Big-bang deploy without Shadow: risk underestimated.
Observability later: must be baked in from day one.
Treating MLOps as ML team's job only: DevOps/Data team collaboration required.

Closing — MLOps is the "Invisible 70%"

The beauty of papers lies in model architecture, but the beauty of production lies in 30 invisible systems meshing together.

The 2025 AI/ML engineer divide:

"Can run a model" = entry level
"Can draw a pipeline/serving architecture" = senior
"Designs drift, cost, eval end-to-end" = staff+

Papers are public; operational know-how is not. That's why this area drives salary gaps.

Next Post — "Data Engineering Complete Guide: Lakehouse, Streaming, dbt, Orchestration, Data Mesh"

Season 2 Ep 8 is about the foundation beneath ML: data engineering. Next up:

Lakehouse architecture (Iceberg, Delta, Hudi)
Batch vs. Streaming (Flink, Kafka Streams, Spark Structured Streaming)
Data modeling with dbt + Elementary
Airflow vs. Prefect vs. Dagster vs. Temporal
The real meaning of Data Mesh
Data Contracts and Schema Registry

"The way data works" has changed in 2025 — continues in the next post.