Skip to content
Published on

MLOps Complete Guide — Model Serving, Feature Store, Drift, A/B Testing, GPU Economics (Season 2 Ep 7, 2025)

Authors

Intro — Why MLOps isn't just "DevOps + ML"

DevOps deploys code. MLOps deploys, monitors, and versions code + data + models simultaneously.

Four reasons MLOps is uniquely hard:

  1. Reproducibility: same code + same data yields different models (randomness, hardware differences)
  2. Drift: when data distributions shift, models rot in real time
  3. Latency: training is batch, serving is real-time — architecture must split
  4. Cost: one GPU costs 2,000to2,000 to 30,000/month. Bad design evaporates a startup's entire budget

In 2024 to 2025, with the LLM era, MLOps expanded into LLMOps. This post covers both.


Part 1 — Google's 5 Levels of MLOps Maturity (2021)

LevelCharacteristics
Lv.0Manual: notebook to manual deploy. Small-scale experiments
Lv.1Automated training pipeline + manual deploy
Lv.2Auto training + auto deploy + monitoring
Lv.3Auto retraining (triggered on drift detection)
Lv.4Fully automated + linked to business metrics

Most enterprises live at Lv.1 to Lv.2. Lv.3+ is Netflix, Uber, Airbnb territory.


Part 2 — Model Serving: the Inference System

2.1 General ML Serving

ToolStrengthUse
TorchServePyTorch nativePyTorch standard
TensorFlow ServingMature, long-standingTF models
Triton Inference Server (NVIDIA)Multi-framework, dynamic batchingProduction standard
BentoMLPython-friendlyFast prototyping
KServeKubernetes nativeK8s environments

2.2 LLM Serving (2024 to 2025 standard)

ToolCharacteristics
vLLMPagedAttention, dominant throughput, open-source standard
TGI (HuggingFace)Written in Rust, stable
TensorRT-LLMNVIDIA optimized, top performance
SGLangOptimized for complex workflows
llama.cppCPU, Mac, edge

Default for open-source LLM production in 2025: vLLM.

2.3 vLLM's Innovation: PagedAttention

Classic attention KV cache uses contiguous memory allocation — heavy fragmentation wastes 60 to 80% of GPU memory.

PagedAttention manages block-wise like OS virtual memory — under 4% waste, 2 to 4x throughput on concurrent requests.

2.4 Four Serving Patterns

  1. Online (real-time): millisecond response. API server.
  2. Batch: bulk prediction (nightly jobs). Efficient.
  3. Streaming: event-driven (Kafka to model).
  4. Edge: on-device (mobile, IoT).

2.5 Serving Performance Metrics

  • Latency (P50, P95, P99): response time
  • Throughput (QPS): requests per second
  • TTFT (Time to First Token, LLM): time to first token
  • TPS (Tokens Per Second, LLM): generation speed
  • GPU Utilization: target 60 to 80% (too low wastes; too high explodes latency)

Part 3 — Feature Store: the ML Feature Compute Layer

3.1 Why a Feature Store

Prevents the disaster of feature computation diverging between training and serving.

Example: "purchase amount over last 7 days" — if the time boundary or aggregation logic differs by even 0.001% between training data and live requests, model performance craters.

3.2 Three Roles of a Feature Store

  1. Offline Store (training): Parquet, BigQuery, Snowflake — bulk lookup for training
  2. Online Store (serving): Redis, DynamoDB — low-latency lookup
  3. Feature Definition: central registry of "what this feature is and how it's computed"

3.3 2025 Feature Store Options

ToolCharacteristics
FeastOpen-source standard, lightweight
TectonCommercial, enterprise
HopsworksEnd-to-end platform
Databricks Feature StoreDelta Lake integrated
Self-builtRedis + S3 + metadata DB

3.4 Feast Example

from feast import Entity, FeatureView, Field
from feast.types import Float32, Int64

user = Entity(name="user_id", value_type=Int64)

user_activity = FeatureView(
    name="user_activity_7d",
    entities=[user],
    features=[
        Field(name="purchase_amount_7d", dtype=Float32),
        Field(name="click_count_7d", dtype=Int64),
    ],
    source=bigquery_source,
    online=True,
    ttl=timedelta(days=7),
)

# Training
features = store.get_historical_features(entity_df, feature_refs).to_df()

# Serving
features = store.get_online_features(
    features=feature_refs, entity_rows=[{"user_id": 1}]
).to_dict()

Part 4 — Training Infra: Large-Scale Training

4.1 Single GPU to Distributed Training Progression

  1. Single GPU (~7B parameters)
  2. Data Parallel (DP): multiple GPUs replicate model + different data
  3. Distributed Data Parallel (DDP): DP improved, All-Reduce for gradient sync
  4. Model Parallel: model too big, split it
  5. Tensor Parallel: split within a single layer (Megatron-LM)
  6. Pipeline Parallel: distribute by layer across GPUs
  7. 3D Parallel: DP + TP + PP combined (GPT-4 class)

4.2 2025 Distributed Training Tools

ToolCharacteristics
PyTorch DDPStandard
DeepSpeed (MS)ZeRO optimization, essential for LLMs
FSDP (Meta)PyTorch native, DeepSpeed alternative
Megatron-LM (NVIDIA)Ultra-large models
Ray TrainUnified interface
Determined AIIntegrated experiment tracking

4.3 ZeRO (Zero Redundancy Optimizer)

Shards optimizer state across GPUs, dramatically reducing memory:

  • ZeRO-1: optimizer state sharding
  • ZeRO-2: + gradient sharding
  • ZeRO-3: + model parameter sharding (similar to FSDP)

4.4 Lightweight Fine-tuning Techniques

TechniqueSavings
LoRAOnly ~1% of parameters trained
QLoRALoRA + 4-bit quantization — fine-tune 70B on a single GPU
DoRALoRA improved (magnitude/direction split)
GaloreNear full-parameter quality + memory savings

Part 5 — Experiment Tracking: Recording Experiments

5.1 Why It Matters

"Why was that model 3 months ago so good?" — no answer = no reproducibility = worthless.

5.2 2025 Tool Comparison

ToolProsCons
MLflowOpen-source, self-hostablePlain UI
Weights & BiasesBest UI, great for collabSaaS, cost
Neptune.aiStrong metadataSMB-sized
CometW&B alternativeSMB-sized
ClearMLOpen-source, includes pipelinesLearning curve

5.3 MLflow Basics

import mlflow
import mlflow.pytorch

mlflow.set_experiment("image_classifier")

with mlflow.start_run():
    mlflow.log_params({"lr": 0.001, "batch_size": 64})

    for epoch in range(epochs):
        train_loss = train(model, loader)
        val_acc = validate(model, val_loader)
        mlflow.log_metrics({"train_loss": train_loss, "val_acc": val_acc}, step=epoch)

    mlflow.pytorch.log_model(model, "model")
    mlflow.log_artifact("confusion_matrix.png")

5.4 Ten Things to Track

  1. Hyperparameters: LR, batch size, optimizer
  2. Data version: DVC or Delta Lake hash
  3. Code version: git commit hash
  4. Metrics: train/val loss, acc, AUC, etc.
  5. Model artifact: weights, architecture
  6. Environment: Python version, GPU type, requirements
  7. Training time: total and per-epoch
  8. Resources: GPU memory, CPU usage
  9. Random seed: reproducibility
  10. Dataset stats: class distribution, sample count

Part 6 — Drift Detection: How Models Rot

6.1 Three Kinds of Drift

  1. Data Drift (Covariate Shift): input distribution changes
    • Example: pre/post-COVID shopping patterns
  2. Concept Drift: input to output relationship changes
    • Example: the definition of spam itself shifts
  3. Label Drift: label distribution changes
    • Example: fraud rate jumps from 1% to 5%

6.2 Detection Methods

Statistical:

  • KS Test (single feature)
  • PSI (Population Stability Index)
  • Wasserstein Distance
  • Chi-Square (categorical)

ML-based:

  • Domain Classifier (training vs. production classifier)
  • Autoencoder reconstruction error

Performance-based:

  • Actual performance after delayed labels arrive
  • Proxy metrics (CTR, conversion rate)

6.3 2025 Drift Tools

  • Evidently AI: open-source, dashboards
  • Arize AI: commercial, LLM + ML unified
  • WhyLabs: data quality + drift
  • Fiddler: enterprise

6.4 LLM-Specific Problems

  • Prompt Drift: prompt distribution changes (user trends)
  • Response Drift: response quality degradation (model update effects)
  • Cost Drift: average token count creeps up — costs explode

Part 7 — Model A/B Testing and Deployment Strategies

7.1 Five Deployment Strategies

StrategyDescriptionRisk
Blue-GreenFull swap of old/new environmentsMedium
CanaryN% on new model, gradual expansionLow
A/BSplit users precisely in halfLow (stats required)
ShadowNew model processes real traffic, response comes from oldSafest
Multi-armed BanditAutomatic traffic reallocationIntelligent

7.2 A/B Test Design

  1. Hypothesis: "new model lifts CTR by at least 5%"
  2. Sample size calculation: 80% statistical power, 5% significance
  3. Randomization: user ID hash-based
  4. Duration: include weekly patterns (minimum 1 week)
  5. Guardrail metrics: primary metric + safety-net metrics (latency, error rate)
  6. Analysis: significance + subgroup impact

7.3 Shadow Deployment

User Request
  |
[Router]
  |-- Prod Model --> User Response (returned)
  +-- Shadow Model --> Log (invisible to user)

Pros: zero risk, validated with real traffic. Cons: 2x cost, side effects hard to detect on logic changes.

7.4 2024 to 2025 Experiment Platforms

  • Eppo: statistically rigorous
  • GrowthBook: open-source
  • Statsig: Facebook alumni
  • Self-built: favored by large enterprises

Part 8 — GPU Economics 2025

8.1 GPU Option Comparison

OptionPrice (H100 baseline)FlexibilityGood For
On-demand Cloud~$3/hrHighestSmall, irregular
Spot/Preemptible1to1 to 1.5/hr (60 to 70% off)LowBatch training
Reserved (1 to 3 years)~1.5to1.5 to 2/hrMediumPredictable workloads
Dedicatedthousands/monthHighLong-term production
OwnedH100 ~30K,DGX 30K, DGX ~400KHighestLarge-scale, long-term
  • H100 to B200 (Blackwell): 2.5x performance at similar price
  • AMD MI300X: H100 alternative, 192GB memory
  • Groq LPU: inference specialized, highest tokens/sec
  • AWS Trainium/Inferentia: in-house chips, better price/perf
  • Google TPU v5: training and inference

8.3 Ten Cost-Optimization Tactics

  1. Spot instances + checkpointing: training is resumable. 70% off.
  2. Mixed precision (FP16/BF16): 2x speed and memory
  3. Gradient accumulation: large batch effect on small GPUs
  4. Gradient checkpointing: half memory, 20% slower
  5. Quantization (INT8/INT4): 2 to 4x inference memory reduction
  6. LoRA/QLoRA: 99% savings on fine-tuning
  7. Model distillation: replicate performance in a smaller model
  8. Batching + dynamic batching: serving throughput
  9. Request caching: reuse results for repeated prompts
  10. Right-sizing: avoid overprovisioning (don't use A100 when A10 suffices)

8.4 Cloud vs. Owning: Break-Even Point

Simple rule: 24/7 GPU usage, 3+ units, 1+ year expected — consider buying.

Reality: factor in infra team headcount, power/cooling, refresh cycles (3 to 4 years).


Part 9 — Data Pipelines

9.1 Batch vs. Streaming

BatchStreaming
Airflow, Prefect, DagsterKafka, Flink, Spark Streaming
Hourly/daily batchesReal-time
Lower costHigher cost
Delay toleratedms to s required

9.2 2025 Orchestration

  • Airflow 2.x: standard, mature
  • Prefect: Pythonic, great UX
  • Dagster: type-safe, data-aware
  • Temporal: workflow specialized, restartable

9.3 ML Data Quality Checks

  • Nulls: threshold on null ratio
  • Outliers: Z-score or Isolation Forest
  • Schema drift: column add/remove/type change
  • Range checks: age > 0, price > 0
  • Distribution: histogram vs. baseline
  • Uniqueness: no ID duplicates
  • Relationships: FK integrity

9.4 Great Expectations / Soda

import great_expectations as ge

df = ge.from_pandas(df)
df.expect_column_values_to_not_be_null("user_id")
df.expect_column_values_to_be_between("age", 0, 150)
df.expect_column_value_lengths_to_be_between("email", 5, 100)

Part 10 — Observability and Debugging

10.1 Three Pillars + ML-Specific

Generic apps:

  • Metrics (Prometheus)
  • Logs (Loki, Elasticsearch)
  • Traces (OpenTelemetry)

ML additions:

  • Prediction log: input + output + model version
  • Feature log: Feature Store lookup records
  • Drift metrics: distribution statistics
  • Explanations: SHAP, LIME

10.2 LLM Observability Tools

  • LangSmith: LangChain team
  • Langfuse: open-source
  • Helicone: proxy-based
  • Phoenix (Arize): open-source, strong

Part 11 — MLOps Mastery Roadmap (6 Months)

Month 1: Fundamentals + Serving

  • FastAPI + PyTorch model serving
  • Docker + K8s basics

Month 2: Training Infra

  • PyTorch DDP
  • Ray Train or DeepSpeed hands-on
  • MLflow for experiment tracking

Month 3: Feature Store + Data

  • Feast install and operations
  • Airflow or Dagster pipelines
  • Great Expectations data quality

Month 4: LLM Specialization

  • vLLM operations
  • Prompt management (Langfuse)
  • LLM-as-a-Judge eval

Month 5: Drift + A/B

  • Evidently AI for drift detection
  • GrowthBook for A/B testing
  • Shadow Deployment in practice

Month 6: Optimization + Scale

  • GPU cost monitoring
  • LoRA fine-tuning
  • Model distillation experiments

Part 12 — MLOps Checklist of 12

  1. Know your team's position on the 5-level MLOps maturity scale
  2. Understand why vLLM beats legacy serving (PagedAttention)
  3. Articulate Feature Store's three roles
  4. Know the differences between DDP, FSDP, and ZeRO
  5. Know the memory-saving principle of LoRA vs. QLoRA
  6. Know MLflow's 5 logging targets
  7. Explain the 3 kinds of drift (Data/Concept/Label)
  8. Know Shadow Deployment's trade-offs
  9. Know when to pick Canary vs. A/B testing
  10. Know GPU spot instance pitfalls during training
  11. Know the difference between batch and dynamic batching
  12. Know the 3 elements of LLM observability (prompt, completion, metadata)

Part 13 — 10 MLOps Anti-Patterns

  1. Deploying from notebooks only: zero reproducibility. Move to pipelines.
  2. Splitting feature computation across training and serving: use a Feature Store or shared library.
  3. Deploying without an eval set: no way to detect regressions.
  4. No drift monitoring: quiet failure in 6 months.
  5. Single A/B metric: guardrails are mandatory.
  6. Picking GPUs "by feel": A10/A100/H100 need clear criteria.
  7. Spot instances without checkpointing: termination erases training.
  8. Big-bang deploy without Shadow: risk underestimated.
  9. Observability later: must be baked in from day one.
  10. Treating MLOps as ML team's job only: DevOps/Data team collaboration required.

Closing — MLOps is the "Invisible 70%"

The beauty of papers lies in model architecture, but the beauty of production lies in 30 invisible systems meshing together.

The 2025 AI/ML engineer divide:

  • "Can run a model" = entry level
  • "Can draw a pipeline/serving architecture" = senior
  • "Designs drift, cost, eval end-to-end" = staff+

Papers are public; operational know-how is not. That's why this area drives salary gaps.


Next Post — "Data Engineering Complete Guide: Lakehouse, Streaming, dbt, Orchestration, Data Mesh"

Season 2 Ep 8 is about the foundation beneath ML: data engineering. Next up:

  • Lakehouse architecture (Iceberg, Delta, Hudi)
  • Batch vs. Streaming (Flink, Kafka Streams, Spark Structured Streaming)
  • Data modeling with dbt + Elementary
  • Airflow vs. Prefect vs. Dagster vs. Temporal
  • The real meaning of Data Mesh
  • Data Contracts and Schema Registry

"The way data works" has changed in 2025 — continues in the next post.