✍️ 필사 모드: MLOps Complete Guide — Model Serving, Feature Store, Drift, A/B Testing, GPU Economics (Season 2 Ep 7, 2025)
EnglishIntro — Why MLOps isn't just "DevOps + ML"
DevOps deploys code. MLOps deploys, monitors, and versions code + data + models simultaneously.
Four reasons MLOps is uniquely hard:
- Reproducibility: same code + same data yields different models (randomness, hardware differences)
- Drift: when data distributions shift, models rot in real time
- Latency: training is batch, serving is real-time — architecture must split
- Cost: one GPU costs 30,000/month. Bad design evaporates a startup's entire budget
In 2024 to 2025, with the LLM era, MLOps expanded into LLMOps. This post covers both.
Part 1 — Google's 5 Levels of MLOps Maturity (2021)
| Level | Characteristics |
|---|---|
| Lv.0 | Manual: notebook to manual deploy. Small-scale experiments |
| Lv.1 | Automated training pipeline + manual deploy |
| Lv.2 | Auto training + auto deploy + monitoring |
| Lv.3 | Auto retraining (triggered on drift detection) |
| Lv.4 | Fully automated + linked to business metrics |
Most enterprises live at Lv.1 to Lv.2. Lv.3+ is Netflix, Uber, Airbnb territory.
Part 2 — Model Serving: the Inference System
2.1 General ML Serving
| Tool | Strength | Use |
|---|---|---|
| TorchServe | PyTorch native | PyTorch standard |
| TensorFlow Serving | Mature, long-standing | TF models |
| Triton Inference Server (NVIDIA) | Multi-framework, dynamic batching | Production standard |
| BentoML | Python-friendly | Fast prototyping |
| KServe | Kubernetes native | K8s environments |
2.2 LLM Serving (2024 to 2025 standard)
| Tool | Characteristics |
|---|---|
| vLLM | PagedAttention, dominant throughput, open-source standard |
| TGI (HuggingFace) | Written in Rust, stable |
| TensorRT-LLM | NVIDIA optimized, top performance |
| SGLang | Optimized for complex workflows |
| llama.cpp | CPU, Mac, edge |
Default for open-source LLM production in 2025: vLLM.
2.3 vLLM's Innovation: PagedAttention
Classic attention KV cache uses contiguous memory allocation — heavy fragmentation wastes 60 to 80% of GPU memory.
PagedAttention manages block-wise like OS virtual memory — under 4% waste, 2 to 4x throughput on concurrent requests.
2.4 Four Serving Patterns
- Online (real-time): millisecond response. API server.
- Batch: bulk prediction (nightly jobs). Efficient.
- Streaming: event-driven (Kafka to model).
- Edge: on-device (mobile, IoT).
2.5 Serving Performance Metrics
- Latency (P50, P95, P99): response time
- Throughput (QPS): requests per second
- TTFT (Time to First Token, LLM): time to first token
- TPS (Tokens Per Second, LLM): generation speed
- GPU Utilization: target 60 to 80% (too low wastes; too high explodes latency)
Part 3 — Feature Store: the ML Feature Compute Layer
3.1 Why a Feature Store
Prevents the disaster of feature computation diverging between training and serving.
Example: "purchase amount over last 7 days" — if the time boundary or aggregation logic differs by even 0.001% between training data and live requests, model performance craters.
3.2 Three Roles of a Feature Store
- Offline Store (training): Parquet, BigQuery, Snowflake — bulk lookup for training
- Online Store (serving): Redis, DynamoDB — low-latency lookup
- Feature Definition: central registry of "what this feature is and how it's computed"
3.3 2025 Feature Store Options
| Tool | Characteristics |
|---|---|
| Feast | Open-source standard, lightweight |
| Tecton | Commercial, enterprise |
| Hopsworks | End-to-end platform |
| Databricks Feature Store | Delta Lake integrated |
| Self-built | Redis + S3 + metadata DB |
3.4 Feast Example
from feast import Entity, FeatureView, Field
from feast.types import Float32, Int64
user = Entity(name="user_id", value_type=Int64)
user_activity = FeatureView(
name="user_activity_7d",
entities=[user],
features=[
Field(name="purchase_amount_7d", dtype=Float32),
Field(name="click_count_7d", dtype=Int64),
],
source=bigquery_source,
online=True,
ttl=timedelta(days=7),
)
# Training
features = store.get_historical_features(entity_df, feature_refs).to_df()
# Serving
features = store.get_online_features(
features=feature_refs, entity_rows=[{"user_id": 1}]
).to_dict()
Part 4 — Training Infra: Large-Scale Training
4.1 Single GPU to Distributed Training Progression
- Single GPU (~7B parameters)
- Data Parallel (DP): multiple GPUs replicate model + different data
- Distributed Data Parallel (DDP): DP improved, All-Reduce for gradient sync
- Model Parallel: model too big, split it
- Tensor Parallel: split within a single layer (Megatron-LM)
- Pipeline Parallel: distribute by layer across GPUs
- 3D Parallel: DP + TP + PP combined (GPT-4 class)
4.2 2025 Distributed Training Tools
| Tool | Characteristics |
|---|---|
| PyTorch DDP | Standard |
| DeepSpeed (MS) | ZeRO optimization, essential for LLMs |
| FSDP (Meta) | PyTorch native, DeepSpeed alternative |
| Megatron-LM (NVIDIA) | Ultra-large models |
| Ray Train | Unified interface |
| Determined AI | Integrated experiment tracking |
4.3 ZeRO (Zero Redundancy Optimizer)
Shards optimizer state across GPUs, dramatically reducing memory:
- ZeRO-1: optimizer state sharding
- ZeRO-2: + gradient sharding
- ZeRO-3: + model parameter sharding (similar to FSDP)
4.4 Lightweight Fine-tuning Techniques
| Technique | Savings |
|---|---|
| LoRA | Only ~1% of parameters trained |
| QLoRA | LoRA + 4-bit quantization — fine-tune 70B on a single GPU |
| DoRA | LoRA improved (magnitude/direction split) |
| Galore | Near full-parameter quality + memory savings |
Part 5 — Experiment Tracking: Recording Experiments
5.1 Why It Matters
"Why was that model 3 months ago so good?" — no answer = no reproducibility = worthless.
5.2 2025 Tool Comparison
| Tool | Pros | Cons |
|---|---|---|
| MLflow | Open-source, self-hostable | Plain UI |
| Weights & Biases | Best UI, great for collab | SaaS, cost |
| Neptune.ai | Strong metadata | SMB-sized |
| Comet | W&B alternative | SMB-sized |
| ClearML | Open-source, includes pipelines | Learning curve |
5.3 MLflow Basics
import mlflow
import mlflow.pytorch
mlflow.set_experiment("image_classifier")
with mlflow.start_run():
mlflow.log_params({"lr": 0.001, "batch_size": 64})
for epoch in range(epochs):
train_loss = train(model, loader)
val_acc = validate(model, val_loader)
mlflow.log_metrics({"train_loss": train_loss, "val_acc": val_acc}, step=epoch)
mlflow.pytorch.log_model(model, "model")
mlflow.log_artifact("confusion_matrix.png")
5.4 Ten Things to Track
- Hyperparameters: LR, batch size, optimizer
- Data version: DVC or Delta Lake hash
- Code version: git commit hash
- Metrics: train/val loss, acc, AUC, etc.
- Model artifact: weights, architecture
- Environment: Python version, GPU type, requirements
- Training time: total and per-epoch
- Resources: GPU memory, CPU usage
- Random seed: reproducibility
- Dataset stats: class distribution, sample count
Part 6 — Drift Detection: How Models Rot
6.1 Three Kinds of Drift
- Data Drift (Covariate Shift): input distribution changes
- Example: pre/post-COVID shopping patterns
- Concept Drift: input to output relationship changes
- Example: the definition of spam itself shifts
- Label Drift: label distribution changes
- Example: fraud rate jumps from 1% to 5%
6.2 Detection Methods
Statistical:
- KS Test (single feature)
- PSI (Population Stability Index)
- Wasserstein Distance
- Chi-Square (categorical)
ML-based:
- Domain Classifier (training vs. production classifier)
- Autoencoder reconstruction error
Performance-based:
- Actual performance after delayed labels arrive
- Proxy metrics (CTR, conversion rate)
6.3 2025 Drift Tools
- Evidently AI: open-source, dashboards
- Arize AI: commercial, LLM + ML unified
- WhyLabs: data quality + drift
- Fiddler: enterprise
6.4 LLM-Specific Problems
- Prompt Drift: prompt distribution changes (user trends)
- Response Drift: response quality degradation (model update effects)
- Cost Drift: average token count creeps up — costs explode
Part 7 — Model A/B Testing and Deployment Strategies
7.1 Five Deployment Strategies
| Strategy | Description | Risk |
|---|---|---|
| Blue-Green | Full swap of old/new environments | Medium |
| Canary | N% on new model, gradual expansion | Low |
| A/B | Split users precisely in half | Low (stats required) |
| Shadow | New model processes real traffic, response comes from old | Safest |
| Multi-armed Bandit | Automatic traffic reallocation | Intelligent |
7.2 A/B Test Design
- Hypothesis: "new model lifts CTR by at least 5%"
- Sample size calculation: 80% statistical power, 5% significance
- Randomization: user ID hash-based
- Duration: include weekly patterns (minimum 1 week)
- Guardrail metrics: primary metric + safety-net metrics (latency, error rate)
- Analysis: significance + subgroup impact
7.3 Shadow Deployment
User Request
|
[Router]
|-- Prod Model --> User Response (returned)
+-- Shadow Model --> Log (invisible to user)
Pros: zero risk, validated with real traffic. Cons: 2x cost, side effects hard to detect on logic changes.
7.4 2024 to 2025 Experiment Platforms
- Eppo: statistically rigorous
- GrowthBook: open-source
- Statsig: Facebook alumni
- Self-built: favored by large enterprises
Part 8 — GPU Economics 2025
8.1 GPU Option Comparison
| Option | Price (H100 baseline) | Flexibility | Good For |
|---|---|---|---|
| On-demand Cloud | ~$3/hr | Highest | Small, irregular |
| Spot/Preemptible | 1.5/hr (60 to 70% off) | Low | Batch training |
| Reserved (1 to 3 years) | ~2/hr | Medium | Predictable workloads |
| Dedicated | thousands/month | High | Long-term production |
| Owned | H100 ~400K | Highest | Large-scale, long-term |
8.2 2024 to 2025 Trends
- H100 to B200 (Blackwell): 2.5x performance at similar price
- AMD MI300X: H100 alternative, 192GB memory
- Groq LPU: inference specialized, highest tokens/sec
- AWS Trainium/Inferentia: in-house chips, better price/perf
- Google TPU v5: training and inference
8.3 Ten Cost-Optimization Tactics
- Spot instances + checkpointing: training is resumable. 70% off.
- Mixed precision (FP16/BF16): 2x speed and memory
- Gradient accumulation: large batch effect on small GPUs
- Gradient checkpointing: half memory, 20% slower
- Quantization (INT8/INT4): 2 to 4x inference memory reduction
- LoRA/QLoRA: 99% savings on fine-tuning
- Model distillation: replicate performance in a smaller model
- Batching + dynamic batching: serving throughput
- Request caching: reuse results for repeated prompts
- Right-sizing: avoid overprovisioning (don't use A100 when A10 suffices)
8.4 Cloud vs. Owning: Break-Even Point
Simple rule: 24/7 GPU usage, 3+ units, 1+ year expected — consider buying.
Reality: factor in infra team headcount, power/cooling, refresh cycles (3 to 4 years).
Part 9 — Data Pipelines
9.1 Batch vs. Streaming
| Batch | Streaming |
|---|---|
| Airflow, Prefect, Dagster | Kafka, Flink, Spark Streaming |
| Hourly/daily batches | Real-time |
| Lower cost | Higher cost |
| Delay tolerated | ms to s required |
9.2 2025 Orchestration
- Airflow 2.x: standard, mature
- Prefect: Pythonic, great UX
- Dagster: type-safe, data-aware
- Temporal: workflow specialized, restartable
9.3 ML Data Quality Checks
- Nulls: threshold on null ratio
- Outliers: Z-score or Isolation Forest
- Schema drift: column add/remove/type change
- Range checks: age > 0, price > 0
- Distribution: histogram vs. baseline
- Uniqueness: no ID duplicates
- Relationships: FK integrity
9.4 Great Expectations / Soda
import great_expectations as ge
df = ge.from_pandas(df)
df.expect_column_values_to_not_be_null("user_id")
df.expect_column_values_to_be_between("age", 0, 150)
df.expect_column_value_lengths_to_be_between("email", 5, 100)
Part 10 — Observability and Debugging
10.1 Three Pillars + ML-Specific
Generic apps:
- Metrics (Prometheus)
- Logs (Loki, Elasticsearch)
- Traces (OpenTelemetry)
ML additions:
- Prediction log: input + output + model version
- Feature log: Feature Store lookup records
- Drift metrics: distribution statistics
- Explanations: SHAP, LIME
10.2 LLM Observability Tools
- LangSmith: LangChain team
- Langfuse: open-source
- Helicone: proxy-based
- Phoenix (Arize): open-source, strong
Part 11 — MLOps Mastery Roadmap (6 Months)
Month 1: Fundamentals + Serving
- FastAPI + PyTorch model serving
- Docker + K8s basics
Month 2: Training Infra
- PyTorch DDP
- Ray Train or DeepSpeed hands-on
- MLflow for experiment tracking
Month 3: Feature Store + Data
- Feast install and operations
- Airflow or Dagster pipelines
- Great Expectations data quality
Month 4: LLM Specialization
- vLLM operations
- Prompt management (Langfuse)
- LLM-as-a-Judge eval
Month 5: Drift + A/B
- Evidently AI for drift detection
- GrowthBook for A/B testing
- Shadow Deployment in practice
Month 6: Optimization + Scale
- GPU cost monitoring
- LoRA fine-tuning
- Model distillation experiments
Part 12 — MLOps Checklist of 12
- Know your team's position on the 5-level MLOps maturity scale
- Understand why vLLM beats legacy serving (PagedAttention)
- Articulate Feature Store's three roles
- Know the differences between DDP, FSDP, and ZeRO
- Know the memory-saving principle of LoRA vs. QLoRA
- Know MLflow's 5 logging targets
- Explain the 3 kinds of drift (Data/Concept/Label)
- Know Shadow Deployment's trade-offs
- Know when to pick Canary vs. A/B testing
- Know GPU spot instance pitfalls during training
- Know the difference between batch and dynamic batching
- Know the 3 elements of LLM observability (prompt, completion, metadata)
Part 13 — 10 MLOps Anti-Patterns
- Deploying from notebooks only: zero reproducibility. Move to pipelines.
- Splitting feature computation across training and serving: use a Feature Store or shared library.
- Deploying without an eval set: no way to detect regressions.
- No drift monitoring: quiet failure in 6 months.
- Single A/B metric: guardrails are mandatory.
- Picking GPUs "by feel": A10/A100/H100 need clear criteria.
- Spot instances without checkpointing: termination erases training.
- Big-bang deploy without Shadow: risk underestimated.
- Observability later: must be baked in from day one.
- Treating MLOps as ML team's job only: DevOps/Data team collaboration required.
Closing — MLOps is the "Invisible 70%"
The beauty of papers lies in model architecture, but the beauty of production lies in 30 invisible systems meshing together.
The 2025 AI/ML engineer divide:
- "Can run a model" = entry level
- "Can draw a pipeline/serving architecture" = senior
- "Designs drift, cost, eval end-to-end" = staff+
Papers are public; operational know-how is not. That's why this area drives salary gaps.
Next Post — "Data Engineering Complete Guide: Lakehouse, Streaming, dbt, Orchestration, Data Mesh"
Season 2 Ep 8 is about the foundation beneath ML: data engineering. Next up:
- Lakehouse architecture (Iceberg, Delta, Hudi)
- Batch vs. Streaming (Flink, Kafka Streams, Spark Structured Streaming)
- Data modeling with dbt + Elementary
- Airflow vs. Prefect vs. Dagster vs. Temporal
- The real meaning of Data Mesh
- Data Contracts and Schema Registry
"The way data works" has changed in 2025 — continues in the next post.
현재 단락 (1/282)
DevOps deploys code. MLOps deploys, monitors, and versions **code + data + models** simultaneously.