- Published on
Toss Bank ML Engineer (MLOps) Complete Guide: From MLFlow to LLM Platform — Tech Stack Deep Dive
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction: Why Toss Bank ML Platform Is Different
- 1. Team Analysis: ML Platform Team at Toss Bank
- 2. JD Line-by-Line Analysis
- 3. Tech Stack Deep Dive
- 3.1 Kubernetes for MLOps
- 3.2 MLFlow: Experiment Tracking and Model Registry
- 3.3 Apache Airflow: Workflow Orchestration
- 3.4 Kubeflow: Kubernetes-Native ML Pipelines
- 3.5 JupyterHub: Multi-User Notebook Platform
- 3.6 Triton Inference Server: Production Model Serving
- 3.7 ScyllaDB Feature Store: Low-Latency Feature Serving
- 3.8 LLM Platform: GPU Infrastructure and Inference Optimization
- 3.9 GPU Frameworks and CUDA Fundamentals
- 3.10 Distributed Database Fundamentals
- 4. MLOps Maturity Model
- 5. Interview Preparation: 30 Expected Questions
- 6. Eight-Month Study Roadmap
- 7. Resume Strategy for Toss Bank ML Platform
- 8. Portfolio Projects
- 9. Knowledge Check Quiz
- 10. References and Resources
Introduction: Why Toss Bank ML Platform Is Different
Toss Bank is not just another fintech with a machine learning team. As part of the Viva Republica ecosystem (the parent company behind the Toss super-app), Toss Bank operates one of the most aggressive ML-driven financial platforms in Asia. The ML Platform team is the infrastructure backbone that makes this possible — building and maintaining the systems that allow data scientists to go from Jupyter notebook to production model in hours rather than weeks.
The MLOps Engineer position on this team is not a typical "deploy a model and forget it" role. The JD reveals a team operating at MLOps Maturity Level 3-4 (more on this below), with ambitions to reach Level 5 — full autonomous ML operations. This means you are expected to understand not just individual tools, but the entire lifecycle from experimentation to serving, monitoring, retraining, and governance.
This guide dissects every line of the job description, maps each requirement to specific technologies and study resources, and gives you a concrete 8-month plan to become a competitive candidate. Whether you are a backend engineer pivoting into MLOps, a data scientist who wants to understand infrastructure, or an experienced MLOps practitioner evaluating this role — this document is your comprehensive preparation resource.
1. Team Analysis: ML Platform Team at Toss Bank
What the Team Actually Does
The ML Platform Team sits at the intersection of data engineering, ML engineering, and platform engineering. Their mandate is threefold:
- Build the ML infrastructure layer — Training pipelines, experiment tracking, model registry, feature stores, and serving infrastructure
- Enable self-service ML for data scientists — JupyterHub environments, automated pipeline creation, one-click model deployment
- Operate the LLM platform — Inference optimization, GPU cluster management, RAG pipelines for banking-specific use cases
Team Positioning Within Toss Bank
Understanding where the team sits in the organizational hierarchy matters for interview preparation.
| Layer | Function | Example |
|---|---|---|
| Business Teams | Define ML use cases | Credit scoring, fraud detection, personalized recommendations |
| Data Science Team | Build and validate models | Feature engineering, model training, evaluation |
| ML Platform Team (this role) | Build and operate the platform | MLFlow, Kubeflow, Triton, Feature Store |
| Infrastructure Team | Provide compute and network | Kubernetes clusters, GPU nodes, networking |
| Security/Compliance | Ensure regulatory adherence | Model audit trails, data governance |
The ML Platform Team is the critical middle layer. They do not build business models themselves, but they make it possible for dozens of data scientists to work efficiently and deploy models safely into production.
Why This Role Matters in 2025-2026
Three trends make this position especially significant:
-
LLM integration in banking — Every major Korean financial institution is racing to deploy LLMs for customer service, document processing, and internal tooling. Toss Bank needs infrastructure that can handle both traditional ML (XGBoost, LightGBM for credit scoring) and generative AI workloads simultaneously.
-
Regulatory pressure — Korean financial regulators (FSC/FSS) now require model explainability and audit trails for any ML system that affects credit decisions. The ML platform must provide governance capabilities out of the box.
-
Scale challenges — Toss Bank serves millions of active users. The ML platform must handle thousands of feature computations per second, serve models at sub-10ms latency, and manage dozens of concurrent experiments — all while maintaining five-nines reliability for financial transactions.
2. JD Line-by-Line Analysis
Let us break down each requirement from the job description and understand what the hiring team is really asking for.
Core Responsibilities
"Design and develop ML platform services (MLFlow, Airflow, JupyterHub, Kubeflow)"
This is the heart of the role. You are not just using these tools — you are building and customizing them. The four tools mentioned form the complete ML lifecycle:
- MLFlow — Experiment tracking, model registry, model versioning
- Airflow — Workflow orchestration for data pipelines and training jobs
- JupyterHub — Multi-user notebook environment for data scientists
- Kubeflow — Kubernetes-native ML pipeline orchestration
The word "design" is critical. It signals they want someone who can architect solutions, not just follow tutorials.
"Build and operate inference serving infrastructure (Triton Inference Server)"
Model serving is often the hardest part of MLOps. Triton Inference Server (by NVIDIA) is an enterprise-grade serving solution that supports multiple model frameworks (TensorFlow, PyTorch, ONNX, TensorRT) simultaneously. Operating Triton at scale means understanding:
- Model ensemble patterns
- Dynamic batching configuration
- GPU memory management
- A/B testing and canary deployment for models
"Design and develop feature store based on distributed database (ScyllaDB)"
This is a strong architectural signal. Most companies use off-the-shelf feature stores (Feast, Tecton, Hopsworks). Toss Bank has built a custom feature store on ScyllaDB — a high-performance Cassandra-compatible database written in C++. This means:
- They need sub-millisecond feature lookups at scale
- They prioritize consistency and low tail latency (critical for financial services)
- You need to understand distributed database internals, not just API usage
"Build and operate LLM platform (GPU infrastructure, inference optimization)"
This is the forward-looking part of the role. LLM operations require a fundamentally different skill set from traditional ML:
- GPU cluster management (NVIDIA A100/H100, multi-GPU serving)
- Inference optimization (quantization, KV-cache optimization, continuous batching)
- RAG pipeline architecture for grounding LLM responses in banking data
- Cost optimization (GPU compute is expensive — efficient utilization is essential)
Required Qualifications
"3+ years of experience in backend development or ML engineering"
The dual framing (backend OR ML engineering) is intentional. They want someone who can write production-quality code. A data scientist with only notebook experience will struggle here. A backend engineer with no ML understanding will also struggle. The sweet spot is an engineer who can write Go/Python services AND understands ML concepts.
"Experience with Kubernetes and container orchestration"
This is non-negotiable. Every tool in their stack (MLFlow, Airflow, JupyterHub, Kubeflow, Triton) runs on Kubernetes. You need to understand:
- Pod scheduling, resource requests/limits
- Custom operators and CRDs (Custom Resource Definitions)
- Helm charts and Kustomize for deployment management
- Persistent volume management for model artifacts and training data
"Understanding of ML lifecycle (training, serving, monitoring)"
They want to confirm you see the big picture. ML in production is not "train a model and deploy it." It is a continuous cycle:
Training leads to validation, which leads to registry, then deployment, then monitoring, then retraining. Understanding each transition point and what can go wrong is essential.
Preferred Qualifications
"Experience with GPU infrastructure and CUDA"
This separates senior candidates from junior ones. If you have hands-on experience with GPU memory profiling, CUDA kernel optimization, or multi-GPU training with NCCL — you will stand out significantly.
"Experience with distributed systems or databases"
The ScyllaDB feature store requirement makes this especially relevant. Experience with Cassandra, DynamoDB, or any LSM-tree based database gives you a huge advantage.
"Contributions to open-source ML tools"
This is the strongest signal in the "preferred" section. Active open-source contributors demonstrate both technical depth and community engagement. Even small contributions to MLFlow, Kubeflow, or Triton will be noticed.
3. Tech Stack Deep Dive
3.1 Kubernetes for MLOps
Kubernetes is the foundation of the entire Toss Bank ML platform. Every other tool runs on top of it. Your Kubernetes knowledge needs to go beyond basic deployments.
Key Concepts for MLOps on Kubernetes
| Concept | MLOps Relevance |
|---|---|
| Namespaces | Isolating dev/staging/prod ML environments |
| Resource Quotas | Preventing runaway training jobs from starving other workloads |
| Node Affinity and Taints | Directing GPU workloads to GPU nodes, CPU workloads to CPU nodes |
| Custom Resource Definitions | Kubeflow Pipelines, TFJob, PyTorchJob all use CRDs |
| Persistent Volume Claims | Storing training data, model artifacts, checkpoints |
| Horizontal Pod Autoscaler | Scaling inference servers based on request load |
GPU Scheduling on Kubernetes
GPU management is critical for this role. NVIDIA provides the nvidia-device-plugin for Kubernetes, and you need to understand:
apiVersion: v1
kind: Pod
metadata:
name: gpu-training-job
spec:
containers:
- name: training
image: training-image:v1
resources:
limits:
nvidia.com/gpu: 2
nodeSelector:
accelerator: nvidia-a100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
This example shows a training pod requesting 2 NVIDIA A100 GPUs with appropriate node selection and tolerations. In production, you would also configure:
- GPU time-slicing for smaller workloads (MIG — Multi-Instance GPU)
- RDMA networking for multi-node training
- Topology-aware scheduling for optimal GPU-to-GPU communication
Study Resources
- Kubernetes official documentation: Concepts section (free)
- "Kubernetes in Action" by Marko Luksa — the definitive deep-dive book
- NVIDIA GPU Operator documentation
- CKA (Certified Kubernetes Administrator) certification — strongly recommended
3.2 MLFlow: Experiment Tracking and Model Registry
MLFlow is the de facto standard for experiment tracking in the ML industry. At Toss Bank, it serves as both the experiment tracking system and the model registry.
Architecture Overview
MLFlow has four core components:
- Tracking — Logs parameters, metrics, and artifacts for each experiment run
- Projects — Packages ML code in a reusable, reproducible format
- Models — Standardizes model packaging across frameworks
- Model Registry — Centralized model store with versioning, staging, and approval workflows
How Toss Bank Likely Uses MLFlow
In a financial institution, model governance is not optional. The MLFlow Model Registry becomes the central control plane:
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Register a new model version
model_uri = "runs:/abc123/model"
mv = client.create_model_version(
name="credit-scoring-v2",
source=model_uri,
run_id="abc123",
description="LightGBM credit scoring model with 47 features"
)
# Transition through stages with approval
client.transition_model_version_stage(
name="credit-scoring-v2",
version=mv.version,
stage="Staging"
)
# After validation, promote to production
client.transition_model_version_stage(
name="credit-scoring-v2",
version=mv.version,
stage="Production",
archive_existing_versions=True
)
Key Topics to Study
- MLFlow Tracking Server deployment on Kubernetes (PostgreSQL backend, S3/MinIO artifact store)
- Custom MLFlow plugins (e.g., custom authentication, custom artifact stores)
- MLFlow Model Serving vs dedicated serving solutions (Triton)
- Integration with Airflow for automated training pipelines
- Metric comparison and experiment analysis APIs
Study Resources
- MLFlow official documentation (comprehensive and well-written)
- "Practical MLOps" by Noah Gift and Alfredo Deza (O'Reilly)
- MLFlow GitHub repository — read the source code for the tracking server
3.3 Apache Airflow: Workflow Orchestration
Airflow is the industry standard for data pipeline orchestration. In the ML context, it manages the complex dependencies between data preparation, feature computation, model training, evaluation, and deployment.
Why Airflow for ML Pipelines
| Capability | ML Application |
|---|---|
| DAG (Directed Acyclic Graph) scheduling | Defining dependencies between data prep, training, evaluation steps |
| Retry and error handling | Recovering from transient GPU failures during training |
| SLA monitoring | Ensuring daily model retraining completes before business hours |
| Parameterized DAGs | Running the same pipeline with different hyperparameters |
| Custom operators | Building Kubernetes-native training job operators |
Example: ML Training DAG
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from datetime import datetime, timedelta
default_args = {
"owner": "ml-platform",
"retries": 2,
"retry_delay": timedelta(minutes=10),
}
with DAG(
dag_id="credit_scoring_daily_retrain",
default_args=default_args,
schedule_interval="0 2 * * *",
start_date=datetime(2025, 1, 1),
catchup=False,
) as dag:
feature_extraction = KubernetesPodOperator(
task_id="extract_features",
name="feature-extraction",
namespace="ml-pipelines",
image="feature-pipeline:v3",
arguments=["--date", "{{ ds }}"],
resources={
"requests": {"cpu": "4", "memory": "16Gi"},
"limits": {"cpu": "8", "memory": "32Gi"},
},
)
model_training = KubernetesPodOperator(
task_id="train_model",
name="model-training",
namespace="ml-pipelines",
image="training-pipeline:v5",
arguments=["--date", "{{ ds }}", "--experiment", "credit-scoring"],
resources={
"requests": {"cpu": "8", "memory": "32Gi", "nvidia.com/gpu": "1"},
"limits": {"cpu": "16", "memory": "64Gi", "nvidia.com/gpu": "1"},
},
)
model_evaluation = KubernetesPodOperator(
task_id="evaluate_model",
name="model-evaluation",
namespace="ml-pipelines",
image="evaluation-pipeline:v2",
arguments=["--date", "{{ ds }}"],
)
feature_extraction >> model_training >> model_evaluation
Key Topics to Study
- KubernetesExecutor vs CeleryExecutor — tradeoffs for ML workloads
- Custom operators for MLFlow integration
- XCom for passing metadata between tasks (model metrics, artifact URIs)
- Connection and variable management for secrets
- Airflow on Kubernetes: Helm chart deployment and configuration
- DAG versioning and testing strategies
Study Resources
- Apache Airflow official documentation
- "Data Pipelines with Apache Airflow" by Bas Harenslak and Julian de Ruiter (Manning)
- Astronomer.io blog and guides (Astronomer is the commercial Airflow company)
3.4 Kubeflow: Kubernetes-Native ML Pipelines
Kubeflow is the Kubernetes-native ML platform that provides pipeline orchestration, hyperparameter tuning, and distributed training capabilities. While Airflow handles general workflow orchestration, Kubeflow is purpose-built for ML.
Kubeflow Components Relevant to This Role
| Component | Purpose |
|---|---|
| Kubeflow Pipelines (KFP) | Define and run ML workflows as reusable pipelines |
| Katib | Automated hyperparameter tuning |
| Training Operators | Distributed training for TensorFlow, PyTorch, XGBoost |
| KServe | Model serving (may overlap with Triton in Toss Bank setup) |
| Notebooks | Jupyter notebook management (may overlap with JupyterHub) |
Kubeflow Pipelines Example
from kfp import dsl
from kfp import compiler
@dsl.component(
base_image="python:3.11",
packages_to_install=["scikit-learn", "pandas"]
)
def preprocess_data(input_path: str, output_path: dsl.OutputPath(str)):
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_parquet(input_path)
scaler = StandardScaler()
df_scaled = pd.DataFrame(
scaler.fit_transform(df),
columns=df.columns
)
df_scaled.to_parquet(output_path)
@dsl.component(
base_image="python:3.11",
packages_to_install=["lightgbm", "mlflow"]
)
def train_model(data_path: str, model_name: str):
import lightgbm as lgb
import mlflow
import pandas as pd
df = pd.read_parquet(data_path)
# Training logic here
mlflow.lightgbm.log_model(model, model_name)
@dsl.pipeline(name="credit-scoring-pipeline")
def credit_scoring_pipeline(input_path: str = "s3://data/features/"):
preprocess_task = preprocess_data(input_path=input_path)
train_task = train_model(
data_path=preprocess_task.outputs["output_path"],
model_name="credit-scoring"
)
compiler.Compiler().compile(credit_scoring_pipeline, "pipeline.yaml")
Kubeflow vs Airflow: When to Use Which
- Airflow: Best for complex DAGs with mixed workloads (data pipelines + ML + ETL), mature ecosystem with hundreds of operators, strong scheduling capabilities
- Kubeflow: Best for pure ML pipelines, native Kubernetes integration, built-in hyperparameter tuning and distributed training, better experiment tracking integration
Many teams (including likely Toss Bank) use both: Airflow for top-level orchestration and data pipelines, Kubeflow for the ML-specific pipeline steps within those workflows.
Study Resources
- Kubeflow official documentation
- Kubeflow Pipelines SDK v2 documentation (this is the current version)
- Google Cloud Vertex AI Pipelines (uses KFP under the hood)
3.5 JupyterHub: Multi-User Notebook Platform
JupyterHub is the multi-user server for Jupyter notebooks. In an ML platform context, it provides the self-service environment where data scientists experiment and develop models.
Why JupyterHub Matters for This Role
You are not just deploying JupyterHub — you are building a customized, secure, enterprise-grade notebook platform for a financial institution. This involves:
- Authentication and Authorization — Integrating with the company identity provider (LDAP, OIDC, SAML)
- Resource Management — Allowing users to request specific compute profiles (CPU-only, single GPU, multi-GPU)
- Image Management — Maintaining curated Docker images with pre-installed ML frameworks
- Persistent Storage — Ensuring notebooks and data persist across server restarts
- Security — Network isolation, secret management, preventing data exfiltration in a banking environment
Kubernetes-Native JupyterHub Architecture
On Kubernetes, JupyterHub uses the kubespawner to create individual pods for each user:
# JupyterHub Helm values (simplified)
singleuser:
profileList:
- display_name: 'CPU - Small (2 CPU, 8GB)'
description: 'For data exploration and light processing'
kubespawner_override:
cpu_limit: 2
mem_limit: '8G'
- display_name: 'GPU - A100 (8 CPU, 32GB, 1 GPU)'
description: 'For model training and fine-tuning'
kubespawner_override:
cpu_limit: 8
mem_limit: '32G'
extra_resource_limits:
nvidia.com/gpu: '1'
node_selector:
accelerator: nvidia-a100
storage:
type: dynamic
capacity: 50Gi
storageClass: fast-ssd
Study Resources
- Zero to JupyterHub with Kubernetes documentation
- JupyterHub for Kubernetes Helm chart documentation
- KubeSpawner documentation
3.6 Triton Inference Server: Production Model Serving
NVIDIA Triton Inference Server is the industry-leading solution for serving ML models in production. It supports multiple frameworks simultaneously and provides advanced features like dynamic batching, model ensemble, and GPU utilization optimization.
Why Triton for Financial Services
| Feature | Banking Benefit |
|---|---|
| Multi-framework support | Serve XGBoost credit models and PyTorch NLP models from the same server |
| Dynamic batching | Maximize throughput while meeting latency SLAs |
| Model ensemble | Chain preprocessing, model inference, and postprocessing |
| Model versioning | Seamless A/B testing and canary deployments |
| Metrics and monitoring | Prometheus-compatible metrics for model performance tracking |
| gRPC and HTTP endpoints | Flexible integration with existing banking microservices |
Triton Model Repository Structure
model_repository/
credit_scoring/
config.pbtxt
1/
model.onnx
2/
model.onnx
fraud_detection/
config.pbtxt
1/
model.plan
text_classifier/
config.pbtxt
1/
model.pt
Triton Configuration Example
name: "credit_scoring"
platform: "onnxruntime_onnx"
max_batch_size: 64
input [
{
name: "features"
data_type: TYPE_FP32
dims: [ 47 ]
}
]
output [
{
name: "probability"
data_type: TYPE_FP32
dims: [ 1 ]
}
]
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]
dynamic_batching {
preferred_batch_size: [ 16, 32, 64 ]
max_queue_delay_microseconds: 100
}
Key Topics to Study
- Model conversion: PyTorch to ONNX, TensorFlow to TensorRT
- Dynamic batching configuration and tuning
- Model ensemble for pre/post-processing pipelines
- Triton Inference Server on Kubernetes (Helm chart deployment)
- Performance analysis with Triton Model Analyzer
- Custom backends for non-standard model formats
- Health checks and readiness probes for Kubernetes integration
Study Resources
- NVIDIA Triton Inference Server documentation
- NVIDIA Deep Learning Examples GitHub repository
- Triton Model Analyzer documentation
- "Serving Machine Learning Models" by Yaron Haviv (O'Reilly)
3.7 ScyllaDB Feature Store: Low-Latency Feature Serving
The decision to build a feature store on ScyllaDB is one of the most distinctive aspects of Toss Bank's ML platform. Understanding why they chose this architecture reveals a lot about the team's priorities.
What Is a Feature Store
A feature store is a centralized repository for storing, managing, and serving ML features. It solves several critical problems:
- Feature consistency — Ensuring the same feature computation is used during both training and inference
- Feature reuse — Allowing multiple models to share the same features without redundant computation
- Low-latency serving — Providing precomputed features at inference time with sub-millisecond latency
- Point-in-time correctness — Retrieving features as they existed at a specific historical timestamp for training (avoiding data leakage)
Why ScyllaDB Instead of Redis or DynamoDB
| Requirement | ScyllaDB Advantage |
|---|---|
| Consistent low-latency (p99) | Shard-per-core architecture eliminates context switching overhead |
| Large feature sets | Supports wide rows with hundreds of columns per entity |
| Time-series features | Native TTL and time-windowed compaction for historical features |
| Operational simplicity | Cassandra-compatible but with C++ performance (no JVM tuning) |
| Cost at scale | Better price-performance ratio than DynamoDB for predictable workloads |
Feature Store Architecture Pattern
Batch Features
Training Data -----> [Airflow + Spark] -----> ScyllaDB (offline store)
|
v
Online Features [Feature Serving API]
Live Requests -----> [Streaming Pipeline] -------> |
v
[Model Serving (Triton)]
ScyllaDB Data Modeling for Features
CREATE TABLE feature_store.user_features (
user_id text,
feature_timestamp timestamp,
avg_transaction_amount_7d double,
transaction_count_30d int,
max_single_transaction_90d double,
credit_utilization_ratio double,
days_since_last_late_payment int,
PRIMARY KEY (user_id, feature_timestamp)
) WITH CLUSTERING ORDER BY (feature_timestamp DESC)
AND default_time_to_live = 7776000;
Key Topics to Study
- ScyllaDB architecture: shard-per-core model, seastar framework
- Data modeling for wide-column databases (Cassandra/ScyllaDB)
- Feature store concepts: online vs offline store, feature freshness, time-travel
- Comparison with existing feature stores (Feast, Tecton, Hopsworks)
- ScyllaDB performance tuning: compaction strategies, caching, read/write consistency levels
- Driver selection and connection pooling for high-throughput workloads
Study Resources
- ScyllaDB University (free online courses)
- "Cassandra: The Definitive Guide" by Jeff Carpenter and Eben Hewitt (concepts transfer directly)
- Feast documentation (to understand general feature store concepts)
- ScyllaDB Architecture documentation
3.8 LLM Platform: GPU Infrastructure and Inference Optimization
The LLM platform responsibility is the most forward-looking part of this role. Building infrastructure for Large Language Models requires understanding a completely different set of constraints than traditional ML.
LLM Infrastructure Challenges in Banking
- Data privacy — Banking data cannot leave the organization, ruling out most cloud LLM APIs. On-premises or VPC-hosted inference is required.
- Latency requirements — Customer-facing chatbots need first-token latency under 500ms and generation speed above 30 tokens per second.
- Cost management — A single NVIDIA H100 GPU costs over $30,000. Efficient utilization of GPU clusters is essential for ROI.
- Model governance — Regulatory requirements mean every LLM response in banking must be traceable, auditable, and explainable.
Key LLM Serving Technologies
| Technology | Purpose |
|---|---|
| vLLM | High-throughput LLM serving with PagedAttention |
| TensorRT-LLM | NVIDIA optimized LLM inference engine |
| Triton + TensorRT-LLM backend | Enterprise-grade LLM serving on Triton |
| NVIDIA NIM | Containerized, optimized inference microservices |
| Ray Serve | Distributed serving framework for complex inference graphs |
LLM Inference Optimization Techniques
- Quantization: Reducing model precision from FP16 to INT8 or INT4. AWQ and GPTQ are the most common methods. This typically reduces GPU memory usage by 50-75% with minimal accuracy loss.
- KV-Cache Optimization: PagedAttention (used by vLLM) manages the key-value cache like virtual memory pages, dramatically improving throughput for concurrent requests.
- Continuous Batching: Unlike static batching, continuous batching allows new requests to join a batch as previous requests complete, maximizing GPU utilization.
- Speculative Decoding: Using a small draft model to generate candidate tokens that a larger model verifies, potentially speeding up inference by 2-3x.
- Tensor Parallelism: Splitting a single model across multiple GPUs for serving models too large for one GPU's memory.
RAG Pipeline for Banking
Retrieval-Augmented Generation (RAG) is essential for banking LLM applications to ground responses in accurate, up-to-date information:
User Query --> [Embedding Model] --> [Vector DB Search]
|
Retrieved Context
|
v
[Prompt Template + Context + Query] --> [LLM] --> Response
|
[Guardrails]
|
Final Response
Study Resources
- vLLM documentation and GitHub repository
- NVIDIA TensorRT-LLM documentation
- "LLM Engineer's Handbook" by Paul Iusztin and Maxime Labonne
- Hugging Face Text Generation Inference documentation
- NVIDIA NIM documentation
3.9 GPU Frameworks and CUDA Fundamentals
Understanding GPU computing at a deeper level sets apart strong candidates from average ones. You do not need to be a CUDA kernel developer, but you need to understand the fundamentals.
GPU Architecture Basics
| Concept | Description |
|---|---|
| Streaming Multiprocessor (SM) | The basic processing unit of a GPU. An A100 has 108 SMs |
| CUDA Core | Individual processing units within each SM |
| Tensor Core | Specialized hardware for matrix operations (critical for ML) |
| HBM (High Bandwidth Memory) | GPU memory (80GB on A100, 80GB on H100) |
| NVLink | High-speed GPU-to-GPU interconnect (900 GB/s on H100) |
| PCIe | CPU-to-GPU interconnect (slower than NVLink) |
Multi-Instance GPU (MIG)
MIG allows partitioning a single GPU into multiple isolated instances. This is essential for maximizing GPU utilization:
# List MIG-capable GPUs
nvidia-smi mig -lgip
# Create a MIG instance (3g.40gb profile on A100)
nvidia-smi mig -cgi 9,9 -C
# List created instances
nvidia-smi mig -lgi
In a banking context, MIG allows running smaller inference workloads on fractions of an A100/H100 rather than dedicating an entire GPU to each model.
CUDA Memory Management for ML Engineers
Understanding GPU memory is critical for debugging out-of-memory errors and optimizing training:
- Model Parameters: Stored in GPU memory (e.g., a 7B parameter model in FP16 needs about 14GB)
- Gradients: Same size as parameters during training (another 14GB)
- Optimizer States: Adam optimizer stores two additional copies (another 28GB)
- Activations: Intermediate values during forward pass (varies with batch size and sequence length)
Total training memory for a 7B model with Adam in FP16 is roughly: 14 + 14 + 28 = 56GB minimum, plus activations.
NCCL (NVIDIA Collective Communications Library)
NCCL is the library that enables efficient multi-GPU and multi-node communication:
- AllReduce: Aggregating gradients across GPUs during distributed training
- AllGather: Collecting tensor shards from all GPUs for tensor parallelism
- ReduceScatter: Combining reduction and distribution for pipeline parallelism
Study Resources
- NVIDIA CUDA Programming Guide
- "Programming Massively Parallel Processors" by David Kirk and Wen-mei Hwu
- NVIDIA Deep Learning Performance Guide
- PyTorch Distributed Training documentation
3.10 Distributed Database Fundamentals
Since Toss Bank uses ScyllaDB for its feature store and operates at financial-grade scale, a solid understanding of distributed database theory and practice is essential.
CAP Theorem and Its Practical Implications
The CAP theorem states that a distributed system can provide at most two of three guarantees: Consistency, Availability, and Partition tolerance. In practice:
- ScyllaDB/Cassandra: AP system with tunable consistency (you can configure per-query consistency levels)
- For feature serving: Usually QUORUM reads for consistency with LOCAL_QUORUM for latency optimization
- For feature writing: ONE or LOCAL_ONE for high write throughput during batch feature computation
Consistency Levels in Practice
| Consistency Level | Reads From | Use Case |
|---|---|---|
| ONE | Any single replica | Maximum speed, eventual consistency |
| QUORUM | Majority of replicas | Strong consistency for critical features |
| LOCAL_QUORUM | Majority in local datacenter | Consistent reads with low latency |
| ALL | All replicas | Maximum consistency (rarely used in production) |
LSM-Tree Architecture
ScyllaDB and Cassandra use Log-Structured Merge Trees for storage:
- Writes go to an in-memory Memtable
- When the Memtable is full, it is flushed to disk as an SSTable
- Background compaction merges SSTables to reclaim space and optimize reads
- Bloom filters and partition indexes enable fast lookups
Understanding compaction strategies (Size-Tiered, Leveled, Time-Window) is critical for feature store performance tuning.
Consistent Hashing and Data Distribution
ScyllaDB distributes data across nodes using consistent hashing:
- Each node owns a range of token values
- Partition keys are hashed to determine which node stores the data
- Virtual nodes (vnodes) improve data distribution uniformity
- Replication factor determines how many copies of each piece of data exist
Study Resources
- "Designing Data-Intensive Applications" by Martin Kleppmann — essential reading
- ScyllaDB University courses (free)
- "Database Internals" by Alex Petrov (O'Reilly)
- Jepsen.io for distributed systems correctness analysis
4. MLOps Maturity Model
Understanding where Toss Bank sits on the MLOps maturity model helps you frame your interview answers and demonstrate strategic thinking.
The Five Levels
| Level | Name | Characteristics |
|---|---|---|
| 0 | No MLOps | Manual everything — notebooks to production via copy-paste |
| 1 | DevOps but no MLOps | CI/CD exists but ML-specific pipelines do not |
| 2 | Automated Training | Automated training pipelines, basic experiment tracking |
| 3 | Automated Deployment | Automated model deployment, A/B testing, monitoring |
| 4 | Full MLOps | Automated retraining, feature stores, model governance |
| 5 | Autonomous ML | Self-healing pipelines, automated feature engineering, continuous optimization |
Toss Bank's Current Position (Estimated Level 3-4)
Based on the JD analysis, Toss Bank appears to be at Level 3-4:
Evidence for Level 3+:
- Dedicated ML Platform team (not ad-hoc)
- Mature tool stack (MLFlow, Kubeflow, Airflow)
- Production model serving (Triton)
- Custom feature store (ScyllaDB-based)
Aspirations toward Level 4-5:
- LLM platform development (cutting-edge capabilities)
- GPU infrastructure management (scaling up)
- The fact that they are hiring suggests expansion and maturation
In your interview, frame your contributions as helping the team advance from their current level to the next. Show that you understand not just the tools, but the organizational and process changes needed for MLOps maturity.
5. Interview Preparation: 30 Expected Questions
Kubernetes and Infrastructure (Questions 1-6)
Q1. How would you design a Kubernetes cluster to support both ML training workloads and model inference serving? What considerations differ between the two?
Q2. Explain how the NVIDIA device plugin works in Kubernetes. How do you handle GPU scheduling, and what happens when a GPU node becomes unhealthy during a training job?
Q3. A training job is consuming all available GPU memory on a node, preventing other pods from scheduling. How would you prevent this using Kubernetes resource management?
Q4. Describe how you would implement a blue-green deployment strategy for ML model updates on Kubernetes. What metrics would you monitor during the canary phase?
Q5. How does Multi-Instance GPU (MIG) work, and in what scenarios would you choose MIG over dedicated GPU allocation?
Q6. Explain the tradeoffs between using a single large Kubernetes cluster versus multiple smaller clusters for separating training and serving workloads.
MLFlow and Experiment Management (Questions 7-12)
Q7. How would you design an MLFlow deployment for a team of 50+ data scientists with requirements for high availability and security?
Q8. Describe your approach to organizing MLFlow experiments and runs for a large organization. How do you prevent experiment sprawl and maintain discoverability?
Q9. How would you implement a model approval workflow using the MLFlow Model Registry? What stages would you define, and what automated checks would you add?
Q10. MLFlow's tracking server is experiencing performance issues with a large volume of experiments. How would you diagnose and resolve this?
Q11. How do you handle model reproducibility? Walk through the steps from experiment to production deployment, ensuring you can recreate any model version.
Q12. Describe how you would integrate MLFlow with your CI/CD pipeline to automate model testing and deployment.
Airflow and Pipeline Orchestration (Questions 13-18)
Q13. Compare the KubernetesExecutor and CeleryExecutor in Airflow for ML pipeline workloads. Which would you recommend for Toss Bank and why?
Q14. How would you design a DAG that handles daily model retraining with automatic rollback if the new model underperforms the current production model?
Q15. A critical Airflow DAG failed at 3 AM and the morning model predictions are stale. Walk through your incident response process.
Q16. How would you implement data quality checks within an Airflow ML pipeline to prevent training on corrupted or incomplete data?
Q17. Describe your strategy for testing Airflow DAGs. How do you ensure DAG changes do not break production workflows?
Q18. How would you manage secrets and credentials in Airflow for connecting to various data sources and ML services?
Model Serving and Triton (Questions 19-24)
Q19. You need to serve a model that requires sub-5ms p99 latency for credit scoring decisions. How would you design the serving infrastructure using Triton?
Q20. Explain dynamic batching in Triton Inference Server. How do you tune the batch size and queue delay parameters for optimal throughput-latency tradeoff?
Q21. How would you implement an A/B test between two model versions using Triton? What metrics would you track, and how long would you run the test?
Q22. A deployed model is experiencing gradual performance degradation over two weeks. Describe your approach to diagnosing and resolving model drift.
Q23. How would you design a model ensemble in Triton that chains a feature preprocessor, a primary model, and a postprocessor together?
Q24. Describe the steps to convert a PyTorch model to an optimized TensorRT engine for deployment on Triton. What potential pitfalls should you watch for?
Feature Store and Distributed Databases (Questions 25-27)
Q25. Why would you choose ScyllaDB over Redis for a feature store backend? Under what circumstances might Redis be the better choice?
Q26. Explain how you would handle feature freshness requirements for a real-time fraud detection model. What is your architecture for updating features in near-real-time?
Q27. Describe point-in-time correctness in a feature store. Why is it important, and how do you implement it with ScyllaDB?
LLM Platform (Questions 28-30)
Q28. How would you design an LLM serving platform for a banking environment where all data must remain on-premises? What are the key architectural decisions?
Q29. Explain the difference between tensor parallelism and pipeline parallelism for serving large language models. When would you use each?
Q30. You are tasked with reducing LLM inference costs by 50% without significantly impacting response quality. What approaches would you consider?
6. Eight-Month Study Roadmap
This roadmap assumes you are currently a backend engineer or junior ML engineer with basic Python and cloud experience. Adjust the timeline based on your starting point.
Month 1-2: Kubernetes and Container Foundations
Goal: Achieve CKA-level Kubernetes proficiency
| Week | Focus Area | Deliverable |
|---|---|---|
| 1-2 | Core Kubernetes concepts | Deploy a multi-service application on a local K8s cluster |
| 3-4 | Advanced scheduling, storage, networking | Configure GPU node pools with taints and tolerations |
| 5-6 | Helm, Kustomize, GitOps | Create Helm charts for ML services deployment |
| 7-8 | CKA exam preparation | Pass CKA certification |
Daily practice: 1.5 hours on weekdays, 3 hours on weekends Resources: KodeKloud CKA course, Kubernetes documentation, killer.sh practice exams
Month 3-4: Core ML Platform Tools
Goal: Deploy and customize MLFlow, Airflow, and JupyterHub on Kubernetes
| Week | Focus Area | Deliverable |
|---|---|---|
| 9-10 | MLFlow deep dive | Deploy MLFlow with PostgreSQL backend and S3 artifact store on K8s |
| 11-12 | Airflow deep dive | Build an ML training DAG with KubernetesPodOperator |
| 13-14 | JupyterHub deployment | Configure multi-profile JupyterHub with GPU support |
| 15-16 | Integration project | End-to-end pipeline: JupyterHub experiment to MLFlow to Airflow training to model registry |
Daily practice: 2 hours on weekdays, 4 hours on weekends Resources: Official documentation for each tool, "Practical MLOps" book
Month 5-6: Model Serving and Feature Engineering
Goal: Master Triton deployment and build a feature store prototype
| Week | Focus Area | Deliverable |
|---|---|---|
| 17-18 | Triton Inference Server | Deploy models in 3 different formats on Triton with dynamic batching |
| 19-20 | Model optimization | Convert PyTorch model to ONNX and TensorRT, benchmark performance |
| 21-22 | ScyllaDB fundamentals | Complete ScyllaDB University courses, build a feature serving API |
| 23-24 | Feature store integration | Build a complete feature store with online/offline serving using ScyllaDB |
Daily practice: 2 hours on weekdays, 4 hours on weekends Resources: Triton documentation, ScyllaDB University, "Designing Data-Intensive Applications"
Month 7-8: LLM Platform and Interview Preparation
Goal: Build LLM serving experience and prepare for interviews
| Week | Focus Area | Deliverable |
|---|---|---|
| 25-26 | LLM serving fundamentals | Deploy an open-source LLM with vLLM and Triton TensorRT-LLM backend |
| 27-28 | RAG pipeline | Build a RAG system with vector database and LLM serving |
| 29-30 | GPU optimization | Implement quantization, benchmark different serving configurations |
| 31-32 | Interview preparation | Mock interviews, review all 30 questions, prepare STAR-format stories |
Daily practice: 2 hours on weekdays, 5 hours on weekends Resources: vLLM documentation, Hugging Face resources, mock interview practice
Study Schedule Summary
Month 1-2: [========== Kubernetes + CKA ==========]
Month 3-4: [==== MLFlow ==][== Airflow ==][= JupyterHub =]
Month 5-6: [=== Triton ===][=== ScyllaDB Feature Store ===]
Month 7-8: [=== LLM Platform ===][== Interview Prep ==]
7. Resume Strategy for Toss Bank ML Platform
Resume Structure
Your resume should directly map to the JD requirements. Here is the recommended structure:
Header Section
- Name, contact information, GitHub profile, blog/portfolio URL
Summary (3-4 lines)
- Years of experience, primary domain (MLOps/ML Engineering)
- Key technologies matching the JD (Kubernetes, MLFlow, Triton)
- Quantified achievement (e.g., "reduced model deployment time from 2 weeks to 4 hours")
Experience Section (STAR format)
For each position, structure bullets as:
- Situation: What was the context
- Task: What was your specific responsibility
- Action: What did you do (technical details)
- Result: What was the measurable outcome
Example bullet:
"Designed and deployed an MLFlow-based experiment tracking system on Kubernetes for 30+ data scientists, reducing experiment-to-production time by 80% and establishing model governance workflows that satisfied SOC 2 audit requirements."
Keywords to Include
These terms should appear naturally in your resume:
| Category | Terms |
|---|---|
| Infrastructure | Kubernetes, Docker, Helm, GitOps, ArgoCD |
| ML Platform | MLFlow, Kubeflow, Airflow, JupyterHub |
| Model Serving | Triton Inference Server, ONNX, TensorRT, gRPC |
| Data | ScyllaDB, Cassandra, Feature Store, Kafka |
| LLM | vLLM, TensorRT-LLM, RAG, Vector Database, Quantization |
| GPU | CUDA, MIG, NCCL, A100, H100 |
| Practices | CI/CD, Monitoring, A/B Testing, Canary Deployment |
Common Resume Mistakes for MLOps Roles
- Listing tools without context — "Experience with MLFlow" is weak. "Deployed MLFlow tracking server handling 10,000+ experiment runs across 5 teams" is strong.
- Focusing only on model accuracy — For a platform role, infrastructure metrics matter more (deployment frequency, serving latency, platform uptime).
- Ignoring scale indicators — Always include numbers: how many models, how many users, what throughput, what latency.
- Missing the governance angle — Financial services care deeply about audit trails, compliance, and model explainability. Mention these explicitly.
8. Portfolio Projects
Project 1: End-to-End MLOps Platform on Kubernetes
Objective: Demonstrate your ability to build and integrate the core ML platform stack.
Architecture:
JupyterHub (experimentation)
|
v
MLFlow (experiment tracking + model registry)
|
v
Airflow (automated training pipeline)
|
v
Triton (model serving)
|
v
Prometheus + Grafana (monitoring)
Implementation Details:
- Set up a local Kubernetes cluster (kind or minikube with GPU support)
- Deploy MLFlow with PostgreSQL and MinIO (S3-compatible storage)
- Deploy Airflow with KubernetesExecutor
- Build a training DAG that:
- Pulls data from a feature table
- Trains a LightGBM model
- Logs metrics and artifacts to MLFlow
- Registers the model in MLFlow Model Registry
- Deploys the model to Triton if performance meets threshold
- Deploy JupyterHub for interactive experimentation
- Set up Prometheus scraping for all services plus Grafana dashboards
GitHub Repository Structure:
mlops-platform/
infrastructure/
kubernetes/
mlflow/
airflow/
jupyterhub/
triton/
monitoring/
pipelines/
training/
evaluation/
deployment/
models/
credit_scoring/
fraud_detection/
docs/
architecture.md
setup.md
Makefile
README.md
What This Demonstrates: Platform engineering skills, Kubernetes proficiency, tool integration, and understanding of the full ML lifecycle.
Project 2: Feature Store on ScyllaDB
Objective: Show that you understand distributed database design and feature store concepts at a deep level.
Implementation Details:
- Deploy a 3-node ScyllaDB cluster on Kubernetes
- Design a feature schema for a credit scoring use case:
- User demographic features (slowly changing)
- Transaction aggregate features (computed daily)
- Real-time transaction features (updated per transaction)
- Build a batch pipeline (using Airflow) that computes daily aggregate features and writes them to ScyllaDB
- Build a streaming pipeline (using Kafka and a Python consumer) that updates real-time features
- Build a feature serving API (FastAPI) that retrieves features by user ID with sub-5ms p99 latency
- Implement point-in-time correctness for training data generation
- Add monitoring: feature freshness, serving latency, error rates
Key Design Decisions to Document:
- Partition key design and why
- Compaction strategy selection
- Consistency level choices for reads and writes
- TTL strategy for feature expiration
- Connection pooling and driver configuration
What This Demonstrates: Distributed database expertise, feature store architecture understanding, real-time systems design, and performance optimization skills.
Project 3: LLM Serving Platform with RAG
Objective: Demonstrate hands-on experience with LLM infrastructure and inference optimization.
Implementation Details:
- Deploy an open-source LLM (Llama 3 8B or Mistral 7B) using vLLM on Kubernetes with GPU
- Implement a RAG pipeline:
- Document ingestion pipeline (PDF/text to embeddings)
- Vector database (ChromaDB or Milvus) for similarity search
- Prompt template system with context injection
- Streaming response generation
- Optimize inference:
- Quantize the model to INT4 using AWQ
- Benchmark latency and throughput before and after quantization
- Configure continuous batching parameters
- Implement request-level caching for common queries
- Build a simple chat interface to demonstrate the system
- Add monitoring: tokens per second, time to first token, GPU utilization, cache hit rate
Bonus: Deploy the same model using Triton with the TensorRT-LLM backend and compare performance with vLLM.
What This Demonstrates: LLM infrastructure skills, optimization expertise, RAG architecture understanding, and the ability to make quantitative engineering decisions.
9. Knowledge Check Quiz
Q1. What is the primary advantage of ScyllaDB's shard-per-core architecture over Cassandra's thread-per-request model?
ScyllaDB's shard-per-core architecture assigns each CPU core its own data shard and processing thread, eliminating context switching overhead and lock contention. This results in predictable, consistent low-latency performance (especially at p99) compared to Cassandra's JVM-based thread-per-request model which suffers from garbage collection pauses and cross-thread coordination overhead. For a feature store where p99 latency matters as much as average latency, this architectural difference is critical.
Q2. Why is continuous batching superior to static batching for LLM inference, and what system implements this?
Static batching requires all requests in a batch to complete before any response can be returned, meaning shorter requests are delayed by longer ones. Continuous batching (also called iteration-level batching) allows new requests to enter the batch as soon as any request in the current batch completes a generation step. This maximizes GPU utilization by keeping the GPU busy processing tokens rather than waiting for the longest request to finish. vLLM implements this through its PagedAttention mechanism, and Triton supports it through the TensorRT-LLM backend.
Q3. Explain the difference between Kubeflow Pipelines and Airflow for ML workflow orchestration. When would you use each?
Airflow is a general-purpose workflow orchestration tool optimized for scheduled, dependency-driven DAGs. It excels at data pipelines, ETL jobs, and complex scheduling logic with a mature ecosystem of 500+ operators. Kubeflow Pipelines is a Kubernetes-native ML pipeline orchestrator optimized for ML-specific workflows. It provides first-class support for experiment tracking, artifact management, and pipeline visualization. Use Airflow when you need complex scheduling, mixed workload orchestration (data + ML), and integration with diverse data sources. Use Kubeflow Pipelines when you need pure ML pipelines with tight Kubernetes integration, pipeline versioning and comparison, and integration with Kubeflow's hyperparameter tuning (Katib) and training operators.
Q4. In the context of MLFlow Model Registry, what are the model stages, and how would you implement an automated promotion workflow?
MLFlow Model Registry defines four stages: None, Staging, Production, and Archived. An automated promotion workflow would work as follows. First, when a new model version is registered (from an Airflow training DAG), it enters the None stage. Second, automated tests run against the model (data validation, performance benchmarks, bias checks). Third, if all tests pass, the model is promoted to Staging. Fourth, a canary deployment runs in production with the Staging model serving a small percentage of traffic. Fifth, if canary metrics (latency, accuracy, error rate) meet thresholds over a defined period, the model is promoted to Production. Sixth, the previously active Production model is moved to Archived. This workflow should be implemented using MLFlow webhooks or API polling combined with Airflow DAGs for orchestration.
Q5. How would you design a zero-downtime model update strategy on Triton Inference Server running on Kubernetes?
Triton supports model versioning natively through its model repository. The strategy works as follows. First, upload the new model version to the model repository (S3/MinIO) as a new version directory (e.g., version 2 alongside version 1). Second, configure Triton's model control mode to use explicit loading, and call the model load API to load the new version. Third, update the model configuration to set the new version as the default version policy. Fourth, at the Kubernetes level, use a rolling update strategy with readiness probes that check model health endpoints. Fifth, configure appropriate max surge and max unavailable parameters in the Deployment spec to ensure there is always at least one healthy pod serving the previous model version. Sixth, for more sophisticated traffic shifting, use an Istio or Linkerd service mesh to gradually shift traffic from the old version to the new version based on custom metrics. Seventh, implement automated rollback by monitoring model-specific metrics (accuracy, latency p99) and triggering a rollback if they degrade beyond thresholds.
10. References and Resources
Official Documentation
- Kubernetes Documentation — https://kubernetes.io/docs/
- MLFlow Documentation — https://mlflow.org/docs/latest/
- Apache Airflow Documentation — https://airflow.apache.org/docs/
- Kubeflow Documentation — https://www.kubeflow.org/docs/
- JupyterHub Documentation — https://jupyterhub.readthedocs.io/
- NVIDIA Triton Inference Server — https://docs.nvidia.com/deeplearning/triton-inference-server/
- ScyllaDB Documentation — https://docs.scylladb.com/
- vLLM Documentation — https://docs.vllm.ai/
- NVIDIA TensorRT-LLM — https://nvidia.github.io/TensorRT-LLM/
- Feast Feature Store — https://docs.feast.dev/
Books
- "Designing Data-Intensive Applications" by Martin Kleppmann (O'Reilly) — essential for distributed systems concepts
- "Kubernetes in Action" by Marko Luksa (Manning) — comprehensive Kubernetes deep dive
- "Practical MLOps" by Noah Gift and Alfredo Deza (O'Reilly) — MLOps practices and patterns
- "Data Pipelines with Apache Airflow" by Bas Harenslak and Julian de Ruiter (Manning) — Airflow best practices
- "Programming Massively Parallel Processors" by David Kirk and Wen-mei Hwu — GPU computing fundamentals
- "Cassandra: The Definitive Guide" by Jeff Carpenter and Eben Hewitt (O'Reilly) — distributed database concepts applicable to ScyllaDB
- "Database Internals" by Alex Petrov (O'Reilly) — storage engine and distributed system internals
- "Machine Learning Engineering" by Andriy Burkov — practical ML engineering patterns
Online Courses and Certifications
- CKA (Certified Kubernetes Administrator) — https://www.cncf.io/certification/cka/
- ScyllaDB University — https://university.scylladb.com/
- NVIDIA Deep Learning Institute — https://www.nvidia.com/en-us/training/
- Made With ML (MLOps Course) — https://madewithml.com/
- Full Stack Deep Learning — https://fullstackdeeplearning.com/
Community Resources
- MLOps Community Slack — https://mlops.community/
- Kubeflow Slack Channel
- NVIDIA Developer Forums — https://forums.developer.nvidia.com/
- Toss Tech Blog — https://toss.tech/
- Airflow Summit conference talks (YouTube)