Skip to content

Split View: 토스뱅크 ML Engineer (MLOps) 합격 완벽 가이드: MLFlow부터 LLM 플랫폼까지 기술스택 총정리

✨ Learn with Quiz
|

토스뱅크 ML Engineer (MLOps) 합격 완벽 가이드: MLFlow부터 LLM 플랫폼까지 기술스택 총정리

들어가며: 왜 토스뱅크 ML Platform Team인가

토스뱅크는 2021년 출범 이후 국내 인터넷전문은행의 판도를 바꿔왔습니다. 특히 ML Platform Team은 전사 머신러닝 인프라를 책임지는 핵심 조직으로, 금융 도메인 특화 MLOps라는 희소성 높은 경험을 쌓을 수 있는 곳입니다.

이 글은 토스뱅크 ML Engineer (MLOps) JD를 한 줄 한 줄 해부하고, 각 기술스택을 실무 수준까지 깊이 있게 다룹니다. 단순히 "이런 기술이 있다" 수준이 아니라, "면접관이 왜 이 기술을 물어보는지", "실무에서 어떤 문제를 해결하는지"까지 파고들겠습니다.


1. 토스뱅크 ML Platform Team 분석

1-1. 팀 미션

토스뱅크 ML Platform Team의 핵심 미션은 전사 머신러닝 플랫폼 구축 및 운영입니다. 이것이 의미하는 바를 구체적으로 풀어보겠습니다.

  • 플랫폼 엔지니어링: ML 엔지니어와 데이터 사이언티스트가 모델을 빠르게 실험하고 배포할 수 있는 셀프서비스 플랫폼 구축
  • ML 챕터 내 공통 기술 개발: 여신(대출 심사), 수신(이자율 예측), 이상거래탐지(FDS), 마케팅(추천) 등 각 도메인 ML 팀이 공통으로 사용하는 인프라 제공
  • End-to-End 자동화: 데이터 수집 → 피처 엔지니어링 → 모델 학습 → 평가 → 배포 → 모니터링의 전 과정 자동화

1-2. 핀테크 x MLOps의 특수성

일반 IT 회사의 MLOps와 금융 MLOps는 근본적으로 다릅니다.

규제 준수 (Compliance)

  • 금융위원회 AI 가이드라인: 모델 설명가능성(Explainability) 필수
  • 개인정보보호법: 학습 데이터의 비식별화, 암호화 필수
  • 모델 감사 추적(Audit Trail): 어떤 데이터로 어떤 모델이 어떤 결정을 내렸는지 완전 추적 가능해야 함
  • 모델 거버넌스: 모델 승인 프로세스, 변경 관리, 폐기 절차

실시간성 (Low Latency)

  • 대출 심사: P99 레이턴시 100ms 이하 요구
  • 이상거래탐지: 거래 발생 시 밀리초 단위로 판단
  • Feature Store의 Online Serving P99이 한 자릿수 ms

설명가능성 (Explainability)

  • SHAP, LIME 등을 활용한 모델 판단 근거 제공
  • KServe Explainer를 통한 추론 시 실시간 설명 생성
  • 금감원 검사 대응을 위한 모델 문서화 자동화

1-3. 토스의 기술 문화

토스는 사일로 해체마이크로서비스 아키텍처를 지향합니다. ML Platform Team도 이 철학 아래 동작하므로:

  • 각 ML 팀이 독립적으로 모델을 배포할 수 있는 셀프서비스 구조
  • 플랫폼 팀은 인프라만 제공하고, 도메인 팀이 자율적으로 운영
  • Slack Bot, Internal Dashboard 등 개발자 경험(DX) 중시

2. JD 완전 해부: 라인 바이 라인

토스뱅크 ML Engineer (MLOps) JD의 주요 항목을 하나씩 분석합니다.

자격 요건 분석

"Kubernetes 기반 인프라 운영 경험"

  • 단순 kubectl 명령어가 아닌, 클러스터 설계부터 운영까지의 깊은 이해
  • GPU 노드 관리, 리소스 할당 최적화, 네트워크 정책 설정 포함
  • Helm Chart 작성, Operator 패턴 이해 필수

"MLFlow, Airflow, Kubeflow 등 ML 플랫폼 도구 경험"

  • 세 가지 도구 모두를 사용해본 경험이 이상적
  • 각 도구의 역할 구분을 명확히 이해해야 함
  • MLFlow = 실험 추적 + 모델 관리, Airflow = 워크플로우 스케줄링, Kubeflow = K8s 네이티브 ML 파이프라인

"모델 서빙 파이프라인 구축 및 운영 경험"

  • Triton Inference Server 또는 TFServing, Seldon Core 경험
  • 전처리 → 추론 → 후처리 파이프라인 설계
  • 오토스케일링, 카나리 배포, A/B 테스트 구현

"Feature Store 설계 및 운영 경험"

  • ScyllaDB를 Online Store로 사용하는 아키텍처 이해
  • Offline Store와 Online Store 간 동기화 전략
  • Feature computation 파이프라인 (Spark/Flink 기반)

우대 사항 분석

"LLM 서빙 및 플랫폼 구축 경험"

  • 2024년부터 토스뱅크도 LLM을 적극 도입 중
  • vLLM, TensorRT-LLM 등 고성능 LLM 추론 엔진 경험
  • RAG 파이프라인, Prompt Management 등 LLMOps 전반

"GPU 클러스터 운영 및 최적화 경험"

  • NVIDIA A100/H100 클러스터 운영
  • MIG, time-slicing 등 GPU 공유 전략
  • DCGM을 활용한 GPU 모니터링

"분산 데이터베이스 운영 경험"

  • ScyllaDB 또는 Cassandra 운영 경험
  • Compaction 전략, repair, topology 변경 경험
  • 데이터 모델링 (Partition Key 설계의 중요성)

3. 기술스택 딥다이브

이 섹션이 이 글의 핵심입니다. 각 기술을 면접에서 자신 있게 설명할 수 있는 수준까지 다룹니다.

3-1. Kubernetes (기반 인프라)

아키텍처 완전 이해

Kubernetes의 아키텍처는 Control Plane과 **Data Plane(Worker Nodes)**으로 나뉩니다.

Control Plane 컴포넌트:

  • kube-apiserver: 모든 컴포넌트의 통신 허브. RESTful API 제공. etcd의 유일한 클라이언트
  • etcd: 클러스터 상태를 저장하는 분산 Key-Value 스토어. Raft 합의 알고리즘 사용
  • kube-scheduler: Pod를 적절한 노드에 배치. Predicates(필터링) → Priorities(스코어링)
  • kube-controller-manager: ReplicaSet, Deployment, StatefulSet 등의 컨트롤러 실행

Worker Node 컴포넌트:

  • kubelet: 노드에서 Pod 라이프사이클 관리. CRI(Container Runtime Interface)를 통해 컨테이너 실행
  • kube-proxy: 서비스 네트워킹. iptables 또는 IPVS 모드
  • Container Runtime: containerd, CRI-O

Deployment 전략

# Canary Deployment 예시 (Argo Rollouts)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: ml-model-rollout
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause:
            duration: 1h
        - setWeight: 30
        - pause:
            duration: 1h
        - setWeight: 60
        - pause:
            duration: 30m
      analysis:
        templates:
          - templateName: model-accuracy-check

ML 모델 배포에서 Canary 배포가 특히 중요한 이유: 새 모델이 프로덕션 트래픽의 일부에서 정확도를 검증한 후에야 전체 배포를 진행합니다.

Resource Management 심화

# GPU Pod의 리소스 설정
apiVersion: v1
kind: Pod
metadata:
  name: triton-server
spec:
  containers:
    - name: triton
      image: nvcr.io/nvidia/tritonserver:24.01-py3
      resources:
        requests:
          cpu: '4'
          memory: '16Gi'
          nvidia.com/gpu: '1'
        limits:
          cpu: '8'
          memory: '32Gi'
          nvidia.com/gpu: '1'
  nodeSelector:
    accelerator: nvidia-a100
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

QoS 클래스 이해:

  • Guaranteed: requests == limits (GPU 워크로드에 권장)
  • Burstable: requests 설정 but limits가 더 큼
  • BestEffort: requests/limits 미설정 (ML 워크로드에 비추천)

GPU on Kubernetes

MLOps에서 가장 중요한 부분 중 하나입니다.

NVIDIA Device Plugin:

  • K8s가 GPU를 인식하도록 해주는 DaemonSet
  • nvidia.com/gpu 리소스 타입을 등록

GPU 공유 전략:

전략설명장점단점
MIG (Multi-Instance GPU)A100을 최대 7개 인스턴스로 분할하드웨어 수준 격리A100/H100만 지원
Time-Slicing시분할로 GPU 공유모든 GPU 지원격리 없음, 간섭 가능
MPS (Multi-Process Service)CUDA 컨텍스트 공유오버헤드 낮음메모리 보호 제한적
vGPUNVIDIA GRID 기반강한 격리라이선스 비용

NVIDIA GPU Operator:

  • Device Plugin, Container Toolkit, DCGM Exporter를 자동으로 설치/관리
  • 노드에 GPU가 추가되면 자동으로 드라이버와 플러그인 배포
# GPU Operator 설치 (Helm)
# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
# helm install gpu-operator nvidia/gpu-operator

GitOps with ArgoCD

# ArgoCD Application 예시 - ML Model Deployment
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: fraud-detection-model
spec:
  project: ml-platform
  source:
    repoURL: https://github.com/tossbank/ml-deployments
    targetRevision: main
    path: models/fraud-detection
  destination:
    server: https://kubernetes.default.svc
    namespace: ml-serving
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

실습 로드맵:

  1. Minikube/kind로 로컬 K8s 클러스터 구축
  2. NVIDIA Device Plugin 설치 후 GPU Pod 실행 테스트
  3. Helm Chart 작성으로 Triton Server 배포
  4. ArgoCD 설치 후 GitOps 파이프라인 구축
  5. EKS/GKE에서 실제 GPU 노드 그룹 운영

추천 자료:

  • CKA/CKAD 자격증 (실전 역량 증명에 최고)
  • killer.sh (CKA 모의시험)
  • Kubernetes in Action 2nd Edition (Manning)
  • NVIDIA GPU Operator 공식 문서

3-2. MLFlow (실험 관리 및 모델 레지스트리)

4대 컴포넌트 이해

1. MLFlow Tracking

  • 실험(Experiment) 안에 여러 실행(Run)을 기록
  • 파라미터, 메트릭, 아티팩트, 소스 코드 자동 추적
import mlflow

mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("fraud-detection-v2")

with mlflow.start_run(run_name="xgboost-baseline"):
    mlflow.log_param("max_depth", 6)
    mlflow.log_param("learning_rate", 0.1)
    mlflow.log_param("n_estimators", 1000)

    model = train_model(params)

    mlflow.log_metric("auc", 0.9542)
    mlflow.log_metric("f1", 0.8731)
    mlflow.log_metric("precision", 0.9012)
    mlflow.log_metric("recall", 0.8467)

    # 모델 아티팩트 저장
    mlflow.xgboost.log_model(model, "model")

    # 학습 데이터 메타데이터 기록 (금융 규제 대응)
    mlflow.log_param("training_data_version", "2024-03-15")
    mlflow.log_param("data_hash", "sha256:abc123...")

2. MLFlow Projects

  • 재현 가능한 실행 환경 정의
  • MLproject 파일 + conda.yaml (또는 Dockerfile)

3. MLFlow Models

  • 다양한 ML 프레임워크의 모델을 통일된 포맷으로 패키징
  • Flavor 시스템: PyTorch, XGBoost, TensorFlow, ONNX 등

4. MLFlow Model Registry

  • 모델 버전 관리의 핵심
# 모델 등록
result = mlflow.register_model(
    "runs:/abc123/model",
    "fraud-detection-model"
)

# Stage 전환 (Staging -> Production)
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="fraud-detection-model",
    version=3,
    stage="Production",
    archive_existing_versions=True  # 기존 Production 모델 자동 아카이브
)

Tracking Server 프로덕션 구축

# docker-compose.yaml for MLFlow Server
version: '3'
services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.11.0
    command: >
      mlflow server
      --backend-store-uri postgresql://mlflow:password@postgres:5432/mlflow
      --default-artifact-root s3://tossbank-ml-artifacts/
      --host 0.0.0.0
      --port 5000
    environment:
      AWS_ACCESS_KEY_ID: ...
      AWS_SECRET_ACCESS_KEY: ...
    ports:
      - '5000:5000'

  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: password
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

왜 PostgreSQL + S3 조합인가?

  • PostgreSQL: 실험 메타데이터 (파라미터, 메트릭) — 쿼리 성능 우수
  • S3: 모델 아티팩트 (대용량 바이너리) — 확장성 무한
  • SQLite + 로컬 파일은 단일 장비 한계로 프로덕션에 부적합

MLFlow + K8s 연동

# MLFlow 모델을 K8s에 배포
# mlflow models build-docker 명령으로 Docker 이미지 생성
# 또는 KServe InferenceService로 직접 배포

# KServe로 MLFlow 모델 서빙
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-model
spec:
  predictor:
    mlflow:
      protocolVersion: v2
      storageUri: s3://tossbank-ml-artifacts/fraud-model/v3
      resources:
        requests:
          cpu: '2'
          memory: '4Gi'

실습 체크리스트:

  1. Docker Compose로 MLFlow 서버 구축 (PostgreSQL + MinIO)
  2. XGBoost 모델 학습 → 파라미터/메트릭 로깅
  3. Model Registry에 모델 등록 → Stage 전환
  4. MLFlow Models로 Docker 이미지 빌드 → K8s 배포
  5. A/B 테스트: 두 모델 버전 동시 서빙 후 비교

추천 자료:

  • MLFlow 공식 문서: mlflow.org/docs/latest
  • "MLOps: Continuous delivery and automation pipelines in machine learning" (Google Cloud)
  • "Practical MLOps" (O'Reilly, Noah Gift)

3-3. Apache Airflow (워크플로우 오케스트레이션)

아키텍처 깊이 이해

Airflow의 아키텍처는 4개 핵심 컴포넌트로 구성됩니다.

Scheduler:

  • DAG 파일을 파싱하여 실행 계획 수립
  • DagRun과 TaskInstance 생성
  • Executor에게 태스크 실행 위임
  • min_file_process_interval로 DAG 파싱 주기 조절

Webserver:

  • Flask 기반 UI
  • DAG 상태 모니터링, 로그 확인, 수동 트리거

Workers:

  • 실제 태스크를 실행하는 프로세스
  • Executor 종류에 따라 다르게 동작

Metadata DB:

  • PostgreSQL 또는 MySQL
  • DAG 정의, 실행 이력, 변수, 커넥션 정보 저장

Executor 비교 (면접 핵심!)

Executor특징적합한 환경
LocalExecutor단일 머신, 멀티프로세스개발/소규모
CeleryExecutorRedis/RabbitMQ 기반 분산중규모, 안정적
KubernetesExecutor태스크마다 Pod 생성K8s 환경, GPU 워크로드
CeleryKubernetesExecutorCelery + K8s 혼합대규모, 유연성 필요

토스뱅크에서는 KubernetesExecutor 또는 CeleryKubernetesExecutor를 사용할 가능성이 높습니다. GPU 학습 태스크는 K8s Pod로, 경량 태스크는 Celery Worker로 처리하는 하이브리드 전략입니다.

ML 파이프라인 DAG 작성

from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    "owner": "ml-platform",
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
    "execution_timeout": timedelta(hours=2),
}

with DAG(
    dag_id="fraud_detection_training_pipeline",
    default_args=default_args,
    schedule_interval="0 2 * * *",  # 매일 새벽 2시
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=["ml", "fraud-detection"],
) as dag:

    # 1. 데이터 수집 (경량 - Python)
    extract_data = PythonOperator(
        task_id="extract_data",
        python_callable=extract_training_data,
    )

    # 2. 피처 엔지니어링 (Spark on K8s)
    feature_engineering = KubernetesPodOperator(
        task_id="feature_engineering",
        name="feature-eng-pod",
        namespace="ml-jobs",
        image="tossbank/spark-feature-eng:latest",
        arguments=["--date", "{{ ds }}"],
        resources={
            "requests": {"cpu": "4", "memory": "16Gi"},
            "limits": {"cpu": "8", "memory": "32Gi"},
        },
        is_delete_operator_pod=True,
        get_logs=True,
    )

    # 3. 모델 학습 (GPU Pod)
    train_model = KubernetesPodOperator(
        task_id="train_model",
        name="model-training-pod",
        namespace="ml-jobs",
        image="tossbank/fraud-model-trainer:latest",
        arguments=[
            "--experiment-name", "fraud-detection-v2",
            "--date", "{{ ds }}",
        ],
        resources={
            "requests": {"cpu": "4", "memory": "32Gi", "nvidia.com/gpu": "1"},
            "limits": {"cpu": "8", "memory": "64Gi", "nvidia.com/gpu": "1"},
        },
        node_selector={"accelerator": "nvidia-a100"},
        tolerations=[{
            "key": "nvidia.com/gpu",
            "operator": "Exists",
            "effect": "NoSchedule",
        }],
        is_delete_operator_pod=True,
        get_logs=True,
    )

    # 4. 모델 평가
    evaluate_model = KubernetesPodOperator(
        task_id="evaluate_model",
        name="model-evaluation-pod",
        namespace="ml-jobs",
        image="tossbank/model-evaluator:latest",
        is_delete_operator_pod=True,
        get_logs=True,
    )

    # 5. 모델 배포 (조건부)
    deploy_model = KubernetesPodOperator(
        task_id="deploy_model",
        name="model-deploy-pod",
        namespace="ml-serving",
        image="tossbank/model-deployer:latest",
        arguments=["--model-name", "fraud-detection", "--stage", "canary"],
        is_delete_operator_pod=True,
        get_logs=True,
    )

    extract_data >> feature_engineering >> train_model >> evaluate_model >> deploy_model

XCom으로 태스크 간 데이터 전달

# 모델 학습 태스크에서 메트릭 push
def train_and_push_metrics(**context):
    model, metrics = train_model()
    context["ti"].xcom_push(key="model_auc", value=metrics["auc"])
    context["ti"].xcom_push(key="model_version", value="v3.2.1")

# 평가 태스크에서 메트릭 pull하여 배포 결정
def evaluate_and_decide(**context):
    auc = context["ti"].xcom_pull(task_ids="train_model", key="model_auc")
    if auc >= 0.95:
        return "deploy_model"  # BranchPythonOperator로 분기
    else:
        return "notify_team"

Secrets 관리 (Vault 연동)

토스뱅크 같은 금융사에서는 민감 정보 관리가 필수입니다.

# airflow.cfg
# [secrets]
# backend = airflow.providers.hashicorp.secrets.vault.VaultBackend
# backend_kwargs = {"connections_path": "airflow/connections", "variables_path": "airflow/variables", "url": "https://vault.tossbank.com:8200"}

실습 체크리스트:

  1. Docker Compose로 Airflow 로컬 환경 구축
  2. 간단한 Python DAG 작성 → Webserver에서 실행 확인
  3. KubernetesPodOperator로 GPU 태스크 실행
  4. MLFlow와 연동하여 학습-평가-등록 자동화
  5. SLA miss alert 설정, 모니터링 대시보드 구축

추천 자료:

  • "Data Pipelines with Apache Airflow" (Manning, Bas Harenslak)
  • Apache Airflow 공식 문서
  • Astronomer Guides (astronomer.io/guides)
  • Marc Lamberti의 Airflow 강의 (Udemy)

3-4. Kubeflow (K8s 기반 ML 플랫폼)

핵심 컴포넌트

Kubeflow Pipelines (KFP):

  • K8s 네이티브 ML 파이프라인 오케스트레이션
  • Argo Workflows 기반 (최신 버전은 KFP v2)
  • 파이프라인 = 컴포넌트들의 DAG
from kfp import dsl
from kfp import compiler

@dsl.component(
    base_image="python:3.11",
    packages_to_install=["pandas", "scikit-learn", "mlflow"]
)
def train_component(
    data_path: str,
    learning_rate: float,
    max_depth: int,
) -> str:
    import pandas as pd
    from sklearn.ensemble import GradientBoostingClassifier
    import mlflow

    # 학습 로직
    df = pd.read_parquet(data_path)
    model = GradientBoostingClassifier(
        learning_rate=learning_rate,
        max_depth=max_depth,
    )
    model.fit(df.drop("target", axis=1), df["target"])

    with mlflow.start_run():
        mlflow.sklearn.log_model(model, "model")
        run_id = mlflow.active_run().info.run_id

    return run_id

@dsl.pipeline(name="Fraud Detection Pipeline")
def fraud_pipeline(data_path: str = "s3://data/fraud/latest"):
    train_task = train_component(
        data_path=data_path,
        learning_rate=0.1,
        max_depth=6,
    )
    train_task.set_gpu_limit(1)
    train_task.set_memory_limit("32Gi")
    train_task.add_node_selector_constraint(
        "accelerator", "nvidia-a100"
    )

compiler.Compiler().compile(fraud_pipeline, "pipeline.yaml")

Katib (하이퍼파라미터 자동 튜닝):

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: fraud-model-tuning
spec:
  objective:
    type: maximize
    goal: 0.98
    objectiveMetricName: auc
  algorithm:
    algorithmName: bayesianoptimization
  parallelTrialCount: 3
  maxTrialCount: 30
  maxFailedTrialCount: 3
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: '0.001'
        max: '0.3'
    - name: max_depth
      parameterType: int
      feasibleSpace:
        min: '3'
        max: '12'
    - name: n_estimators
      parameterType: int
      feasibleSpace:
        min: '100'
        max: '2000'

KServe (모델 서빙):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
spec:
  predictor:
    model:
      modelFormat:
        name: xgboost
      storageUri: s3://models/fraud-detection/v3
      resources:
        requests:
          cpu: '2'
          memory: '4Gi'
  transformer:
    containers:
      - image: tossbank/fraud-preprocessor:latest
        name: transformer
  explainer:
    containers:
      - image: tossbank/fraud-explainer:latest
        name: explainer

Kubeflow vs Airflow: 언제 무엇을 쓸 것인가

기준AirflowKubeflow Pipelines
주 용도범용 워크플로우ML 특화 파이프라인
스케줄링강력 (cron, 센서)제한적
K8s 통합KubernetesExecutor네이티브
파이프라인 캐싱없음컴포넌트 단위 캐싱
UI강력한 모니터링파이프라인 시각화
학습 곡선상대적으로 낮음상대적으로 높음

토스뱅크의 예상 조합:

  • Airflow: 데이터 파이프라인, 배치 스케줄링, 외부 시스템 연동
  • Kubeflow: ML 학습 파이프라인, 하이퍼파라미터 튜닝, 모델 서빙

추천 자료:

  • Kubeflow 공식 문서: kubeflow.org/docs
  • Google Cloud MLOps Examples
  • KFP SDK v2 마이그레이션 가이드

3-5. JupyterHub (노트북 환경)

JupyterHub on K8s

데이터 사이언티스트의 일상 도구인 JupyterHub를 K8s 위에서 운영하는 것은 MLOps 엔지니어의 중요한 역할입니다.

Zero to JupyterHub:

# values.yaml (Helm)
singleuser:
  image:
    name: tossbank/ml-notebook
    tag: latest
  profileList:
    - display_name: 'CPU Notebook (Small)'
      description: '2 CPU, 4GB RAM'
      kubespawner_override:
        cpu_limit: 2
        mem_limit: '4G'
    - display_name: 'GPU Notebook (A100)'
      description: '8 CPU, 32GB RAM, 1 A100 GPU'
      kubespawner_override:
        cpu_limit: 8
        mem_limit: '32G'
        extra_resource_limits:
          nvidia.com/gpu: '1'
        node_selector:
          accelerator: nvidia-a100
        tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule

hub:
  config:
    Authenticator:
      admin_users:
        - admin
    GenericOAuthenticator:
      client_id: jupyterhub
      client_secret: ...
      oauth_callback_url: https://jupyter.tossbank.com/hub/oauth_callback
      authorize_url: https://auth.tossbank.com/oauth/authorize
      token_url: https://auth.tossbank.com/oauth/token

proxy:
  service:
    type: ClusterIP

커스텀 노트북 이미지: GPU 노트북에는 CUDA, cuDNN, PyTorch, TensorFlow, MLFlow 클라이언트 등이 사전 설치되어야 합니다.

FROM nvidia/cuda:12.3.1-cudnn9-runtime-ubuntu22.04

RUN pip install \
    jupyterlab==4.1.0 \
    torch==2.2.0 \
    tensorflow==2.15.0 \
    mlflow==2.11.0 \
    xgboost==2.0.3 \
    scikit-learn==1.4.0 \
    pandas==2.2.0 \
    boto3==1.34.0

# MLFlow 트래킹 서버 기본 설정
ENV MLFLOW_TRACKING_URI=http://mlflow-server:5000

보안 고려사항 (금융사 필수):

  • NetworkPolicy로 노트북 간 네트워크 격리
  • PodSecurityPolicy/PodSecurityStandard 적용
  • 리소스 쿼터 설정 (GPU 독점 방지)
  • PVC 기반 persistent workspace (노트북 재시작 시 데이터 보존)

3-6. Triton Inference Server (모델 서빙)

이 섹션은 토스뱅크 ML Platform에서 가장 실무적으로 중요한 기술입니다.

아키텍처 이해

Triton은 세 가지 핵심 개념으로 동작합니다.

Model Repository:

  • 파일 시스템 구조로 모델 관리
  • 로컬, S3, GCS, Azure Blob 지원
model_repository/
├── fraud_detection/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.onnx
│   └── 2/
│       └── model.onnx
├── text_classifier/
│   ├── config.pbtxt
│   └── 1/
│       └── model.pt
└── ensemble_pipeline/
    ├── config.pbtxt
    └── 1/

Backend System:

  • TensorRT: NVIDIA GPU 최적 추론
  • ONNX Runtime: 범용 프레임워크
  • PyTorch (LibTorch): PyTorch 네이티브
  • Python Backend: 커스텀 전/후처리
  • vLLM Backend: LLM 서빙

Dynamic Batching (면접 핵심!)

Dynamic Batching은 Triton의 킬러 기능입니다. 개별 추론 요청을 모아서 배치로 처리하여 GPU 활용도를 극대화합니다.

# config.pbtxt
name: "fraud_detection"
platform: "onnxruntime_onnx"
max_batch_size: 64

dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 100
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  }
]

input [
  {
    name: "features"
    data_type: TYPE_FP32
    dims: [128]
  }
]

output [
  {
    name: "probability"
    data_type: TYPE_FP32
    dims: [1]
  }
]

preferred_batch_size 설정 전략:

  • GPU 메모리와 모델 크기에 따라 결정
  • 너무 크면: 대기 시간 증가 (레이턴시 악화)
  • 너무 작으면: GPU 활용도 저하 (처리량 악화)
  • Model Analyzer로 최적값 탐색 필요

max_queue_delay_microseconds:

  • 배치를 채우기 위해 요청을 대기시키는 최대 시간
  • 금융 서비스에서는 보통 100us ~ 1000us
  • 실시간 FDS: 100us 이하 / 배치 추천: 1000us~5000us

Model Ensemble (파이프라인 서빙)

# ensemble_config.pbtxt
name: "fraud_pipeline"
platform: "ensemble"
max_batch_size: 32

ensemble_scheduling {
  step [
    {
      model_name: "preprocessor"
      model_version: -1
      input_map {
        key: "raw_transaction"
        value: "RAW_INPUT"
      }
      output_map {
        key: "processed_features"
        value: "FEATURES"
      }
    },
    {
      model_name: "fraud_detection"
      model_version: -1
      input_map {
        key: "features"
        value: "FEATURES"
      }
      output_map {
        key: "probability"
        value: "FRAUD_SCORE"
      }
    },
    {
      model_name: "postprocessor"
      model_version: -1
      input_map {
        key: "score"
        value: "FRAUD_SCORE"
      }
      output_map {
        key: "result"
        value: "FINAL_RESULT"
      }
    }
  ]
}

모델 최적화 파이프라인

원본 모델 (PyTorch/TF)
ONNX 변환 (torch.onnx.export)
ONNX 최적화 (onnxoptimizer, onnxsim)
TensorRT 변환 (trtexec)
[FP16/INT8 양자화]
Triton 배포 (TensorRT backend)
# TensorRT 변환 예시
trtexec \
    --onnx=fraud_model.onnx \
    --saveEngine=fraud_model.plan \
    --fp16 \
    --workspace=4096 \
    --minShapes=features:1x128 \
    --optShapes=features:32x128 \
    --maxShapes=features:64x128

Perf Analyzer (성능 벤치마크)

# Triton Perf Analyzer로 성능 측정
perf_analyzer \
    -m fraud_detection \
    -u localhost:8001 \
    --concurrency-range 1:64:4 \
    --input-data random \
    -b 1 \
    --measurement-interval 5000 \
    --percentile 99

결과 분석 포인트:

  • P50/P90/P99 레이턴시
  • Throughput (infer/sec)
  • GPU utilization
  • Batch size vs Latency 트레이드오프

Triton + K8s 오토스케일링

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triton-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: nv_inference_request_success
        target:
          type: AverageValue
          averageValue: '100'
    - type: Pods
      pods:
        metric:
          name: nv_gpu_utilization
        target:
          type: AverageValue
          averageValue: '70'

실습 체크리스트:

  1. ONNX 모델을 Triton에 배포하고 gRPC/HTTP로 추론 요청
  2. Dynamic Batching 설정 후 Perf Analyzer로 성능 비교
  3. Model Ensemble로 전처리-추론-후처리 파이프라인 구축
  4. TensorRT로 모델 최적화 후 FP32 대비 성능 비교
  5. K8s HPA로 오토스케일링 설정 후 부하 테스트

추천 자료:

  • NVIDIA Triton 공식 문서: docs.nvidia.com/triton
  • NVIDIA Triton GitHub Examples
  • "Deploying AI Models at Scale" (NVIDIA DLI 과정)

3-7. ScyllaDB와 Feature Store

ScyllaDB 아키텍처 심화

ScyllaDB는 Cassandra를 C++로 재작성한 고성능 분산 NoSQL입니다. 토스뱅크가 Feature Store의 Online Store로 ScyllaDB를 선택한 이유를 깊이 이해해야 합니다.

Shard-per-Core 아키텍처:

  • 각 CPU 코어가 독립적인 샤드를 담당
  • 코어 간 락(lock) 없음 → 극한의 성능
  • Seastar 프레임워크 기반: 비동기 이벤트 루프, share-nothing 디자인
  • JVM GC 없음 (C++) → P99 레이턴시 안정적

Cassandra와의 결정적 차이:

항목CassandraScyllaDB
언어Java (JVM)C++ (Seastar)
P99 레이턴시수십 ms1-2 ms
GC 이슈있음 (Stop-the-World)없음
CPU 활용JVM 오버헤드코어 100% 활용
스케일링노드 수로코어 수 + 노드 수
호환성-CQL 호환

데이터 모델링 (면접 핵심!)

ScyllaDB 데이터 모델링에서 Partition Key 설계는 성능을 좌우합니다.

-- Feature Store 테이블 설계 예시
CREATE TABLE feature_store.user_features (
    user_id text,
    feature_name text,
    feature_value blob,
    updated_at timestamp,
    PRIMARY KEY ((user_id), feature_name)
) WITH CLUSTERING ORDER BY (feature_name ASC)
  AND compaction = {
    'class': 'TimeWindowCompactionStrategy',
    'compaction_window_unit': 'HOURS',
    'compaction_window_size': 1
  }
  AND gc_grace_seconds = 86400;

-- 실시간 트랜잭션 피처 테이블
CREATE TABLE feature_store.transaction_features (
    user_id text,
    feature_time timestamp,
    feature_name text,
    feature_value double,
    PRIMARY KEY ((user_id), feature_time, feature_name)
) WITH CLUSTERING ORDER BY (feature_time DESC, feature_name ASC)
  AND default_time_to_live = 604800;  -- 7일 TTL

Partition Key 설계 원칙:

  • Hot partition 방지: 특정 파티션에 데이터 쏠림 금지
  • 파티션 크기 100MB 이하 유지
  • 읽기 패턴에 맞춘 설계 (Feature Store는 user_id로 조회가 대부분)

Compaction 전략 (면접에서 자주 나옴)

전략특징적합한 워크로드
STCS (Size-Tiered)비슷한 크기의 SSTable 병합쓰기 위주
LCS (Leveled)레벨별 SSTable 관리읽기 위주
TWCS (Time-Window)시간 윈도우별 관리시계열 데이터
ICS (Incremental)ScyllaDB 독자 전략, 점진적 병합범용 (권장)

Feature Store의 Online Store는 읽기 위주 워크로드이므로 LCS 또는 ICS가 적합합니다.

Feature Store 개념 정복

Offline Store vs Online Store:

[배치 학습]                    [실시간 서빙]
    │                              │
    ▼                              ▼
┌──────────┐                ┌──────────┐
Offline  │ ──동기화──→  │ OnlineStore    │                │ Store(Parquet/(ScyllaDB)Hive)    │                │          │
└──────────┘                └──────────┘
    │                              │
    ▼                              ▼
 학습용 피처셋               실시간 추론 피처
(point-in-time correct)     (P99 < 5ms)

Point-in-Time Correctness:

  • 시간 여행 쿼리의 핵심 개념
  • 학습 시점의 피처를 정확히 재현 (Data Leakage 방지)
  • 예: 3월 1일 모델 학습 시, 3월 1일 기준으로만 알 수 있었던 피처 사용
# Feast를 활용한 Feature Store 구축 예시
from feast import FeatureStore, Entity, FeatureView, Field
from feast.types import Float64, String
from feast.infra.online_stores.contrib.scylladb_online_store import ScyllaOnlineStore

# feature_store.yaml
# project: tossbank_features
# registry: s3://tossbank-feast/registry.db
# provider: local
# online_store:
#   type: feast_custom_provider.ScyllaOnlineStore
#   hosts:
#     - scylla-node1:9042
#     - scylla-node2:9042
#   keyspace: feature_store
#   replication_factor: 3

user = Entity(
    name="user_id",
    description="Bank customer ID",
)

user_features = FeatureView(
    name="user_transaction_features",
    entities=[user],
    schema=[
        Field(name="avg_transaction_amount_7d", dtype=Float64),
        Field(name="transaction_count_24h", dtype=Float64),
        Field(name="unique_merchants_30d", dtype=Float64),
        Field(name="max_single_transaction_7d", dtype=Float64),
    ],
    online=True,
    source=transaction_source,  # BatchSource (Spark, BigQuery, etc.)
)

오픈소스 Feature Store 비교:

항목FeastTectonHopsworks
라이선스Apache 2.0상용AGPL/상용
Online StoreRedis, DynamoDB, ScyllaDB 등자체 구현RonDB
Offline StoreBigQuery, Redshift, SparkSpark, SnowflakeHive
실시간 피처제한적강력강력
관리형 서비스없음 (셀프호스팅)SaaSSaaS/자체호스팅

실습 체크리스트:

  1. ScyllaDB Docker 클러스터 (3노드) 구축
  2. CQL로 Feature Store 테이블 설계
  3. Feast + ScyllaDB Online Store 연동
  4. Spark에서 피처 계산 → ScyllaDB Materialization
  5. 실시간 피처 조회 P99 레이턴시 측정

추천 자료:

  • ScyllaDB University (free.scylladb.com) - 무료 과정!
  • "Designing Data-Intensive Applications" (Martin Kleppmann) - DDIA
  • Feast 공식 문서: docs.feast.dev

3-8. LLM 플랫폼 구축 (최신 트렌드!)

2024년 이후 모든 핀테크 기업이 LLM을 도입하고 있으며, 토스뱅크도 예외가 아닙니다. 이 섹션은 면접에서 차별화할 수 있는 핵심 영역입니다.

LLM 서빙 아키텍처

vLLM:

  • PagedAttention: GPU 메모리를 가상 메모리처럼 관리
  • KV Cache를 페이지 단위로 관리하여 메모리 낭비 최소화
  • Continuous Batching: 요청이 완료되면 즉시 새 요청 삽입
  • Speculative Decoding: 작은 모델로 초안 생성 → 큰 모델로 검증
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,  # 2 GPU 병렬
    gpu_memory_utilization=0.9,
    max_model_len=8192,
    enforce_eager=False,  # CUDA graph 활성화
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
)

outputs = llm.generate(prompts, sampling_params)

TensorRT-LLM:

  • NVIDIA의 LLM 최적화 라이브러리
  • FP8/INT4 양자화, In-flight Batching
  • KV Cache 재활용, Paged KV Cache

Triton + vLLM Backend:

# config.pbtxt for vLLM backend
name: "llama3-8b"
backend: "vllm"
max_batch_size: 0  # vLLM이 배칭 관리

model_transaction_policy {
  decoupled: True
}

parameters {
  key: "model"
  value: {
    string_value: "meta-llama/Llama-3-8B-Instruct"
  }
}
parameters {
  key: "tensor_parallel_size"
  value: {
    string_value: "2"
  }
}
parameters {
  key: "gpu_memory_utilization"
  value: {
    string_value: "0.9"
  }
}

SGLang:

  • RadixAttention: prefix 공유를 통한 KV Cache 재활용
  • 구조화된 출력 지원 (JSON schema 강제)
  • vLLM 대비 특정 워크로드에서 더 높은 처리량

LLM 서빙 성능 지표

지표설명목표 (프로덕션)
TTFT (Time to First Token)첫 토큰까지 걸리는 시간200ms 이하
TPS (Tokens Per Second)초당 생성 토큰 수30+ TPS/user
Throughput전체 시스템 처리량QPS x 평균 출력 길이
P99 Latency99번째 백분위 지연TTFT의 2배 이하

LLM Gateway 아키텍처

클라이언트 요청
┌─────────────┐
LLM Gateway │ ← 인증, 레이트리밋, 라우팅
 (Kong/Envoy)│ ← 비용 추적, 로깅
└──────┬──────┘
  ┌────┴────┐
  ▼         ▼
┌─────┐  ┌─────┐
│vLLM │  │vLLM │  ← 모델별 서버 풀
│Pool1│  │Pool2│
(8B)(70B)└─────┘  └─────┘

Gateway 핵심 기능:

  • 모델 라우팅: 요청 복잡도에 따라 작은/큰 모델로 분배
  • 레이트 리밋: 사용자/팀별 토큰 사용량 제한
  • 비용 관리: 토큰당 비용 추적, 예산 알림
  • 폴백: 프라이머리 모델 장애 시 대체 모델로 전환
  • 캐싱: 동일 프롬프트 결과 캐싱 (semantic cache 포함)

RAG 파이프라인

사용자 쿼리
┌──────────────┐
Query        │ ← 쿼리 임베딩 생성
Embedding└──────┬───────┘
┌──────────────┐
Vector DB    │ ← 유사 문서 검색 (Top-K)
 (Milvus)└──────┬───────┘
┌──────────────┐
Reranker     │ ← 검색 결과 재정렬
 (Cross-Enc.)└──────┬───────┘
┌──────────────┐
LLM          │ ← Context + Query → 답변 생성
 (Llama 3)└──────────────┘

Vector DB 비교:

DB특징장점단점
Milvus분산 아키텍처, GPU 가속대규모, 높은 성능복잡한 운영
QdrantRust 기반, 필터링 강력빠른 시작, 좋은 API상대적으로 작은 커뮤니티
pgvectorPostgreSQL 확장기존 PG 인프라 활용대규모에서 한계
Weaviate모듈식, 멀티모달쉬운 시작메모리 사용량
Pinecone완전 관리형운영 부담 없음벤더 종속, 비용

Fine-tuning 인프라

# K8s에서 LoRA Fine-tuning Job
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: llama3-lora-finetune
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: trainer
              image: tossbank/lora-trainer:latest
              args:
                - '--model_name=meta-llama/Llama-3-8B'
                - '--lora_r=16'
                - '--lora_alpha=32'
                - '--learning_rate=2e-4'
                - '--num_epochs=3'
                - '--batch_size=4'
                - '--gradient_accumulation_steps=8'
              resources:
                limits:
                  nvidia.com/gpu: '4'
          nodeSelector:
            accelerator: nvidia-a100-80g

LLM 모니터링

모니터링해야 할 핵심 메트릭:

  • 토큰 사용량: 입력/출력 토큰 수, 사용자별/팀별 집계
  • 비용 추적: 토큰당 비용 계산, 예산 대비 사용량
  • 품질 지표: 환각(hallucination) 탐지, 유해 콘텐츠 필터링
  • 성능 지표: TTFT, TPS, 큐 깊이, GPU 활용률
  • 보안: PII 노출 탐지, guardrails 위반 로그
# Prometheus 메트릭 예시
from prometheus_client import Counter, Histogram

llm_tokens_total = Counter(
    "llm_tokens_total",
    "Total tokens processed",
    ["model", "direction", "team"]
)

llm_latency = Histogram(
    "llm_request_duration_seconds",
    "LLM request latency",
    ["model"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

실습 체크리스트:

  1. vLLM으로 오픈소스 LLM 로컬 서빙
  2. Triton + vLLM Backend로 프로덕션급 서빙
  3. RAG 파이프라인: Milvus + LangChain + vLLM
  4. LoRA Fine-tuning 후 A/B 서빙
  5. LLM 모니터링 대시보드 (Grafana) 구축

추천 자료:

  • vLLM 공식 문서: docs.vllm.ai
  • NVIDIA NIM (NVIDIA Inference Microservices)
  • "Building LLM Powered Applications" (Chip Huyen)
  • LangChain / LlamaIndex 공식 문서

3-9. GPU 프레임워크와 성능 최적화

CUDA 기초 이해

MLOps 엔지니어가 직접 CUDA 코드를 작성할 일은 드물지만, GPU가 어떻게 동작하는지 이해해야 최적화 판단을 할 수 있습니다.

GPU 메모리 계층:

  • Global Memory: 가장 크지만 가장 느림 (HBM)
  • Shared Memory: 블록 내 스레드 공유, 매우 빠름
  • Register: 스레드별 전용, 가장 빠름
  • L1/L2 Cache: 자동 관리

실전에서 중요한 것:

  • GPU 메모리 크기 → 모델 크기 제한 결정
  • HBM 대역폭 → Throughput 상한 결정
  • A100 80GB: HBM 대역폭 2TB/s
  • H100 80GB: HBM 대역폭 3.35TB/s

Mixed Precision Training

import torch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()

    with autocast(dtype=torch.float16):
        output = model(batch["input"])
        loss = criterion(output, batch["target"])

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

정밀도 비교:

타입비트용도성능 향상
FP3232기본 학습기준
FP1616Mixed Precision2x
BF1616큰 모델 학습 (A100+)2x (범위 넓음)
FP88추론/학습 (H100+)4x
INT88추론 최적화4x
INT44LLM 양자화 (GPTQ, AWQ)8x

Multi-GPU 전략

DataParallel (DP): 가장 단순, 단일 노드

  • 한 GPU에 모델 복사 → 데이터 분배 → 그래디언트 평균
  • GIL 이슈, GPU 0에 부하 집중

DistributedDataParallel (DDP): 프로덕션 표준

  • 프로세스당 하나의 GPU
  • All-Reduce로 그래디언트 동기화
  • NCCL 통신 백엔드
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group("nccl")
model = DDP(model.to(local_rank), device_ids=[local_rank])

FSDP (Fully Sharded Data Parallel): 대규모 모델용

  • 모델 파라미터를 GPU간 분산 저장
  • 필요할 때만 파라미터를 모아서 연산
  • ZeRO Stage 3와 유사

DeepSpeed: Microsoft의 분산 학습 라이브러리

  • ZeRO Stage 1/2/3: 점점 더 많은 것을 샤딩
  • Offloading: GPU 메모리 부족 시 CPU/NVMe로 오프로드

GPU 모니터링

# nvidia-smi 기본 모니터링
nvidia-smi --query-gpu=gpu_name,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.used,power.draw --format=csv -l 1

# DCGM (Data Center GPU Manager)
# dcgmi dmon -e 155,156,203,204,1001,1002,1003,1004

Prometheus + DCGM Exporter:

# dcgm-exporter DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    spec:
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.3.0-ubuntu22.04
          ports:
            - containerPort: 9400

주요 모니터링 메트릭:

  • DCGM_FI_DEV_GPU_UTIL: GPU 연산 활용률
  • DCGM_FI_DEV_MEM_COPY_UTIL: 메모리 대역폭 활용률
  • DCGM_FI_DEV_GPU_TEMP: GPU 온도
  • DCGM_FI_PROF_SM_ACTIVE: SM(Streaming Multiprocessor) 활성률
  • DCGM_FI_DEV_POWER_USAGE: 전력 소비량

3-10. 분산 데이터베이스 기초 (CS 지식)

ScyllaDB/Cassandra를 운영하려면 분산 시스템 기초가 필수입니다.

CAP 이론

  • Consistency: 모든 노드가 같은 데이터 반환
  • Availability: 모든 요청에 응답
  • Partition Tolerance: 네트워크 분할에도 동작

ScyllaDB/Cassandra는 AP 시스템 (기본 설정), 하지만 Consistency Level 조절로 CP처럼 동작 가능:

  • ONE: 빠른 읽기, 약한 일관성
  • QUORUM: 과반수 노드 합의, 강한 일관성
  • ALL: 모든 노드 합의, 가장 강하지만 가장 느림

Consistent Hashing

ScyllaDB가 데이터를 분산하는 핵심 메커니즘입니다.

Token Ring (0 ~ 2^63)
      ┌───────────┐
Node A    (token: 0 ~ 33%)
      ├───────────┤
Node B    (token: 33% ~ 66%)
      ├───────────┤
Node C    (token: 66% ~ 100%)
      └───────────┘

Partition KeyMurmur3 HashToken → 담당 노드

면접 포인트:

  • 노드 추가 시: 기존 노드의 토큰 범위를 분할 (데이터 이동 최소화)
  • Virtual Node (vnode): 각 물리 노드가 여러 토큰 범위 담당 → 더 균일한 분배

Gossip Protocol

노드 간 클러스터 상태 정보를 교환하는 프로토콜:

  • 매 초마다 임의의 노드와 상태 교환
  • 생존/장애 감지
  • 스키마 정보 전파
  • 장점: 중앙 서버 불필요, 확장성 우수

Merkle Tree (Anti-Entropy Repair)

  • 데이터 일관성 검증을 위한 해시 트리
  • 노드 간 데이터 비교를 O(log N)으로 수행
  • repair 명령으로 불일치 데이터 복구

LSM-Tree vs B-Tree

특성LSM-Tree (ScyllaDB)B-Tree (PostgreSQL)
쓰기매우 빠름 (순차 쓰기)느림 (랜덤 쓰기)
읽기상대적으로 느림빠름
공간 효율Compaction 필요즉시 정리
적합 워크로드쓰기 위주, 시계열읽기 위주, OLTP

추천 자료:

  • "Designing Data-Intensive Applications" (Martin Kleppmann) - DDIA
  • ScyllaDB 아키텍처 백서
  • MIT 6.824: Distributed Systems (무료 강의)

4. MLOps 성숙도 모델

Google에서 정의한 MLOps 성숙도 모델을 토스뱅크에 적용해봅니다.

Level 0: 수동 프로세스

데이터 사이언티스트가 Jupyter Notebook에서 모델 학습
     (수동)
모델 파일을 엔지니어에게 전달
     (수동)
엔지니어가 모델을 서버에 배포
     (수동)
모니터링? 뭐 그런 거...

문제점: 느림, 재현 불가, 감사 추적 불가 (금융 규제 위반)

Level 1: ML Pipeline 자동화

Airflow/Kubeflow로 파이프라인 자동화
데이터 수집 → 피처 엔지니어링 → 학습 → 평가 → 배포
자동 스케줄링 (매일/매주 재학습)

개선점: 재현성, 자동 재학습, 기본적 추적

Level 2: CI/CD for ML

코드 변경 → CI (단위 테스트 + 데이터 검증)
자동 학습 → 모델 품질 게이트 (AUC > threshold)
카나리 배포 → A/B 테스트
자동 모니터링 → 드리프트 감지 → 자동 재학습 트리거

개선점: 완전 자동화, 품질 보증, 지속적 개선

토스뱅크의 예상 성숙도

토스뱅크는 Level 1 ~ Level 2 사이로 추정됩니다. ML Platform Team은 Level 2를 완성하고 더 나아가는 역할입니다. 지원자가 기여할 수 있는 영역:

  • Level 2 완성: CI/CD 파이프라인 고도화, 자동 모니터링
  • Feature Store 성숙: ScyllaDB 기반 고성능 Online Store
  • LLM 플랫폼 신규 구축: vLLM + Triton으로 LLM 서빙 인프라
  • GPU 클러스터 최적화: MIG, time-slicing으로 비용 효율화

5. 면접 예상 질문 30선

K8s 및 인프라 (10문제)

Q1. K8s에서 GPU 노드를 관리하는 방법을 설명해주세요.

  • NVIDIA Device Plugin, Taint/Toleration, Node Selector
  • MIG vs Time-Slicing 트레이드오프
  • GPU Operator로 자동화

Q2. Pod의 QoS 클래스 3가지와 ML 워크로드에서의 선택 기준은?

  • Guaranteed: GPU 학습/서빙 (자원 보장 필수)
  • Burstable: 배치 전처리 (탄력적 자원 활용)
  • BestEffort: 개발 환경 (비용 절감)

Q3. K8s Deployment와 StatefulSet의 차이, 그리고 ML 인프라에서 각각 언제 사용하나요?

  • Deployment: Triton Server, 모델 서빙 (stateless)
  • StatefulSet: ScyllaDB, Kafka (유니크 네트워크 ID, 안정적 스토리지)

Q4. HPA와 VPA의 차이, 모델 서빙에 어떤 것이 적합한가요?

  • HPA: Pod 수 조절 (추론 서버 스케일링에 적합)
  • VPA: Pod 리소스 조절 (학습 잡 리소스 최적화에 적합)
  • 커스텀 메트릭 기반 HPA (GPU 활용률, 큐 깊이)

Q5. K8s 네트워크 정책으로 ML 플랫폼의 보안을 어떻게 구현하나요?

  • Namespace 단위 격리
  • Jupyter 노트북 간 통신 차단
  • 모델 서빙 → Feature Store 간만 허용

Q6. etcd의 역할과 장애 시 클러스터에 미치는 영향은?

  • 모든 클러스터 상태 저장소
  • Raft 합의 알고리즘, 과반수 노드 필요
  • etcd 장애 = 새로운 리소스 생성/수정 불가

Q7. K8s에서 볼륨 관리: PV/PVC/StorageClass 각각의 역할은?

  • PV: 물리적 스토리지 (EBS, NFS)
  • PVC: Pod가 요청하는 스토리지 추상화
  • StorageClass: 동적 프로비저닝 정책

Q8. Helm과 Kustomize의 차이, 언제 무엇을 쓰나요?

  • Helm: 패키지 관리자, 복잡한 애플리케이션 (Triton, Airflow)
  • Kustomize: 오버레이 기반 커스터마이징, 환경별 설정

Q9. K8s에서 CronJob과 Airflow 스케줄링의 차이점은?

  • CronJob: 단순 반복 작업, 의존성 관리 불가
  • Airflow: 복잡한 DAG, 재시도, XCom, 모니터링

Q10. K8s 클러스터 업그레이드 전략을 설명해주세요.

  • Control Plane 먼저, 그 다음 Worker Node
  • 롤링 업그레이드: 노드별 drain → upgrade → uncordon
  • Blue-Green: 새 노드 그룹 생성 후 워크로드 이동

MLOps 플랫폼 (10문제)

Q11. MLFlow Tracking Server의 프로덕션 아키텍처를 설계해주세요.

  • PostgreSQL (메타데이터) + S3 (아티팩트) + Nginx (프록시)
  • 고가용성: 다중 Tracking Server + LB
  • 보안: OAuth2 인증, HTTPS

Q12. MLFlow Model Registry의 Stage 관리 전략은?

  • None → Staging (자동 테스트 통과) → Production (인간 승인) → Archived
  • 금융에서는 Production 전환 시 반드시 승인 프로세스 필요

Q13. Airflow에서 KubernetesExecutor를 쓸 때의 장단점은?

  • 장점: 태스크별 격리, GPU 리소스 동적 할당, 비용 효율
  • 단점: Pod 생성 오버헤드 (수십 초), cold start

Q14. Airflow DAG에서 Data Leakage를 방지하는 방법은?

  • 실행 날짜 기반 데이터 파티셔닝
  • Jinja 템플릿으로 날짜 주입 (ds, execution_date)
  • Point-in-time Feature Store 활용

Q15. Kubeflow Pipelines vs Airflow, 언제 무엇을 쓰나요?

  • Airflow: 범용 워크플로우, 외부 시스템 연동, 데이터 파이프라인
  • KFP: ML 특화, 컴포넌트 캐싱, K8s 네이티브, 파이프라인 버전 관리

Q16. ML 파이프라인의 데이터 검증(Data Validation)을 어떻게 구현하나요?

  • Great Expectations / TensorFlow Data Validation
  • 스키마 검증, 통계 검증 (분포 변화 감지)
  • Airflow에서 검증 실패 시 파이프라인 중단 및 알림

Q17. Model Drift를 감지하고 대응하는 전략은?

  • Data Drift: 입력 데이터 분포 변화 (KS test, PSI)
  • Concept Drift: 입출력 관계 변화 (모델 정확도 하락)
  • 대응: 자동 재학습 트리거, 그림자 배포(Shadow Deploy)

Q18. Feature Store의 Online/Offline 일관성을 어떻게 보장하나요?

  • Offline 피처 계산 → Online Store Materialization
  • 최종 일관성(Eventual Consistency) 수용
  • 피처 메타데이터로 버전 관리

Q19. ML 모델의 A/B 테스트를 인프라 수준에서 어떻게 구현하나요?

  • Istio/Envoy로 트래픽 분할 (weight-based routing)
  • KServe의 Canary 배포 기능 활용
  • 실시간 메트릭 비교 → 자동 승격/롤백

Q20. MLOps에서 Reproducibility(재현성)를 보장하는 방법은?

  • 코드 버전 (Git), 데이터 버전 (DVC), 모델 버전 (MLFlow)
  • 실행 환경 (Docker 이미지 태그)
  • 하이퍼파라미터, 랜덤 시드 기록

모델 서빙 및 LLM (10문제)

Q21. Triton의 Dynamic Batching이 왜 필요하고, 어떻게 최적화하나요?

  • GPU는 배치 처리에 최적화 → 개별 요청은 GPU 활용도 낮음
  • preferred_batch_size: GPU 메모리와 모델 크기 기반 설정
  • max_queue_delay: 레이턴시 SLO에 맞춰 조절

Q22. Triton Model Ensemble과 Python Backend의 차이점은?

  • Ensemble: 선언적 파이프라인, Triton 내부 최적화, 오버헤드 최소
  • Python Backend: 유연한 커스텀 로직, 외부 라이브러리 사용 가능
  • 성능이 중요하면 Ensemble, 복잡한 로직이면 Python Backend

Q23. 모델 양자화(Quantization) 방법들과 각각의 적합한 상황은?

  • Post-training Quantization: 학습 없이 변환 (FP16, INT8)
  • Quantization-Aware Training: 학습 중 양자화 시뮬레이션
  • LLM 특화: GPTQ (3/4bit), AWQ (4bit), SmoothQuant

Q24. vLLM의 PagedAttention이 해결하는 문제는?

  • 기존: KV Cache를 연속 메모리에 할당 → 내부 단편화, 외부 단편화
  • PagedAttention: 가상 메모리처럼 페이지 단위 관리 → 메모리 낭비 96% 감소
  • 동적 KV Cache 할당으로 더 많은 동시 요청 처리 가능

Q25. LLM 서빙에서 TTFT와 TPS의 트레이드오프를 설명해주세요.

  • TTFT: prefill 단계 (입력 토큰 처리) → GPU 연산 바운드
  • TPS: decode 단계 (토큰 하나씩 생성) → 메모리 대역폭 바운드
  • 배치 크기 증가: TPS 향상, TTFT 악화
  • Speculative decoding: 두 지표 모두 개선 가능

Q26. ScyllaDB를 Feature Store Online Store로 선택한 이유는?

  • P99 레이턴시 1-2ms (Cassandra의 10분의 1)
  • JVM GC 없음 → 안정적인 tail latency
  • CQL 호환 → Cassandra 경험 활용 가능
  • Shard-per-core: 코어 수에 비례하는 선형 성능

Q27. RAG 파이프라인의 성능을 최적화하는 방법은?

  • Chunking 전략: 의미 단위 분할, 오버랩
  • Embedding 모델: 도메인 특화 fine-tuning
  • Reranking: Cross-encoder로 검색 정확도 향상
  • 캐싱: 빈번한 쿼리 결과 캐싱

Q28. LLM Gateway에서 비용 관리를 어떻게 하나요?

  • 토큰 사용량 추적 (입력/출력 분리)
  • 팀/프로젝트별 예산 할당, 알림
  • 모델 라우팅: 간단한 쿼리는 작은 모델로
  • Semantic caching: 유사 질문 캐시 활용

Q29. 모델 서빙에서 Canary 배포와 Shadow 배포의 차이는?

  • Canary: 실제 트래픽의 일부를 새 모델로 라우팅 (실제 사용자 영향)
  • Shadow: 모든 트래픽을 새 모델에도 복제하되 응답은 기존 모델 사용
  • 금융에서는 Shadow 배포로 충분히 검증 후 Canary → Full 배포

Q30. GPU 클러스터의 비용 최적화 전략을 설명해주세요.

  • Spot/Preemptible 인스턴스 (학습용)
  • MIG로 A100을 분할하여 소규모 추론 서빙
  • Cluster Autoscaler + 오프피크 스케일다운
  • 모델 양자화로 GPU 수 절감

6. 8개월 학습 로드맵

1개월차: 기초 다지기

주차주제목표핵심 활동
1주Linux/Docker 기초컨테이너 완전 이해Dockerfile 작성, 멀티스테이지 빌드
2주Python 고급비동기, 데코레이터, 타입힌트FastAPI 프로젝트
3주K8s 입문Pod, Deployment, ServiceMinikube 실습
4주K8s 심화ConfigMap, Secret, RBACkind 클러스터 구축

핵심 프로젝트: Docker로 ML 모델 서빙 API 구축 (FastAPI + PyTorch)

2개월차: Kubernetes 마스터

주차주제목표핵심 활동
1주K8s 네트워킹Service, Ingress, DNSIngress Controller 설정
2주K8s 스토리지PV/PVC, StorageClassStatefulSet 배포
3주Helm/Kustomize패키지 관리커스텀 Helm Chart 작성
4주GPU on K8sDevice Plugin, MIGGPU Pod 실행, 모니터링

핵심 프로젝트: K8s 클러스터에 GPU 기반 추론 서버 배포 자격증 목표: CKA 준비 시작

3개월차: MLFlow + Airflow

주차주제목표핵심 활동
1주MLFlow Tracking실험 추적 완전 이해MLFlow 서버 구축
2주MLFlow Registry모델 버전 관리Stage 전환 파이프라인
3주Airflow 입문DAG 작성, Executor로컬 Airflow 환경 구축
4주Airflow + MLFlowML 파이프라인 자동화학습-평가-등록 DAG

핵심 프로젝트: MLFlow + Airflow로 자동 재학습 파이프라인 구축

4개월차: Triton + 모델 서빙

주차주제목표핵심 활동
1주Triton 기초모델 배포, Dynamic BatchingResNet 모델 Triton 배포
2주Triton 심화Ensemble, Python Backend전처리-추론-후처리 파이프라인
3주모델 최적화ONNX, TensorRT, 양자화FP16/INT8 변환 및 벤치마크
4주KServe + TritonInferenceServiceK8s에서 프로덕션급 서빙

핵심 프로젝트: BERT 모델 → ONNX → TensorRT → Triton 서빙 + 성능 벤치마크

5개월차: Feature Store + ScyllaDB

주차주제목표핵심 활동
1주ScyllaDB 기초아키텍처, CQL, 데이터 모델링ScyllaDB University 수강
2주ScyllaDB 운영Compaction, Repair, 모니터링3노드 클러스터 운영
3주Feature Store 개념Feast, Offline/Online 구분Feast 설치 및 실습
4주Feast + ScyllaDBOnline Store 연동실시간 피처 서빙 구현

핵심 프로젝트: Feast + ScyllaDB Online Store + Spark Offline Store 풀 구축

6개월차: LLM 플랫폼

주차주제목표핵심 활동
1주vLLM 기초LLM 서빙, PagedAttention오픈소스 LLM 로컬 서빙
2주Triton + vLLM프로덕션급 LLM 서빙Triton vLLM Backend 배포
3주RAG 파이프라인Vector DB, Embedding, 생성Milvus + LangChain 구축
4주LLM 모니터링토큰, 비용, 품질 추적Grafana 대시보드 구축

핵심 프로젝트: vLLM + Triton + RAG + 모니터링 풀스택 LLM 플랫폼

7개월차: 통합 및 프로젝트

주차주제목표핵심 활동
1주GitOpsArgoCD, CI/CD for MLArgoCD 파이프라인 구축
2주모니터링Prometheus, Grafana통합 모니터링 대시보드
3주포트폴리오 정리GitHub, 블로그프로젝트 README 작성
4주시스템 디자인 연습ML 시스템 설계면접 예상 질문 풀이

핵심 프로젝트: End-to-End MLOps 플랫폼 (전체 기술스택 통합)

8개월차: 면접 준비

주차주제목표핵심 활동
1주코딩 테스트Python, 알고리즘LeetCode Medium 풀이
2주시스템 디자인ML 시스템 설계모의 면접
3주기술 면접딥다이브 질문 대비본 글의 30선 복습
4주행동 면접STAR 기법프로젝트 경험 정리

7. 이력서 작성 전략

JD 기반 키워드 매핑

이력서에 반드시 포함해야 할 키워드:

  • 인프라: Kubernetes, Docker, Helm, ArgoCD, Terraform
  • ML 플랫폼: MLFlow, Airflow, Kubeflow, JupyterHub
  • 모델 서빙: Triton, KServe, ONNX, TensorRT
  • 데이터: ScyllaDB, Feature Store, Feast, Spark
  • LLM: vLLM, RAG, Vector DB, LoRA
  • GPU: CUDA, MIG, DCGM, Multi-GPU Training

STAR 기법 활용

각 프로젝트 경험을 다음 구조로 정리:

  • Situation: 어떤 문제 상황이었는가
  • Task: 무엇을 해결해야 했는가
  • Action: 어떤 기술적 접근을 취했는가
  • Result: 정량적 결과 (레이턴시 50% 감소, 처리량 3배 증가 등)

금융 도메인 강조

  • 규제 준수 경험 (데이터 보호, 모델 감사)
  • 설명가능성 구현 경험
  • 고가용성 시스템 운영 경험
  • 장애 대응 경험 (RCA, 포스트모템)

8. 포트폴리오 프로젝트 아이디어

프로젝트 1: MLOps 풀 파이프라인

목표: 데이터 수집부터 모델 서빙까지 완전 자동화된 ML 파이프라인

기술스택:

  • K8s (kind/EKS), MLFlow, Airflow, Triton, ArgoCD
  • PostgreSQL, MinIO (S3 대체), Redis

구성:

  1. Airflow DAG: 데이터 수집 → 전처리 → 학습 → 평가
  2. MLFlow: 실험 추적, 모델 레지스트리
  3. Triton: Dynamic Batching 모델 서빙
  4. ArgoCD: GitOps 기반 자동 배포
  5. Prometheus + Grafana: 모니터링 대시보드

차별화 포인트:

  • Canary 배포 + 자동 롤백 (정확도 기반)
  • Data Validation 단계 포함
  • Model Drift 감지 → 자동 재학습 트리거

프로젝트 2: Feature Store (Feast + ScyllaDB)

목표: 실시간 피처 서빙이 가능한 Feature Store 구축

기술스택:

  • Feast, ScyllaDB (Online), MinIO/Parquet (Offline)
  • Spark (피처 계산), FastAPI (피처 서빙 API)

구성:

  1. Offline Store: Parquet 파일, Spark로 배치 피처 계산
  2. Online Store: ScyllaDB, P99 5ms 이하 서빙
  3. Feature Registry: Feast에서 피처 정의/등록/검색
  4. Materialization: Offline → Online 동기화 파이프라인
  5. API 서버: 피처 벡터 실시간 조회

차별화 포인트:

  • Point-in-time correctness 구현
  • ScyllaDB 성능 벤치마크 (vs Redis, vs Cassandra)
  • 피처 모니터링 (분포 변화 감지)

프로젝트 3: LLM 서빙 플랫폼

목표: 프로덕션급 LLM 서빙 + RAG + 모니터링

기술스택:

  • vLLM, Triton (vLLM Backend), Milvus, LangChain
  • K8s, Prometheus, Grafana

구성:

  1. LLM 서빙: Triton + vLLM Backend (Llama 3 8B)
  2. RAG: 문서 임베딩 → Milvus → Retrieval → Generation
  3. LLM Gateway: 레이트 리밋, 라우팅, 비용 추적
  4. 모니터링: TTFT, TPS, 토큰 사용량, GPU 활용률
  5. 보안: PII 마스킹, Guardrails

차별화 포인트:

  • vLLM vs TensorRT-LLM 성능 비교 벤치마크
  • Semantic caching 구현
  • LoRA 어댑터 핫스왑 기능
  • A/B 서빙 (기본 모델 vs Fine-tuned 모델)

실전 퀴즈

Q1. Triton Inference Server에서 Dynamic Batching의 preferred_batch_size를 32로 설정하고 max_queue_delay를 100 마이크로초로 설정했습니다. 요청이 초당 10개만 들어오는 상황에서 예상되는 동작은?

A: 초당 10개 요청은 preferred_batch_size 32를 채우기에 턱없이 부족합니다. max_queue_delay가 100 마이크로초이므로, 대부분의 요청은 100 마이크로초 대기 후 1~2개씩 소규모 배치로 처리됩니다.

이 상황에서의 최적화: preferred_batch_size를 작게 조정하거나 (예: 1, 4, 8), 요청 패턴에 맞게 max_queue_delay를 늘려서 더 많은 요청을 모을 수 있습니다. 단, 레이턴시 SLO와의 트레이드오프를 고려해야 합니다. Model Analyzer를 사용하면 최적의 설정을 자동으로 탐색할 수 있습니다.

Q2. ScyllaDB에서 Partition Key를 날짜 (예: 2024-03-15)로 설정한 Feature Store 테이블이 있습니다. 어떤 문제가 발생할 수 있나요?

A: Hot Partition 문제가 발생합니다. 모든 당일 데이터가 하나의 파티션에 집중되므로, 해당 파티션을 담당하는 노드에 부하가 몰립니다.

해결 방법:

  1. Composite Partition Key: 날짜 + user_id 해시의 버킷 (예: date, bucket)
  2. Time-bucketing: 시간 단위로 분할 (예: 2024-03-15T14)
  3. Shard Key 추가: 임의의 shard 번호 (0-15)를 Partition Key에 추가

Feature Store에서는 user_id를 Partition Key로 사용하는 것이 일반적입니다. 사용자별 조회가 주 패턴이기 때문입니다.

Q3. vLLM의 PagedAttention과 전통적인 KV Cache 관리의 메모리 효율 차이를 설명하세요.

A: 전통적인 방식에서는 각 요청의 KV Cache를 연속된 메모리 블록으로 할당합니다. 이로 인해:

  • 내부 단편화: 최대 시퀀스 길이만큼 미리 할당 (실제 사용량보다 과다)
  • 외부 단편화: 요청 완료 후 빈 공간이 파편화
  • 결과적으로 GPU 메모리의 60-80%만 활용

PagedAttention은 가상 메모리 개념을 적용합니다:

  • KV Cache를 고정 크기 페이지 (Block)로 분할
  • 논리적 KV Cache → 물리적 페이지 매핑 (Page Table)
  • 필요한 만큼만 페이지 할당 (내부 단편화 제거)
  • 비연속 메모리 사용 가능 (외부 단편화 제거)
  • 결과: 메모리 낭비를 최대 96% 줄이고, 동시 처리 가능한 요청 수 2-4배 증가
Q4. Airflow의 KubernetesExecutor에서 GPU 학습 태스크의 Cold Start 문제를 어떻게 완화하나요?

A: KubernetesExecutor는 태스크마다 새 Pod를 생성하므로, GPU 워크로드의 경우 다음과 같은 Cold Start 오버헤드가 있습니다:

  • Pod 스케줄링: 5-10초
  • 컨테이너 이미지 Pull: 30초-수분 (GPU 이미지는 수 GB)
  • GPU 드라이버 초기화: 수 초
  • 모델 로딩: 수 초-수분

완화 전략:

  1. 이미지 Pre-pulling: DaemonSet으로 GPU 노드에 이미지 미리 캐시
  2. PVC 기반 모델 캐시: 모델 파일을 PVC에 저장하여 반복 다운로드 방지
  3. Warm Pod Pool: Airflow 대신 Kubeflow의 Training Operator 사용 (Pod 재활용)
  4. CeleryKubernetesExecutor: 경량 태스크는 Celery, GPU 태스크만 K8s Pod
  5. Resource Quota: GPU 노드를 ML 전용으로 확보 (스케줄링 대기 감소)
Q5. ML 모델의 Canary 배포에서 "성공"을 판단하는 기준을 어떻게 설계하나요? 금융 도메인 특화 관점에서 답해주세요.

A: 금융 ML 모델의 Canary 배포 성공 기준은 일반 서비스보다 훨씬 엄격합니다.

기술적 메트릭 (자동 판단):

  • P99 레이턴시: 기존 모델 대비 120% 이하
  • Error Rate: 기존 모델 대비 동일 이하
  • GPU 활용률: 예상 범위 이내

비즈니스 메트릭 (자동 + 수동 판단):

  • FDS 모델: False Positive Rate 변화율 5% 이내, False Negative Rate 감소 확인
  • 대출 심사: 승인율 변동 3% 이내, 예상 부실률 모니터링
  • 추천 모델: CTR 유지 또는 개선

규제 메트릭 (수동 판단):

  • 설명가능성 검증: SHAP 값 분포가 합리적인지
  • 공정성 검증: 특정 그룹에 대한 편향 없는지
  • 감사 추적: 모든 판단 근거가 기록되는지

배포 전략:

  1. Shadow 배포 (1주): 실 트래픽 복제, 결과 비교만
  2. Canary 5% (3일): 소수 사용자에 적용, 모든 메트릭 모니터링
  3. Canary 30% (3일): 확대 적용
  4. Full 배포: 모든 기준 통과 시

Argo Rollouts의 AnalysisRun으로 자동 판단을 구현하고, 금융 규제 관련 항목은 인간 승인 게이트를 추가합니다.


참고 자료

공식 문서

도서

  • "Designing Data-Intensive Applications" - Martin Kleppmann (분산 시스템 바이블)
  • "Kubernetes in Action, 2nd Edition" - Marko Luksa (K8s 실전)
  • "Data Pipelines with Apache Airflow" - Bas Harenslak (Airflow 깊이 학습)
  • "Practical MLOps" - Noah Gift, Alfredo Deza (MLOps 전반)
  • "Designing Machine Learning Systems" - Chip Huyen (ML 시스템 설계)
  • "Building LLM Powered Applications" - Valentina Alto (LLM 애플리케이션)
  • "Database Internals" - Alex Petrov (DB 내부 구조)

무료 학습 자료

GitHub 레포지토리

커뮤니티


마무리: 당신만의 MLOps 여정

토스뱅크 ML Platform Team은 금융 x ML x 인프라라는 세 가지 축이 만나는 희소한 교차점에 있습니다. 이 글에서 다룬 기술스택은 방대하지만, 모든 것을 처음부터 완벽하게 알 필요는 없습니다.

핵심은 기초를 단단히 하고, 하나의 기술을 깊이 파고들 수 있는 능력을 보여주는 것입니다. K8s를 모르면서 Triton을 논할 수 없고, 분산 시스템을 이해하지 못하면서 ScyllaDB를 운영할 수 없습니다.

8개월 로드맵을 따라가되, 자신만의 속도를 찾으세요. 가장 중요한 것은 직접 만들어보는 것입니다. 이론 100시간보다 실습 10시간이 면접에서 더 빛납니다.

토스뱅크 ML Platform의 일원이 되어 대한민국 금융 AI의 미래를 함께 만들어가시기를 응원합니다.

Toss Bank ML Engineer (MLOps) Complete Guide: From MLFlow to LLM Platform — Tech Stack Deep Dive

Introduction: Why Toss Bank ML Platform Is Different

Toss Bank is not just another fintech with a machine learning team. As part of the Viva Republica ecosystem (the parent company behind the Toss super-app), Toss Bank operates one of the most aggressive ML-driven financial platforms in Asia. The ML Platform team is the infrastructure backbone that makes this possible — building and maintaining the systems that allow data scientists to go from Jupyter notebook to production model in hours rather than weeks.

The MLOps Engineer position on this team is not a typical "deploy a model and forget it" role. The JD reveals a team operating at MLOps Maturity Level 3-4 (more on this below), with ambitions to reach Level 5 — full autonomous ML operations. This means you are expected to understand not just individual tools, but the entire lifecycle from experimentation to serving, monitoring, retraining, and governance.

This guide dissects every line of the job description, maps each requirement to specific technologies and study resources, and gives you a concrete 8-month plan to become a competitive candidate. Whether you are a backend engineer pivoting into MLOps, a data scientist who wants to understand infrastructure, or an experienced MLOps practitioner evaluating this role — this document is your comprehensive preparation resource.


1. Team Analysis: ML Platform Team at Toss Bank

What the Team Actually Does

The ML Platform Team sits at the intersection of data engineering, ML engineering, and platform engineering. Their mandate is threefold:

  1. Build the ML infrastructure layer — Training pipelines, experiment tracking, model registry, feature stores, and serving infrastructure
  2. Enable self-service ML for data scientists — JupyterHub environments, automated pipeline creation, one-click model deployment
  3. Operate the LLM platform — Inference optimization, GPU cluster management, RAG pipelines for banking-specific use cases

Team Positioning Within Toss Bank

Understanding where the team sits in the organizational hierarchy matters for interview preparation.

LayerFunctionExample
Business TeamsDefine ML use casesCredit scoring, fraud detection, personalized recommendations
Data Science TeamBuild and validate modelsFeature engineering, model training, evaluation
ML Platform Team (this role)Build and operate the platformMLFlow, Kubeflow, Triton, Feature Store
Infrastructure TeamProvide compute and networkKubernetes clusters, GPU nodes, networking
Security/ComplianceEnsure regulatory adherenceModel audit trails, data governance

The ML Platform Team is the critical middle layer. They do not build business models themselves, but they make it possible for dozens of data scientists to work efficiently and deploy models safely into production.

Why This Role Matters in 2025-2026

Three trends make this position especially significant:

  1. LLM integration in banking — Every major Korean financial institution is racing to deploy LLMs for customer service, document processing, and internal tooling. Toss Bank needs infrastructure that can handle both traditional ML (XGBoost, LightGBM for credit scoring) and generative AI workloads simultaneously.

  2. Regulatory pressure — Korean financial regulators (FSC/FSS) now require model explainability and audit trails for any ML system that affects credit decisions. The ML platform must provide governance capabilities out of the box.

  3. Scale challenges — Toss Bank serves millions of active users. The ML platform must handle thousands of feature computations per second, serve models at sub-10ms latency, and manage dozens of concurrent experiments — all while maintaining five-nines reliability for financial transactions.


2. JD Line-by-Line Analysis

Let us break down each requirement from the job description and understand what the hiring team is really asking for.

Core Responsibilities

"Design and develop ML platform services (MLFlow, Airflow, JupyterHub, Kubeflow)"

This is the heart of the role. You are not just using these tools — you are building and customizing them. The four tools mentioned form the complete ML lifecycle:

  • MLFlow — Experiment tracking, model registry, model versioning
  • Airflow — Workflow orchestration for data pipelines and training jobs
  • JupyterHub — Multi-user notebook environment for data scientists
  • Kubeflow — Kubernetes-native ML pipeline orchestration

The word "design" is critical. It signals they want someone who can architect solutions, not just follow tutorials.

"Build and operate inference serving infrastructure (Triton Inference Server)"

Model serving is often the hardest part of MLOps. Triton Inference Server (by NVIDIA) is an enterprise-grade serving solution that supports multiple model frameworks (TensorFlow, PyTorch, ONNX, TensorRT) simultaneously. Operating Triton at scale means understanding:

  • Model ensemble patterns
  • Dynamic batching configuration
  • GPU memory management
  • A/B testing and canary deployment for models

"Design and develop feature store based on distributed database (ScyllaDB)"

This is a strong architectural signal. Most companies use off-the-shelf feature stores (Feast, Tecton, Hopsworks). Toss Bank has built a custom feature store on ScyllaDB — a high-performance Cassandra-compatible database written in C++. This means:

  • They need sub-millisecond feature lookups at scale
  • They prioritize consistency and low tail latency (critical for financial services)
  • You need to understand distributed database internals, not just API usage

"Build and operate LLM platform (GPU infrastructure, inference optimization)"

This is the forward-looking part of the role. LLM operations require a fundamentally different skill set from traditional ML:

  • GPU cluster management (NVIDIA A100/H100, multi-GPU serving)
  • Inference optimization (quantization, KV-cache optimization, continuous batching)
  • RAG pipeline architecture for grounding LLM responses in banking data
  • Cost optimization (GPU compute is expensive — efficient utilization is essential)

Required Qualifications

"3+ years of experience in backend development or ML engineering"

The dual framing (backend OR ML engineering) is intentional. They want someone who can write production-quality code. A data scientist with only notebook experience will struggle here. A backend engineer with no ML understanding will also struggle. The sweet spot is an engineer who can write Go/Python services AND understands ML concepts.

"Experience with Kubernetes and container orchestration"

This is non-negotiable. Every tool in their stack (MLFlow, Airflow, JupyterHub, Kubeflow, Triton) runs on Kubernetes. You need to understand:

  • Pod scheduling, resource requests/limits
  • Custom operators and CRDs (Custom Resource Definitions)
  • Helm charts and Kustomize for deployment management
  • Persistent volume management for model artifacts and training data

"Understanding of ML lifecycle (training, serving, monitoring)"

They want to confirm you see the big picture. ML in production is not "train a model and deploy it." It is a continuous cycle:

Training leads to validation, which leads to registry, then deployment, then monitoring, then retraining. Understanding each transition point and what can go wrong is essential.

Preferred Qualifications

"Experience with GPU infrastructure and CUDA"

This separates senior candidates from junior ones. If you have hands-on experience with GPU memory profiling, CUDA kernel optimization, or multi-GPU training with NCCL — you will stand out significantly.

"Experience with distributed systems or databases"

The ScyllaDB feature store requirement makes this especially relevant. Experience with Cassandra, DynamoDB, or any LSM-tree based database gives you a huge advantage.

"Contributions to open-source ML tools"

This is the strongest signal in the "preferred" section. Active open-source contributors demonstrate both technical depth and community engagement. Even small contributions to MLFlow, Kubeflow, or Triton will be noticed.


3. Tech Stack Deep Dive

3.1 Kubernetes for MLOps

Kubernetes is the foundation of the entire Toss Bank ML platform. Every other tool runs on top of it. Your Kubernetes knowledge needs to go beyond basic deployments.

Key Concepts for MLOps on Kubernetes

ConceptMLOps Relevance
NamespacesIsolating dev/staging/prod ML environments
Resource QuotasPreventing runaway training jobs from starving other workloads
Node Affinity and TaintsDirecting GPU workloads to GPU nodes, CPU workloads to CPU nodes
Custom Resource DefinitionsKubeflow Pipelines, TFJob, PyTorchJob all use CRDs
Persistent Volume ClaimsStoring training data, model artifacts, checkpoints
Horizontal Pod AutoscalerScaling inference servers based on request load

GPU Scheduling on Kubernetes

GPU management is critical for this role. NVIDIA provides the nvidia-device-plugin for Kubernetes, and you need to understand:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-job
spec:
  containers:
    - name: training
      image: training-image:v1
      resources:
        limits:
          nvidia.com/gpu: 2
  nodeSelector:
    accelerator: nvidia-a100
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

This example shows a training pod requesting 2 NVIDIA A100 GPUs with appropriate node selection and tolerations. In production, you would also configure:

  • GPU time-slicing for smaller workloads (MIG — Multi-Instance GPU)
  • RDMA networking for multi-node training
  • Topology-aware scheduling for optimal GPU-to-GPU communication

Study Resources

  • Kubernetes official documentation: Concepts section (free)
  • "Kubernetes in Action" by Marko Luksa — the definitive deep-dive book
  • NVIDIA GPU Operator documentation
  • CKA (Certified Kubernetes Administrator) certification — strongly recommended

3.2 MLFlow: Experiment Tracking and Model Registry

MLFlow is the de facto standard for experiment tracking in the ML industry. At Toss Bank, it serves as both the experiment tracking system and the model registry.

Architecture Overview

MLFlow has four core components:

  1. Tracking — Logs parameters, metrics, and artifacts for each experiment run
  2. Projects — Packages ML code in a reusable, reproducible format
  3. Models — Standardizes model packaging across frameworks
  4. Model Registry — Centralized model store with versioning, staging, and approval workflows

How Toss Bank Likely Uses MLFlow

In a financial institution, model governance is not optional. The MLFlow Model Registry becomes the central control plane:

import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register a new model version
model_uri = "runs:/abc123/model"
mv = client.create_model_version(
    name="credit-scoring-v2",
    source=model_uri,
    run_id="abc123",
    description="LightGBM credit scoring model with 47 features"
)

# Transition through stages with approval
client.transition_model_version_stage(
    name="credit-scoring-v2",
    version=mv.version,
    stage="Staging"
)

# After validation, promote to production
client.transition_model_version_stage(
    name="credit-scoring-v2",
    version=mv.version,
    stage="Production",
    archive_existing_versions=True
)

Key Topics to Study

  • MLFlow Tracking Server deployment on Kubernetes (PostgreSQL backend, S3/MinIO artifact store)
  • Custom MLFlow plugins (e.g., custom authentication, custom artifact stores)
  • MLFlow Model Serving vs dedicated serving solutions (Triton)
  • Integration with Airflow for automated training pipelines
  • Metric comparison and experiment analysis APIs

Study Resources

  • MLFlow official documentation (comprehensive and well-written)
  • "Practical MLOps" by Noah Gift and Alfredo Deza (O'Reilly)
  • MLFlow GitHub repository — read the source code for the tracking server

3.3 Apache Airflow: Workflow Orchestration

Airflow is the industry standard for data pipeline orchestration. In the ML context, it manages the complex dependencies between data preparation, feature computation, model training, evaluation, and deployment.

Why Airflow for ML Pipelines

CapabilityML Application
DAG (Directed Acyclic Graph) schedulingDefining dependencies between data prep, training, evaluation steps
Retry and error handlingRecovering from transient GPU failures during training
SLA monitoringEnsuring daily model retraining completes before business hours
Parameterized DAGsRunning the same pipeline with different hyperparameters
Custom operatorsBuilding Kubernetes-native training job operators

Example: ML Training DAG

from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from datetime import datetime, timedelta

default_args = {
    "owner": "ml-platform",
    "retries": 2,
    "retry_delay": timedelta(minutes=10),
}

with DAG(
    dag_id="credit_scoring_daily_retrain",
    default_args=default_args,
    schedule_interval="0 2 * * *",
    start_date=datetime(2025, 1, 1),
    catchup=False,
) as dag:

    feature_extraction = KubernetesPodOperator(
        task_id="extract_features",
        name="feature-extraction",
        namespace="ml-pipelines",
        image="feature-pipeline:v3",
        arguments=["--date", "{{ ds }}"],
        resources={
            "requests": {"cpu": "4", "memory": "16Gi"},
            "limits": {"cpu": "8", "memory": "32Gi"},
        },
    )

    model_training = KubernetesPodOperator(
        task_id="train_model",
        name="model-training",
        namespace="ml-pipelines",
        image="training-pipeline:v5",
        arguments=["--date", "{{ ds }}", "--experiment", "credit-scoring"],
        resources={
            "requests": {"cpu": "8", "memory": "32Gi", "nvidia.com/gpu": "1"},
            "limits": {"cpu": "16", "memory": "64Gi", "nvidia.com/gpu": "1"},
        },
    )

    model_evaluation = KubernetesPodOperator(
        task_id="evaluate_model",
        name="model-evaluation",
        namespace="ml-pipelines",
        image="evaluation-pipeline:v2",
        arguments=["--date", "{{ ds }}"],
    )

    feature_extraction >> model_training >> model_evaluation

Key Topics to Study

  • KubernetesExecutor vs CeleryExecutor — tradeoffs for ML workloads
  • Custom operators for MLFlow integration
  • XCom for passing metadata between tasks (model metrics, artifact URIs)
  • Connection and variable management for secrets
  • Airflow on Kubernetes: Helm chart deployment and configuration
  • DAG versioning and testing strategies

Study Resources

  • Apache Airflow official documentation
  • "Data Pipelines with Apache Airflow" by Bas Harenslak and Julian de Ruiter (Manning)
  • Astronomer.io blog and guides (Astronomer is the commercial Airflow company)

3.4 Kubeflow: Kubernetes-Native ML Pipelines

Kubeflow is the Kubernetes-native ML platform that provides pipeline orchestration, hyperparameter tuning, and distributed training capabilities. While Airflow handles general workflow orchestration, Kubeflow is purpose-built for ML.

Kubeflow Components Relevant to This Role

ComponentPurpose
Kubeflow Pipelines (KFP)Define and run ML workflows as reusable pipelines
KatibAutomated hyperparameter tuning
Training OperatorsDistributed training for TensorFlow, PyTorch, XGBoost
KServeModel serving (may overlap with Triton in Toss Bank setup)
NotebooksJupyter notebook management (may overlap with JupyterHub)

Kubeflow Pipelines Example

from kfp import dsl
from kfp import compiler

@dsl.component(
    base_image="python:3.11",
    packages_to_install=["scikit-learn", "pandas"]
)
def preprocess_data(input_path: str, output_path: dsl.OutputPath(str)):
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    df = pd.read_parquet(input_path)
    scaler = StandardScaler()
    df_scaled = pd.DataFrame(
        scaler.fit_transform(df),
        columns=df.columns
    )
    df_scaled.to_parquet(output_path)

@dsl.component(
    base_image="python:3.11",
    packages_to_install=["lightgbm", "mlflow"]
)
def train_model(data_path: str, model_name: str):
    import lightgbm as lgb
    import mlflow
    import pandas as pd
    df = pd.read_parquet(data_path)
    # Training logic here
    mlflow.lightgbm.log_model(model, model_name)

@dsl.pipeline(name="credit-scoring-pipeline")
def credit_scoring_pipeline(input_path: str = "s3://data/features/"):
    preprocess_task = preprocess_data(input_path=input_path)
    train_task = train_model(
        data_path=preprocess_task.outputs["output_path"],
        model_name="credit-scoring"
    )

compiler.Compiler().compile(credit_scoring_pipeline, "pipeline.yaml")

Kubeflow vs Airflow: When to Use Which

  • Airflow: Best for complex DAGs with mixed workloads (data pipelines + ML + ETL), mature ecosystem with hundreds of operators, strong scheduling capabilities
  • Kubeflow: Best for pure ML pipelines, native Kubernetes integration, built-in hyperparameter tuning and distributed training, better experiment tracking integration

Many teams (including likely Toss Bank) use both: Airflow for top-level orchestration and data pipelines, Kubeflow for the ML-specific pipeline steps within those workflows.

Study Resources

  • Kubeflow official documentation
  • Kubeflow Pipelines SDK v2 documentation (this is the current version)
  • Google Cloud Vertex AI Pipelines (uses KFP under the hood)

3.5 JupyterHub: Multi-User Notebook Platform

JupyterHub is the multi-user server for Jupyter notebooks. In an ML platform context, it provides the self-service environment where data scientists experiment and develop models.

Why JupyterHub Matters for This Role

You are not just deploying JupyterHub — you are building a customized, secure, enterprise-grade notebook platform for a financial institution. This involves:

  1. Authentication and Authorization — Integrating with the company identity provider (LDAP, OIDC, SAML)
  2. Resource Management — Allowing users to request specific compute profiles (CPU-only, single GPU, multi-GPU)
  3. Image Management — Maintaining curated Docker images with pre-installed ML frameworks
  4. Persistent Storage — Ensuring notebooks and data persist across server restarts
  5. Security — Network isolation, secret management, preventing data exfiltration in a banking environment

Kubernetes-Native JupyterHub Architecture

On Kubernetes, JupyterHub uses the kubespawner to create individual pods for each user:

# JupyterHub Helm values (simplified)
singleuser:
  profileList:
    - display_name: 'CPU - Small (2 CPU, 8GB)'
      description: 'For data exploration and light processing'
      kubespawner_override:
        cpu_limit: 2
        mem_limit: '8G'
    - display_name: 'GPU - A100 (8 CPU, 32GB, 1 GPU)'
      description: 'For model training and fine-tuning'
      kubespawner_override:
        cpu_limit: 8
        mem_limit: '32G'
        extra_resource_limits:
          nvidia.com/gpu: '1'
        node_selector:
          accelerator: nvidia-a100
  storage:
    type: dynamic
    capacity: 50Gi
    storageClass: fast-ssd

Study Resources

  • Zero to JupyterHub with Kubernetes documentation
  • JupyterHub for Kubernetes Helm chart documentation
  • KubeSpawner documentation

3.6 Triton Inference Server: Production Model Serving

NVIDIA Triton Inference Server is the industry-leading solution for serving ML models in production. It supports multiple frameworks simultaneously and provides advanced features like dynamic batching, model ensemble, and GPU utilization optimization.

Why Triton for Financial Services

FeatureBanking Benefit
Multi-framework supportServe XGBoost credit models and PyTorch NLP models from the same server
Dynamic batchingMaximize throughput while meeting latency SLAs
Model ensembleChain preprocessing, model inference, and postprocessing
Model versioningSeamless A/B testing and canary deployments
Metrics and monitoringPrometheus-compatible metrics for model performance tracking
gRPC and HTTP endpointsFlexible integration with existing banking microservices

Triton Model Repository Structure

model_repository/
  credit_scoring/
    config.pbtxt
    1/
      model.onnx
    2/
      model.onnx
  fraud_detection/
    config.pbtxt
    1/
      model.plan
  text_classifier/
    config.pbtxt
    1/
      model.pt

Triton Configuration Example

name: "credit_scoring"
platform: "onnxruntime_onnx"
max_batch_size: 64
input [
  {
    name: "features"
    data_type: TYPE_FP32
    dims: [ 47 ]
  }
]
output [
  {
    name: "probability"
    data_type: TYPE_FP32
    dims: [ 1 ]
  }
]
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]
dynamic_batching {
  preferred_batch_size: [ 16, 32, 64 ]
  max_queue_delay_microseconds: 100
}

Key Topics to Study

  • Model conversion: PyTorch to ONNX, TensorFlow to TensorRT
  • Dynamic batching configuration and tuning
  • Model ensemble for pre/post-processing pipelines
  • Triton Inference Server on Kubernetes (Helm chart deployment)
  • Performance analysis with Triton Model Analyzer
  • Custom backends for non-standard model formats
  • Health checks and readiness probes for Kubernetes integration

Study Resources

  • NVIDIA Triton Inference Server documentation
  • NVIDIA Deep Learning Examples GitHub repository
  • Triton Model Analyzer documentation
  • "Serving Machine Learning Models" by Yaron Haviv (O'Reilly)

3.7 ScyllaDB Feature Store: Low-Latency Feature Serving

The decision to build a feature store on ScyllaDB is one of the most distinctive aspects of Toss Bank's ML platform. Understanding why they chose this architecture reveals a lot about the team's priorities.

What Is a Feature Store

A feature store is a centralized repository for storing, managing, and serving ML features. It solves several critical problems:

  1. Feature consistency — Ensuring the same feature computation is used during both training and inference
  2. Feature reuse — Allowing multiple models to share the same features without redundant computation
  3. Low-latency serving — Providing precomputed features at inference time with sub-millisecond latency
  4. Point-in-time correctness — Retrieving features as they existed at a specific historical timestamp for training (avoiding data leakage)

Why ScyllaDB Instead of Redis or DynamoDB

RequirementScyllaDB Advantage
Consistent low-latency (p99)Shard-per-core architecture eliminates context switching overhead
Large feature setsSupports wide rows with hundreds of columns per entity
Time-series featuresNative TTL and time-windowed compaction for historical features
Operational simplicityCassandra-compatible but with C++ performance (no JVM tuning)
Cost at scaleBetter price-performance ratio than DynamoDB for predictable workloads

Feature Store Architecture Pattern

                          Batch Features
Training Data -----> [Airflow + Spark] -----> ScyllaDB (offline store)
                                                    |
                                                    v
                          Online Features      [Feature Serving API]
Live Requests -----> [Streaming Pipeline] ------->  |
                                                    v
                                              [Model Serving (Triton)]

ScyllaDB Data Modeling for Features

CREATE TABLE feature_store.user_features (
    user_id text,
    feature_timestamp timestamp,
    avg_transaction_amount_7d double,
    transaction_count_30d int,
    max_single_transaction_90d double,
    credit_utilization_ratio double,
    days_since_last_late_payment int,
    PRIMARY KEY (user_id, feature_timestamp)
) WITH CLUSTERING ORDER BY (feature_timestamp DESC)
  AND default_time_to_live = 7776000;

Key Topics to Study

  • ScyllaDB architecture: shard-per-core model, seastar framework
  • Data modeling for wide-column databases (Cassandra/ScyllaDB)
  • Feature store concepts: online vs offline store, feature freshness, time-travel
  • Comparison with existing feature stores (Feast, Tecton, Hopsworks)
  • ScyllaDB performance tuning: compaction strategies, caching, read/write consistency levels
  • Driver selection and connection pooling for high-throughput workloads

Study Resources

  • ScyllaDB University (free online courses)
  • "Cassandra: The Definitive Guide" by Jeff Carpenter and Eben Hewitt (concepts transfer directly)
  • Feast documentation (to understand general feature store concepts)
  • ScyllaDB Architecture documentation

3.8 LLM Platform: GPU Infrastructure and Inference Optimization

The LLM platform responsibility is the most forward-looking part of this role. Building infrastructure for Large Language Models requires understanding a completely different set of constraints than traditional ML.

LLM Infrastructure Challenges in Banking

  1. Data privacy — Banking data cannot leave the organization, ruling out most cloud LLM APIs. On-premises or VPC-hosted inference is required.
  2. Latency requirements — Customer-facing chatbots need first-token latency under 500ms and generation speed above 30 tokens per second.
  3. Cost management — A single NVIDIA H100 GPU costs over $30,000. Efficient utilization of GPU clusters is essential for ROI.
  4. Model governance — Regulatory requirements mean every LLM response in banking must be traceable, auditable, and explainable.

Key LLM Serving Technologies

TechnologyPurpose
vLLMHigh-throughput LLM serving with PagedAttention
TensorRT-LLMNVIDIA optimized LLM inference engine
Triton + TensorRT-LLM backendEnterprise-grade LLM serving on Triton
NVIDIA NIMContainerized, optimized inference microservices
Ray ServeDistributed serving framework for complex inference graphs

LLM Inference Optimization Techniques

  • Quantization: Reducing model precision from FP16 to INT8 or INT4. AWQ and GPTQ are the most common methods. This typically reduces GPU memory usage by 50-75% with minimal accuracy loss.
  • KV-Cache Optimization: PagedAttention (used by vLLM) manages the key-value cache like virtual memory pages, dramatically improving throughput for concurrent requests.
  • Continuous Batching: Unlike static batching, continuous batching allows new requests to join a batch as previous requests complete, maximizing GPU utilization.
  • Speculative Decoding: Using a small draft model to generate candidate tokens that a larger model verifies, potentially speeding up inference by 2-3x.
  • Tensor Parallelism: Splitting a single model across multiple GPUs for serving models too large for one GPU's memory.

RAG Pipeline for Banking

Retrieval-Augmented Generation (RAG) is essential for banking LLM applications to ground responses in accurate, up-to-date information:

User Query --> [Embedding Model] --> [Vector DB Search]
                                          |
                                     Retrieved Context
                                          |
                                          v
              [Prompt Template + Context + Query] --> [LLM] --> Response
                                                                   |
                                                              [Guardrails]
                                                                   |
                                                              Final Response

Study Resources

  • vLLM documentation and GitHub repository
  • NVIDIA TensorRT-LLM documentation
  • "LLM Engineer's Handbook" by Paul Iusztin and Maxime Labonne
  • Hugging Face Text Generation Inference documentation
  • NVIDIA NIM documentation

3.9 GPU Frameworks and CUDA Fundamentals

Understanding GPU computing at a deeper level sets apart strong candidates from average ones. You do not need to be a CUDA kernel developer, but you need to understand the fundamentals.

GPU Architecture Basics

ConceptDescription
Streaming Multiprocessor (SM)The basic processing unit of a GPU. An A100 has 108 SMs
CUDA CoreIndividual processing units within each SM
Tensor CoreSpecialized hardware for matrix operations (critical for ML)
HBM (High Bandwidth Memory)GPU memory (80GB on A100, 80GB on H100)
NVLinkHigh-speed GPU-to-GPU interconnect (900 GB/s on H100)
PCIeCPU-to-GPU interconnect (slower than NVLink)

Multi-Instance GPU (MIG)

MIG allows partitioning a single GPU into multiple isolated instances. This is essential for maximizing GPU utilization:

# List MIG-capable GPUs
nvidia-smi mig -lgip

# Create a MIG instance (3g.40gb profile on A100)
nvidia-smi mig -cgi 9,9 -C

# List created instances
nvidia-smi mig -lgi

In a banking context, MIG allows running smaller inference workloads on fractions of an A100/H100 rather than dedicating an entire GPU to each model.

CUDA Memory Management for ML Engineers

Understanding GPU memory is critical for debugging out-of-memory errors and optimizing training:

  • Model Parameters: Stored in GPU memory (e.g., a 7B parameter model in FP16 needs about 14GB)
  • Gradients: Same size as parameters during training (another 14GB)
  • Optimizer States: Adam optimizer stores two additional copies (another 28GB)
  • Activations: Intermediate values during forward pass (varies with batch size and sequence length)

Total training memory for a 7B model with Adam in FP16 is roughly: 14 + 14 + 28 = 56GB minimum, plus activations.

NCCL (NVIDIA Collective Communications Library)

NCCL is the library that enables efficient multi-GPU and multi-node communication:

  • AllReduce: Aggregating gradients across GPUs during distributed training
  • AllGather: Collecting tensor shards from all GPUs for tensor parallelism
  • ReduceScatter: Combining reduction and distribution for pipeline parallelism

Study Resources

  • NVIDIA CUDA Programming Guide
  • "Programming Massively Parallel Processors" by David Kirk and Wen-mei Hwu
  • NVIDIA Deep Learning Performance Guide
  • PyTorch Distributed Training documentation

3.10 Distributed Database Fundamentals

Since Toss Bank uses ScyllaDB for its feature store and operates at financial-grade scale, a solid understanding of distributed database theory and practice is essential.

CAP Theorem and Its Practical Implications

The CAP theorem states that a distributed system can provide at most two of three guarantees: Consistency, Availability, and Partition tolerance. In practice:

  • ScyllaDB/Cassandra: AP system with tunable consistency (you can configure per-query consistency levels)
  • For feature serving: Usually QUORUM reads for consistency with LOCAL_QUORUM for latency optimization
  • For feature writing: ONE or LOCAL_ONE for high write throughput during batch feature computation

Consistency Levels in Practice

Consistency LevelReads FromUse Case
ONEAny single replicaMaximum speed, eventual consistency
QUORUMMajority of replicasStrong consistency for critical features
LOCAL_QUORUMMajority in local datacenterConsistent reads with low latency
ALLAll replicasMaximum consistency (rarely used in production)

LSM-Tree Architecture

ScyllaDB and Cassandra use Log-Structured Merge Trees for storage:

  1. Writes go to an in-memory Memtable
  2. When the Memtable is full, it is flushed to disk as an SSTable
  3. Background compaction merges SSTables to reclaim space and optimize reads
  4. Bloom filters and partition indexes enable fast lookups

Understanding compaction strategies (Size-Tiered, Leveled, Time-Window) is critical for feature store performance tuning.

Consistent Hashing and Data Distribution

ScyllaDB distributes data across nodes using consistent hashing:

  • Each node owns a range of token values
  • Partition keys are hashed to determine which node stores the data
  • Virtual nodes (vnodes) improve data distribution uniformity
  • Replication factor determines how many copies of each piece of data exist

Study Resources

  • "Designing Data-Intensive Applications" by Martin Kleppmann — essential reading
  • ScyllaDB University courses (free)
  • "Database Internals" by Alex Petrov (O'Reilly)
  • Jepsen.io for distributed systems correctness analysis

4. MLOps Maturity Model

Understanding where Toss Bank sits on the MLOps maturity model helps you frame your interview answers and demonstrate strategic thinking.

The Five Levels

LevelNameCharacteristics
0No MLOpsManual everything — notebooks to production via copy-paste
1DevOps but no MLOpsCI/CD exists but ML-specific pipelines do not
2Automated TrainingAutomated training pipelines, basic experiment tracking
3Automated DeploymentAutomated model deployment, A/B testing, monitoring
4Full MLOpsAutomated retraining, feature stores, model governance
5Autonomous MLSelf-healing pipelines, automated feature engineering, continuous optimization

Toss Bank's Current Position (Estimated Level 3-4)

Based on the JD analysis, Toss Bank appears to be at Level 3-4:

Evidence for Level 3+:

  • Dedicated ML Platform team (not ad-hoc)
  • Mature tool stack (MLFlow, Kubeflow, Airflow)
  • Production model serving (Triton)
  • Custom feature store (ScyllaDB-based)

Aspirations toward Level 4-5:

  • LLM platform development (cutting-edge capabilities)
  • GPU infrastructure management (scaling up)
  • The fact that they are hiring suggests expansion and maturation

In your interview, frame your contributions as helping the team advance from their current level to the next. Show that you understand not just the tools, but the organizational and process changes needed for MLOps maturity.


5. Interview Preparation: 30 Expected Questions

Kubernetes and Infrastructure (Questions 1-6)

Q1. How would you design a Kubernetes cluster to support both ML training workloads and model inference serving? What considerations differ between the two?

Q2. Explain how the NVIDIA device plugin works in Kubernetes. How do you handle GPU scheduling, and what happens when a GPU node becomes unhealthy during a training job?

Q3. A training job is consuming all available GPU memory on a node, preventing other pods from scheduling. How would you prevent this using Kubernetes resource management?

Q4. Describe how you would implement a blue-green deployment strategy for ML model updates on Kubernetes. What metrics would you monitor during the canary phase?

Q5. How does Multi-Instance GPU (MIG) work, and in what scenarios would you choose MIG over dedicated GPU allocation?

Q6. Explain the tradeoffs between using a single large Kubernetes cluster versus multiple smaller clusters for separating training and serving workloads.

MLFlow and Experiment Management (Questions 7-12)

Q7. How would you design an MLFlow deployment for a team of 50+ data scientists with requirements for high availability and security?

Q8. Describe your approach to organizing MLFlow experiments and runs for a large organization. How do you prevent experiment sprawl and maintain discoverability?

Q9. How would you implement a model approval workflow using the MLFlow Model Registry? What stages would you define, and what automated checks would you add?

Q10. MLFlow's tracking server is experiencing performance issues with a large volume of experiments. How would you diagnose and resolve this?

Q11. How do you handle model reproducibility? Walk through the steps from experiment to production deployment, ensuring you can recreate any model version.

Q12. Describe how you would integrate MLFlow with your CI/CD pipeline to automate model testing and deployment.

Airflow and Pipeline Orchestration (Questions 13-18)

Q13. Compare the KubernetesExecutor and CeleryExecutor in Airflow for ML pipeline workloads. Which would you recommend for Toss Bank and why?

Q14. How would you design a DAG that handles daily model retraining with automatic rollback if the new model underperforms the current production model?

Q15. A critical Airflow DAG failed at 3 AM and the morning model predictions are stale. Walk through your incident response process.

Q16. How would you implement data quality checks within an Airflow ML pipeline to prevent training on corrupted or incomplete data?

Q17. Describe your strategy for testing Airflow DAGs. How do you ensure DAG changes do not break production workflows?

Q18. How would you manage secrets and credentials in Airflow for connecting to various data sources and ML services?

Model Serving and Triton (Questions 19-24)

Q19. You need to serve a model that requires sub-5ms p99 latency for credit scoring decisions. How would you design the serving infrastructure using Triton?

Q20. Explain dynamic batching in Triton Inference Server. How do you tune the batch size and queue delay parameters for optimal throughput-latency tradeoff?

Q21. How would you implement an A/B test between two model versions using Triton? What metrics would you track, and how long would you run the test?

Q22. A deployed model is experiencing gradual performance degradation over two weeks. Describe your approach to diagnosing and resolving model drift.

Q23. How would you design a model ensemble in Triton that chains a feature preprocessor, a primary model, and a postprocessor together?

Q24. Describe the steps to convert a PyTorch model to an optimized TensorRT engine for deployment on Triton. What potential pitfalls should you watch for?

Feature Store and Distributed Databases (Questions 25-27)

Q25. Why would you choose ScyllaDB over Redis for a feature store backend? Under what circumstances might Redis be the better choice?

Q26. Explain how you would handle feature freshness requirements for a real-time fraud detection model. What is your architecture for updating features in near-real-time?

Q27. Describe point-in-time correctness in a feature store. Why is it important, and how do you implement it with ScyllaDB?

LLM Platform (Questions 28-30)

Q28. How would you design an LLM serving platform for a banking environment where all data must remain on-premises? What are the key architectural decisions?

Q29. Explain the difference between tensor parallelism and pipeline parallelism for serving large language models. When would you use each?

Q30. You are tasked with reducing LLM inference costs by 50% without significantly impacting response quality. What approaches would you consider?


6. Eight-Month Study Roadmap

This roadmap assumes you are currently a backend engineer or junior ML engineer with basic Python and cloud experience. Adjust the timeline based on your starting point.

Month 1-2: Kubernetes and Container Foundations

Goal: Achieve CKA-level Kubernetes proficiency

WeekFocus AreaDeliverable
1-2Core Kubernetes conceptsDeploy a multi-service application on a local K8s cluster
3-4Advanced scheduling, storage, networkingConfigure GPU node pools with taints and tolerations
5-6Helm, Kustomize, GitOpsCreate Helm charts for ML services deployment
7-8CKA exam preparationPass CKA certification

Daily practice: 1.5 hours on weekdays, 3 hours on weekends Resources: KodeKloud CKA course, Kubernetes documentation, killer.sh practice exams

Month 3-4: Core ML Platform Tools

Goal: Deploy and customize MLFlow, Airflow, and JupyterHub on Kubernetes

WeekFocus AreaDeliverable
9-10MLFlow deep diveDeploy MLFlow with PostgreSQL backend and S3 artifact store on K8s
11-12Airflow deep diveBuild an ML training DAG with KubernetesPodOperator
13-14JupyterHub deploymentConfigure multi-profile JupyterHub with GPU support
15-16Integration projectEnd-to-end pipeline: JupyterHub experiment to MLFlow to Airflow training to model registry

Daily practice: 2 hours on weekdays, 4 hours on weekends Resources: Official documentation for each tool, "Practical MLOps" book

Month 5-6: Model Serving and Feature Engineering

Goal: Master Triton deployment and build a feature store prototype

WeekFocus AreaDeliverable
17-18Triton Inference ServerDeploy models in 3 different formats on Triton with dynamic batching
19-20Model optimizationConvert PyTorch model to ONNX and TensorRT, benchmark performance
21-22ScyllaDB fundamentalsComplete ScyllaDB University courses, build a feature serving API
23-24Feature store integrationBuild a complete feature store with online/offline serving using ScyllaDB

Daily practice: 2 hours on weekdays, 4 hours on weekends Resources: Triton documentation, ScyllaDB University, "Designing Data-Intensive Applications"

Month 7-8: LLM Platform and Interview Preparation

Goal: Build LLM serving experience and prepare for interviews

WeekFocus AreaDeliverable
25-26LLM serving fundamentalsDeploy an open-source LLM with vLLM and Triton TensorRT-LLM backend
27-28RAG pipelineBuild a RAG system with vector database and LLM serving
29-30GPU optimizationImplement quantization, benchmark different serving configurations
31-32Interview preparationMock interviews, review all 30 questions, prepare STAR-format stories

Daily practice: 2 hours on weekdays, 5 hours on weekends Resources: vLLM documentation, Hugging Face resources, mock interview practice

Study Schedule Summary

Month 1-2: [========== Kubernetes + CKA ==========]
Month 3-4: [==== MLFlow ==][== Airflow ==][= JupyterHub =]
Month 5-6: [=== Triton ===][=== ScyllaDB Feature Store ===]
Month 7-8: [=== LLM Platform ===][== Interview Prep ==]

7. Resume Strategy for Toss Bank ML Platform

Resume Structure

Your resume should directly map to the JD requirements. Here is the recommended structure:

Header Section

  • Name, contact information, GitHub profile, blog/portfolio URL

Summary (3-4 lines)

  • Years of experience, primary domain (MLOps/ML Engineering)
  • Key technologies matching the JD (Kubernetes, MLFlow, Triton)
  • Quantified achievement (e.g., "reduced model deployment time from 2 weeks to 4 hours")

Experience Section (STAR format)

For each position, structure bullets as:

  • Situation: What was the context
  • Task: What was your specific responsibility
  • Action: What did you do (technical details)
  • Result: What was the measurable outcome

Example bullet:

"Designed and deployed an MLFlow-based experiment tracking system on Kubernetes for 30+ data scientists, reducing experiment-to-production time by 80% and establishing model governance workflows that satisfied SOC 2 audit requirements."

Keywords to Include

These terms should appear naturally in your resume:

CategoryTerms
InfrastructureKubernetes, Docker, Helm, GitOps, ArgoCD
ML PlatformMLFlow, Kubeflow, Airflow, JupyterHub
Model ServingTriton Inference Server, ONNX, TensorRT, gRPC
DataScyllaDB, Cassandra, Feature Store, Kafka
LLMvLLM, TensorRT-LLM, RAG, Vector Database, Quantization
GPUCUDA, MIG, NCCL, A100, H100
PracticesCI/CD, Monitoring, A/B Testing, Canary Deployment

Common Resume Mistakes for MLOps Roles

  1. Listing tools without context — "Experience with MLFlow" is weak. "Deployed MLFlow tracking server handling 10,000+ experiment runs across 5 teams" is strong.
  2. Focusing only on model accuracy — For a platform role, infrastructure metrics matter more (deployment frequency, serving latency, platform uptime).
  3. Ignoring scale indicators — Always include numbers: how many models, how many users, what throughput, what latency.
  4. Missing the governance angle — Financial services care deeply about audit trails, compliance, and model explainability. Mention these explicitly.

8. Portfolio Projects

Project 1: End-to-End MLOps Platform on Kubernetes

Objective: Demonstrate your ability to build and integrate the core ML platform stack.

Architecture:

JupyterHub (experimentation)
     |
     v
MLFlow (experiment tracking + model registry)
     |
     v
Airflow (automated training pipeline)
     |
     v
Triton (model serving)
     |
     v
Prometheus + Grafana (monitoring)

Implementation Details:

  1. Set up a local Kubernetes cluster (kind or minikube with GPU support)
  2. Deploy MLFlow with PostgreSQL and MinIO (S3-compatible storage)
  3. Deploy Airflow with KubernetesExecutor
  4. Build a training DAG that:
    • Pulls data from a feature table
    • Trains a LightGBM model
    • Logs metrics and artifacts to MLFlow
    • Registers the model in MLFlow Model Registry
    • Deploys the model to Triton if performance meets threshold
  5. Deploy JupyterHub for interactive experimentation
  6. Set up Prometheus scraping for all services plus Grafana dashboards

GitHub Repository Structure:

mlops-platform/
  infrastructure/
    kubernetes/
      mlflow/
      airflow/
      jupyterhub/
      triton/
      monitoring/
  pipelines/
    training/
    evaluation/
    deployment/
  models/
    credit_scoring/
    fraud_detection/
  docs/
    architecture.md
    setup.md
  Makefile
  README.md

What This Demonstrates: Platform engineering skills, Kubernetes proficiency, tool integration, and understanding of the full ML lifecycle.


Project 2: Feature Store on ScyllaDB

Objective: Show that you understand distributed database design and feature store concepts at a deep level.

Implementation Details:

  1. Deploy a 3-node ScyllaDB cluster on Kubernetes
  2. Design a feature schema for a credit scoring use case:
    • User demographic features (slowly changing)
    • Transaction aggregate features (computed daily)
    • Real-time transaction features (updated per transaction)
  3. Build a batch pipeline (using Airflow) that computes daily aggregate features and writes them to ScyllaDB
  4. Build a streaming pipeline (using Kafka and a Python consumer) that updates real-time features
  5. Build a feature serving API (FastAPI) that retrieves features by user ID with sub-5ms p99 latency
  6. Implement point-in-time correctness for training data generation
  7. Add monitoring: feature freshness, serving latency, error rates

Key Design Decisions to Document:

  • Partition key design and why
  • Compaction strategy selection
  • Consistency level choices for reads and writes
  • TTL strategy for feature expiration
  • Connection pooling and driver configuration

What This Demonstrates: Distributed database expertise, feature store architecture understanding, real-time systems design, and performance optimization skills.


Project 3: LLM Serving Platform with RAG

Objective: Demonstrate hands-on experience with LLM infrastructure and inference optimization.

Implementation Details:

  1. Deploy an open-source LLM (Llama 3 8B or Mistral 7B) using vLLM on Kubernetes with GPU
  2. Implement a RAG pipeline:
    • Document ingestion pipeline (PDF/text to embeddings)
    • Vector database (ChromaDB or Milvus) for similarity search
    • Prompt template system with context injection
    • Streaming response generation
  3. Optimize inference:
    • Quantize the model to INT4 using AWQ
    • Benchmark latency and throughput before and after quantization
    • Configure continuous batching parameters
    • Implement request-level caching for common queries
  4. Build a simple chat interface to demonstrate the system
  5. Add monitoring: tokens per second, time to first token, GPU utilization, cache hit rate

Bonus: Deploy the same model using Triton with the TensorRT-LLM backend and compare performance with vLLM.

What This Demonstrates: LLM infrastructure skills, optimization expertise, RAG architecture understanding, and the ability to make quantitative engineering decisions.


9. Knowledge Check Quiz

Q1. What is the primary advantage of ScyllaDB's shard-per-core architecture over Cassandra's thread-per-request model?

ScyllaDB's shard-per-core architecture assigns each CPU core its own data shard and processing thread, eliminating context switching overhead and lock contention. This results in predictable, consistent low-latency performance (especially at p99) compared to Cassandra's JVM-based thread-per-request model which suffers from garbage collection pauses and cross-thread coordination overhead. For a feature store where p99 latency matters as much as average latency, this architectural difference is critical.

Q2. Why is continuous batching superior to static batching for LLM inference, and what system implements this?

Static batching requires all requests in a batch to complete before any response can be returned, meaning shorter requests are delayed by longer ones. Continuous batching (also called iteration-level batching) allows new requests to enter the batch as soon as any request in the current batch completes a generation step. This maximizes GPU utilization by keeping the GPU busy processing tokens rather than waiting for the longest request to finish. vLLM implements this through its PagedAttention mechanism, and Triton supports it through the TensorRT-LLM backend.

Q3. Explain the difference between Kubeflow Pipelines and Airflow for ML workflow orchestration. When would you use each?

Airflow is a general-purpose workflow orchestration tool optimized for scheduled, dependency-driven DAGs. It excels at data pipelines, ETL jobs, and complex scheduling logic with a mature ecosystem of 500+ operators. Kubeflow Pipelines is a Kubernetes-native ML pipeline orchestrator optimized for ML-specific workflows. It provides first-class support for experiment tracking, artifact management, and pipeline visualization. Use Airflow when you need complex scheduling, mixed workload orchestration (data + ML), and integration with diverse data sources. Use Kubeflow Pipelines when you need pure ML pipelines with tight Kubernetes integration, pipeline versioning and comparison, and integration with Kubeflow's hyperparameter tuning (Katib) and training operators.

Q4. In the context of MLFlow Model Registry, what are the model stages, and how would you implement an automated promotion workflow?

MLFlow Model Registry defines four stages: None, Staging, Production, and Archived. An automated promotion workflow would work as follows. First, when a new model version is registered (from an Airflow training DAG), it enters the None stage. Second, automated tests run against the model (data validation, performance benchmarks, bias checks). Third, if all tests pass, the model is promoted to Staging. Fourth, a canary deployment runs in production with the Staging model serving a small percentage of traffic. Fifth, if canary metrics (latency, accuracy, error rate) meet thresholds over a defined period, the model is promoted to Production. Sixth, the previously active Production model is moved to Archived. This workflow should be implemented using MLFlow webhooks or API polling combined with Airflow DAGs for orchestration.

Q5. How would you design a zero-downtime model update strategy on Triton Inference Server running on Kubernetes?

Triton supports model versioning natively through its model repository. The strategy works as follows. First, upload the new model version to the model repository (S3/MinIO) as a new version directory (e.g., version 2 alongside version 1). Second, configure Triton's model control mode to use explicit loading, and call the model load API to load the new version. Third, update the model configuration to set the new version as the default version policy. Fourth, at the Kubernetes level, use a rolling update strategy with readiness probes that check model health endpoints. Fifth, configure appropriate max surge and max unavailable parameters in the Deployment spec to ensure there is always at least one healthy pod serving the previous model version. Sixth, for more sophisticated traffic shifting, use an Istio or Linkerd service mesh to gradually shift traffic from the old version to the new version based on custom metrics. Seventh, implement automated rollback by monitoring model-specific metrics (accuracy, latency p99) and triggering a rollback if they degrade beyond thresholds.


10. References and Resources

Official Documentation

  1. Kubernetes Documentation — https://kubernetes.io/docs/
  2. MLFlow Documentation — https://mlflow.org/docs/latest/
  3. Apache Airflow Documentation — https://airflow.apache.org/docs/
  4. Kubeflow Documentation — https://www.kubeflow.org/docs/
  5. JupyterHub Documentation — https://jupyterhub.readthedocs.io/
  6. NVIDIA Triton Inference Server — https://docs.nvidia.com/deeplearning/triton-inference-server/
  7. ScyllaDB Documentation — https://docs.scylladb.com/
  8. vLLM Documentation — https://docs.vllm.ai/
  9. NVIDIA TensorRT-LLM — https://nvidia.github.io/TensorRT-LLM/
  10. Feast Feature Store — https://docs.feast.dev/

Books

  1. "Designing Data-Intensive Applications" by Martin Kleppmann (O'Reilly) — essential for distributed systems concepts
  2. "Kubernetes in Action" by Marko Luksa (Manning) — comprehensive Kubernetes deep dive
  3. "Practical MLOps" by Noah Gift and Alfredo Deza (O'Reilly) — MLOps practices and patterns
  4. "Data Pipelines with Apache Airflow" by Bas Harenslak and Julian de Ruiter (Manning) — Airflow best practices
  5. "Programming Massively Parallel Processors" by David Kirk and Wen-mei Hwu — GPU computing fundamentals
  6. "Cassandra: The Definitive Guide" by Jeff Carpenter and Eben Hewitt (O'Reilly) — distributed database concepts applicable to ScyllaDB
  7. "Database Internals" by Alex Petrov (O'Reilly) — storage engine and distributed system internals
  8. "Machine Learning Engineering" by Andriy Burkov — practical ML engineering patterns

Online Courses and Certifications

  1. CKA (Certified Kubernetes Administrator) — https://www.cncf.io/certification/cka/
  2. ScyllaDB University — https://university.scylladb.com/
  3. NVIDIA Deep Learning Institute — https://www.nvidia.com/en-us/training/
  4. Made With ML (MLOps Course) — https://madewithml.com/
  5. Full Stack Deep Learning — https://fullstackdeeplearning.com/

Community Resources

  1. MLOps Community Slack — https://mlops.community/
  2. Kubeflow Slack Channel
  3. NVIDIA Developer Forums — https://forums.developer.nvidia.com/
  4. Toss Tech Blog — https://toss.tech/
  5. Airflow Summit conference talks (YouTube)