Toss Bank ML Backend Engineer Study Guide: Server Architecture, ML Serving, and Career Roadmap

Introduction: Where ML Meets Banking at Scale
1. JD Deep Analysis: What Toss Bank Actually Wants
2. Tech Stack Deep Dive
3. Interview Preparation: 30 Questions
4. Six-Month Study Roadmap
5. Resume Strategy
- What Toss Bank Wants to See
- Resume Format
6. Portfolio Project Ideas
7. Quiz: Test Your Knowledge
8. References and Resources
Conclusion

Introduction: Where ML Meets Banking at Scale

Machine learning in production is not about Jupyter notebooks. It is about serving millions of predictions per second with sub-10ms latency while maintaining model accuracy, handling graceful degradation when models fail, and operating in a regulatory environment where every prediction must be explainable and auditable.

Toss Bank's ML Service Team sits at this exact intersection. They build the backend infrastructure that serves AI models for credit scoring, fraud detection, recommendation systems, and real-time risk assessment. When Toss Bank posts an ML Backend Engineer position, they are not looking for someone who can train a model — they are looking for someone who can build and operate the production systems that serve those models to millions of users.

This is a role that requires deep backend engineering skills (server architecture, distributed systems, performance optimization) combined with practical ML serving knowledge (model deployment, A/B testing infrastructure, feature stores). You need to be equally comfortable designing a high-availability microservice architecture and deploying a TensorFlow Serving cluster on Kubernetes.

This guide breaks down every requirement in the job description, maps each one to specific technologies and study materials, and provides a 6-month plan to get you interview-ready.

1. JD Deep Analysis: What Toss Bank Actually Wants

1.1 Core Responsibilities

"Design and implement robust server architectures for ML model serving"

This is the primary responsibility. You will build the systems that accept inference requests, route them to the appropriate model, return predictions, and handle failures gracefully. This involves:

Designing APIs for synchronous (REST, gRPC) and asynchronous (event-driven) inference
Building model registries that track model versions, metadata, and deployment status
Implementing A/B testing and canary deployment infrastructure for models
Managing model lifecycle: training pipeline integration, model validation, promotion, and rollback
Shadow mode deployments for new models (serve traffic but do not use predictions)

"Handle large-scale traffic with high availability and low latency"

Toss Bank serves millions of users. The ML backend must:

Handle tens of thousands of requests per second per model endpoint
Maintain p99 latency under 10ms for critical inference paths (fraud detection)
Achieve 99.99% availability (less than 52 minutes of downtime per year)
Implement circuit breakers, bulkheads, and fallback strategies
Auto-scale based on traffic patterns (morning peaks, payday spikes)

"Deploy and operate AI model serving infrastructure on Kubernetes"

Kubernetes is the deployment platform. You need to:

Deploy model serving frameworks (TensorFlow Serving, Triton Inference Server, TorchServe) on K8s
Configure GPU scheduling and resource quotas for model workloads
Implement custom HPA (Horizontal Pod Autoscaler) metrics based on inference latency and queue depth
Manage model artifacts with persistent volumes or object storage integration
Set up canary deployments using Istio or Argo Rollouts

"Build and maintain MSA (Microservice Architecture) with distributed tracing"

The ML platform is built as microservices. You need to:

Design service boundaries for feature computation, model inference, post-processing, and logging
Implement distributed tracing with OpenTelemetry across all services
Set up service mesh (Istio) for traffic management, mTLS, and observability
Build centralized logging with ELK or Loki stack
Implement health checks, readiness probes, and liveness probes for all services

"Develop core platform services in Python, Kotlin, and Go"

Toss Bank uses a polyglot tech stack:

Python: ML model serving, feature engineering, data pipelines
Kotlin: Core backend services, Spring Boot applications, API gateways
Go: High-performance infrastructure services, CLI tools, operators

You are not expected to be an expert in all three, but you need working proficiency in at least two and willingness to learn the third.

1.2 Required Qualifications

"5+ years of backend engineering experience"

This is a mid-to-senior role. They want someone who has built and operated production systems, not just written code. Experience with production incidents, performance optimization under pressure, and architectural decision-making is expected.

"Experience designing and operating microservice architectures"

You should be able to discuss service decomposition strategies, inter-service communication patterns (sync vs async), distributed transaction management (saga pattern, eventual consistency), and the operational overhead of microservices (deployment pipelines, monitoring, debugging).

"Proficiency in at least one of Python, Kotlin, or Go"

Deep expertise in one language is more valuable than shallow knowledge of all three. However, you should be able to read and understand code in the other two languages. The interview will likely focus on your strongest language but may include pair programming in another.

"Understanding of ML model serving concepts"

You do not need to train models, but you need to understand:

Model formats (SavedModel, ONNX, TorchScript) and their trade-offs
Batch inference vs online inference
Feature stores and feature computation pipelines
Model monitoring (data drift, concept drift, performance degradation)
A/B testing methodology for ML models

"Kubernetes operational experience"

This means more than kubectl apply. You should understand:

StatefulSets vs Deployments for model serving
Resource requests and limits for GPU workloads
Network policies and service mesh configuration
Custom Resource Definitions (CRDs) and operators
Cluster autoscaling and node pool management

1.3 Preferred Qualifications

"Experience with GPU infrastructure for ML inference"

GPU resource management is a specialized skill. Understanding NVIDIA GPU scheduling, CUDA, multi-instance GPU (MIG) partitioning, and GPU monitoring (DCGM) sets you apart.

"Contributions to open-source ML infrastructure projects"

Projects like KServe, Seldon Core, MLflow, or Kubeflow contributions demonstrate that you understand the ML infrastructure ecosystem.

"Experience with real-time feature serving"

Feature stores like Feast or Tecton, and the ability to serve features with sub-millisecond latency, are highly valued for online ML inference.

2. Tech Stack Deep Dive

2.1 Server Architecture Patterns for ML Serving

Synchronous Inference Architecture

The most common pattern for real-time ML inference:

Client -> API Gateway -> Model Router -> Model Server -> Response
                           |
                      Feature Store
                      (Redis/DynamoDB)

// Kotlin - Model Router Service
@RestController
@RequestMapping("/api/v1/predict")
class PredictionController(
    private val modelRouter: ModelRouter,
    private val featureService: FeatureService,
    private val metricsService: MetricsService
) {
    @PostMapping("/{modelName}")
    suspend fun predict(
        @PathVariable modelName: String,
        @RequestBody request: PredictionRequest
    ): ResponseEntity<PredictionResponse> {
        val timer = metricsService.startTimer("prediction_latency")
        try {
            // 1. Fetch features
            val features = featureService.getFeatures(
                request.entityId,
                modelName
            )

            // 2. Route to appropriate model version
            val modelEndpoint = modelRouter.route(
                modelName,
                request.headers
            )

            // 3. Call model server
            val prediction = modelEndpoint.predict(features)

            // 4. Log prediction for monitoring
            metricsService.recordPrediction(modelName, prediction)

            return ResponseEntity.ok(prediction)
        } catch (e: ModelUnavailableException) {
            // Fallback to default model or rule-based system
            return ResponseEntity.ok(fallbackPrediction(modelName, request))
        } finally {
            timer.stop()
        }
    }
}

Asynchronous Inference Architecture

For non-real-time workloads (batch scoring, pre-computation):

Producer -> Kafka Topic -> Flink/Consumer -> Model Server -> Result Topic
                                                |
                                           Feature Store

# Python - Async inference worker
from confluent_kafka import Consumer, Producer
import tritonclient.grpc as grpc_client

class InferenceWorker:
    def __init__(self, config):
        self.consumer = Consumer(config.kafka_consumer_config)
        self.producer = Producer(config.kafka_producer_config)
        self.triton = grpc_client.InferenceServerClient(
            url=config.triton_url
        )

    def process_batch(self, messages):
        # Batch multiple requests for efficient GPU utilization
        inputs = self._prepare_batch_inputs(messages)

        # Triton Inference Server call
        result = self.triton.infer(
            model_name="fraud_detection_v3",
            inputs=inputs,
            outputs=[grpc_client.InferRequestedOutput("predictions")]
        )

        predictions = result.as_numpy("predictions")

        # Publish results
        for msg, pred in zip(messages, predictions):
            self.producer.produce(
                topic="prediction-results",
                key=msg.key(),
                value=self._serialize_prediction(pred)
            )

        self.producer.flush()

2.2 Kubernetes for ML Workloads

GPU Scheduling on Kubernetes

# GPU-enabled model serving deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-model-serving
  labels:
    app: fraud-model
    version: v3
spec:
  replicas: 3
  selector:
    matchLabels:
      app: fraud-model
  template:
    metadata:
      labels:
        app: fraud-model
        version: v3
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8002'
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            requests:
              cpu: '4'
              memory: '16Gi'
              nvidia.com/gpu: '1'
            limits:
              cpu: '8'
              memory: '32Gi'
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: model-store
              mountPath: /models
          readinessProbe:
            httpGet:
              path: /v2/health/ready
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /v2/health/live
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 15
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-artifacts-pvc
      nodeSelector:
        gpu-type: nvidia-a100
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule

Custom HPA for ML Workloads

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: fraud-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fraud-model-serving
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: inference_request_queue_depth
        target:
          type: AverageValue
          averageValue: '10'
    - type: Pods
      pods:
        metric:
          name: inference_latency_p99_ms
        target:
          type: AverageValue
          averageValue: '8'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 120

2.3 Model Serving Frameworks

Triton Inference Server

Triton (by NVIDIA) supports multiple frameworks and provides advanced features:

# Model configuration for Triton
# config.pbtxt in model repository
"""
name: "fraud_detection"
platform: "onnxruntime_onnx"
max_batch_size: 64
input [
  {
    name: "features"
    data_type: TYPE_FP32
    dims: [ 128 ]
  }
]
output [
  {
    name: "probability"
    data_type: TYPE_FP32
    dims: [ 1 ]
  }
]
instance_group [
  {
    count: 2
    kind: KIND_GPU
  }
]
dynamic_batching {
  preferred_batch_size: [ 16, 32, 64 ]
  max_queue_delay_microseconds: 100
}
"""

TensorFlow Serving

# Model export for TF Serving
import tensorflow as tf

model = tf.keras.models.load_model("fraud_model")

# Export as SavedModel with signature
tf.saved_model.save(
    model,
    "exported_model/1",
    signatures={
        "serving_default": model.predict.get_concrete_function(
            tf.TensorSpec(shape=[None, 128], dtype=tf.float32, name="features")
        )
    }
)

KServe on Kubernetes

KServe (formerly KFServing) provides a standardized ML serving layer on K8s:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detection
spec:
  predictor:
    model:
      modelFormat:
        name: onnx
      storageUri: 's3://models/fraud-detection/v3'
      resources:
        requests:
          cpu: '2'
          memory: '8Gi'
          nvidia.com/gpu: '1'
    minReplicas: 2
    maxReplicas: 10
  transformer:
    containers:
      - name: feature-transformer
        image: registry.tossbank.com/feature-transformer:v1
        resources:
          requests:
            cpu: '1'
            memory: '2Gi'

2.4 Feature Store Architecture

A feature store provides consistent feature computation for training and serving:

# Feature definition with Feast
from feast import Entity, Feature, FeatureView, FileSource
from feast.types import Float32, Int64

# Define entity
user = Entity(
    name="user_id",
    join_keys=["user_id"],
    value_type=Int64,
)

# Define feature view
user_transaction_features = FeatureView(
    name="user_transaction_features",
    entities=[user],
    ttl=timedelta(hours=24),
    schema=[
        Feature(name="transaction_count_7d", dtype=Float32),
        Feature(name="avg_transaction_amount_30d", dtype=Float32),
        Feature(name="max_transaction_amount_7d", dtype=Float32),
        Feature(name="unique_merchants_30d", dtype=Int64),
        Feature(name="late_night_transaction_ratio", dtype=Float32),
    ],
    online=True,
    source=FileSource(
        path="s3://features/user_transactions.parquet",
        timestamp_field="event_timestamp",
    ),
)

// Go - High-performance feature serving endpoint
package main

import (
    "context"
    "encoding/json"
    "net/http"
    "time"

    "github.com/redis/go-redis/v9"
)

type FeatureServer struct {
    redis *redis.Client
}

type FeatureRequest struct {
    EntityID   string   `json:"entity_id"`
    Features   []string `json:"features"`
}

type FeatureResponse struct {
    Features map[string]float64 `json:"features"`
    Latency  int64              `json:"latency_us"`
}

func (s *FeatureServer) GetFeatures(
    w http.ResponseWriter, r *http.Request,
) {
    start := time.Now()

    var req FeatureRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }

    // Batch fetch from Redis
    ctx := context.Background()
    pipe := s.redis.Pipeline()

    cmds := make([]*redis.StringCmd, len(req.Features))
    for i, feature := range req.Features {
        key := req.EntityID + ":" + feature
        cmds[i] = pipe.Get(ctx, key)
    }

    _, err := pipe.Exec(ctx)
    if err != nil && err != redis.Nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    features := make(map[string]float64)
    for i, cmd := range cmds {
        val, err := cmd.Float64()
        if err == nil {
            features[req.Features[i]] = val
        }
    }

    resp := FeatureResponse{
        Features: features,
        Latency:  time.Since(start).Microseconds(),
    }

    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(resp)
}

2.5 Distributed Tracing with OpenTelemetry

Tracing across ML microservices is essential for debugging latency issues:

// Kotlin - OpenTelemetry instrumentation
import io.opentelemetry.api.GlobalOpenTelemetry
import io.opentelemetry.api.trace.SpanKind
import io.opentelemetry.api.trace.StatusCode

class TracedModelInference(
    private val modelClient: ModelClient
) {
    private val tracer = GlobalOpenTelemetry.getTracer("ml-inference")

    suspend fun infer(
        modelName: String,
        features: Map<String, Float>
    ): PredictionResult {
        val span = tracer.spanBuilder("model-inference")
            .setSpanKind(SpanKind.CLIENT)
            .setAttribute("model.name", modelName)
            .setAttribute("model.features.count", features.size.toLong())
            .startSpan()

        return try {
            span.makeCurrent().use {
                val result = modelClient.predict(modelName, features)

                span.setAttribute("model.version", result.modelVersion)
                span.setAttribute(
                    "model.latency_ms",
                    result.inferenceLatencyMs.toDouble()
                )
                span.setStatus(StatusCode.OK)

                result
            }
        } catch (e: Exception) {
            span.setStatus(StatusCode.ERROR, e.message ?: "Unknown error")
            span.recordException(e)
            throw e
        } finally {
            span.end()
        }
    }
}

2.6 Circuit Breaker Pattern for Model Serving

When a model server becomes unhealthy, the circuit breaker prevents cascading failures:

// Kotlin - Resilience4j circuit breaker for model inference
import io.github.resilience4j.circuitbreaker.CircuitBreaker
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig
import java.time.Duration

val circuitBreakerConfig = CircuitBreakerConfig.custom()
    .failureRateThreshold(50f)
    .slowCallRateThreshold(80f)
    .slowCallDurationThreshold(Duration.ofMillis(200))
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(100)
    .minimumNumberOfCalls(20)
    .build()

val modelCircuitBreaker = CircuitBreaker.of(
    "fraud-model", circuitBreakerConfig
)

// Usage with fallback
fun predictWithFallback(
    features: Map<String, Float>
): PredictionResult {
    return try {
        modelCircuitBreaker.executeSupplier {
            modelClient.predict("fraud_detection_v3", features)
        }
    } catch (e: Exception) {
        // Fallback to rule-based system when model is unavailable
        ruleBasedFraudDetection(features)
    }
}

2.7 Model Monitoring and Observability

# Python - Model monitoring with custom metrics
from prometheus_client import Counter, Histogram, Gauge
import numpy as np

# Metrics
prediction_counter = Counter(
    "model_predictions_total",
    "Total predictions",
    ["model_name", "model_version"]
)

prediction_latency = Histogram(
    "model_prediction_latency_seconds",
    "Prediction latency",
    ["model_name"],
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5]
)

feature_drift_gauge = Gauge(
    "model_feature_drift_score",
    "Feature drift score (PSI)",
    ["model_name", "feature_name"]
)

prediction_distribution = Histogram(
    "model_prediction_distribution",
    "Distribution of prediction values",
    ["model_name"],
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)


class ModelMonitor:
    def __init__(self, model_name, reference_data):
        self.model_name = model_name
        self.reference_distributions = self._compute_distributions(
            reference_data
        )

    def record_prediction(self, features, prediction, latency):
        prediction_counter.labels(
            model_name=self.model_name,
            model_version="v3"
        ).inc()

        prediction_latency.labels(
            model_name=self.model_name
        ).observe(latency)

        prediction_distribution.labels(
            model_name=self.model_name
        ).observe(prediction)

    def check_feature_drift(self, recent_features):
        """Calculate PSI (Population Stability Index) for drift detection"""
        for feature_name, values in recent_features.items():
            psi = self._calculate_psi(
                self.reference_distributions[feature_name],
                values
            )
            feature_drift_gauge.labels(
                model_name=self.model_name,
                feature_name=feature_name
            ).set(psi)

            if psi > 0.2:
                self._alert_drift(feature_name, psi)

    def _calculate_psi(self, expected, actual, bins=10):
        expected_hist, edges = np.histogram(expected, bins=bins)
        actual_hist, _ = np.histogram(actual, bins=edges)

        # Avoid division by zero
        expected_pct = (expected_hist + 1) / (sum(expected_hist) + bins)
        actual_pct = (actual_hist + 1) / (sum(actual_hist) + bins)

        psi = sum(
            (actual_pct - expected_pct) * np.log(actual_pct / expected_pct)
        )
        return psi

3. Interview Preparation: 30 Questions

3.1 Backend Architecture (Questions 1-10)

Q1. How would you design a model serving system that handles 50,000 requests per second with p99 latency under 10ms?

Start with a multi-tier architecture: an API gateway (Kong or Envoy) for rate limiting and routing, a model router service that directs requests to the correct model version, and model server pods running Triton or TF Serving. Use gRPC instead of REST for inter-service communication to reduce serialization overhead. Pre-load models into GPU memory and use dynamic batching to maximize GPU utilization. Cache frequently used features in Redis with sub-millisecond access. Horizontal scaling with custom HPA metrics based on inference queue depth.

Q2. Explain the differences between online inference, batch inference, and near-real-time inference. When would you use each?

Online inference serves predictions synchronously per request, used for real-time decisions like fraud detection (latency requirement: less than 10ms). Batch inference processes large datasets periodically, used for pre-computed recommendations or credit score updates (latency: minutes to hours). Near-real-time inference processes streaming data with small delays, used for event-driven scoring where slight delay is acceptable (latency: 100ms to seconds). The choice depends on latency requirements, cost efficiency, and data freshness needs.

Q3. How do you implement graceful degradation when a model server fails?

Layer 1: Circuit breaker per model endpoint that opens after a threshold of failures. Layer 2: Fallback to a simpler model (a lightweight model or rule-based system) when the primary model is unavailable. Layer 3: Cached predictions for frequently seen inputs. Layer 4: Default safe values for critical paths (e.g., flag for manual review instead of auto-approve). Monitor fallback rates and alert when they exceed thresholds.

Q4. Describe the saga pattern and how you would apply it to an ML pipeline that writes to multiple services.

The saga pattern manages distributed transactions by breaking them into a sequence of local transactions, each with a compensating action. In an ML pipeline: Step 1 writes the prediction to the database (compensate: delete). Step 2 publishes the prediction event to Kafka (compensate: publish rollback event). Step 3 updates the feature store with the new data point (compensate: revert feature). If any step fails, the compensating actions execute in reverse order. Use an orchestrator (saga execution coordinator) or choreography (event-driven).

Q5. How would you handle zero-downtime deployment of a new model version?

Use canary deployment: deploy the new model version alongside the current one. Route a small percentage (1-5%) of traffic to the new version. Monitor prediction quality metrics (accuracy, latency, error rate) for the canary. Gradually increase traffic to the new version if metrics are healthy. Use Argo Rollouts or Istio traffic splitting for Kubernetes-native canary deployments. Maintain the ability to instantly rollback by keeping the old model version running.

Q6. What is the difference between horizontal and vertical scaling for model serving? When would you prefer each?

Horizontal scaling adds more model server replicas (more pods). It is preferred for CPU-bound models and when you need to handle more concurrent requests. Vertical scaling adds more resources (GPU, memory) to existing instances. It is preferred for large models that do not fit in a single GPU (model parallelism) or when model loading time makes horizontal scaling slow. In practice, use horizontal scaling for most workloads and vertical scaling for very large models (LLMs).

Q7. How do you implement request-level tracing across an ML inference pipeline with multiple microservices?

Use OpenTelemetry as the instrumentation standard. Inject a trace context (trace ID, span ID) at the API gateway. Each service extracts the context, creates a child span, and propagates the context to downstream calls. Instrument key operations: feature fetching, model inference, post-processing. Send traces to Jaeger or Tempo. Add custom attributes to spans: model name, model version, feature count, prediction value. Use trace-based alerting for latency anomalies.

Q8. Explain the CAP theorem trade-offs in a feature store that serves both training and online inference.

A feature store must be consistent (training and serving use the same features), available (online inference cannot wait), and partition-tolerant. For online serving: prioritize AP with eventual consistency — serve from Redis with best-effort freshness. For training: prioritize CP — use a batch store (S3, BigQuery) with strict point-in-time correctness. Bridge the gap with a dual-write architecture: features are written to both the online store (Redis) and the offline store (S3) with reconciliation checks.

Q9. How do you prevent cascading failures in a microservice architecture?

Implement bulkheads to isolate failure domains (separate thread pools per downstream service). Use circuit breakers with configurable thresholds. Implement timeouts on all external calls (no unbounded waits). Use retry with exponential backoff and jitter. Limit request queues with bounded buffers (reject excess load instead of queuing indefinitely). Health check endpoints that downstream services can probe. Load shedding at the API gateway when the system is overloaded.

Q10. Describe your approach to API versioning for ML model endpoints.

Use URL-based versioning (e.g., /v1/predict, /v2/predict) for breaking changes. Use header-based versioning for minor changes. The model router maps API versions to model versions. Maintain backward compatibility: v1 callers should continue working when v2 is deployed. Deprecation policy: announce 3 months before removing an API version. Include version information in all responses for debugging. Run both versions simultaneously during the migration period.

3.2 ML Serving (Questions 11-15)

Q11. Compare TensorFlow Serving, Triton Inference Server, and TorchServe. When would you use each?

TensorFlow Serving is optimized for TF models, has mature gRPC API, and supports model versioning out of the box. Use it for TF-only shops. Triton Inference Server supports multiple frameworks (TF, PyTorch, ONNX, TensorRT), offers dynamic batching, and has the best GPU utilization. Use it for multi-framework environments or when GPU efficiency matters most. TorchServe is the native serving solution for PyTorch models, integrates with TorchScript, and has the simplest setup. Use it for PyTorch-only workloads with simpler requirements.

Q12. What is dynamic batching and why is it important for GPU-based inference?

Dynamic batching groups multiple individual inference requests into a single batch before sending to the GPU. GPUs are optimized for parallel computation and process a batch of 32 almost as fast as a single input. Without batching, GPU utilization can be as low as 5-10%. Dynamic batching adds a small delay (configurable, typically 100-500 microseconds) to wait for more requests before executing. The trade-off is slightly higher latency per request but dramatically higher throughput.

Q13. Explain the ONNX format and when you would use it.

ONNX (Open Neural Network Exchange) is a cross-framework model format. A model trained in PyTorch can be exported to ONNX and served by TF Serving, Triton, or ONNX Runtime. Benefits: framework-agnostic serving, optimizations through ONNX Runtime (graph optimization, quantization), deployment flexibility. Use ONNX when you want to decouple training framework choice from serving infrastructure, or when ONNX Runtime provides better performance than native serving.

Q14. How do you implement A/B testing for ML models in production?

At the infrastructure level: deploy both model versions as separate endpoints. The model router splits traffic based on a hash of the user ID (ensures consistent treatment per user). Log predictions with model version, user ID, and timestamp. On the analytics side: define success metrics (click-through rate, conversion, fraud detection rate), calculate statistical significance, and monitor guardrail metrics (latency, error rate). Use Bayesian or frequentist methods depending on traffic volume. Minimum experiment duration: 1-2 weeks for most ML models.

Q15. What is model warm-up and why does it matter?

Model warm-up is the process of sending dummy inference requests to a newly deployed model before routing real traffic. Without warm-up, the first real requests experience high latency because: the model must be loaded from disk to memory (or GPU), TensorFlow and PyTorch perform lazy initialization (compilation on first call), caches (CPU cache, GPU cache) are cold. Warm-up eliminates this cold-start penalty. In Triton, use the --load-model flag. In TF Serving, configure warm-up data in the model directory.

3.3 Kubernetes and Infrastructure (Questions 16-20)

Q16. How do you manage GPU resources on Kubernetes for model serving?

Install the NVIDIA device plugin for K8s. Define GPU resource requests and limits in pod specs. Use node selectors and tolerations to schedule GPU workloads on GPU nodes. Implement MIG (Multi-Instance GPU) for sharing a single A100 across multiple pods. Monitor GPU utilization, memory, and temperature with DCGM-exporter and Prometheus. Set up cluster autoscaler with GPU node pools that scale based on pending GPU pod requests.

Q17. Compare StatefulSet and Deployment for model serving. When would you use each?

Use Deployments for stateless model servers where any replica can handle any request (most common). Use StatefulSets when models require persistent local storage for model caching (each pod maintains its own model cache), when you need stable network identities for model ensembles, or when model sharding requires deterministic pod-to-shard mapping. In practice, most model serving workloads use Deployments with external storage (S3, PVC) for model artifacts.

Q18. How would you implement canary deployment for a model on Kubernetes?

Option 1 — Istio traffic splitting: Deploy new model version as a separate deployment. Create Istio VirtualService rules that split traffic by percentage. Gradually shift traffic from stable to canary. Option 2 — Argo Rollouts: Define a Rollout resource with canary strategy. Configure analysis templates that check model metrics. Auto-promote or rollback based on analysis results. Option 3 — KServe canary: Use KServe InferenceService with canaryTrafficPercent field. Both options integrate with Prometheus metrics for automated decision-making.

Q19. How do you handle model artifact storage and versioning on Kubernetes?

Use an object store (S3, GCS, MinIO) as the source of truth for model artifacts. Each model version is stored at a unique path: s3://models/model-name/version/. Use a model registry (MLflow Model Registry, custom) to track metadata: version, metrics, training run ID, approval status. For K8s: use init containers that download the model from object storage on pod startup, or mount object storage directly using CSI drivers. Cache frequently used models on local NVMe storage for faster loading.

Q20. Explain how you would set up a service mesh for ML microservices.

Install Istio with sidecar injection. Configure mTLS for all service-to-service communication (zero-trust security). Use VirtualService for traffic routing (canary, A/B testing). Use DestinationRule for circuit breaking and outlier detection. Enable Envoy access logging for request-level visibility. Set up Kiali for service mesh visualization. Configure rate limiting at the mesh level. Use PeerAuthentication for strict mTLS enforcement. Monitor mesh health with Istio metrics exported to Prometheus.

3.4 Distributed Systems (Questions 21-25)

Q21. How do you ensure exactly-once processing in an ML prediction pipeline that spans multiple services?

True exactly-once is very hard across services. Instead, aim for effectively-once by making each service idempotent. Use a unique request ID generated at the gateway and propagated through all services. Each service checks if it has already processed this request ID (idempotency key in Redis or database). For Kafka-based pipelines, use Kafka transactions. For database writes, use upserts with the request ID as the unique constraint. Log and reconcile to catch edge cases.

Q22. How would you design a feature computation pipeline that serves both real-time and batch consumers?

Use the Lambda or Kappa architecture. Lambda: a speed layer (Flink) computes real-time features from streaming data, while a batch layer (Spark) recomputes features periodically for accuracy. Kappa: use only Flink with a reprocessing capability. In both cases, write features to a dual-store: Redis for online serving, S3/BigQuery for offline training. Ensure schema consistency between real-time and batch features with a shared feature definition registry. Run automated consistency checks between the two stores.

Q23. Explain the challenges of distributed logging in a microservice architecture and your solution.

Challenges: log volume (millions of log lines per minute), correlation across services, log format inconsistency, and storage cost. Solution: structured logging in JSON format with mandatory fields (trace ID, service name, timestamp, severity). Ship logs via Fluent Bit sidecar to Kafka, then to OpenSearch or Loki. Implement sampling for verbose logs (log 10% of debug-level requests). Use log-based alerting for error patterns. Set retention policies: 7 days hot, 30 days warm, 90 days cold in S3.

Q24. How do you handle clock skew in distributed tracing across services?

Use NTP synchronization on all nodes with tight drift thresholds (less than 1ms). In trace collection, use the Jaeger/Tempo agent-side timestamp adjustment. For causality ordering when clocks disagree, use Lamport timestamps or vector clocks. In practice, ensure all K8s nodes use the same NTP server pool, configure chrony with tight maxdist settings, and alert when clock drift exceeds 5ms. OpenTelemetry SDKs handle most cases by using monotonic clocks for duration measurement.

Q25. How would you design a rate limiter for an ML prediction API?

Use a token bucket or sliding window algorithm. Implement at two levels: per-client rate limiting at the API gateway (Kong, Envoy) using Redis for shared state, and per-model rate limiting at the model router to prevent a single model from consuming all resources. Configuration: allow burst traffic (bucket capacity), set sustained rate per client, and implement priority queues for critical requests (fraud detection gets higher priority than recommendations). Return 429 status with Retry-After header when limits are exceeded.

3.5 Python, Kotlin, and Go (Questions 26-30)

Q26. Compare Python's asyncio, Kotlin coroutines, and Go goroutines for concurrent ML serving.

Python asyncio is single-threaded with cooperative multitasking. Good for I/O-bound work (waiting for model server responses) but limited by the GIL for CPU-bound work. Use with uvicorn/FastAPI for async inference endpoints. Kotlin coroutines are lightweight threads on the JVM with structured concurrency. Good for Spring Boot services that need to call multiple downstream services concurrently. Go goroutines are lightweight green threads with preemptive scheduling. Excellent for high-concurrency infrastructure services (feature servers, API gateways) with minimal overhead.

Q27. How would you implement a connection pool for gRPC model server calls in Kotlin?

Use a channel pool that maintains multiple gRPC channels to the model server. Each channel can multiplex multiple requests. Configure max connections per endpoint, idle timeout, and health checking:

class GrpcChannelPool(
    private val endpoint: String,
    private val poolSize: Int = 10
) {
    private val channels = ConcurrentLinkedQueue<ManagedChannel>()

    init {
        repeat(poolSize) {
            channels.offer(
                ManagedChannelBuilder
                    .forTarget(endpoint)
                    .usePlaintext()
                    .keepAliveTime(30, TimeUnit.SECONDS)
                    .maxInboundMessageSize(16 * 1024 * 1024)
                    .build()
            )
        }
    }

    fun getChannel(): ManagedChannel {
        return channels.poll() ?: createNewChannel()
    }

    fun returnChannel(channel: ManagedChannel) {
        if (channel.getState(false) != ConnectivityState.SHUTDOWN) {
            channels.offer(channel)
        }
    }
}

Q28. Write a Go middleware that implements request logging and latency tracking for an ML API.

func MetricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        traceID := r.Header.Get("X-Trace-ID")
        if traceID == "" {
            traceID = uuid.New().String()
        }

        // Wrap response writer to capture status code
        wrapped := &statusResponseWriter{ResponseWriter: w, status: 200}

        next.ServeHTTP(wrapped, r)

        duration := time.Since(start)

        // Record metrics
        httpRequestsTotal.WithLabelValues(
            r.Method, r.URL.Path,
            strconv.Itoa(wrapped.status),
        ).Inc()

        httpRequestDuration.WithLabelValues(
            r.Method, r.URL.Path,
        ).Observe(duration.Seconds())

        // Structured log
        log.Info().
            Str("trace_id", traceID).
            Str("method", r.Method).
            Str("path", r.URL.Path).
            Int("status", wrapped.status).
            Dur("latency", duration).
            Msg("request completed")
    })
}

Q29. How would you structure a Python ML serving application for testability?

Separate concerns into layers: API layer (FastAPI routes), service layer (business logic, feature assembly), model layer (model loading, inference). Use dependency injection for all external dependencies (model client, feature store, database). Create interfaces (Protocol classes in Python) for model inference so you can mock them in tests. Test each layer independently: unit tests for business logic, integration tests with a test model server, end-to-end tests with the full pipeline. Use pytest fixtures for common test setup.

Q30. When would you use Python vs Kotlin vs Go for an ML platform service?

Python: model serving wrappers, feature engineering pipelines, data validation, any service that needs tight integration with ML libraries (scikit-learn, pandas, TensorFlow). Kotlin: API gateways, orchestration services, anything that benefits from the JVM ecosystem (Spring Boot, Kafka client, JDBC), services that need to integrate with existing Java infrastructure. Go: high-performance stateless services (feature servers, request routers), CLI tools, Kubernetes operators, any service where startup time and memory footprint matter. The key principle: use the right language for the job, not the language you are most comfortable with.

4. Six-Month Study Roadmap

Month 1: Backend Fundamentals

Week 1-2: Server Architecture

Study clean architecture and hexagonal architecture patterns
Build a REST API in Kotlin with Spring Boot (CRUD, validation, error handling)
Implement the same API in Go with Gin or Echo framework
Compare performance characteristics using load testing with k6

Week 3-4: Concurrency and Performance

Study Kotlin coroutines with the official guide and Kotlin in Action
Implement a concurrent HTTP client in Go using goroutines and channels
Build a Python async API with FastAPI and asyncio
Profile all three implementations: memory usage, latency under load, CPU utilization

Month 2: Kubernetes Deep Dive

Week 1-2: Core K8s Concepts

Complete the CKA (Certified Kubernetes Administrator) curriculum
Deploy a multi-service application on Minikube or kind
Implement Deployments, Services, ConfigMaps, Secrets, and PVCs
Practice with resource requests, limits, and QoS classes

Week 3-4: Advanced K8s for ML

Set up GPU scheduling with NVIDIA device plugin (on cloud or local GPU)
Configure HPA with custom metrics from Prometheus
Implement canary deployments with Argo Rollouts
Set up Istio service mesh with traffic splitting

Month 3: ML Serving Fundamentals

Week 1-2: Model Serving Frameworks

Deploy TensorFlow Serving with a sample model
Deploy Triton Inference Server and test with ONNX model
Compare latency and throughput between the two
Implement dynamic batching and measure the throughput improvement

Week 3-4: KServe and Model Management

Install KServe on a Kubernetes cluster
Deploy a model with InferenceService
Implement canary deployment for a model version upgrade
Set up model monitoring with Prometheus metrics

Month 4: Distributed Systems

Week 1-2: Theory and Patterns

Read "Designing Data-Intensive Applications" chapters 5, 8, 9
Study microservice patterns: circuit breaker, bulkhead, saga, CQRS
Implement a circuit breaker in Kotlin using Resilience4j
Practice distributed systems design questions

Week 3-4: Observability

Set up OpenTelemetry instrumentation in a multi-service application
Deploy Jaeger or Tempo for distributed tracing
Set up Prometheus + Grafana for metrics collection and dashboards
Implement structured logging with correlation IDs across services

Month 5: Feature Stores and Data Pipelines

Week 1-2: Feature Store

Deploy Feast locally and define feature views
Implement online feature serving with Redis backend
Build a feature computation pipeline with Python
Test point-in-time correctness for training data

Week 3-4: End-to-End ML Pipeline

Build a complete ML serving pipeline: feature store to model server to API
Implement A/B testing infrastructure with traffic splitting
Add model monitoring (data drift, prediction distribution)
Set up automated model retraining triggers

Month 6: Integration and Interview Prep

Week 1-2: Portfolio Project

Build a production-quality ML serving platform with all components
Document the architecture with diagrams and decision records
Create a monitoring dashboard with key ML metrics
Write load tests that validate SLA requirements

Week 3-4: Interview Preparation

Practice all 30 interview questions with a partner
Do 5 system design mock interviews focused on ML infrastructure
Review and refine your resume with quantifiable metrics
Prepare a 5-minute architecture walkthrough of your portfolio project
Study Toss Bank tech blog posts for culture and technical context

5. Resume Strategy

What Toss Bank Wants to See

Your resume should prove that you can build and operate ML infrastructure at scale. Frame everything in terms of impact.

Lead with Scale Metrics

"Designed and operated ML serving infrastructure handling 100K predictions per second with p99 latency under 8ms"
"Reduced model deployment time from 2 days to 15 minutes by building a KServe-based CI/CD pipeline"
"Built a feature store serving 50M features per day with sub-millisecond latency using Redis cluster"

Show Polyglot Experience

List primary language first with depth indicators ("5 years Python, production ML systems")
Include secondary languages with context ("Kotlin for Spring Boot microservices", "Go for high-performance tools")
Highlight language selection decisions: "Chose Go for the feature server due to low latency requirements, reducing p99 from 5ms to 0.8ms"

Demonstrate Operational Maturity

On-call experience for ML systems (model failures, data drift incidents)
Cost optimization: GPU utilization improvements, right-sizing model serving infrastructure
Zero-downtime deployments and migration stories
Monitoring and alerting setup for ML-specific metrics

Resume Format

2 pages maximum
"Technical Skills" section that maps directly to the JD keywords
XYZ format for achievements: "Accomplished X by implementing Y, resulting in Z"
Include links to relevant open-source contributions or public talks
GitHub profile with at least one relevant project

6. Portfolio Project Ideas

Project 1: Full-Stack ML Serving Platform

Build a complete ML serving platform that demonstrates all the JD requirements:

Model Server: Triton Inference Server on Kubernetes with GPU support
API Gateway: Kotlin Spring Boot with gRPC model client
Feature Store: Redis-backed feature serving with Python feature pipelines
Monitoring: Prometheus + Grafana dashboards with model-specific metrics
Canary Deployments: Argo Rollouts with automated analysis
Distributed Tracing: OpenTelemetry across all services

Project 2: Real-Time Fraud Detection System

Build an end-to-end fraud detection system:

Kotlin API that accepts transaction events
Go feature server that computes real-time features
Python model serving with ensemble of models
A/B testing infrastructure for model comparison
Circuit breaker with fallback to rule-based detection
Dashboard showing detection rate, false positive rate, latency

Project 3: Model Deployment Pipeline

Build a CI/CD pipeline for ML models:

MLflow for model tracking and registry
KServe for model serving with canary support
Automated model validation (schema check, performance benchmark)
Shadow mode deployment for new models
Automated rollback on metric degradation
Slack/webhook notifications for deployment events

Project 4: Kubernetes Operator for ML Workloads

Build a custom K8s operator in Go:

Custom Resource Definition for MLModel resources
Controller that manages model serving deployments
Automatic scaling based on inference metrics
Model artifact syncing from S3
Health checking and auto-recovery for model pods
Integration with Prometheus for metrics

7. Quiz: Test Your Knowledge

Q1: Your ML serving system handles 10K requests per second. You deploy a new model version that increases inference latency from 5ms to 50ms. What happens to the system and how do you respond?

With 10K RPS and 50ms latency per request, the system needs 500 concurrent connections (up from 50). This can exhaust thread pools, connection limits, and downstream capacity, causing cascading timeouts. Immediate response: roll back the canary deployment to the old model version. Then investigate: Is the model larger (more GPU memory needed)? Is it not optimized (needs TensorRT optimization or quantization)? Is it making extra network calls? Fix the latency issue before redeploying. Always have automated canary analysis that catches latency regressions before they reach 100% traffic.

Q2: You notice that your fraud detection model's precision drops from 95% to 80% over two weeks, but the model itself has not changed. What is happening?

This is data drift or concept drift. Possible causes: (1) Input feature distributions have shifted — new transaction patterns, seasonality, or a business change (new merchant category, promotional event). (2) Upstream data pipeline bug producing incorrect feature values. (3) Feature store stale data (TTL too long, pipeline failure). Diagnosis: check feature drift metrics (PSI for each feature), compare recent vs training data distributions, verify upstream pipeline health. Solution: retrain on recent data, implement automated drift detection alerts, set up a regular retraining schedule.

Q3: You need to serve an LLM (3B parameters, 6GB model) on Kubernetes. The model takes 30 seconds to load. How do you handle scaling and deployment?

Use StatefulSets or Deployments with PVC-backed model storage to avoid downloading the model on each pod start. Pre-pull the model image during off-peak hours. Set minimum replicas high enough to handle baseline traffic (avoid scaling from zero). Configure HPA to scale up aggressively (preemptive) and scale down slowly (stabilization window of 10+ minutes) to avoid cold-start latency. Use pod disruption budgets to ensure rolling updates never take all replicas offline. Consider model warm-up with health checks that only pass after the model has served a test inference.

Q4: Your Kubernetes cluster has 8 A100 GPUs shared by 5 ML model serving teams. How do you manage GPU resources fairly?

Implement resource quotas per namespace (each team gets a namespace). Set GPU limits: e.g., team A gets 2 GPUs, team B gets 2, and 2 are shared. Use NVIDIA MIG to partition A100s into smaller instances for models that do not need a full GPU. Implement priority classes: production models get higher priority than development workloads. Use cluster autoscaler to add GPU nodes when demand exceeds capacity. Monitor GPU utilization per team and reclaim underutilized allocations quarterly. Set up a GPU scheduling dashboard with Grafana.

Q5: You are designing the interface between the feature store and the model server. What protocol and format would you choose, and why?

Use gRPC with Protocol Buffers. gRPC provides lower latency than REST due to HTTP/2 multiplexing and binary serialization. Protocol Buffers are strongly typed, which prevents schema mismatches between the feature store and model server. The schema definition serves as a contract between teams. For the feature tensor format, use a flat buffer or numpy-compatible binary format to avoid deserialization overhead. Cache frequently requested feature sets in the model server's local memory (LRU cache with TTL). For batch requests, support streaming gRPC to avoid large single-message overhead.

8. References and Resources

Books

Martin Kleppmann — Designing Data-Intensive Applications, O'Reilly, 2017
Sam Newman — Building Microservices, 2nd Edition, O'Reilly, 2021
Chip Huyen — Designing Machine Learning Systems, O'Reilly, 2022
Chris Richardson — Microservices Patterns, Manning, 2018

Official Documentation

Kubernetes Documentation — https://kubernetes.io/docs/
NVIDIA Triton Inference Server — https://github.com/triton-inference-server/server
KServe Documentation — https://kserve.github.io/website/
Feast Feature Store — https://feast.dev/
OpenTelemetry Documentation — https://opentelemetry.io/docs/

Courses and Tutorials

Made With ML — MLOps course — https://madewithml.com/
Full Stack Deep Learning — https://fullstackdeeplearning.com/
Google ML Engineering Best Practices — https://developers.google.com/machine-learning/guides/rules-of-ml
CKA Exam Preparation — https://kubernetes.io/docs/reference/kubectl/

Community and Talks

MLOps Community — https://mlops.community/
KubeCon + CloudNativeCon recordings — https://www.cncf.io/kubecon/
Chip Huyen's blog — https://huyenchip.com/blog/
Toss Tech Blog — https://toss.tech/
Netflix Tech Blog — https://netflixtechblog.com/

Conclusion

The Toss Bank ML Backend Engineer role demands a rare combination of strong backend engineering fundamentals, practical ML serving expertise, and Kubernetes operational skills. You are not being hired to train models — you are being hired to build the reliable, scalable, observable systems that serve those models to millions of banking customers.

The six-month roadmap focuses on building depth in backend architecture first, then layering on ML-specific infrastructure knowledge. This sequence matters: a solid backend engineer who learns ML serving will outperform an ML specialist who struggles with distributed systems.

Your portfolio should demonstrate three things: you can build high-performance systems in multiple languages, you can deploy and operate ML models on Kubernetes, and you understand the operational realities of running ML in production (monitoring, scaling, incident response).

The demand for ML Backend Engineers will only increase as financial services companies integrate more AI into their core operations. Toss Bank is at the forefront of this trend in Korea. Prepare thoroughly, build real projects, and demonstrate that you can operate at the intersection of backend engineering and machine learning.

Start building. The models are waiting to be served.