- Published on
Toss Bank ML Backend Engineer Study Guide: Server Architecture, ML Serving, and Career Roadmap
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction: Where ML Meets Banking at Scale
- 1. JD Deep Analysis: What Toss Bank Actually Wants
- 2. Tech Stack Deep Dive
- 3. Interview Preparation: 30 Questions
- 4. Six-Month Study Roadmap
- 5. Resume Strategy
- 6. Portfolio Project Ideas
- 7. Quiz: Test Your Knowledge
- 8. References and Resources
- Conclusion
Introduction: Where ML Meets Banking at Scale
Machine learning in production is not about Jupyter notebooks. It is about serving millions of predictions per second with sub-10ms latency while maintaining model accuracy, handling graceful degradation when models fail, and operating in a regulatory environment where every prediction must be explainable and auditable.
Toss Bank's ML Service Team sits at this exact intersection. They build the backend infrastructure that serves AI models for credit scoring, fraud detection, recommendation systems, and real-time risk assessment. When Toss Bank posts an ML Backend Engineer position, they are not looking for someone who can train a model — they are looking for someone who can build and operate the production systems that serve those models to millions of users.
This is a role that requires deep backend engineering skills (server architecture, distributed systems, performance optimization) combined with practical ML serving knowledge (model deployment, A/B testing infrastructure, feature stores). You need to be equally comfortable designing a high-availability microservice architecture and deploying a TensorFlow Serving cluster on Kubernetes.
This guide breaks down every requirement in the job description, maps each one to specific technologies and study materials, and provides a 6-month plan to get you interview-ready.
1. JD Deep Analysis: What Toss Bank Actually Wants
1.1 Core Responsibilities
"Design and implement robust server architectures for ML model serving"
This is the primary responsibility. You will build the systems that accept inference requests, route them to the appropriate model, return predictions, and handle failures gracefully. This involves:
- Designing APIs for synchronous (REST, gRPC) and asynchronous (event-driven) inference
- Building model registries that track model versions, metadata, and deployment status
- Implementing A/B testing and canary deployment infrastructure for models
- Managing model lifecycle: training pipeline integration, model validation, promotion, and rollback
- Shadow mode deployments for new models (serve traffic but do not use predictions)
"Handle large-scale traffic with high availability and low latency"
Toss Bank serves millions of users. The ML backend must:
- Handle tens of thousands of requests per second per model endpoint
- Maintain p99 latency under 10ms for critical inference paths (fraud detection)
- Achieve 99.99% availability (less than 52 minutes of downtime per year)
- Implement circuit breakers, bulkheads, and fallback strategies
- Auto-scale based on traffic patterns (morning peaks, payday spikes)
"Deploy and operate AI model serving infrastructure on Kubernetes"
Kubernetes is the deployment platform. You need to:
- Deploy model serving frameworks (TensorFlow Serving, Triton Inference Server, TorchServe) on K8s
- Configure GPU scheduling and resource quotas for model workloads
- Implement custom HPA (Horizontal Pod Autoscaler) metrics based on inference latency and queue depth
- Manage model artifacts with persistent volumes or object storage integration
- Set up canary deployments using Istio or Argo Rollouts
"Build and maintain MSA (Microservice Architecture) with distributed tracing"
The ML platform is built as microservices. You need to:
- Design service boundaries for feature computation, model inference, post-processing, and logging
- Implement distributed tracing with OpenTelemetry across all services
- Set up service mesh (Istio) for traffic management, mTLS, and observability
- Build centralized logging with ELK or Loki stack
- Implement health checks, readiness probes, and liveness probes for all services
"Develop core platform services in Python, Kotlin, and Go"
Toss Bank uses a polyglot tech stack:
- Python: ML model serving, feature engineering, data pipelines
- Kotlin: Core backend services, Spring Boot applications, API gateways
- Go: High-performance infrastructure services, CLI tools, operators
You are not expected to be an expert in all three, but you need working proficiency in at least two and willingness to learn the third.
1.2 Required Qualifications
"5+ years of backend engineering experience"
This is a mid-to-senior role. They want someone who has built and operated production systems, not just written code. Experience with production incidents, performance optimization under pressure, and architectural decision-making is expected.
"Experience designing and operating microservice architectures"
You should be able to discuss service decomposition strategies, inter-service communication patterns (sync vs async), distributed transaction management (saga pattern, eventual consistency), and the operational overhead of microservices (deployment pipelines, monitoring, debugging).
"Proficiency in at least one of Python, Kotlin, or Go"
Deep expertise in one language is more valuable than shallow knowledge of all three. However, you should be able to read and understand code in the other two languages. The interview will likely focus on your strongest language but may include pair programming in another.
"Understanding of ML model serving concepts"
You do not need to train models, but you need to understand:
- Model formats (SavedModel, ONNX, TorchScript) and their trade-offs
- Batch inference vs online inference
- Feature stores and feature computation pipelines
- Model monitoring (data drift, concept drift, performance degradation)
- A/B testing methodology for ML models
"Kubernetes operational experience"
This means more than kubectl apply. You should understand:
- StatefulSets vs Deployments for model serving
- Resource requests and limits for GPU workloads
- Network policies and service mesh configuration
- Custom Resource Definitions (CRDs) and operators
- Cluster autoscaling and node pool management
1.3 Preferred Qualifications
"Experience with GPU infrastructure for ML inference"
GPU resource management is a specialized skill. Understanding NVIDIA GPU scheduling, CUDA, multi-instance GPU (MIG) partitioning, and GPU monitoring (DCGM) sets you apart.
"Contributions to open-source ML infrastructure projects"
Projects like KServe, Seldon Core, MLflow, or Kubeflow contributions demonstrate that you understand the ML infrastructure ecosystem.
"Experience with real-time feature serving"
Feature stores like Feast or Tecton, and the ability to serve features with sub-millisecond latency, are highly valued for online ML inference.
2. Tech Stack Deep Dive
2.1 Server Architecture Patterns for ML Serving
Synchronous Inference Architecture
The most common pattern for real-time ML inference:
Client -> API Gateway -> Model Router -> Model Server -> Response
|
Feature Store
(Redis/DynamoDB)
// Kotlin - Model Router Service
@RestController
@RequestMapping("/api/v1/predict")
class PredictionController(
private val modelRouter: ModelRouter,
private val featureService: FeatureService,
private val metricsService: MetricsService
) {
@PostMapping("/{modelName}")
suspend fun predict(
@PathVariable modelName: String,
@RequestBody request: PredictionRequest
): ResponseEntity<PredictionResponse> {
val timer = metricsService.startTimer("prediction_latency")
try {
// 1. Fetch features
val features = featureService.getFeatures(
request.entityId,
modelName
)
// 2. Route to appropriate model version
val modelEndpoint = modelRouter.route(
modelName,
request.headers
)
// 3. Call model server
val prediction = modelEndpoint.predict(features)
// 4. Log prediction for monitoring
metricsService.recordPrediction(modelName, prediction)
return ResponseEntity.ok(prediction)
} catch (e: ModelUnavailableException) {
// Fallback to default model or rule-based system
return ResponseEntity.ok(fallbackPrediction(modelName, request))
} finally {
timer.stop()
}
}
}
Asynchronous Inference Architecture
For non-real-time workloads (batch scoring, pre-computation):
Producer -> Kafka Topic -> Flink/Consumer -> Model Server -> Result Topic
|
Feature Store
# Python - Async inference worker
from confluent_kafka import Consumer, Producer
import tritonclient.grpc as grpc_client
class InferenceWorker:
def __init__(self, config):
self.consumer = Consumer(config.kafka_consumer_config)
self.producer = Producer(config.kafka_producer_config)
self.triton = grpc_client.InferenceServerClient(
url=config.triton_url
)
def process_batch(self, messages):
# Batch multiple requests for efficient GPU utilization
inputs = self._prepare_batch_inputs(messages)
# Triton Inference Server call
result = self.triton.infer(
model_name="fraud_detection_v3",
inputs=inputs,
outputs=[grpc_client.InferRequestedOutput("predictions")]
)
predictions = result.as_numpy("predictions")
# Publish results
for msg, pred in zip(messages, predictions):
self.producer.produce(
topic="prediction-results",
key=msg.key(),
value=self._serialize_prediction(pred)
)
self.producer.flush()
2.2 Kubernetes for ML Workloads
GPU Scheduling on Kubernetes
# GPU-enabled model serving deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-model-serving
labels:
app: fraud-model
version: v3
spec:
replicas: 3
selector:
matchLabels:
app: fraud-model
template:
metadata:
labels:
app: fraud-model
version: v3
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8002'
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-py3
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: grpc
- containerPort: 8002
name: metrics
resources:
requests:
cpu: '4'
memory: '16Gi'
nvidia.com/gpu: '1'
limits:
cpu: '8'
memory: '32Gi'
nvidia.com/gpu: '1'
volumeMounts:
- name: model-store
mountPath: /models
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 15
volumes:
- name: model-store
persistentVolumeClaim:
claimName: model-artifacts-pvc
nodeSelector:
gpu-type: nvidia-a100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Custom HPA for ML Workloads
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: fraud-model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fraud-model-serving
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: inference_request_queue_depth
target:
type: AverageValue
averageValue: '10'
- type: Pods
pods:
metric:
name: inference_latency_p99_ms
target:
type: AverageValue
averageValue: '8'
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 120
2.3 Model Serving Frameworks
Triton Inference Server
Triton (by NVIDIA) supports multiple frameworks and provides advanced features:
# Model configuration for Triton
# config.pbtxt in model repository
"""
name: "fraud_detection"
platform: "onnxruntime_onnx"
max_batch_size: 64
input [
{
name: "features"
data_type: TYPE_FP32
dims: [ 128 ]
}
]
output [
{
name: "probability"
data_type: TYPE_FP32
dims: [ 1 ]
}
]
instance_group [
{
count: 2
kind: KIND_GPU
}
]
dynamic_batching {
preferred_batch_size: [ 16, 32, 64 ]
max_queue_delay_microseconds: 100
}
"""
TensorFlow Serving
# Model export for TF Serving
import tensorflow as tf
model = tf.keras.models.load_model("fraud_model")
# Export as SavedModel with signature
tf.saved_model.save(
model,
"exported_model/1",
signatures={
"serving_default": model.predict.get_concrete_function(
tf.TensorSpec(shape=[None, 128], dtype=tf.float32, name="features")
)
}
)
KServe on Kubernetes
KServe (formerly KFServing) provides a standardized ML serving layer on K8s:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detection
spec:
predictor:
model:
modelFormat:
name: onnx
storageUri: 's3://models/fraud-detection/v3'
resources:
requests:
cpu: '2'
memory: '8Gi'
nvidia.com/gpu: '1'
minReplicas: 2
maxReplicas: 10
transformer:
containers:
- name: feature-transformer
image: registry.tossbank.com/feature-transformer:v1
resources:
requests:
cpu: '1'
memory: '2Gi'
2.4 Feature Store Architecture
A feature store provides consistent feature computation for training and serving:
# Feature definition with Feast
from feast import Entity, Feature, FeatureView, FileSource
from feast.types import Float32, Int64
# Define entity
user = Entity(
name="user_id",
join_keys=["user_id"],
value_type=Int64,
)
# Define feature view
user_transaction_features = FeatureView(
name="user_transaction_features",
entities=[user],
ttl=timedelta(hours=24),
schema=[
Feature(name="transaction_count_7d", dtype=Float32),
Feature(name="avg_transaction_amount_30d", dtype=Float32),
Feature(name="max_transaction_amount_7d", dtype=Float32),
Feature(name="unique_merchants_30d", dtype=Int64),
Feature(name="late_night_transaction_ratio", dtype=Float32),
],
online=True,
source=FileSource(
path="s3://features/user_transactions.parquet",
timestamp_field="event_timestamp",
),
)
// Go - High-performance feature serving endpoint
package main
import (
"context"
"encoding/json"
"net/http"
"time"
"github.com/redis/go-redis/v9"
)
type FeatureServer struct {
redis *redis.Client
}
type FeatureRequest struct {
EntityID string `json:"entity_id"`
Features []string `json:"features"`
}
type FeatureResponse struct {
Features map[string]float64 `json:"features"`
Latency int64 `json:"latency_us"`
}
func (s *FeatureServer) GetFeatures(
w http.ResponseWriter, r *http.Request,
) {
start := time.Now()
var req FeatureRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
return
}
// Batch fetch from Redis
ctx := context.Background()
pipe := s.redis.Pipeline()
cmds := make([]*redis.StringCmd, len(req.Features))
for i, feature := range req.Features {
key := req.EntityID + ":" + feature
cmds[i] = pipe.Get(ctx, key)
}
_, err := pipe.Exec(ctx)
if err != nil && err != redis.Nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
features := make(map[string]float64)
for i, cmd := range cmds {
val, err := cmd.Float64()
if err == nil {
features[req.Features[i]] = val
}
}
resp := FeatureResponse{
Features: features,
Latency: time.Since(start).Microseconds(),
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(resp)
}
2.5 Distributed Tracing with OpenTelemetry
Tracing across ML microservices is essential for debugging latency issues:
// Kotlin - OpenTelemetry instrumentation
import io.opentelemetry.api.GlobalOpenTelemetry
import io.opentelemetry.api.trace.SpanKind
import io.opentelemetry.api.trace.StatusCode
class TracedModelInference(
private val modelClient: ModelClient
) {
private val tracer = GlobalOpenTelemetry.getTracer("ml-inference")
suspend fun infer(
modelName: String,
features: Map<String, Float>
): PredictionResult {
val span = tracer.spanBuilder("model-inference")
.setSpanKind(SpanKind.CLIENT)
.setAttribute("model.name", modelName)
.setAttribute("model.features.count", features.size.toLong())
.startSpan()
return try {
span.makeCurrent().use {
val result = modelClient.predict(modelName, features)
span.setAttribute("model.version", result.modelVersion)
span.setAttribute(
"model.latency_ms",
result.inferenceLatencyMs.toDouble()
)
span.setStatus(StatusCode.OK)
result
}
} catch (e: Exception) {
span.setStatus(StatusCode.ERROR, e.message ?: "Unknown error")
span.recordException(e)
throw e
} finally {
span.end()
}
}
}
2.6 Circuit Breaker Pattern for Model Serving
When a model server becomes unhealthy, the circuit breaker prevents cascading failures:
// Kotlin - Resilience4j circuit breaker for model inference
import io.github.resilience4j.circuitbreaker.CircuitBreaker
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig
import java.time.Duration
val circuitBreakerConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(50f)
.slowCallRateThreshold(80f)
.slowCallDurationThreshold(Duration.ofMillis(200))
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(100)
.minimumNumberOfCalls(20)
.build()
val modelCircuitBreaker = CircuitBreaker.of(
"fraud-model", circuitBreakerConfig
)
// Usage with fallback
fun predictWithFallback(
features: Map<String, Float>
): PredictionResult {
return try {
modelCircuitBreaker.executeSupplier {
modelClient.predict("fraud_detection_v3", features)
}
} catch (e: Exception) {
// Fallback to rule-based system when model is unavailable
ruleBasedFraudDetection(features)
}
}
2.7 Model Monitoring and Observability
# Python - Model monitoring with custom metrics
from prometheus_client import Counter, Histogram, Gauge
import numpy as np
# Metrics
prediction_counter = Counter(
"model_predictions_total",
"Total predictions",
["model_name", "model_version"]
)
prediction_latency = Histogram(
"model_prediction_latency_seconds",
"Prediction latency",
["model_name"],
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5]
)
feature_drift_gauge = Gauge(
"model_feature_drift_score",
"Feature drift score (PSI)",
["model_name", "feature_name"]
)
prediction_distribution = Histogram(
"model_prediction_distribution",
"Distribution of prediction values",
["model_name"],
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)
class ModelMonitor:
def __init__(self, model_name, reference_data):
self.model_name = model_name
self.reference_distributions = self._compute_distributions(
reference_data
)
def record_prediction(self, features, prediction, latency):
prediction_counter.labels(
model_name=self.model_name,
model_version="v3"
).inc()
prediction_latency.labels(
model_name=self.model_name
).observe(latency)
prediction_distribution.labels(
model_name=self.model_name
).observe(prediction)
def check_feature_drift(self, recent_features):
"""Calculate PSI (Population Stability Index) for drift detection"""
for feature_name, values in recent_features.items():
psi = self._calculate_psi(
self.reference_distributions[feature_name],
values
)
feature_drift_gauge.labels(
model_name=self.model_name,
feature_name=feature_name
).set(psi)
if psi > 0.2:
self._alert_drift(feature_name, psi)
def _calculate_psi(self, expected, actual, bins=10):
expected_hist, edges = np.histogram(expected, bins=bins)
actual_hist, _ = np.histogram(actual, bins=edges)
# Avoid division by zero
expected_pct = (expected_hist + 1) / (sum(expected_hist) + bins)
actual_pct = (actual_hist + 1) / (sum(actual_hist) + bins)
psi = sum(
(actual_pct - expected_pct) * np.log(actual_pct / expected_pct)
)
return psi
3. Interview Preparation: 30 Questions
3.1 Backend Architecture (Questions 1-10)
Q1. How would you design a model serving system that handles 50,000 requests per second with p99 latency under 10ms?
Start with a multi-tier architecture: an API gateway (Kong or Envoy) for rate limiting and routing, a model router service that directs requests to the correct model version, and model server pods running Triton or TF Serving. Use gRPC instead of REST for inter-service communication to reduce serialization overhead. Pre-load models into GPU memory and use dynamic batching to maximize GPU utilization. Cache frequently used features in Redis with sub-millisecond access. Horizontal scaling with custom HPA metrics based on inference queue depth.
Q2. Explain the differences between online inference, batch inference, and near-real-time inference. When would you use each?
Online inference serves predictions synchronously per request, used for real-time decisions like fraud detection (latency requirement: less than 10ms). Batch inference processes large datasets periodically, used for pre-computed recommendations or credit score updates (latency: minutes to hours). Near-real-time inference processes streaming data with small delays, used for event-driven scoring where slight delay is acceptable (latency: 100ms to seconds). The choice depends on latency requirements, cost efficiency, and data freshness needs.
Q3. How do you implement graceful degradation when a model server fails?
Layer 1: Circuit breaker per model endpoint that opens after a threshold of failures. Layer 2: Fallback to a simpler model (a lightweight model or rule-based system) when the primary model is unavailable. Layer 3: Cached predictions for frequently seen inputs. Layer 4: Default safe values for critical paths (e.g., flag for manual review instead of auto-approve). Monitor fallback rates and alert when they exceed thresholds.
Q4. Describe the saga pattern and how you would apply it to an ML pipeline that writes to multiple services.
The saga pattern manages distributed transactions by breaking them into a sequence of local transactions, each with a compensating action. In an ML pipeline: Step 1 writes the prediction to the database (compensate: delete). Step 2 publishes the prediction event to Kafka (compensate: publish rollback event). Step 3 updates the feature store with the new data point (compensate: revert feature). If any step fails, the compensating actions execute in reverse order. Use an orchestrator (saga execution coordinator) or choreography (event-driven).
Q5. How would you handle zero-downtime deployment of a new model version?
Use canary deployment: deploy the new model version alongside the current one. Route a small percentage (1-5%) of traffic to the new version. Monitor prediction quality metrics (accuracy, latency, error rate) for the canary. Gradually increase traffic to the new version if metrics are healthy. Use Argo Rollouts or Istio traffic splitting for Kubernetes-native canary deployments. Maintain the ability to instantly rollback by keeping the old model version running.
Q6. What is the difference between horizontal and vertical scaling for model serving? When would you prefer each?
Horizontal scaling adds more model server replicas (more pods). It is preferred for CPU-bound models and when you need to handle more concurrent requests. Vertical scaling adds more resources (GPU, memory) to existing instances. It is preferred for large models that do not fit in a single GPU (model parallelism) or when model loading time makes horizontal scaling slow. In practice, use horizontal scaling for most workloads and vertical scaling for very large models (LLMs).
Q7. How do you implement request-level tracing across an ML inference pipeline with multiple microservices?
Use OpenTelemetry as the instrumentation standard. Inject a trace context (trace ID, span ID) at the API gateway. Each service extracts the context, creates a child span, and propagates the context to downstream calls. Instrument key operations: feature fetching, model inference, post-processing. Send traces to Jaeger or Tempo. Add custom attributes to spans: model name, model version, feature count, prediction value. Use trace-based alerting for latency anomalies.
Q8. Explain the CAP theorem trade-offs in a feature store that serves both training and online inference.
A feature store must be consistent (training and serving use the same features), available (online inference cannot wait), and partition-tolerant. For online serving: prioritize AP with eventual consistency — serve from Redis with best-effort freshness. For training: prioritize CP — use a batch store (S3, BigQuery) with strict point-in-time correctness. Bridge the gap with a dual-write architecture: features are written to both the online store (Redis) and the offline store (S3) with reconciliation checks.
Q9. How do you prevent cascading failures in a microservice architecture?
Implement bulkheads to isolate failure domains (separate thread pools per downstream service). Use circuit breakers with configurable thresholds. Implement timeouts on all external calls (no unbounded waits). Use retry with exponential backoff and jitter. Limit request queues with bounded buffers (reject excess load instead of queuing indefinitely). Health check endpoints that downstream services can probe. Load shedding at the API gateway when the system is overloaded.
Q10. Describe your approach to API versioning for ML model endpoints.
Use URL-based versioning (e.g., /v1/predict, /v2/predict) for breaking changes. Use header-based versioning for minor changes. The model router maps API versions to model versions. Maintain backward compatibility: v1 callers should continue working when v2 is deployed. Deprecation policy: announce 3 months before removing an API version. Include version information in all responses for debugging. Run both versions simultaneously during the migration period.
3.2 ML Serving (Questions 11-15)
Q11. Compare TensorFlow Serving, Triton Inference Server, and TorchServe. When would you use each?
TensorFlow Serving is optimized for TF models, has mature gRPC API, and supports model versioning out of the box. Use it for TF-only shops. Triton Inference Server supports multiple frameworks (TF, PyTorch, ONNX, TensorRT), offers dynamic batching, and has the best GPU utilization. Use it for multi-framework environments or when GPU efficiency matters most. TorchServe is the native serving solution for PyTorch models, integrates with TorchScript, and has the simplest setup. Use it for PyTorch-only workloads with simpler requirements.
Q12. What is dynamic batching and why is it important for GPU-based inference?
Dynamic batching groups multiple individual inference requests into a single batch before sending to the GPU. GPUs are optimized for parallel computation and process a batch of 32 almost as fast as a single input. Without batching, GPU utilization can be as low as 5-10%. Dynamic batching adds a small delay (configurable, typically 100-500 microseconds) to wait for more requests before executing. The trade-off is slightly higher latency per request but dramatically higher throughput.
Q13. Explain the ONNX format and when you would use it.
ONNX (Open Neural Network Exchange) is a cross-framework model format. A model trained in PyTorch can be exported to ONNX and served by TF Serving, Triton, or ONNX Runtime. Benefits: framework-agnostic serving, optimizations through ONNX Runtime (graph optimization, quantization), deployment flexibility. Use ONNX when you want to decouple training framework choice from serving infrastructure, or when ONNX Runtime provides better performance than native serving.
Q14. How do you implement A/B testing for ML models in production?
At the infrastructure level: deploy both model versions as separate endpoints. The model router splits traffic based on a hash of the user ID (ensures consistent treatment per user). Log predictions with model version, user ID, and timestamp. On the analytics side: define success metrics (click-through rate, conversion, fraud detection rate), calculate statistical significance, and monitor guardrail metrics (latency, error rate). Use Bayesian or frequentist methods depending on traffic volume. Minimum experiment duration: 1-2 weeks for most ML models.
Q15. What is model warm-up and why does it matter?
Model warm-up is the process of sending dummy inference requests to a newly deployed model before routing real traffic. Without warm-up, the first real requests experience high latency because: the model must be loaded from disk to memory (or GPU), TensorFlow and PyTorch perform lazy initialization (compilation on first call), caches (CPU cache, GPU cache) are cold. Warm-up eliminates this cold-start penalty. In Triton, use the --load-model flag. In TF Serving, configure warm-up data in the model directory.
3.3 Kubernetes and Infrastructure (Questions 16-20)
Q16. How do you manage GPU resources on Kubernetes for model serving?
Install the NVIDIA device plugin for K8s. Define GPU resource requests and limits in pod specs. Use node selectors and tolerations to schedule GPU workloads on GPU nodes. Implement MIG (Multi-Instance GPU) for sharing a single A100 across multiple pods. Monitor GPU utilization, memory, and temperature with DCGM-exporter and Prometheus. Set up cluster autoscaler with GPU node pools that scale based on pending GPU pod requests.
Q17. Compare StatefulSet and Deployment for model serving. When would you use each?
Use Deployments for stateless model servers where any replica can handle any request (most common). Use StatefulSets when models require persistent local storage for model caching (each pod maintains its own model cache), when you need stable network identities for model ensembles, or when model sharding requires deterministic pod-to-shard mapping. In practice, most model serving workloads use Deployments with external storage (S3, PVC) for model artifacts.
Q18. How would you implement canary deployment for a model on Kubernetes?
Option 1 — Istio traffic splitting: Deploy new model version as a separate deployment. Create Istio VirtualService rules that split traffic by percentage. Gradually shift traffic from stable to canary. Option 2 — Argo Rollouts: Define a Rollout resource with canary strategy. Configure analysis templates that check model metrics. Auto-promote or rollback based on analysis results. Option 3 — KServe canary: Use KServe InferenceService with canaryTrafficPercent field. Both options integrate with Prometheus metrics for automated decision-making.
Q19. How do you handle model artifact storage and versioning on Kubernetes?
Use an object store (S3, GCS, MinIO) as the source of truth for model artifacts. Each model version is stored at a unique path: s3://models/model-name/version/. Use a model registry (MLflow Model Registry, custom) to track metadata: version, metrics, training run ID, approval status. For K8s: use init containers that download the model from object storage on pod startup, or mount object storage directly using CSI drivers. Cache frequently used models on local NVMe storage for faster loading.
Q20. Explain how you would set up a service mesh for ML microservices.
Install Istio with sidecar injection. Configure mTLS for all service-to-service communication (zero-trust security). Use VirtualService for traffic routing (canary, A/B testing). Use DestinationRule for circuit breaking and outlier detection. Enable Envoy access logging for request-level visibility. Set up Kiali for service mesh visualization. Configure rate limiting at the mesh level. Use PeerAuthentication for strict mTLS enforcement. Monitor mesh health with Istio metrics exported to Prometheus.
3.4 Distributed Systems (Questions 21-25)
Q21. How do you ensure exactly-once processing in an ML prediction pipeline that spans multiple services?
True exactly-once is very hard across services. Instead, aim for effectively-once by making each service idempotent. Use a unique request ID generated at the gateway and propagated through all services. Each service checks if it has already processed this request ID (idempotency key in Redis or database). For Kafka-based pipelines, use Kafka transactions. For database writes, use upserts with the request ID as the unique constraint. Log and reconcile to catch edge cases.
Q22. How would you design a feature computation pipeline that serves both real-time and batch consumers?
Use the Lambda or Kappa architecture. Lambda: a speed layer (Flink) computes real-time features from streaming data, while a batch layer (Spark) recomputes features periodically for accuracy. Kappa: use only Flink with a reprocessing capability. In both cases, write features to a dual-store: Redis for online serving, S3/BigQuery for offline training. Ensure schema consistency between real-time and batch features with a shared feature definition registry. Run automated consistency checks between the two stores.
Q23. Explain the challenges of distributed logging in a microservice architecture and your solution.
Challenges: log volume (millions of log lines per minute), correlation across services, log format inconsistency, and storage cost. Solution: structured logging in JSON format with mandatory fields (trace ID, service name, timestamp, severity). Ship logs via Fluent Bit sidecar to Kafka, then to OpenSearch or Loki. Implement sampling for verbose logs (log 10% of debug-level requests). Use log-based alerting for error patterns. Set retention policies: 7 days hot, 30 days warm, 90 days cold in S3.
Q24. How do you handle clock skew in distributed tracing across services?
Use NTP synchronization on all nodes with tight drift thresholds (less than 1ms). In trace collection, use the Jaeger/Tempo agent-side timestamp adjustment. For causality ordering when clocks disagree, use Lamport timestamps or vector clocks. In practice, ensure all K8s nodes use the same NTP server pool, configure chrony with tight maxdist settings, and alert when clock drift exceeds 5ms. OpenTelemetry SDKs handle most cases by using monotonic clocks for duration measurement.
Q25. How would you design a rate limiter for an ML prediction API?
Use a token bucket or sliding window algorithm. Implement at two levels: per-client rate limiting at the API gateway (Kong, Envoy) using Redis for shared state, and per-model rate limiting at the model router to prevent a single model from consuming all resources. Configuration: allow burst traffic (bucket capacity), set sustained rate per client, and implement priority queues for critical requests (fraud detection gets higher priority than recommendations). Return 429 status with Retry-After header when limits are exceeded.
3.5 Python, Kotlin, and Go (Questions 26-30)
Q26. Compare Python's asyncio, Kotlin coroutines, and Go goroutines for concurrent ML serving.
Python asyncio is single-threaded with cooperative multitasking. Good for I/O-bound work (waiting for model server responses) but limited by the GIL for CPU-bound work. Use with uvicorn/FastAPI for async inference endpoints. Kotlin coroutines are lightweight threads on the JVM with structured concurrency. Good for Spring Boot services that need to call multiple downstream services concurrently. Go goroutines are lightweight green threads with preemptive scheduling. Excellent for high-concurrency infrastructure services (feature servers, API gateways) with minimal overhead.
Q27. How would you implement a connection pool for gRPC model server calls in Kotlin?
Use a channel pool that maintains multiple gRPC channels to the model server. Each channel can multiplex multiple requests. Configure max connections per endpoint, idle timeout, and health checking:
class GrpcChannelPool(
private val endpoint: String,
private val poolSize: Int = 10
) {
private val channels = ConcurrentLinkedQueue<ManagedChannel>()
init {
repeat(poolSize) {
channels.offer(
ManagedChannelBuilder
.forTarget(endpoint)
.usePlaintext()
.keepAliveTime(30, TimeUnit.SECONDS)
.maxInboundMessageSize(16 * 1024 * 1024)
.build()
)
}
}
fun getChannel(): ManagedChannel {
return channels.poll() ?: createNewChannel()
}
fun returnChannel(channel: ManagedChannel) {
if (channel.getState(false) != ConnectivityState.SHUTDOWN) {
channels.offer(channel)
}
}
}
Q28. Write a Go middleware that implements request logging and latency tracking for an ML API.
func MetricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
traceID := r.Header.Get("X-Trace-ID")
if traceID == "" {
traceID = uuid.New().String()
}
// Wrap response writer to capture status code
wrapped := &statusResponseWriter{ResponseWriter: w, status: 200}
next.ServeHTTP(wrapped, r)
duration := time.Since(start)
// Record metrics
httpRequestsTotal.WithLabelValues(
r.Method, r.URL.Path,
strconv.Itoa(wrapped.status),
).Inc()
httpRequestDuration.WithLabelValues(
r.Method, r.URL.Path,
).Observe(duration.Seconds())
// Structured log
log.Info().
Str("trace_id", traceID).
Str("method", r.Method).
Str("path", r.URL.Path).
Int("status", wrapped.status).
Dur("latency", duration).
Msg("request completed")
})
}
Q29. How would you structure a Python ML serving application for testability?
Separate concerns into layers: API layer (FastAPI routes), service layer (business logic, feature assembly), model layer (model loading, inference). Use dependency injection for all external dependencies (model client, feature store, database). Create interfaces (Protocol classes in Python) for model inference so you can mock them in tests. Test each layer independently: unit tests for business logic, integration tests with a test model server, end-to-end tests with the full pipeline. Use pytest fixtures for common test setup.
Q30. When would you use Python vs Kotlin vs Go for an ML platform service?
Python: model serving wrappers, feature engineering pipelines, data validation, any service that needs tight integration with ML libraries (scikit-learn, pandas, TensorFlow). Kotlin: API gateways, orchestration services, anything that benefits from the JVM ecosystem (Spring Boot, Kafka client, JDBC), services that need to integrate with existing Java infrastructure. Go: high-performance stateless services (feature servers, request routers), CLI tools, Kubernetes operators, any service where startup time and memory footprint matter. The key principle: use the right language for the job, not the language you are most comfortable with.
4. Six-Month Study Roadmap
Month 1: Backend Fundamentals
Week 1-2: Server Architecture
- Study clean architecture and hexagonal architecture patterns
- Build a REST API in Kotlin with Spring Boot (CRUD, validation, error handling)
- Implement the same API in Go with Gin or Echo framework
- Compare performance characteristics using load testing with k6
Week 3-4: Concurrency and Performance
- Study Kotlin coroutines with the official guide and Kotlin in Action
- Implement a concurrent HTTP client in Go using goroutines and channels
- Build a Python async API with FastAPI and asyncio
- Profile all three implementations: memory usage, latency under load, CPU utilization
Month 2: Kubernetes Deep Dive
Week 1-2: Core K8s Concepts
- Complete the CKA (Certified Kubernetes Administrator) curriculum
- Deploy a multi-service application on Minikube or kind
- Implement Deployments, Services, ConfigMaps, Secrets, and PVCs
- Practice with resource requests, limits, and QoS classes
Week 3-4: Advanced K8s for ML
- Set up GPU scheduling with NVIDIA device plugin (on cloud or local GPU)
- Configure HPA with custom metrics from Prometheus
- Implement canary deployments with Argo Rollouts
- Set up Istio service mesh with traffic splitting
Month 3: ML Serving Fundamentals
Week 1-2: Model Serving Frameworks
- Deploy TensorFlow Serving with a sample model
- Deploy Triton Inference Server and test with ONNX model
- Compare latency and throughput between the two
- Implement dynamic batching and measure the throughput improvement
Week 3-4: KServe and Model Management
- Install KServe on a Kubernetes cluster
- Deploy a model with InferenceService
- Implement canary deployment for a model version upgrade
- Set up model monitoring with Prometheus metrics
Month 4: Distributed Systems
Week 1-2: Theory and Patterns
- Read "Designing Data-Intensive Applications" chapters 5, 8, 9
- Study microservice patterns: circuit breaker, bulkhead, saga, CQRS
- Implement a circuit breaker in Kotlin using Resilience4j
- Practice distributed systems design questions
Week 3-4: Observability
- Set up OpenTelemetry instrumentation in a multi-service application
- Deploy Jaeger or Tempo for distributed tracing
- Set up Prometheus + Grafana for metrics collection and dashboards
- Implement structured logging with correlation IDs across services
Month 5: Feature Stores and Data Pipelines
Week 1-2: Feature Store
- Deploy Feast locally and define feature views
- Implement online feature serving with Redis backend
- Build a feature computation pipeline with Python
- Test point-in-time correctness for training data
Week 3-4: End-to-End ML Pipeline
- Build a complete ML serving pipeline: feature store to model server to API
- Implement A/B testing infrastructure with traffic splitting
- Add model monitoring (data drift, prediction distribution)
- Set up automated model retraining triggers
Month 6: Integration and Interview Prep
Week 1-2: Portfolio Project
- Build a production-quality ML serving platform with all components
- Document the architecture with diagrams and decision records
- Create a monitoring dashboard with key ML metrics
- Write load tests that validate SLA requirements
Week 3-4: Interview Preparation
- Practice all 30 interview questions with a partner
- Do 5 system design mock interviews focused on ML infrastructure
- Review and refine your resume with quantifiable metrics
- Prepare a 5-minute architecture walkthrough of your portfolio project
- Study Toss Bank tech blog posts for culture and technical context
5. Resume Strategy
What Toss Bank Wants to See
Your resume should prove that you can build and operate ML infrastructure at scale. Frame everything in terms of impact.
Lead with Scale Metrics
- "Designed and operated ML serving infrastructure handling 100K predictions per second with p99 latency under 8ms"
- "Reduced model deployment time from 2 days to 15 minutes by building a KServe-based CI/CD pipeline"
- "Built a feature store serving 50M features per day with sub-millisecond latency using Redis cluster"
Show Polyglot Experience
- List primary language first with depth indicators ("5 years Python, production ML systems")
- Include secondary languages with context ("Kotlin for Spring Boot microservices", "Go for high-performance tools")
- Highlight language selection decisions: "Chose Go for the feature server due to low latency requirements, reducing p99 from 5ms to 0.8ms"
Demonstrate Operational Maturity
- On-call experience for ML systems (model failures, data drift incidents)
- Cost optimization: GPU utilization improvements, right-sizing model serving infrastructure
- Zero-downtime deployments and migration stories
- Monitoring and alerting setup for ML-specific metrics
Resume Format
- 2 pages maximum
- "Technical Skills" section that maps directly to the JD keywords
- XYZ format for achievements: "Accomplished X by implementing Y, resulting in Z"
- Include links to relevant open-source contributions or public talks
- GitHub profile with at least one relevant project
6. Portfolio Project Ideas
Project 1: Full-Stack ML Serving Platform
Build a complete ML serving platform that demonstrates all the JD requirements:
- Model Server: Triton Inference Server on Kubernetes with GPU support
- API Gateway: Kotlin Spring Boot with gRPC model client
- Feature Store: Redis-backed feature serving with Python feature pipelines
- Monitoring: Prometheus + Grafana dashboards with model-specific metrics
- Canary Deployments: Argo Rollouts with automated analysis
- Distributed Tracing: OpenTelemetry across all services
Project 2: Real-Time Fraud Detection System
Build an end-to-end fraud detection system:
- Kotlin API that accepts transaction events
- Go feature server that computes real-time features
- Python model serving with ensemble of models
- A/B testing infrastructure for model comparison
- Circuit breaker with fallback to rule-based detection
- Dashboard showing detection rate, false positive rate, latency
Project 3: Model Deployment Pipeline
Build a CI/CD pipeline for ML models:
- MLflow for model tracking and registry
- KServe for model serving with canary support
- Automated model validation (schema check, performance benchmark)
- Shadow mode deployment for new models
- Automated rollback on metric degradation
- Slack/webhook notifications for deployment events
Project 4: Kubernetes Operator for ML Workloads
Build a custom K8s operator in Go:
- Custom Resource Definition for MLModel resources
- Controller that manages model serving deployments
- Automatic scaling based on inference metrics
- Model artifact syncing from S3
- Health checking and auto-recovery for model pods
- Integration with Prometheus for metrics
7. Quiz: Test Your Knowledge
Q1: Your ML serving system handles 10K requests per second. You deploy a new model version that increases inference latency from 5ms to 50ms. What happens to the system and how do you respond?
With 10K RPS and 50ms latency per request, the system needs 500 concurrent connections (up from 50). This can exhaust thread pools, connection limits, and downstream capacity, causing cascading timeouts. Immediate response: roll back the canary deployment to the old model version. Then investigate: Is the model larger (more GPU memory needed)? Is it not optimized (needs TensorRT optimization or quantization)? Is it making extra network calls? Fix the latency issue before redeploying. Always have automated canary analysis that catches latency regressions before they reach 100% traffic.
Q2: You notice that your fraud detection model's precision drops from 95% to 80% over two weeks, but the model itself has not changed. What is happening?
This is data drift or concept drift. Possible causes: (1) Input feature distributions have shifted — new transaction patterns, seasonality, or a business change (new merchant category, promotional event). (2) Upstream data pipeline bug producing incorrect feature values. (3) Feature store stale data (TTL too long, pipeline failure). Diagnosis: check feature drift metrics (PSI for each feature), compare recent vs training data distributions, verify upstream pipeline health. Solution: retrain on recent data, implement automated drift detection alerts, set up a regular retraining schedule.
Q3: You need to serve an LLM (3B parameters, 6GB model) on Kubernetes. The model takes 30 seconds to load. How do you handle scaling and deployment?
Use StatefulSets or Deployments with PVC-backed model storage to avoid downloading the model on each pod start. Pre-pull the model image during off-peak hours. Set minimum replicas high enough to handle baseline traffic (avoid scaling from zero). Configure HPA to scale up aggressively (preemptive) and scale down slowly (stabilization window of 10+ minutes) to avoid cold-start latency. Use pod disruption budgets to ensure rolling updates never take all replicas offline. Consider model warm-up with health checks that only pass after the model has served a test inference.
Q4: Your Kubernetes cluster has 8 A100 GPUs shared by 5 ML model serving teams. How do you manage GPU resources fairly?
Implement resource quotas per namespace (each team gets a namespace). Set GPU limits: e.g., team A gets 2 GPUs, team B gets 2, and 2 are shared. Use NVIDIA MIG to partition A100s into smaller instances for models that do not need a full GPU. Implement priority classes: production models get higher priority than development workloads. Use cluster autoscaler to add GPU nodes when demand exceeds capacity. Monitor GPU utilization per team and reclaim underutilized allocations quarterly. Set up a GPU scheduling dashboard with Grafana.
Q5: You are designing the interface between the feature store and the model server. What protocol and format would you choose, and why?
Use gRPC with Protocol Buffers. gRPC provides lower latency than REST due to HTTP/2 multiplexing and binary serialization. Protocol Buffers are strongly typed, which prevents schema mismatches between the feature store and model server. The schema definition serves as a contract between teams. For the feature tensor format, use a flat buffer or numpy-compatible binary format to avoid deserialization overhead. Cache frequently requested feature sets in the model server's local memory (LRU cache with TTL). For batch requests, support streaming gRPC to avoid large single-message overhead.
8. References and Resources
Books
- Martin Kleppmann — Designing Data-Intensive Applications, O'Reilly, 2017
- Sam Newman — Building Microservices, 2nd Edition, O'Reilly, 2021
- Chip Huyen — Designing Machine Learning Systems, O'Reilly, 2022
- Chris Richardson — Microservices Patterns, Manning, 2018
Official Documentation
- Kubernetes Documentation — https://kubernetes.io/docs/
- NVIDIA Triton Inference Server — https://github.com/triton-inference-server/server
- KServe Documentation — https://kserve.github.io/website/
- Feast Feature Store — https://feast.dev/
- OpenTelemetry Documentation — https://opentelemetry.io/docs/
Courses and Tutorials
- Made With ML — MLOps course — https://madewithml.com/
- Full Stack Deep Learning — https://fullstackdeeplearning.com/
- Google ML Engineering Best Practices — https://developers.google.com/machine-learning/guides/rules-of-ml
- CKA Exam Preparation — https://kubernetes.io/docs/reference/kubectl/
Community and Talks
- MLOps Community — https://mlops.community/
- KubeCon + CloudNativeCon recordings — https://www.cncf.io/kubecon/
- Chip Huyen's blog — https://huyenchip.com/blog/
- Toss Tech Blog — https://toss.tech/
- Netflix Tech Blog — https://netflixtechblog.com/
Conclusion
The Toss Bank ML Backend Engineer role demands a rare combination of strong backend engineering fundamentals, practical ML serving expertise, and Kubernetes operational skills. You are not being hired to train models — you are being hired to build the reliable, scalable, observable systems that serve those models to millions of banking customers.
The six-month roadmap focuses on building depth in backend architecture first, then layering on ML-specific infrastructure knowledge. This sequence matters: a solid backend engineer who learns ML serving will outperform an ML specialist who struggles with distributed systems.
Your portfolio should demonstrate three things: you can build high-performance systems in multiple languages, you can deploy and operate ML models on Kubernetes, and you understand the operational realities of running ML in production (monitoring, scaling, incident response).
The demand for ML Backend Engineers will only increase as financial services companies integrate more AI into their core operations. Toss Bank is at the forefront of this trend in Korea. Prepare thoroughly, build real projects, and demonstrate that you can operate at the intersection of backend engineering and machine learning.
Start building. The models are waiting to be served.