1. Why Serve ML Models on Kubernetes
2. KServe Architecture Analysis
3. KServe Installation and Basic Usage
4. NVIDIA Triton Inference Server Feature Analysis
- 4.1 Key Features
5. Model Format Support
- 5.1 Triton Backend Architecture
- 5.2 Supported Frameworks and Model Formats
6. Model Repository Structure and config.pbtxt Configuration
- 6.1 Directory Structure
- 6.2 Detailed config.pbtxt Configuration
7. Dynamic Batching and Concurrent Model Execution
- 7.1 Dynamic Batching
- 7.2 Concurrent Model Execution
8. GPU Resource Allocation
9. Auto-scaling Strategies
10. Canary / Blue-Green Deployment
- 10.1 Canary Rollout
- 10.2 Blue-Green Deployment
11. Prometheus + Grafana Monitoring Setup
12. Summary
References

1. Why Serve ML Models on Kubernetes

When deploying machine learning models to production, serving models on a single server with Flask or FastAPI is suitable for initial prototyping, but has fundamental limitations for handling real production traffic. Kubernetes is the most mature platform for addressing these limitations.

1.1 Scalability

The core strength of Kubernetes is horizontal scaling based on workload. When inference requests spike, Pods can be automatically scaled up, and when traffic decreases, they can be scaled back down. Using HPA (Horizontal Pod Autoscaler), automatic scaling based on CPU, memory, or custom metrics (e.g., GPU utilization, request queue length) is possible. Furthermore, using Knative-based KPA (Knative Pod Autoscaler), Scale-to-Zero is also possible, reducing Pods to 0 when there is no traffic at all.

1.2 Resource Management

ML inference workloads use expensive GPU resources. Kubernetes can precisely control the CPU, memory, and GPU used by each Pod through resources.requests and resources.limits. Using the NVIDIA Device Plugin, nvidia.com/gpu resources can be natively scheduled in Kubernetes, and sharing a single GPU among multiple Pods through MIG (Multi-Instance GPU) or Time-Slicing is also possible.

1.3 Operational Reliability

Kubernetes' Self-Healing mechanism enhances the stability of model serving. If a Pod terminates abnormally, it is automatically restarted, and the model server's health is continuously monitored through liveness/readiness probes. Through Rolling Update and Canary Deployment, models can be safely deployed without downtime during updates.

2. KServe Architecture Analysis

KServe (formerly KFServing) is a standardized platform designed for ML model serving on Kubernetes. KServe leverages Kubernetes CRD (Custom Resource Definition) to manage the entire lifecycle of model serving.

2.1 Layered Architecture

According to the KServe official documentation, KServe consists of four major layers.

Control Plane (Go): The Kubernetes Controller manages the entire lifecycle of InferenceService. It orchestrates operations such as model deployment, updates, deletion, and scaling.
Data Plane (Python): A protocol-agnostic model server framework that handles actual inference requests through V1/V2 inference protocols.
Configuration Layer: Manages the entire system configuration through CRDs and ConfigMaps.
Storage Layer: Supports various model artifact storage backends including S3, GCS, Azure Blob Storage, PVC, and more.

2.2 InferenceService CRD

InferenceService is the core CRD of KServe, encapsulating the complexity of autoscaling, networking, health checking, and server configuration required for ML model serving. A single YAML definition can declaratively manage everything needed for model serving.

A basic InferenceService definition is as follows.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'sklearn-iris'
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: 'gs://kfserving-examples/models/sklearn/1.0/model'

With this single simple YAML, model download, server startup, network endpoint configuration, and autoscaling are all automatically handled.

2.3 Three Core Components: Predictor, Transformer, Explainer

KServe InferenceService consists of three components, of which only Predictor is required and the rest are optional.

Predictor

Predictor is the core workload of InferenceService, consisting of the model and model server. It handles actual inference requests and is exposed externally through network endpoints. KServe provides various Serving Runtimes: TensorFlow Serving, TorchServe, Triton Inference Server, SKLearn, XGBoost, LightGBM, and more are built-in by default.

Transformer

Transformer handles data transformation (pre/post-processing) before and after inference. For example, it performs tasks such as normalizing image inputs, tokenizing text, or converting model outputs into human-readable formats. KServe also provides out-of-the-box Transformers that integrate with Feature Stores like Feast. Custom Transformers are deployed as separate containers and connected to the Predictor through environment variables such as the prediction endpoint.

Explainer

Explainer is an optional component that provides explanations (interpretability) for model predictions. It provides explanations of why the model made specific predictions using XAI (eXplainable AI) techniques such as SHAP and LIME. It operates on a separate data plane from the Predictor's inference results and is activated only when explanation requests are made.

Request Flow

When an inference request arrives, it is processed in the following flow.

Client request enters through the Ingress Gateway
Transformer performs input data preprocessing
Preprocessed data is forwarded to the Predictor for inference execution
Transformer performs output data post-processing
Final result is returned to the client

For Explanation requests, the Explainer additionally intervenes to generate model explanations.

3. KServe Installation and Basic Usage

3.1 Prerequisites

According to the KServe official QuickStart guide, the following are required.

Kubernetes 1.32 or higher
Properly configured kubeconfig
For local development/testing environments, kind (Kubernetes in Docker) or minikube is recommended

3.2 Quick Install

The quickest way to get started in development and testing environments is to use the Quick Install Script.

# KServe Quick Install (Serverless mode - includes Knative)
curl -s "https://raw.githubusercontent.com/kserve/kserve/master/hack/quick_install.sh" | bash

# Raw Deployment mode (install without Knative)
curl -s "https://raw.githubusercontent.com/kserve/kserve/master/hack/quick_install.sh" | bash -s -- -r

This script automatically handles dependency installation, platform detection, and mode configuration.

3.3 Deployment Modes

KServe supports two major deployment modes.

Serverless Mode (Knative): The default installation option, operating based on Knative Serving. It provides serverless features such as Scale-to-Zero, revision-based traffic management, and Canary Deployment.
Standard Mode (Raw Deployment): Operates using only Kubernetes native resources (Deployment, Service, HPA) without Knative. It supports both predictive inference and generative inference workloads while minimizing external dependencies.

3.4 First Model Deployment

Once installation is complete, you can deploy a simple SKLearn model as follows.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'sklearn-iris'
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: 'gs://kfserving-examples/models/sklearn/1.0/model'
      resources:
        requests:
          cpu: '100m'
          memory: '256Mi'
        limits:
          cpu: '1'
          memory: '512Mi'

# Deploy model
kubectl apply -f sklearn-iris.yaml

# Check status
kubectl get inferenceservice sklearn-iris

# Test inference request
curl -v -H "Content-Type: application/json" \
  http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict \
  -d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'

4. NVIDIA Triton Inference Server Feature Analysis

NVIDIA Triton Inference Server is a high-performance inference server capable of simultaneously serving models from various deep learning and machine learning frameworks. It can be used as a KServe Serving Runtime integration or deployed independently.

4.1 Key Features

According to Triton official documentation, Triton provides the following key features.

Multi-Framework Support: Simultaneously serving multiple models from different frameworks on a single server instance
Dynamic Batching: Dynamically grouping individual inference requests to maximize GPU utilization
Concurrent Model Execution: Running multiple models concurrently on the same GPU
Model Ensemble: Connecting multiple models as a pipeline for composite inference
Model Analyzer: Automatically exploring optimal deployment configurations

5. Model Format Support

5.1 Triton Backend Architecture

Triton supports various model formats through its Backend plugin architecture. Each model must be associated with exactly one Backend, specified through the backend or platform field in config.pbtxt.

5.2 Supported Frameworks and Model Formats

ONNX Runtime Backend

Executes models in ONNX (Open Neural Network Exchange) format. It supports models converted from various frameworks including PyTorch, TensorFlow, and scikit-learn. It provides optimized inference on both CPU and GPU, leveraging ONNX Runtime's Graph Optimization and Execution Providers.

# ONNX model file example
model_repository/
  resnet50_onnx/
    config.pbtxt
    1/
      model.onnx

TensorRT Backend

Executes models optimized with NVIDIA TensorRT (engine/plan files). TensorRT is an inference optimization engine specialized for NVIDIA GPUs, providing the highest level of GPU inference performance through FP16/INT8 quantization, Layer Fusion, Kernel Auto-Tuning, and more.

# TensorRT model file example
model_repository/
  resnet50_tensorrt/
    config.pbtxt
    1/
      model.plan

PyTorch Backend

Supports both TorchScript format and PyTorch 2.0's torch.compile format. TorchScript models are converted using torch.jit.trace or torch.jit.script, and the same Backend can handle both TorchScript and PyTorch 2.0 models.

# PyTorch model file example
model_repository/
  bert_torchscript/
    config.pbtxt
    1/
      model.pt

TensorFlow Backend

Supports both TensorFlow's SavedModel and GraphDef formats. The same Backend can execute both TensorFlow 1 and TensorFlow 2 models. SavedModel is the recommended format.

# TensorFlow SavedModel example
model_repository/
  resnet50_tf/
    config.pbtxt
    1/
      model.savedmodel/
        saved_model.pb
        variables/

Other Backends

OpenVINO Backend: Supports inference optimized for Intel hardware
FIL Backend (Forest Inference Library): Supports tree-based models including XGBoost, LightGBM, scikit-learn Random Forest, and cuML Random Forest
Python Backend: Enables implementing custom Python code for pre/post-processing pipelines

6. Model Repository Structure and config.pbtxt Configuration

6.1 Directory Structure

The basic layout of the Model Repository as defined in Triton official documentation is as follows.

<model-repository-path>/
  <model-name>/
    [config.pbtxt]
    [<output-labels-file> ...]
    <version>/
      <model-definition-file>
    <version>/
      <model-definition-file>
    ...
  <model-name>/
    [config.pbtxt]
    ...

The role of each component is as follows.

model-name directory: The top-level directory corresponding to the model name
config.pbtxt: Model configuration file in ModelConfig protobuf format
version directory: Numeric directory names represent model versions. Directories starting with "0" or with non-numeric names are ignored
model-definition-file: Backend-specific model files (model.onnx, model.plan, model.pt, etc.)

A practical example looks like this.

model_repository/
  text_detection/
    config.pbtxt
    1/
      model.onnx
  text_recognition/
    config.pbtxt
    output_labels.txt
    1/
      model.plan
    2/
      model.plan
  ensemble_ocr/
    config.pbtxt
    1/

6.2 Detailed config.pbtxt Configuration

config.pbtxt is a text representation of the ModelConfig protobuf message. At minimum, it should include name, backend (or platform), max_batch_size, input, and output. However, for some models, Triton can automatically generate the minimum configuration.

name: "resnet50"
backend: "onnxruntime"
max_batch_size: 8

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    label_filename: "labels.txt"
  }
]

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}

The meaning of each item in the above example is as follows.

name: Model name (must match the directory name)
backend: Backend to use ("onnxruntime", "tensorrt", "pytorch", "tensorflow", etc.)
max_batch_size: Maximum batch size for dynamic batching. Setting to 0 disables batching
input/output: Input/output tensor names, data types, and dimensions of the model
instance_group: Number of model instances and assigned GPU specification
dynamic_batching: Dynamic batching configuration

7. Dynamic Batching and Concurrent Model Execution

7.1 Dynamic Batching

Dynamic Batching is one of Triton's most powerful performance optimization features. It significantly improves throughput by grouping individually arriving inference requests into a single batch and sending them to the GPU.

According to Triton official documentation, Dynamic Batching can be independently enabled and configured per model. By default, Triton immediately batches requests that arrive without delay, but users can set a limited wait time for the scheduler to collect more requests.

dynamic_batching {
  # Preferred batch sizes (batch to these sizes when possible)
  preferred_batch_size: [ 4, 8 ]

  # Maximum time to wait to form a batch (microseconds)
  max_queue_delay_microseconds: 100

  # Priority settings (optional)
  priority_levels: 2
  default_priority_level: 1

  # Queue policy (optional)
  default_queue_policy {
    timeout_action: REJECT
    default_timeout_microseconds: 5000000
    allow_timeout_override: true
    max_queue_size: 100
  }
}

The roles of key parameters are as follows.

preferred_batch_size: Batch sizes that Triton preferentially tries to form. When a batch of this size is formed, it executes immediately.
max_queue_delay_microseconds: Maximum time to wait for additional requests to fill the batch. When this time expires, it executes immediately with the currently collected requests. This is the key parameter for adjusting the trade-off between latency and throughput.

7.2 Concurrent Model Execution

Triton can run multiple model instances concurrently on the same GPU. This allows more efficient utilization of GPU computational resources.

# Place 2 model instances on a single GPU
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

# Distribute instances across multiple GPUs
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  },
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 1, 2 ]
  }
]

# Place instances on both CPU and GPU simultaneously
instance_group [
  {
    count: 1
    kind: KIND_CPU
  },
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

The formula for maximum throughput presented in Triton official documentation is as follows.

Optimal Concurrency = 2 x <max_batch_size> x <instance_count>

For example, if max_batch_size is 8 and instance_count is 2, the optimal concurrency is 2 x 8 x 2 = 32. This means 32 concurrent requests need to be sent to maximize Triton's performance.

8. GPU Resource Allocation

8.1 NVIDIA Device Plugin for Kubernetes

The NVIDIA Device Plugin is a DaemonSet that enables managing GPUs as native resources in Kubernetes clusters. Once installed, it automatically detects GPUs on each node and registers them as nvidia.com/gpu resources.

# Install NVIDIA Device Plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

How to request GPUs in a Pod is as follows.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: inference
      image: nvcr.io/nvidia/tritonserver:24.01-py3
      resources:
        limits:
          nvidia.com/gpu: 1

8.2 MIG (Multi-Instance GPU)

MIG is a feature supported by the latest GPUs such as NVIDIA A100 and H100, which partitions a single physical GPU into multiple independent GPU instances. Each instance has memory and computational resources isolated at the hardware level, guaranteeing fault isolation.

The key characteristics of MIG are as follows.

Hardware-level isolation: Each MIG instance has independent memory, cache, and compute units
Performance guarantee: Workloads on one instance do not affect other instances
Predefined profiles: Available partition configurations are predetermined based on the GPU model (e.g., splitting an A100 80GB into 7 instances of 10GB)

To use MIG in Kubernetes, the MIG strategy is configured through the NVIDIA GPU Operator.

# MIG profile configuration example (GPU Operator ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-3g.40gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2

8.3 GPU Time-Slicing

Time-Slicing is a method that enables GPU sharing even on older generation GPUs that do not support MIG. According to the NVIDIA GPU Operator official documentation, Time-Slicing allows system administrators to define GPU replicas that can be independently allocated to each Pod.

The key characteristics are as follows.

Software-level sharing: Distributes GPU time evenly for multiple Pods to share
No memory/fault isolation: Unlike MIG, memory and fault isolation are not provided
Flexible partitioning: Allows more users to share GPUs
Combinable with MIG: Time-Slicing can be additionally applied to MIG instances

# Time-Slicing configuration (GPU Operator ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

The above configuration exposes each GPU as 4 replicas, so when 4 Pods each request nvidia.com/gpu: 1, they share the same physical GPU through time-slicing.

9. Auto-scaling Strategies

9.1 HPA (Horizontal Pod Autoscaler)

In KServe's Standard Mode (Raw Deployment), the default Kubernetes HPA is used. It automatically adjusts the number of Pods based on CPU, memory, or custom metrics.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'sklearn-iris'
  annotations:
    serving.kserve.io/autoscalerClass: 'hpa'
    serving.kserve.io/targetUtilizationPercentage: '80'
    serving.kserve.io/minReplicas: '1'
    serving.kserve.io/maxReplicas: '10'
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: 'gs://kfserving-examples/models/sklearn/1.0/model'

9.2 KPA (Knative Pod Autoscaler)

In KServe's Serverless Mode (Knative), KPA is used. The most notable feature of KPA is Scale-to-Zero, which can reduce Pods to 0 when there is no traffic to save resources. HPA and KPA cannot be used simultaneously on the same service, and annotations specify which Autoscaler to use.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'sklearn-iris'
  annotations:
    autoscaling.knative.dev/class: 'kpa.autoscaling.knative.dev'
    autoscaling.knative.dev/metric: 'concurrency'
    autoscaling.knative.dev/target: '10'
    autoscaling.knative.dev/minScale: '0'
    autoscaling.knative.dev/maxScale: '10'
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: 'gs://kfserving-examples/models/sklearn/1.0/model'

The key metrics of KPA are as follows.

concurrency: Number of concurrent requests per Pod (default)
rps: Requests per second per Pod

9.3 GPU Metric-based Auto-scaling

For GPU workloads, CPU/memory-based scaling alone is insufficient. By combining the NVIDIA DCGM (Data Center GPU Manager) Exporter with the Prometheus Adapter, GPU metric-based scaling becomes possible.

# GPU metric-based scaling using KEDA ScaledObject
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: triton-gpu-scaler
spec:
  scaleTargetRef:
    name: triton-inference-server
  pollingInterval: 15
  cooldownPeriod: 60
  minReplicaCount: 1
  maxReplicaCount: 5
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: gpu_utilization
        query: avg(DCGM_FI_DEV_GPU_UTIL{pod=~"triton-.*"})
        threshold: '80'

The KServe Controller Manager supports KEDA (Kubernetes Event-Driven Autoscaling) alongside HPA, enabling custom metric-based scaling configurations.

10. Canary / Blue-Green Deployment

10.1 Canary Rollout

According to KServe official documentation, the Canary Rollout strategy is supported in Serverless Deployment Mode. KServe automatically tracks the last known good revision that received 100% traffic, and when the canaryTrafficPercent field is set, it automatically distributes traffic between the new version and the stable version.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'my-model'
  annotations:
    serving.kserve.io/enable-tag-routing: 'true'
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: tensorflow
      storageUri: 'gs://kfserving-examples/models/tensorflow/flowers-2'

In the above configuration, 10% of traffic is directed to the new model version, and the remaining 90% is routed to the previous stable version. Once the new version's performance is verified, canaryTrafficPercent is gradually increased to eventually reach 100%.

The typical Canary Rollout workflow is as follows.

Set canaryTrafficPercent: 10 when deploying a new model version
Monitor the new version's latency, error rate, etc. through monitoring
If no issues, gradually increase canaryTrafficPercent from 20 to 50 to 100
Immediately rollback by setting canaryTrafficPercent: 0 if issues occur

10.2 Blue-Green Deployment

In Blue-Green Deployment, two complete environments (Blue: current production, Green: new version) are operated simultaneously, and traffic is switched all at once after verification.

In KServe, Blue-Green Deployment can be implemented by setting canaryTrafficPercent: 0 to deploy the new version without sending traffic, then switching to 100 after verification. Additionally, using the serving.kserve.io/enable-tag-routing: "true" annotation enables tag-based routing to directly call and test specific versions.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'my-model'
  annotations:
    serving.kserve.io/enable-tag-routing: 'true'
spec:
  predictor:
    # Deploy new version with 0% traffic (Green)
    canaryTrafficPercent: 0
    model:
      modelFormat:
        name: tensorflow
      storageUri: 'gs://kfserving-examples/models/tensorflow/flowers-v2'

When tag routing is enabled, you can directly call and test the new version with the latest tag and the previous version with the prev tag. Once verification is complete, set canaryTrafficPercent to 100 to switch all traffic to the new version.

11. Prometheus + Grafana Monitoring Setup

11.1 Metrics Collection Architecture

Monitoring is essential for production operations of ML model serving. KServe natively supports metrics collection through Prometheus and visualization through Grafana.

The metrics collection path is as follows.

KServe Serving Container: Each model server (Triton, TorchServe, etc.) exposes its own metrics
Knative Queue Proxy: In Serverless Mode, the queue-proxy container automatically generates request metrics
Prometheus: Dynamically discovers and scrapes Pod metric endpoints through ServiceMonitor
Grafana: Visualizes dashboards using Prometheus as a data source

11.2 Prometheus Configuration

Using the Prometheus Operator and ServiceMonitor, KServe Pod metrics can be automatically collected.

# Prometheus ServiceMonitor for KServe
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kserve-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      serving.kserve.io/inferenceservice: 'my-model'
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
  namespaceSelector:
    matchNames:
      - default

Triton exposes Prometheus metrics on port 8002 by default. The key metrics are as follows.

nv_inference_request_success: Number of successful inference requests
nv_inference_request_failure: Number of failed inference requests
nv_inference_count: Total number of inference executions
nv_inference_exec_count: Total number of inference batch executions
nv_inference_request_duration_us: Inference request processing time (microseconds)
nv_inference_queue_duration_us: Time requests spent waiting in the queue
nv_inference_compute_input_duration_us: Input preprocessing time
nv_inference_compute_infer_duration_us: Actual inference time
nv_inference_compute_output_duration_us: Output post-processing time
nv_gpu_utilization: GPU utilization rate
nv_gpu_memory_used_bytes: GPU memory usage

11.3 Grafana Dashboards

The KServe official documentation provides template dashboards for each Serving Runtime.

KServe ModelServer Latency Dashboard

A dashboard for default KServe ModelServer (sklearn, xgboost, lgb, paddle, pmml, custom) runtimes that visualizes latency in milliseconds for each stage: pre/post-process, predict, and explain.

KServe Triton Latency Dashboard

A Triton-specific dashboard that provides five latency graphs.

Input (preprocess) latency
Infer (predict) latency
Output (postprocess) latency
Internal queue latency
Total latency

It also includes a percentage gauge for GPU memory usage.

KServe TorchServe Latency Dashboard

A TorchServe-specific dashboard that visualizes ts_inference_latency_microseconds and ts_queue_latency_microseconds metrics in milliseconds.

11.4 GPU Monitoring (DCGM Exporter)

For GPU-specific monitoring, deploying the NVIDIA DCGM Exporter additionally enables collecting detailed GPU metrics.

# Install DCGM Exporter (Helm)
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true

The key GPU metrics provided by DCGM Exporter are as follows.

DCGM_FI_DEV_GPU_UTIL: GPU core utilization (%)
DCGM_FI_DEV_MEM_COPY_UTIL: GPU memory bandwidth utilization (%)
DCGM_FI_DEV_FB_USED: Framebuffer usage (MB)
DCGM_FI_DEV_FB_FREE: Framebuffer free space (MB)
DCGM_FI_DEV_GPU_TEMP: GPU temperature (C)
DCGM_FI_DEV_POWER_USAGE: Power consumption (W)
DCGM_FI_DEV_SM_CLOCK: SM clock speed (MHz)
DCGM_FI_PROF_GR_ENGINE_ACTIVE: Graphics Engine active ratio

By comprehensively visualizing these metrics on Grafana dashboards, you can identify model-specific inference performance, GPU resource utilization, and bottleneck areas in real-time for optimization purposes.

12. Summary

ML model serving in Kubernetes environments requires a systematic approach to simultaneously achieve scalability, resource efficiency, and operational reliability beyond simple deployment. KServe abstracts the complexity of model serving through the InferenceService CRD and enables flexible inference pipelines with the Predictor/Transformer/Explainer structure. NVIDIA Triton Inference Server maximizes GPU utilization through multi-framework support, Dynamic Batching, and Concurrent Model Execution.

For GPU resource management, the NVIDIA Device Plugin serves as the foundation, with MIG and Time-Slicing enabling efficient sharing of expensive GPU resources. Auto-scaling should be selected among HPA, KPA, and KEDA based on workload characteristics, and Canary/Blue-Green Deployment ensures safe model updates. Finally, the Prometheus + Grafana + DCGM Exporter combination should be used for comprehensive monitoring of model performance and GPU resources.

By organically combining all these components, you can build a production-grade model serving platform capable of reliably and efficiently operating large-scale ML workloads.