Split View: Kubernetes ML 모델 서빙: KServe와 NVIDIA Triton 완전 분석

Kubernetes ML 모델 서빙: KServe와 NVIDIA Triton 완전 분석

1. 왜 Kubernetes에서 ML 모델을 서빙하는가
2. KServe 아키텍처 분석
3. KServe 설치 및 기본 사용법
4. NVIDIA Triton Inference Server 기능 분석
- 4.1 핵심 특징
5. 모델 포맷 지원
- 5.1 Triton Backend 아키텍처
- 5.2 지원 프레임워크 및 모델 포맷
6. Model Repository 구조와 config.pbtxt 설정
- 6.1 디렉토리 구조
- 6.2 config.pbtxt 상세 설정
7. Dynamic Batching과 Concurrent Model Execution
- 7.1 Dynamic Batching
- 7.2 Concurrent Model Execution
8. GPU 리소스 할당
9. Auto-scaling 전략
10. Canary / Blue-Green Deployment
- 10.1 Canary Rollout
- 10.2 Blue-Green Deployment
11. Prometheus + Grafana 모니터링 설정
12. 정리
References

1. 왜 Kubernetes에서 ML 모델을 서빙하는가

머신러닝 모델을 프로덕션 환경에 배포할 때, 단일 서버에서 Flask나 FastAPI로 모델을 서빙하는 방식은 초기 프로토타이핑에는 적합하지만, 실제 프로덕션 트래픽을 처리하기에는 근본적인 한계가 있다. Kubernetes는 이러한 한계를 해결하는 가장 성숙한 플랫폼이다.

1.1 확장성 (Scalability)

Kubernetes의 핵심 강점은 워크로드에 따른 수평적 확장(Horizontal Scaling)이다. 추론 요청이 급증하면 Pod를 자동으로 늘리고, 트래픽이 줄어들면 다시 축소할 수 있다. HPA(Horizontal Pod Autoscaler)를 활용하면 CPU, 메모리, 혹은 커스텀 메트릭(예: GPU 사용률, 요청 큐 길이)을 기준으로 자동 스케일링이 가능하다. 더 나아가 Knative 기반의 KPA(Knative Pod Autoscaler)를 사용하면 트래픽이 전혀 없을 때 Pod를 0으로 축소하는 Scale-to-Zero도 가능하다.

1.2 리소스 관리 (Resource Management)

ML 추론 워크로드는 GPU라는 고가의 리소스를 사용한다. Kubernetes는 resources.requests와 resources.limits를 통해 각 Pod가 사용하는 CPU, 메모리, GPU를 정밀하게 제어할 수 있다. NVIDIA Device Plugin을 사용하면 nvidia.com/gpu 리소스를 Kubernetes 네이티브하게 스케줄링할 수 있으며, MIG(Multi-Instance GPU)나 Time-Slicing을 통해 단일 GPU를 여러 Pod가 공유하는 것도 가능하다.

1.3 운영 안정성 (Operational Reliability)

Kubernetes의 Self-Healing 메커니즘은 모델 서빙의 안정성을 높인다. Pod가 비정상 종료되면 자동으로 재시작되고, liveness/readiness probe를 통해 모델 서버의 상태를 지속적으로 모니터링한다. Rolling Update와 Canary Deployment를 통해 모델 업데이트 시 다운타임 없이 안전하게 배포할 수 있다.

2. KServe 아키텍처 분석

KServe(구 KFServing)는 Kubernetes 위에서 ML 모델 서빙을 위해 설계된 표준화된 플랫폼이다. KServe는 Kubernetes CRD(Custom Resource Definition)를 활용하여 모델 서빙의 전체 라이프사이클을 관리한다.

2.1 계층 아키텍처 (Layered Architecture)

KServe 공식 문서에 따르면, KServe는 네 개의 주요 계층으로 구성된다.

Control Plane (Go): Kubernetes Controller가 InferenceService의 전체 라이프사이클을 관리한다. 모델 배포, 업데이트, 삭제, 스케일링 등의 작업을 오케스트레이션한다.
Data Plane (Python): 프로토콜에 구애받지 않는(protocol-agnostic) 모델 서버 프레임워크로, V1/V2 추론 프로토콜을 통해 실제 추론 요청을 처리한다.
Configuration Layer: CRD와 ConfigMap을 통해 시스템의 전체 설정을 관리한다.
Storage Layer: S3, GCS, Azure Blob Storage, PVC 등 다양한 모델 아티팩트 스토리지 백엔드를 지원한다.

2.2 InferenceService CRD

InferenceService는 KServe의 핵심 CRD로, ML 모델 서빙에 필요한 autoscaling, networking, health checking, server configuration의 복잡성을 캡슐화한다. 하나의 YAML 정의로 모델 서빙에 필요한 모든 것을 선언적으로 관리할 수 있다.

기본적인 InferenceService 정의는 다음과 같다.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'sklearn-iris'
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: 'gs://kfserving-examples/models/sklearn/1.0/model'

이 간단한 YAML 하나로 모델 다운로드, 서버 기동, 네트워크 엔드포인트 설정, 오토스케일링까지 모두 자동으로 처리된다.

2.3 세 가지 핵심 컴포넌트: Predictor, Transformer, Explainer

KServe InferenceService는 세 가지 컴포넌트로 구성되며, 이 중 Predictor만 필수이고 나머지는 선택적이다.

Predictor

Predictor는 InferenceService의 핵심 워크로드로, 모델과 모델 서버로 구성된다. 실제 추론 요청을 처리하는 역할을 하며, 네트워크 엔드포인트를 통해 외부에 노출된다. KServe는 다양한 Serving Runtime을 제공한다: TensorFlow Serving, TorchServe, Triton Inference Server, SKLearn, XGBoost, LightGBM 등이 기본 내장되어 있다.

Transformer

Transformer는 추론 전후의 데이터 변환(pre/post-processing)을 담당한다. 예를 들어, 이미지 입력을 정규화하거나, 텍스트를 토큰화하거나, 모델 출력을 사람이 읽을 수 있는 형태로 변환하는 작업을 수행한다. KServe는 Feast와 같은 Feature Store와 통합되는 out-of-the-box Transformer도 제공한다. 커스텀 Transformer는 별도의 컨테이너로 배포되며, 예측 엔드포인트와 같은 환경 변수를 통해 Predictor와 연결된다.

Explainer

Explainer는 모델 예측에 대한 설명(interpretability)을 제공하는 선택적 컴포넌트다. SHAP, LIME 등의 XAI(eXplainable AI) 기법을 활용하여 모델이 왜 특정 예측을 내렸는지에 대한 설명을 제공한다. Predictor의 추론 결과와 별도의 데이터 플레인에서 동작하며, 설명 요청 시에만 활성화된다.

요청 흐름 (Request Flow)

추론 요청이 들어오면 다음과 같은 흐름으로 처리된다.

클라이언트 요청이 Ingress Gateway를 통해 진입
Transformer에서 입력 데이터 전처리 수행
전처리된 데이터가 Predictor로 전달되어 추론 실행
Transformer에서 출력 데이터 후처리 수행
최종 결과가 클라이언트에 반환

Explanation 요청의 경우, Explainer가 추가로 개입하여 모델 설명을 생성한다.

3. KServe 설치 및 기본 사용법

3.1 사전 요구사항

KServe 공식 QuickStart 가이드에 따르면 다음이 필요하다.

Kubernetes 1.32 이상
올바르게 설정된 kubeconfig
로컬 개발/테스트 환경에서는 kind(Kubernetes in Docker) 또는 minikube 사용을 권장

3.2 Quick Install

개발 및 테스트 환경에서 가장 빠르게 시작하는 방법은 Quick Install Script를 사용하는 것이다.

# KServe Quick Install (Serverless mode - Knative 포함)
curl -s "https://raw.githubusercontent.com/kserve/kserve/master/hack/quick_install.sh" | bash

# Raw Deployment mode (Knative 없이 설치)
curl -s "https://raw.githubusercontent.com/kserve/kserve/master/hack/quick_install.sh" | bash -s -- -r

이 스크립트는 의존성 설치, 플랫폼 감지, 모드 설정을 자동으로 처리한다.

3.3 배포 모드

KServe는 두 가지 주요 배포 모드를 지원한다.

Serverless Mode (Knative): 기본 설치 옵션으로, Knative Serving을 기반으로 동작한다. Scale-to-Zero, revision 기반 트래픽 관리, Canary Deployment 등 서버리스 기능을 제공한다.
Standard Mode (Raw Deployment): Knative 없이 Kubernetes 네이티브 리소스(Deployment, Service, HPA)만으로 동작한다. 외부 의존성을 최소화하면서도 예측 추론과 생성형 추론 워크로드 모두를 지원한다.

3.4 첫 번째 모델 배포

설치가 완료되면, 다음과 같이 간단한 SKLearn 모델을 배포할 수 있다.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'sklearn-iris'
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: 'gs://kfserving-examples/models/sklearn/1.0/model'
      resources:
        requests:
          cpu: '100m'
          memory: '256Mi'
        limits:
          cpu: '1'
          memory: '512Mi'

# 모델 배포
kubectl apply -f sklearn-iris.yaml

# 상태 확인
kubectl get inferenceservice sklearn-iris

# 추론 요청 테스트
curl -v -H "Content-Type: application/json" \
  http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict \
  -d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'

4. NVIDIA Triton Inference Server 기능 분석

NVIDIA Triton Inference Server는 다양한 딥러닝 및 머신러닝 프레임워크의 모델을 동시에 서빙할 수 있는 고성능 추론 서버이다. KServe의 Serving Runtime으로 통합 사용할 수 있으며, 독립적으로도 배포할 수 있다.

4.1 핵심 특징

Triton 공식 문서에 따르면, Triton은 다음과 같은 핵심 기능을 제공한다.

다중 프레임워크 지원: 하나의 서버 인스턴스에서 서로 다른 프레임워크의 여러 모델을 동시에 서빙
Dynamic Batching: 개별 추론 요청을 동적으로 묶어 GPU 활용률 극대화
Concurrent Model Execution: 동일 GPU에서 여러 모델을 동시 실행
Model Ensemble: 여러 모델을 파이프라인으로 연결하여 복합 추론 수행
Model Analyzer: 최적의 배포 설정을 자동으로 탐색

5. 모델 포맷 지원

5.1 Triton Backend 아키텍처

Triton은 Backend 플러그인 아키텍처를 통해 다양한 모델 포맷을 지원한다. 각 모델은 반드시 하나의 Backend과 연결되어야 하며, config.pbtxt의 backend 또는 platform 필드를 통해 지정한다.

5.2 지원 프레임워크 및 모델 포맷

ONNX Runtime Backend

ONNX(Open Neural Network Exchange) 형식의 모델을 실행한다. PyTorch, TensorFlow, scikit-learn 등 다양한 프레임워크에서 변환된 모델을 지원한다. CPU와 GPU 모두에서 최적화된 추론을 제공하며, ONNX Runtime의 Graph Optimization과 Execution Provider를 활용할 수 있다.

# ONNX 모델 파일 예시
model_repository/
  resnet50_onnx/
    config.pbtxt
    1/
      model.onnx

TensorRT Backend

NVIDIA TensorRT로 최적화된 모델(engine/plan 파일)을 실행한다. TensorRT는 NVIDIA GPU에 특화된 추론 최적화 엔진으로, FP16/INT8 양자화, Layer Fusion, Kernel Auto-Tuning 등을 통해 최고 수준의 GPU 추론 성능을 제공한다.

# TensorRT 모델 파일 예시
model_repository/
  resnet50_tensorrt/
    config.pbtxt
    1/
      model.plan

PyTorch Backend

TorchScript 형식과 PyTorch 2.0의 torch.compile 형식을 모두 지원한다. TorchScript는 torch.jit.trace 또는 torch.jit.script로 변환된 모델이며, 동일한 Backend에서 TorchScript와 PyTorch 2.0 모델을 모두 처리할 수 있다.

# PyTorch 모델 파일 예시
model_repository/
  bert_torchscript/
    config.pbtxt
    1/
      model.pt

TensorFlow Backend

TensorFlow의 SavedModel과 GraphDef 형식을 모두 지원한다. 동일한 Backend이 TensorFlow 1과 TensorFlow 2 모델을 모두 실행할 수 있다. SavedModel이 권장 형식이다.

# TensorFlow SavedModel 예시
model_repository/
  resnet50_tf/
    config.pbtxt
    1/
      model.savedmodel/
        saved_model.pb
        variables/

기타 Backend

OpenVINO Backend: Intel 하드웨어에 최적화된 추론 지원
FIL Backend (Forest Inference Library): XGBoost, LightGBM, scikit-learn Random Forest, cuML Random Forest 등 트리 기반 모델 지원
Python Backend: 커스텀 Python 코드로 전처리/후처리 파이프라인 구현 가능

6. Model Repository 구조와 config.pbtxt 설정

6.1 디렉토리 구조

Triton 공식 문서에서 정의하는 Model Repository의 기본 레이아웃은 다음과 같다.

<model-repository-path>/
  <model-name>/
    [config.pbtxt]
    [<output-labels-file> ...]
    <version>/
      <model-definition-file>
    <version>/
      <model-definition-file>
    ...
  <model-name>/
    [config.pbtxt]
    ...

각 구성 요소의 역할은 다음과 같다.

model-name 디렉토리: 모델 이름에 해당하는 최상위 디렉토리
config.pbtxt: ModelConfig protobuf 형식의 모델 설정 파일
version 디렉토리: 숫자로 된 디렉토리명이 모델 버전을 나타낸다. "0"으로 시작하거나 숫자가 아닌 이름의 디렉토리는 무시된다
model-definition-file: Backend별 모델 파일 (model.onnx, model.plan, model.pt 등)

실제 예시를 보면 다음과 같다.

model_repository/
  text_detection/
    config.pbtxt
    1/
      model.onnx
  text_recognition/
    config.pbtxt
    output_labels.txt
    1/
      model.plan
    2/
      model.plan
  ensemble_ocr/
    config.pbtxt
    1/

6.2 config.pbtxt 상세 설정

config.pbtxt는 ModelConfig protobuf 메시지를 텍스트 형식으로 표현한 것이다. 최소한의 설정은 name, backend(또는 platform), max_batch_size, input, output을 포함해야 한다. 다만 일부 모델의 경우 Triton이 자동으로 최소 설정을 생성할 수 있다.

name: "resnet50"
backend: "onnxruntime"
max_batch_size: 8

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    label_filename: "labels.txt"
  }
]

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}

위 예시에서 각 항목의 의미는 다음과 같다.

name: 모델 이름 (디렉토리명과 일치해야 함)
backend: 사용할 Backend ("onnxruntime", "tensorrt", "pytorch", "tensorflow" 등)
max_batch_size: 동적 배칭 시 최대 배치 크기. 0으로 설정하면 배칭을 비활성화한다
input/output: 모델의 입출력 텐서 이름, 데이터 타입, 차원
instance_group: 모델 인스턴스 수와 할당 GPU 지정
dynamic_batching: 동적 배칭 설정

7. Dynamic Batching과 Concurrent Model Execution

7.1 Dynamic Batching

Dynamic Batching은 Triton의 가장 강력한 성능 최적화 기능 중 하나이다. 개별적으로 도착하는 추론 요청들을 하나의 배치로 묶어서 GPU에 전달함으로써 처리량(throughput)을 크게 향상시킨다.

Triton 공식 문서에 따르면, Dynamic Batching은 모델별로 독립적으로 활성화 및 설정할 수 있다. 기본적으로 Triton은 지연 없이 도착한 요청들을 즉시 배칭하지만, 사용자가 제한된 대기 시간을 설정하여 스케줄러가 더 많은 요청을 수집하도록 할 수 있다.

dynamic_batching {
  # 선호하는 배치 크기 (가능하면 이 크기로 배칭)
  preferred_batch_size: [ 4, 8 ]

  # 배치를 구성하기 위해 대기할 최대 시간 (마이크로초)
  max_queue_delay_microseconds: 100

  # 우선순위 설정 (선택)
  priority_levels: 2
  default_priority_level: 1

  # 큐 정책 (선택)
  default_queue_policy {
    timeout_action: REJECT
    default_timeout_microseconds: 5000000
    allow_timeout_override: true
    max_queue_size: 100
  }
}

핵심 파라미터의 역할은 다음과 같다.

preferred_batch_size: Triton이 우선적으로 구성하려는 배치 크기. 이 크기의 배치가 구성되면 즉시 실행한다.
max_queue_delay_microseconds: 배치를 채우기 위해 추가 요청을 기다리는 최대 시간. 이 시간이 지나면 현재 모인 요청으로 바로 실행한다. latency와 throughput 사이의 트레이드오프를 조절하는 핵심 파라미터이다.

7.2 Concurrent Model Execution

Triton은 동일한 GPU에서 여러 모델 인스턴스를 동시에 실행할 수 있다. 이를 통해 GPU의 연산 자원을 더 효율적으로 활용한다.

# 하나의 GPU에 2개의 모델 인스턴스를 배치
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

# 여러 GPU에 인스턴스를 분산 배치
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  },
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 1, 2 ]
  }
]

# CPU와 GPU에 동시에 인스턴스 배치
instance_group [
  {
    count: 1
    kind: KIND_CPU
  },
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

Triton 공식 문서에서 제시하는 최대 처리량을 위한 동시성 공식은 다음과 같다.

최적 동시성 = 2 x <max_batch_size> x <instance_count>

예를 들어, max_batch_size가 8이고 instance_count가 2라면, 최적 동시성은 2 x 8 x 2 = 32이다. 즉, 32개의 동시 요청을 보내야 Triton의 성능을 최대로 끌어낼 수 있다.

8. GPU 리소스 할당

8.1 NVIDIA Device Plugin for Kubernetes

NVIDIA Device Plugin은 Kubernetes 클러스터에서 GPU를 네이티브 리소스로 관리할 수 있게 해주는 DaemonSet이다. 설치되면 각 노드의 GPU를 자동으로 탐지하여 nvidia.com/gpu 리소스로 등록한다.

# NVIDIA Device Plugin 설치
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

Pod에서 GPU를 요청하는 방법은 다음과 같다.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: inference
      image: nvcr.io/nvidia/tritonserver:24.01-py3
      resources:
        limits:
          nvidia.com/gpu: 1

8.2 MIG (Multi-Instance GPU)

MIG는 NVIDIA A100, H100 등 최신 GPU에서 지원하는 기능으로, 하나의 물리 GPU를 여러 개의 독립된 GPU 인스턴스로 분할한다. 각 인스턴스는 하드웨어 수준에서 메모리와 연산 자원이 격리되어 있어 장애 격리(fault isolation)가 보장된다.

MIG의 주요 특성은 다음과 같다.

하드웨어 수준 격리: 각 MIG 인스턴스는 독립된 메모리, 캐시, 연산 유닛을 가진다
성능 보장: 한 인스턴스의 워크로드가 다른 인스턴스에 영향을 주지 않는다
사전 정의된 프로파일: GPU 모델에 따라 사용 가능한 분할 구성이 정해져 있다 (예: A100 80GB를 7개의 10GB 인스턴스로 분할)

Kubernetes에서 MIG를 사용하려면 NVIDIA GPU Operator를 통해 MIG 전략을 설정한다.

# MIG 프로파일 설정 예시 (GPU Operator ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-3g.40gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2

8.3 GPU Time-Slicing

Time-Slicing은 MIG를 지원하지 않는 이전 세대 GPU에서도 GPU 공유를 가능하게 하는 방식이다. NVIDIA GPU Operator 공식 문서에 따르면, Time-Slicing은 시스템 관리자가 GPU의 복제본(replica)을 정의하여 각 Pod에 독립적으로 할당할 수 있게 한다.

핵심 특성은 다음과 같다.

소프트웨어 수준 공유: GPU 시간을 균등하게 분배하여 여러 Pod가 공유
메모리/장애 격리 없음: MIG와 달리 메모리와 장애 격리가 제공되지 않는다
유연한 분할: 더 많은 수의 사용자가 GPU를 공유 가능
MIG와 결합 가능: MIG 인스턴스에 Time-Slicing을 추가로 적용 가능

# Time-Slicing 설정 (GPU Operator ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

위 설정은 각 GPU를 4개의 복제본으로 노출하여, 4개의 Pod가 각각 nvidia.com/gpu: 1을 요청하면 동일한 물리 GPU를 시분할로 공유하게 된다.

9. Auto-scaling 전략

9.1 HPA (Horizontal Pod Autoscaler)

KServe의 Standard Mode(Raw Deployment)에서는 Kubernetes 기본 HPA를 사용한다. CPU, 메모리, 또는 커스텀 메트릭을 기반으로 Pod 수를 자동 조절한다.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'sklearn-iris'
  annotations:
    serving.kserve.io/autoscalerClass: 'hpa'
    serving.kserve.io/targetUtilizationPercentage: '80'
    serving.kserve.io/minReplicas: '1'
    serving.kserve.io/maxReplicas: '10'
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: 'gs://kfserving-examples/models/sklearn/1.0/model'

9.2 KPA (Knative Pod Autoscaler)

KServe의 Serverless Mode(Knative)에서는 KPA를 사용한다. KPA의 가장 큰 특징은 Scale-to-Zero로, 트래픽이 없을 때 Pod를 0으로 축소하여 리소스를 절약할 수 있다. HPA와 KPA는 동일한 서비스에 동시에 사용할 수 없으며, annotation을 통해 어떤 Autoscaler를 사용할지 지정한다.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'sklearn-iris'
  annotations:
    autoscaling.knative.dev/class: 'kpa.autoscaling.knative.dev'
    autoscaling.knative.dev/metric: 'concurrency'
    autoscaling.knative.dev/target: '10'
    autoscaling.knative.dev/minScale: '0'
    autoscaling.knative.dev/maxScale: '10'
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: 'gs://kfserving-examples/models/sklearn/1.0/model'

KPA의 주요 메트릭은 다음과 같다.

concurrency: Pod당 동시 요청 수 (기본값)
rps: Pod당 초당 요청 수

9.3 GPU 메트릭 기반 Auto-scaling

GPU 워크로드의 경우, CPU/메모리 기반 스케일링만으로는 부족하다. NVIDIA DCGM(Data Center GPU Manager) Exporter와 Prometheus Adapter를 결합하면 GPU 메트릭 기반 스케일링이 가능하다.

# KEDA ScaledObject를 사용한 GPU 메트릭 기반 스케일링
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: triton-gpu-scaler
spec:
  scaleTargetRef:
    name: triton-inference-server
  pollingInterval: 15
  cooldownPeriod: 60
  minReplicaCount: 1
  maxReplicaCount: 5
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: gpu_utilization
        query: avg(DCGM_FI_DEV_GPU_UTIL{pod=~"triton-.*"})
        threshold: '80'

KServe Controller Manager는 HPA와 함께 KEDA(Kubernetes Event-Driven Autoscaling)도 지원하여 커스텀 메트릭 기반 스케일링을 구성할 수 있다.

10. Canary / Blue-Green Deployment

10.1 Canary Rollout

KServe 공식 문서에 따르면, Canary Rollout 전략은 Serverless Deployment Mode에서 지원된다. KServe는 자동으로 마지막으로 100% 트래픽을 받았던 안정 버전(last known good revision)을 추적하며, canaryTrafficPercent 필드를 설정하면 새 버전과 안정 버전 사이에 트래픽을 자동으로 분배한다.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'my-model'
  annotations:
    serving.kserve.io/enable-tag-routing: 'true'
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: tensorflow
      storageUri: 'gs://kfserving-examples/models/tensorflow/flowers-2'

위 설정에서 새 모델 버전에 10%의 트래픽이 전달되고, 나머지 90%는 이전 안정 버전으로 라우팅된다. 새 버전의 성능이 검증되면 canaryTrafficPercent를 점진적으로 올려 최종적으로 100%로 전환한다.

Canary Rollout의 일반적인 워크플로우는 다음과 같다.

새 모델 버전 배포 시 canaryTrafficPercent: 10 설정
모니터링으로 새 버전의 latency, error rate 등을 확인
문제가 없으면 canaryTrafficPercent를 20 -> 50 -> 100으로 점진적 증가
문제 발생 시 canaryTrafficPercent: 0으로 즉시 롤백

10.2 Blue-Green Deployment

Blue-Green Deployment에서는 두 개의 완전한 환경(Blue: 현재 프로덕션, Green: 새 버전)을 동시에 운영하다가, 검증 후 트래픽을 한 번에 전환한다.

KServe에서는 canaryTrafficPercent: 0으로 설정하여 새 버전을 배포하되 트래픽을 보내지 않고, 검증 후 100으로 전환하는 방식으로 Blue-Green Deployment를 구현할 수 있다. 또한 serving.kserve.io/enable-tag-routing: "true" annotation을 사용하면 태그 기반 라우팅으로 특정 버전을 직접 호출하여 테스트할 수 있다.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'my-model'
  annotations:
    serving.kserve.io/enable-tag-routing: 'true'
spec:
  predictor:
    # 트래픽 0%로 새 버전을 배포 (Green)
    canaryTrafficPercent: 0
    model:
      modelFormat:
        name: tensorflow
      storageUri: 'gs://kfserving-examples/models/tensorflow/flowers-v2'

태그 라우팅이 활성화되면, latest 태그로 새 버전을, prev 태그로 이전 버전을 직접 호출하여 테스트할 수 있다. 검증이 완료되면 canaryTrafficPercent를 100으로 설정하여 전체 트래픽을 새 버전으로 전환한다.

11. Prometheus + Grafana 모니터링 설정

11.1 메트릭 수집 아키텍처

ML 모델 서빙의 프로덕션 운영에서 모니터링은 필수적이다. KServe는 Prometheus를 통한 메트릭 수집과 Grafana를 통한 시각화를 기본적으로 지원한다.

메트릭 수집 경로는 다음과 같다.

KServe 서빙 컨테이너: 각 모델 서버(Triton, TorchServe 등)가 자체 메트릭을 노출
Knative Queue Proxy: Serverless Mode에서는 queue-proxy 컨테이너가 자동으로 요청 메트릭 생성
Prometheus: ServiceMonitor를 통해 Pod의 메트릭 엔드포인트를 동적으로 발견하고 스크랩
Grafana: Prometheus를 데이터소스로 사용하여 대시보드 시각화

11.2 Prometheus 설정

Prometheus Operator와 ServiceMonitor를 사용하면 KServe Pod의 메트릭을 자동으로 수집할 수 있다.

# Prometheus ServiceMonitor for KServe
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kserve-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      serving.kserve.io/inferenceservice: 'my-model'
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
  namespaceSelector:
    matchNames:
      - default

Triton은 기본적으로 포트 8002에서 Prometheus 메트릭을 노출한다. 주요 메트릭은 다음과 같다.

nv_inference_request_success: 성공한 추론 요청 수
nv_inference_request_failure: 실패한 추론 요청 수
nv_inference_count: 총 추론 실행 수
nv_inference_exec_count: 총 추론 배치 실행 수
nv_inference_request_duration_us: 추론 요청 처리 시간 (마이크로초)
nv_inference_queue_duration_us: 요청이 큐에서 대기한 시간
nv_inference_compute_input_duration_us: 입력 전처리 시간
nv_inference_compute_infer_duration_us: 실제 추론 시간
nv_inference_compute_output_duration_us: 출력 후처리 시간
nv_gpu_utilization: GPU 사용률
nv_gpu_memory_used_bytes: GPU 메모리 사용량

11.3 Grafana 대시보드

KServe 공식 문서에서는 Serving Runtime별 템플릿 대시보드를 제공한다.

KServe ModelServer Latency Dashboard

기본 KServe ModelServer(sklearn, xgboost, lgb, paddle, pmml, custom) 런타임에 대한 대시보드로, pre/post-process, predict, explain 단계별 latency를 밀리초 단위로 시각화한다.

KServe Triton Latency Dashboard

Triton 전용 대시보드로, 다섯 가지 latency 그래프를 제공한다.

Input (preprocess) latency
Infer (predict) latency
Output (postprocess) latency
Internal queue latency
Total latency

추가로 GPU 메모리 사용량의 퍼센티지 게이지도 포함되어 있다.

KServe TorchServe Latency Dashboard

TorchServe 전용 대시보드로, ts_inference_latency_microseconds와 ts_queue_latency_microseconds 메트릭을 밀리초 단위로 시각화한다.

11.4 GPU 모니터링 (DCGM Exporter)

GPU 전용 모니터링을 위해 NVIDIA DCGM Exporter를 추가로 배포하면 상세한 GPU 메트릭을 수집할 수 있다.

# DCGM Exporter 설치 (Helm)
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true

DCGM Exporter가 제공하는 주요 GPU 메트릭은 다음과 같다.

DCGM_FI_DEV_GPU_UTIL: GPU 코어 사용률 (%)
DCGM_FI_DEV_MEM_COPY_UTIL: GPU 메모리 대역폭 사용률 (%)
DCGM_FI_DEV_FB_USED: 프레임버퍼 사용량 (MB)
DCGM_FI_DEV_FB_FREE: 프레임버퍼 여유 공간 (MB)
DCGM_FI_DEV_GPU_TEMP: GPU 온도 (C)
DCGM_FI_DEV_POWER_USAGE: 전력 사용량 (W)
DCGM_FI_DEV_SM_CLOCK: SM 클럭 속도 (MHz)
DCGM_FI_PROF_GR_ENGINE_ACTIVE: Graphics Engine 활성 비율

이러한 메트릭들을 Grafana 대시보드에서 종합적으로 시각화하면, 모델별 추론 성능, GPU 리소스 활용률, 병목 구간 등을 실시간으로 파악하여 최적화에 활용할 수 있다.

12. 정리

Kubernetes 환경에서의 ML 모델 서빙은 단순한 배포를 넘어, 확장성, 리소스 효율성, 운영 안정성을 동시에 달성하기 위한 체계적인 접근이 필요하다. KServe는 InferenceService CRD를 통해 모델 서빙의 복잡성을 추상화하고, Predictor/Transformer/Explainer 구조로 유연한 추론 파이프라인을 구성할 수 있게 한다. NVIDIA Triton Inference Server는 다중 프레임워크 지원, Dynamic Batching, Concurrent Model Execution 등을 통해 GPU 활용률을 극대화한다.

GPU 리소스 관리에서는 NVIDIA Device Plugin을 기본으로, MIG와 Time-Slicing을 통해 고가의 GPU 리소스를 효율적으로 공유할 수 있다. Auto-scaling은 HPA, KPA, KEDA를 워크로드 특성에 맞게 선택하고, Canary/Blue-Green Deployment로 안전한 모델 업데이트를 보장한다. 마지막으로 Prometheus + Grafana + DCGM Exporter 조합으로 모델 성능과 GPU 리소스를 종합적으로 모니터링해야 한다.

이 모든 구성 요소를 유기적으로 결합하면, 대규모 ML 워크로드를 안정적이고 효율적으로 운영할 수 있는 프로덕션 수준의 모델 서빙 플랫폼을 구축할 수 있다.

References

Kubernetes ML Model Serving: Complete Analysis of KServe and NVIDIA Triton

1. Why Serve ML Models on Kubernetes
2. KServe Architecture Analysis
3. KServe Installation and Basic Usage
4. NVIDIA Triton Inference Server Feature Analysis
- 4.1 Key Features
5. Model Format Support
- 5.1 Triton Backend Architecture
- 5.2 Supported Frameworks and Model Formats
6. Model Repository Structure and config.pbtxt Configuration
- 6.1 Directory Structure
- 6.2 Detailed config.pbtxt Configuration
7. Dynamic Batching and Concurrent Model Execution
- 7.1 Dynamic Batching
- 7.2 Concurrent Model Execution
8. GPU Resource Allocation
9. Auto-scaling Strategies
10. Canary / Blue-Green Deployment
- 10.1 Canary Rollout
- 10.2 Blue-Green Deployment
11. Prometheus + Grafana Monitoring Setup
12. Summary
References
- Quiz

1. Why Serve ML Models on Kubernetes

When deploying machine learning models to production, serving models on a single server with Flask or FastAPI is suitable for initial prototyping, but has fundamental limitations for handling real production traffic. Kubernetes is the most mature platform for addressing these limitations.

1.1 Scalability

The core strength of Kubernetes is horizontal scaling based on workload. When inference requests spike, Pods can be automatically scaled up, and when traffic decreases, they can be scaled back down. Using HPA (Horizontal Pod Autoscaler), automatic scaling based on CPU, memory, or custom metrics (e.g., GPU utilization, request queue length) is possible. Furthermore, using Knative-based KPA (Knative Pod Autoscaler), Scale-to-Zero is also possible, reducing Pods to 0 when there is no traffic at all.

1.2 Resource Management

ML inference workloads use expensive GPU resources. Kubernetes can precisely control the CPU, memory, and GPU used by each Pod through resources.requests and resources.limits. Using the NVIDIA Device Plugin, nvidia.com/gpu resources can be natively scheduled in Kubernetes, and sharing a single GPU among multiple Pods through MIG (Multi-Instance GPU) or Time-Slicing is also possible.

1.3 Operational Reliability

Kubernetes' Self-Healing mechanism enhances the stability of model serving. If a Pod terminates abnormally, it is automatically restarted, and the model server's health is continuously monitored through liveness/readiness probes. Through Rolling Update and Canary Deployment, models can be safely deployed without downtime during updates.

2. KServe Architecture Analysis

KServe (formerly KFServing) is a standardized platform designed for ML model serving on Kubernetes. KServe leverages Kubernetes CRD (Custom Resource Definition) to manage the entire lifecycle of model serving.

2.1 Layered Architecture

According to the KServe official documentation, KServe consists of four major layers.

Control Plane (Go): The Kubernetes Controller manages the entire lifecycle of InferenceService. It orchestrates operations such as model deployment, updates, deletion, and scaling.
Data Plane (Python): A protocol-agnostic model server framework that handles actual inference requests through V1/V2 inference protocols.
Configuration Layer: Manages the entire system configuration through CRDs and ConfigMaps.
Storage Layer: Supports various model artifact storage backends including S3, GCS, Azure Blob Storage, PVC, and more.

2.2 InferenceService CRD

InferenceService is the core CRD of KServe, encapsulating the complexity of autoscaling, networking, health checking, and server configuration required for ML model serving. A single YAML definition can declaratively manage everything needed for model serving.

A basic InferenceService definition is as follows.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'sklearn-iris'
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: 'gs://kfserving-examples/models/sklearn/1.0/model'

With this single simple YAML, model download, server startup, network endpoint configuration, and autoscaling are all automatically handled.

2.3 Three Core Components: Predictor, Transformer, Explainer

KServe InferenceService consists of three components, of which only Predictor is required and the rest are optional.

Predictor

Predictor is the core workload of InferenceService, consisting of the model and model server. It handles actual inference requests and is exposed externally through network endpoints. KServe provides various Serving Runtimes: TensorFlow Serving, TorchServe, Triton Inference Server, SKLearn, XGBoost, LightGBM, and more are built-in by default.

Transformer

Transformer handles data transformation (pre/post-processing) before and after inference. For example, it performs tasks such as normalizing image inputs, tokenizing text, or converting model outputs into human-readable formats. KServe also provides out-of-the-box Transformers that integrate with Feature Stores like Feast. Custom Transformers are deployed as separate containers and connected to the Predictor through environment variables such as the prediction endpoint.

Explainer

Explainer is an optional component that provides explanations (interpretability) for model predictions. It provides explanations of why the model made specific predictions using XAI (eXplainable AI) techniques such as SHAP and LIME. It operates on a separate data plane from the Predictor's inference results and is activated only when explanation requests are made.

Request Flow

When an inference request arrives, it is processed in the following flow.

Client request enters through the Ingress Gateway
Transformer performs input data preprocessing
Preprocessed data is forwarded to the Predictor for inference execution
Transformer performs output data post-processing
Final result is returned to the client

For Explanation requests, the Explainer additionally intervenes to generate model explanations.

3. KServe Installation and Basic Usage

3.1 Prerequisites

According to the KServe official QuickStart guide, the following are required.

Kubernetes 1.32 or higher
Properly configured kubeconfig
For local development/testing environments, kind (Kubernetes in Docker) or minikube is recommended

3.2 Quick Install

The quickest way to get started in development and testing environments is to use the Quick Install Script.

# KServe Quick Install (Serverless mode - includes Knative)
curl -s "https://raw.githubusercontent.com/kserve/kserve/master/hack/quick_install.sh" | bash

# Raw Deployment mode (install without Knative)
curl -s "https://raw.githubusercontent.com/kserve/kserve/master/hack/quick_install.sh" | bash -s -- -r

This script automatically handles dependency installation, platform detection, and mode configuration.

3.3 Deployment Modes

KServe supports two major deployment modes.

Serverless Mode (Knative): The default installation option, operating based on Knative Serving. It provides serverless features such as Scale-to-Zero, revision-based traffic management, and Canary Deployment.
Standard Mode (Raw Deployment): Operates using only Kubernetes native resources (Deployment, Service, HPA) without Knative. It supports both predictive inference and generative inference workloads while minimizing external dependencies.

3.4 First Model Deployment

Once installation is complete, you can deploy a simple SKLearn model as follows.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'sklearn-iris'
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: 'gs://kfserving-examples/models/sklearn/1.0/model'
      resources:
        requests:
          cpu: '100m'
          memory: '256Mi'
        limits:
          cpu: '1'
          memory: '512Mi'

# Deploy model
kubectl apply -f sklearn-iris.yaml

# Check status
kubectl get inferenceservice sklearn-iris

# Test inference request
curl -v -H "Content-Type: application/json" \
  http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict \
  -d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'

4. NVIDIA Triton Inference Server Feature Analysis

NVIDIA Triton Inference Server is a high-performance inference server capable of simultaneously serving models from various deep learning and machine learning frameworks. It can be used as a KServe Serving Runtime integration or deployed independently.

4.1 Key Features

According to Triton official documentation, Triton provides the following key features.

Multi-Framework Support: Simultaneously serving multiple models from different frameworks on a single server instance
Dynamic Batching: Dynamically grouping individual inference requests to maximize GPU utilization
Concurrent Model Execution: Running multiple models concurrently on the same GPU
Model Ensemble: Connecting multiple models as a pipeline for composite inference
Model Analyzer: Automatically exploring optimal deployment configurations

5. Model Format Support

5.1 Triton Backend Architecture

Triton supports various model formats through its Backend plugin architecture. Each model must be associated with exactly one Backend, specified through the backend or platform field in config.pbtxt.

5.2 Supported Frameworks and Model Formats

ONNX Runtime Backend

Executes models in ONNX (Open Neural Network Exchange) format. It supports models converted from various frameworks including PyTorch, TensorFlow, and scikit-learn. It provides optimized inference on both CPU and GPU, leveraging ONNX Runtime's Graph Optimization and Execution Providers.

# ONNX model file example
model_repository/
  resnet50_onnx/
    config.pbtxt
    1/
      model.onnx

TensorRT Backend

Executes models optimized with NVIDIA TensorRT (engine/plan files). TensorRT is an inference optimization engine specialized for NVIDIA GPUs, providing the highest level of GPU inference performance through FP16/INT8 quantization, Layer Fusion, Kernel Auto-Tuning, and more.

# TensorRT model file example
model_repository/
  resnet50_tensorrt/
    config.pbtxt
    1/
      model.plan

PyTorch Backend

Supports both TorchScript format and PyTorch 2.0's torch.compile format. TorchScript models are converted using torch.jit.trace or torch.jit.script, and the same Backend can handle both TorchScript and PyTorch 2.0 models.

# PyTorch model file example
model_repository/
  bert_torchscript/
    config.pbtxt
    1/
      model.pt

TensorFlow Backend

Supports both TensorFlow's SavedModel and GraphDef formats. The same Backend can execute both TensorFlow 1 and TensorFlow 2 models. SavedModel is the recommended format.

# TensorFlow SavedModel example
model_repository/
  resnet50_tf/
    config.pbtxt
    1/
      model.savedmodel/
        saved_model.pb
        variables/

Other Backends

OpenVINO Backend: Supports inference optimized for Intel hardware
FIL Backend (Forest Inference Library): Supports tree-based models including XGBoost, LightGBM, scikit-learn Random Forest, and cuML Random Forest
Python Backend: Enables implementing custom Python code for pre/post-processing pipelines

6. Model Repository Structure and config.pbtxt Configuration

6.1 Directory Structure

The basic layout of the Model Repository as defined in Triton official documentation is as follows.

<model-repository-path>/
  <model-name>/
    [config.pbtxt]
    [<output-labels-file> ...]
    <version>/
      <model-definition-file>
    <version>/
      <model-definition-file>
    ...
  <model-name>/
    [config.pbtxt]
    ...

The role of each component is as follows.

model-name directory: The top-level directory corresponding to the model name
config.pbtxt: Model configuration file in ModelConfig protobuf format
version directory: Numeric directory names represent model versions. Directories starting with "0" or with non-numeric names are ignored
model-definition-file: Backend-specific model files (model.onnx, model.plan, model.pt, etc.)

A practical example looks like this.

model_repository/
  text_detection/
    config.pbtxt
    1/
      model.onnx
  text_recognition/
    config.pbtxt
    output_labels.txt
    1/
      model.plan
    2/
      model.plan
  ensemble_ocr/
    config.pbtxt
    1/

6.2 Detailed config.pbtxt Configuration

config.pbtxt is a text representation of the ModelConfig protobuf message. At minimum, it should include name, backend (or platform), max_batch_size, input, and output. However, for some models, Triton can automatically generate the minimum configuration.

name: "resnet50"
backend: "onnxruntime"
max_batch_size: 8

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    label_filename: "labels.txt"
  }
]

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}

The meaning of each item in the above example is as follows.

name: Model name (must match the directory name)
backend: Backend to use ("onnxruntime", "tensorrt", "pytorch", "tensorflow", etc.)
max_batch_size: Maximum batch size for dynamic batching. Setting to 0 disables batching
input/output: Input/output tensor names, data types, and dimensions of the model
instance_group: Number of model instances and assigned GPU specification
dynamic_batching: Dynamic batching configuration

7. Dynamic Batching and Concurrent Model Execution

7.1 Dynamic Batching

Dynamic Batching is one of Triton's most powerful performance optimization features. It significantly improves throughput by grouping individually arriving inference requests into a single batch and sending them to the GPU.

According to Triton official documentation, Dynamic Batching can be independently enabled and configured per model. By default, Triton immediately batches requests that arrive without delay, but users can set a limited wait time for the scheduler to collect more requests.

dynamic_batching {
  # Preferred batch sizes (batch to these sizes when possible)
  preferred_batch_size: [ 4, 8 ]

  # Maximum time to wait to form a batch (microseconds)
  max_queue_delay_microseconds: 100

  # Priority settings (optional)
  priority_levels: 2
  default_priority_level: 1

  # Queue policy (optional)
  default_queue_policy {
    timeout_action: REJECT
    default_timeout_microseconds: 5000000
    allow_timeout_override: true
    max_queue_size: 100
  }
}

The roles of key parameters are as follows.

preferred_batch_size: Batch sizes that Triton preferentially tries to form. When a batch of this size is formed, it executes immediately.
max_queue_delay_microseconds: Maximum time to wait for additional requests to fill the batch. When this time expires, it executes immediately with the currently collected requests. This is the key parameter for adjusting the trade-off between latency and throughput.

7.2 Concurrent Model Execution

Triton can run multiple model instances concurrently on the same GPU. This allows more efficient utilization of GPU computational resources.

# Place 2 model instances on a single GPU
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

# Distribute instances across multiple GPUs
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  },
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 1, 2 ]
  }
]

# Place instances on both CPU and GPU simultaneously
instance_group [
  {
    count: 1
    kind: KIND_CPU
  },
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

The formula for maximum throughput presented in Triton official documentation is as follows.

Optimal Concurrency = 2 x <max_batch_size> x <instance_count>

For example, if max_batch_size is 8 and instance_count is 2, the optimal concurrency is 2 x 8 x 2 = 32. This means 32 concurrent requests need to be sent to maximize Triton's performance.

8. GPU Resource Allocation

8.1 NVIDIA Device Plugin for Kubernetes

The NVIDIA Device Plugin is a DaemonSet that enables managing GPUs as native resources in Kubernetes clusters. Once installed, it automatically detects GPUs on each node and registers them as nvidia.com/gpu resources.

# Install NVIDIA Device Plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

How to request GPUs in a Pod is as follows.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: inference
      image: nvcr.io/nvidia/tritonserver:24.01-py3
      resources:
        limits:
          nvidia.com/gpu: 1

8.2 MIG (Multi-Instance GPU)

MIG is a feature supported by the latest GPUs such as NVIDIA A100 and H100, which partitions a single physical GPU into multiple independent GPU instances. Each instance has memory and computational resources isolated at the hardware level, guaranteeing fault isolation.

The key characteristics of MIG are as follows.

Hardware-level isolation: Each MIG instance has independent memory, cache, and compute units
Performance guarantee: Workloads on one instance do not affect other instances
Predefined profiles: Available partition configurations are predetermined based on the GPU model (e.g., splitting an A100 80GB into 7 instances of 10GB)

To use MIG in Kubernetes, the MIG strategy is configured through the NVIDIA GPU Operator.

# MIG profile configuration example (GPU Operator ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-3g.40gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2

8.3 GPU Time-Slicing

Time-Slicing is a method that enables GPU sharing even on older generation GPUs that do not support MIG. According to the NVIDIA GPU Operator official documentation, Time-Slicing allows system administrators to define GPU replicas that can be independently allocated to each Pod.

The key characteristics are as follows.

Software-level sharing: Distributes GPU time evenly for multiple Pods to share
No memory/fault isolation: Unlike MIG, memory and fault isolation are not provided
Flexible partitioning: Allows more users to share GPUs
Combinable with MIG: Time-Slicing can be additionally applied to MIG instances

# Time-Slicing configuration (GPU Operator ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

The above configuration exposes each GPU as 4 replicas, so when 4 Pods each request nvidia.com/gpu: 1, they share the same physical GPU through time-slicing.

9. Auto-scaling Strategies

9.1 HPA (Horizontal Pod Autoscaler)

In KServe's Standard Mode (Raw Deployment), the default Kubernetes HPA is used. It automatically adjusts the number of Pods based on CPU, memory, or custom metrics.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'sklearn-iris'
  annotations:
    serving.kserve.io/autoscalerClass: 'hpa'
    serving.kserve.io/targetUtilizationPercentage: '80'
    serving.kserve.io/minReplicas: '1'
    serving.kserve.io/maxReplicas: '10'
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: 'gs://kfserving-examples/models/sklearn/1.0/model'

9.2 KPA (Knative Pod Autoscaler)

In KServe's Serverless Mode (Knative), KPA is used. The most notable feature of KPA is Scale-to-Zero, which can reduce Pods to 0 when there is no traffic to save resources. HPA and KPA cannot be used simultaneously on the same service, and annotations specify which Autoscaler to use.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'sklearn-iris'
  annotations:
    autoscaling.knative.dev/class: 'kpa.autoscaling.knative.dev'
    autoscaling.knative.dev/metric: 'concurrency'
    autoscaling.knative.dev/target: '10'
    autoscaling.knative.dev/minScale: '0'
    autoscaling.knative.dev/maxScale: '10'
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: 'gs://kfserving-examples/models/sklearn/1.0/model'

The key metrics of KPA are as follows.

concurrency: Number of concurrent requests per Pod (default)
rps: Requests per second per Pod

9.3 GPU Metric-based Auto-scaling

For GPU workloads, CPU/memory-based scaling alone is insufficient. By combining the NVIDIA DCGM (Data Center GPU Manager) Exporter with the Prometheus Adapter, GPU metric-based scaling becomes possible.

# GPU metric-based scaling using KEDA ScaledObject
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: triton-gpu-scaler
spec:
  scaleTargetRef:
    name: triton-inference-server
  pollingInterval: 15
  cooldownPeriod: 60
  minReplicaCount: 1
  maxReplicaCount: 5
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: gpu_utilization
        query: avg(DCGM_FI_DEV_GPU_UTIL{pod=~"triton-.*"})
        threshold: '80'

The KServe Controller Manager supports KEDA (Kubernetes Event-Driven Autoscaling) alongside HPA, enabling custom metric-based scaling configurations.

10. Canary / Blue-Green Deployment

10.1 Canary Rollout

According to KServe official documentation, the Canary Rollout strategy is supported in Serverless Deployment Mode. KServe automatically tracks the last known good revision that received 100% traffic, and when the canaryTrafficPercent field is set, it automatically distributes traffic between the new version and the stable version.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'my-model'
  annotations:
    serving.kserve.io/enable-tag-routing: 'true'
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: tensorflow
      storageUri: 'gs://kfserving-examples/models/tensorflow/flowers-2'

In the above configuration, 10% of traffic is directed to the new model version, and the remaining 90% is routed to the previous stable version. Once the new version's performance is verified, canaryTrafficPercent is gradually increased to eventually reach 100%.

The typical Canary Rollout workflow is as follows.

Set canaryTrafficPercent: 10 when deploying a new model version
Monitor the new version's latency, error rate, etc. through monitoring
If no issues, gradually increase canaryTrafficPercent from 20 to 50 to 100
Immediately rollback by setting canaryTrafficPercent: 0 if issues occur

10.2 Blue-Green Deployment

In Blue-Green Deployment, two complete environments (Blue: current production, Green: new version) are operated simultaneously, and traffic is switched all at once after verification.

In KServe, Blue-Green Deployment can be implemented by setting canaryTrafficPercent: 0 to deploy the new version without sending traffic, then switching to 100 after verification. Additionally, using the serving.kserve.io/enable-tag-routing: "true" annotation enables tag-based routing to directly call and test specific versions.

apiVersion: 'serving.kserve.io/v1beta1'
kind: 'InferenceService'
metadata:
  name: 'my-model'
  annotations:
    serving.kserve.io/enable-tag-routing: 'true'
spec:
  predictor:
    # Deploy new version with 0% traffic (Green)
    canaryTrafficPercent: 0
    model:
      modelFormat:
        name: tensorflow
      storageUri: 'gs://kfserving-examples/models/tensorflow/flowers-v2'

When tag routing is enabled, you can directly call and test the new version with the latest tag and the previous version with the prev tag. Once verification is complete, set canaryTrafficPercent to 100 to switch all traffic to the new version.

11. Prometheus + Grafana Monitoring Setup

11.1 Metrics Collection Architecture

Monitoring is essential for production operations of ML model serving. KServe natively supports metrics collection through Prometheus and visualization through Grafana.

The metrics collection path is as follows.

KServe Serving Container: Each model server (Triton, TorchServe, etc.) exposes its own metrics
Knative Queue Proxy: In Serverless Mode, the queue-proxy container automatically generates request metrics
Prometheus: Dynamically discovers and scrapes Pod metric endpoints through ServiceMonitor
Grafana: Visualizes dashboards using Prometheus as a data source

11.2 Prometheus Configuration

Using the Prometheus Operator and ServiceMonitor, KServe Pod metrics can be automatically collected.

# Prometheus ServiceMonitor for KServe
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kserve-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      serving.kserve.io/inferenceservice: 'my-model'
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
  namespaceSelector:
    matchNames:
      - default

Triton exposes Prometheus metrics on port 8002 by default. The key metrics are as follows.

nv_inference_request_success: Number of successful inference requests
nv_inference_request_failure: Number of failed inference requests
nv_inference_count: Total number of inference executions
nv_inference_exec_count: Total number of inference batch executions
nv_inference_request_duration_us: Inference request processing time (microseconds)
nv_inference_queue_duration_us: Time requests spent waiting in the queue
nv_inference_compute_input_duration_us: Input preprocessing time
nv_inference_compute_infer_duration_us: Actual inference time
nv_inference_compute_output_duration_us: Output post-processing time
nv_gpu_utilization: GPU utilization rate
nv_gpu_memory_used_bytes: GPU memory usage

11.3 Grafana Dashboards

The KServe official documentation provides template dashboards for each Serving Runtime.

KServe ModelServer Latency Dashboard

A dashboard for default KServe ModelServer (sklearn, xgboost, lgb, paddle, pmml, custom) runtimes that visualizes latency in milliseconds for each stage: pre/post-process, predict, and explain.

KServe Triton Latency Dashboard

A Triton-specific dashboard that provides five latency graphs.

Input (preprocess) latency
Infer (predict) latency
Output (postprocess) latency
Internal queue latency
Total latency

It also includes a percentage gauge for GPU memory usage.

KServe TorchServe Latency Dashboard

A TorchServe-specific dashboard that visualizes ts_inference_latency_microseconds and ts_queue_latency_microseconds metrics in milliseconds.

11.4 GPU Monitoring (DCGM Exporter)

For GPU-specific monitoring, deploying the NVIDIA DCGM Exporter additionally enables collecting detailed GPU metrics.

# Install DCGM Exporter (Helm)
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true

The key GPU metrics provided by DCGM Exporter are as follows.

DCGM_FI_DEV_GPU_UTIL: GPU core utilization (%)
DCGM_FI_DEV_MEM_COPY_UTIL: GPU memory bandwidth utilization (%)
DCGM_FI_DEV_FB_USED: Framebuffer usage (MB)
DCGM_FI_DEV_FB_FREE: Framebuffer free space (MB)
DCGM_FI_DEV_GPU_TEMP: GPU temperature (C)
DCGM_FI_DEV_POWER_USAGE: Power consumption (W)
DCGM_FI_DEV_SM_CLOCK: SM clock speed (MHz)
DCGM_FI_PROF_GR_ENGINE_ACTIVE: Graphics Engine active ratio

By comprehensively visualizing these metrics on Grafana dashboards, you can identify model-specific inference performance, GPU resource utilization, and bottleneck areas in real-time for optimization purposes.

12. Summary

ML model serving in Kubernetes environments requires a systematic approach to simultaneously achieve scalability, resource efficiency, and operational reliability beyond simple deployment. KServe abstracts the complexity of model serving through the InferenceService CRD and enables flexible inference pipelines with the Predictor/Transformer/Explainer structure. NVIDIA Triton Inference Server maximizes GPU utilization through multi-framework support, Dynamic Batching, and Concurrent Model Execution.

For GPU resource management, the NVIDIA Device Plugin serves as the foundation, with MIG and Time-Slicing enabling efficient sharing of expensive GPU resources. Auto-scaling should be selected among HPA, KPA, and KEDA based on workload characteristics, and Canary/Blue-Green Deployment ensures safe model updates. Finally, the Prometheus + Grafana + DCGM Exporter combination should be used for comprehensive monitoring of model performance and GPU resources.

By organically combining all these components, you can build a production-grade model serving platform capable of reliably and efficiently operating large-scale ML workloads.

References

Quiz

Q1: What is the main topic covered in "Kubernetes ML Model Serving: Complete Analysis of KServe and NVIDIA Triton"?

A systematic analysis of ML model serving architecture in Kubernetes environments based on KServe and NVIDIA Triton official documentation.

Q2: What is 1 Scalability?

The core strength of Kubernetes is horizontal scaling based on workload. When inference requests spike, Pods can be automatically scaled up, and when traffic decreases, they can be scaled back down.

Q3: Explain the core concept of 2 Resource Management.

ML inference workloads use expensive GPU resources. Kubernetes can precisely control the CPU, memory, and GPU used by each Pod through resources.requests and resources.limits.

Q4: What are the key aspects of 3 Operational Reliability?

Q5: Describe the 1 Layered Architecture.

According to the KServe official documentation, KServe consists of four major layers. Control Plane (Go): The Kubernetes Controller manages the entire lifecycle of InferenceService. It orchestrates operations such as model deployment, updates, deletion, and scaling.