Skip to content

Split View: Kubernetes GPU 워크로드 관리: NVIDIA GPU Operator 완전 가이드

|

Kubernetes GPU 워크로드 관리: NVIDIA GPU Operator 완전 가이드

1. Kubernetes에서 GPU를 사용해야 하는 이유

AI/ML 워크로드의 폭발적인 증가로 인해 GPU는 더 이상 선택이 아닌 필수 인프라가 되었다. LLM 학습, 추론 서빙, 컴퓨터 비전, 과학 시뮬레이션 등 다양한 분야에서 GPU 가속 컴퓨팅이 요구되고 있으며, 이러한 워크로드를 대규모로 운영하려면 Kubernetes 기반의 오케스트레이션이 필연적이다.

Kubernetes에서 GPU를 사용하면 다음과 같은 이점을 얻을 수 있다.

  • 리소스 스케줄링 자동화: nvidia.com/gpu 리소스 타입을 통해 Pod 단위로 GPU를 선언적으로 요청하고 할당받을 수 있다.
  • 멀티 테넌시: Namespace, ResourceQuota, LimitRange를 활용하여 팀별 GPU 리소스를 격리하고 공정하게 분배할 수 있다.
  • GPU 공유: MIG, Time-Slicing, MPS 등의 기술을 통해 단일 GPU를 여러 워크로드가 공유함으로써 비용 효율성을 극대화할 수 있다.
  • 자동 스케일링: HPA(Horizontal Pod Autoscaler)나 Karpenter 같은 도구와 결합하여 GPU 워크로드의 수요에 따라 자동으로 스케일링할 수 있다.
  • 운영 표준화: GPU 드라이버 설치, 모니터링, 장애 대응 등의 운영 작업을 Operator 패턴으로 자동화할 수 있다.

그러나 Kubernetes에서 GPU를 제대로 활용하려면 GPU 드라이버, Container Toolkit, Device Plugin, 모니터링 등 여러 소프트웨어 컴포넌트를 정확하게 설치하고 관리해야 한다. 이 복잡성을 해결하기 위해 NVIDIA가 제공하는 것이 바로 GPU Operator이다.


2. NVIDIA GPU Operator 아키텍처

NVIDIA GPU Operator는 Kubernetes의 Operator Framework를 활용하여 GPU 노드에 필요한 모든 NVIDIA 소프트웨어 컴포넌트를 자동으로 프로비저닝하고 관리한다. 공식 문서에 따르면 GPU Operator는 다음 컴포넌트들을 포함한다.

2.1 핵심 컴포넌트

컴포넌트역할
NVIDIA GPU DriverGPU와 운영체제 간의 인터페이스를 제공하며 CUDA를 활성화한다.
NVIDIA Container Toolkit컨테이너 런타임(containerd, CRI-O)이 GPU에 접근할 수 있도록 한다.
NVIDIA Device Pluginkubelet API를 통해 GPU를 Kubernetes 리소스로 노출한다.
GPU Feature Discovery (GFD)노드에 장착된 GPU의 속성(모델, 메모리, CUDA 버전 등)을 감지하여 자동으로 Node Label을 생성한다.
DCGM ExporterNVIDIA DCGM(Data Center GPU Manager)을 기반으로 GPU 메트릭을 Prometheus 형식으로 노출한다.
MIG ManagerMulti-Instance GPU 구성을 Kubernetes 컨트롤러 패턴으로 관리한다.
GPU Direct Storage (GDS)스토리지 디바이스와 GPU 메모리 간 직접 데이터 전송을 활성화한다.

2.2 동작 원리

GPU Operator가 배포되면 다음 순서로 초기화가 진행된다.

  1. gpu-operator Pod 기동: 전체 컴포넌트의 상태를 감시하고 조정(reconciliation)하는 컨트롤러 역할을 수행한다.
  2. Node Feature Discovery (NFD) 배포: 클러스터의 각 노드에 Pod를 배포하여 GPU 유무를 감지하고 관련 Label을 추가한다.
  3. 드라이버 및 Toolkit 설치: GPU가 감지된 노드에 NVIDIA 드라이버와 Container Toolkit을 DaemonSet으로 배포한다.
  4. Device Plugin 배포: GPU를 nvidia.com/gpu 리소스로 kubelet에 등록한다.
  5. 부가 컴포넌트 배포: GFD, DCGM Exporter, MIG Manager 등을 배포한다.

모든 컴포넌트는 ClusterPolicy라는 Custom Resource에 의해 선언적으로 관리된다. Operator는 ClusterPolicy의 desired state와 actual state를 지속적으로 비교하며 차이가 발생하면 자동으로 보정한다.


3. GPU Operator 설치

GPU Operator는 Helm Chart를 통해 설치한다. 공식 문서의 설치 가이드를 기반으로 단계별로 살펴본다.

3.1 사전 요구사항

  • kubectlhelm CLI가 설치되어 있어야 한다.
  • 컨테이너 런타임으로 containerd 또는 CRI-O가 사용되어야 한다.
  • GPU 워크로드를 실행할 모든 Worker Node는 동일한 OS 버전이어야 한다(드라이버를 별도 사전 설치하지 않는 경우).
  • Pod Security Admission(PSA)을 사용하는 경우 Namespace에 privileged 레벨을 설정해야 한다.

3.2 설치 절차

# 1. Namespace 생성 및 PSA 설정
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator \
  pod-security.kubernetes.io/enforce=privileged

# 2. NVIDIA Helm 리포지토리 추가
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# 3. GPU Operator 설치 (기본 구성)
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1

3.3 주요 설치 옵션

파라미터용도기본값
driver.enabledNVIDIA 드라이버 배포 여부true
toolkit.enabledContainer Toolkit 배포 여부true
nfd.enabledNode Feature Discovery 배포 여부true
dcgmExporter.enabledGPU 텔레메트리 활성화 여부true
cdi.enabledContainer Device Interface 사용 여부true
driver.version특정 드라이버 버전 지정릴리스별 상이
mig.strategyMIG 전략 설정 (none, single, mixed)none

호스트에 이미 NVIDIA 드라이버가 설치된 경우:

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1 \
  --set driver.enabled=false

드라이버와 Toolkit 모두 사전 설치된 경우:

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1 \
  --set driver.enabled=false \
  --set toolkit.enabled=false

3.4 설치 검증

설치가 완료되면 간단한 CUDA 샘플로 GPU 동작을 확인할 수 있다.

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vectoradd
      image: 'nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04'
      resources:
        limits:
          nvidia.com/gpu: 1
kubectl apply -f cuda-vectoradd.yaml
kubectl logs pod/cuda-vectoradd
# [Vector addition of 50000 elements]
# Test PASSED

4. NVIDIA Device Plugin 상세 분석

NVIDIA Device Plugin은 Kubernetes의 Device Plugin Framework를 구현한 컴포넌트로, GPU Operator의 핵심 구성 요소이다.

4.1 주요 기능

  • GPU 열거(Enumeration): 노드에 장착된 모든 GPU를 탐지하고 수량을 kubelet에 보고한다.
  • GPU 건강 모니터링: GPU의 상태를 지속적으로 확인하여 unhealthy 상태의 GPU를 스케줄링에서 제외한다.
  • GPU 할당: Pod가 nvidia.com/gpu 리소스를 요청하면 사용 가능한 GPU를 할당한다.
  • 리소스 공유: Time-Slicing, MIG 등의 GPU 공유 전략을 지원한다.

4.2 동작 방식

Device Plugin은 DaemonSet으로 배포되어 각 GPU 노드에서 실행된다. 동작 과정은 다음과 같다.

  1. Plugin이 gRPC 서버를 시작하고 kubelet의 Device Plugin 소켓에 등록한다.
  2. kubelet은 nvidia.com/gpu 리소스의 용량(capacity)을 Node 객체에 반영한다.
  3. Pod가 해당 리소스를 요청하면 kube-scheduler가 적절한 노드에 스케줄링한다.
  4. kubelet은 Device Plugin에게 GPU 할당을 요청하고, Plugin은 컨테이너에 필요한 디바이스 파일과 환경 변수를 설정한다.

4.3 리소스 요청/제한 설정

GPU 리소스는 resources.limits를 통해서만 요청할 수 있다. CPU나 메모리와 달리 requests를 별도로 설정할 수 없으며, limits를 설정하면 자동으로 동일한 값이 requests로 적용된다.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: gpu-container
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/gpu: 2 # 2개의 GPU 요청

중요 제약사항:

  • GPU는 정수 단위로만 요청할 수 있다 (nvidia.com/gpu: 0.5 는 불가).
  • GPU는 노드 간에 공유되지 않는다 (하나의 Pod가 여러 노드의 GPU를 동시에 사용할 수 없다).
  • limits를 지정하지 않으면 GPU가 할당되지 않는다.

5. GPU 리소스 요청/제한 상세

5.1 기본 리소스 모델

Kubernetes에서 GPU는 Extended Resource로 취급된다. nvidia.com/gpu라는 리소스 이름으로 요청하며, 이는 NVIDIA Device Plugin이 kubelet에 등록한 것이다.

resources:
  limits:
    nvidia.com/gpu: 1 # 1개 GPU 전용 할당
    memory: '16Gi' # GPU와 별도로 시스템 메모리도 설정
    cpu: '4'

5.2 ResourceQuota를 통한 GPU 제한

멀티 테넌트 환경에서 Namespace별 GPU 사용량을 제한할 수 있다.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: team-ml
spec:
  hard:
    requests.nvidia.com/gpu: '4' # 최대 4개 GPU
    limits.nvidia.com/gpu: '4'

5.3 LimitRange를 통한 기본값 설정

apiVersion: v1
kind: LimitRange
metadata:
  name: gpu-limit-range
  namespace: team-ml
spec:
  limits:
    - type: Container
      default:
        nvidia.com/gpu: '1'
      defaultRequest:
        nvidia.com/gpu: '1'
      max:
        nvidia.com/gpu: '2'

6. GPU 공유 전략: MIG (Multi-Instance GPU)

6.1 MIG 개요

Multi-Instance GPU(MIG)는 NVIDIA Ampere 아키텍처 이상의 GPU(A100, A30, H100 등)에서 지원되는 하드웨어 수준의 GPU 파티셔닝 기술이다. 단일 GPU를 최대 7개의 독립된 GPU 인스턴스로 분할할 수 있으며, 각 인스턴스는 전용 컴퓨팅 리소스, 메모리, 캐시를 가진다.

MIG의 핵심 장점은 하드웨어 수준의 격리이다. 각 MIG 인스턴스는 독립된 메모리 공간과 오류 격리를 제공하므로, 한 인스턴스의 장애가 다른 인스턴스에 영향을 미치지 않는다.

6.2 MIG 전략

GPU Operator는 두 가지 MIG 전략을 지원한다.

  • Single Strategy: 노드의 모든 GPU가 동일한 MIG 구성을 갖는다. 리소스 이름은 nvidia.com/gpu로 유지된다.
  • Mixed Strategy: 일부 GPU는 MIG 모드로, 나머지는 Full GPU 모드로 운영할 수 있다. MIG 인스턴스는 nvidia.com/mig-<slice> 형태의 별도 리소스로 노출된다.

6.3 MIG 설치 및 구성

MIG를 활성화하여 GPU Operator를 설치한다.

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1 \
  --set mig.strategy=single

사전 정의된 MIG 프로파일을 노드에 적용한다.

# 노드에 MIG 구성 Label 적용
kubectl label nodes gpu-node-01 \
  nvidia.com/mig.config=all-1g.10gb --overwrite

주요 사전 정의 프로파일:

프로파일설명
all-1g.10gb모든 GPU를 1g.10gb 인스턴스로 분할
all-3g.40gb모든 GPU를 3g.40gb 인스턴스로 분할
all-balanced다양한 크기의 인스턴스를 혼합 배치
all-disabledMIG 모드 비활성화

6.4 커스텀 MIG 구성

기본 프로파일 외에 사용자 정의 구성도 가능하다.

apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      custom-profile:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 5
            "2g.20gb": 1

ClusterPolicy에 커스텀 ConfigMap을 연결한다.

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/migManager/config/name","value":"custom-mig-config"}]'

구성 상태를 확인한다.

kubectl get node gpu-node-01 -o=jsonpath='{.metadata.labels}' | jq .
# "nvidia.com/mig.config.state": "success" 확인

6.5 MIG 리소스 사용

MIG 인스턴스를 사용하는 Pod는 다음과 같이 리소스를 요청한다.

apiVersion: v1
kind: Pod
metadata:
  name: mig-workload
spec:
  containers:
    - name: cuda-app
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1 # 1g.10gb MIG 인스턴스 1개 요청

7. GPU Time-Slicing 설정

7.1 Time-Slicing 개요

Time-Slicing은 NVIDIA GPU의 시분할 스케줄러를 활용하여 하나의 GPU를 여러 Pod가 시간적으로 공유하는 방식이다. MIG와 달리 메모리 격리나 오류 격리를 제공하지 않지만, MIG를 지원하지 않는 구형 GPU(T4, V100 등)에서도 사용할 수 있으며, GPU를 더 많은 수의 사용자/워크로드와 공유할 수 있다는 장점이 있다.

7.2 ConfigMap 구성

Time-Slicing은 ConfigMap을 통해 설정한다.

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

주요 설정 필드:

필드타입설명
renameByDefaultbooleantrue이면 리소스를 nvidia.com/gpu.shared로 광고한다. false이면 product label에 -SHARED 접미사를 추가한다.
failRequestsGreaterThanOneboolean하나의 Pod가 1개 이상의 GPU replica를 요청하는 것을 거부한다. true 설정을 권장한다.
resources.namestring리소스 이름 (nvidia.com/gpu 등)
resources.replicasintegerGPU당 Time-Slice 복제본 수

7.3 클러스터 적용

클러스터 전체에 적용:

kubectl apply -f time-slicing-config.yaml

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  -n gpu-operator --type merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

특정 노드에만 적용:

# ConfigMap에 GPU 모델별 설정을 정의한 경우
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  -n gpu-operator --type merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config"}}}}'

# 해당 노드에 Label 적용
kubectl label node gpu-node-01 \
  nvidia.com/device-plugin.config=tesla-t4

7.4 적용 결과 확인

Time-Slicing이 적용되면 노드의 Label과 allocatable 리소스가 변경된다.

kubectl get node gpu-node-01 -o jsonpath='{.status.allocatable}' | jq .
# "nvidia.com/gpu": "4"  (물리 GPU 1개 * replicas 4)

kubectl get node gpu-node-01 --show-labels | grep nvidia
# nvidia.com/gpu.replicas=4
# nvidia.com/gpu.product=Tesla-T4-SHARED

7.5 주요 제약사항

  • 메모리/오류 격리 없음: Time-Slice 복제본 간에 GPU 메모리가 공유되며, 한 프로세스의 장애가 다른 프로세스에 영향을 줄 수 있다.
  • 비례적 컴퓨팅 보장 없음: 여러 GPU replica를 요청해도 비례적으로 더 많은 컴퓨팅 파워를 보장받지 못한다.
  • DCGM Exporter 제한: GPU Time-Slicing이 활성화된 경우 메트릭을 개별 컨테이너에 연관시키는 것이 지원되지 않는다.
  • ConfigMap 변경 시 수동 재시작 필요: Operator가 ConfigMap 변경을 자동 감지하지 않으므로 수동으로 DaemonSet을 재시작해야 한다.
kubectl rollout restart -n gpu-operator \
  daemonset/nvidia-device-plugin-daemonset

8. MPS (Multi-Process Service) 설정

8.1 MPS 개요

NVIDIA MPS(Multi-Process Service)는 여러 CUDA 프로세스가 단일 GPU에서 동시에 실행될 수 있도록 하는 클라이언트-서버 아키텍처이다. 기존 CUDA 컨텍스트 스위칭의 오버헤드를 줄이고, 여러 프로세스의 CUDA 커널을 하나의 GPU 컨텍스트로 멀티플렉싱하여 GPU 활용률을 높인다.

8.2 MIG vs Time-Slicing vs MPS 비교

특성MIGTime-SlicingMPS
메모리 격리O (하드웨어)X부분적 (소프트웨어)
오류 격리OXX
지원 GPUAmpere 이상모든 NVIDIA GPUVolta 이상 권장
최대 파티션 수7제한 없음48 (Pre-Volta: 16)
커널 동시 실행OX (시분할)O
파티션 크기 유연성고정 프로파일균등 분할임의 크기

8.3 MPS의 장단점

장점:

  • MIG와 달리 임의 크기의 GPU 슬라이스를 생성할 수 있다.
  • Time-Slicing과 달리 메모리 할당 제한을 강제할 수 있어 OOM 에러를 줄일 수 있다.
  • 여러 프로세스의 CUDA 커널이 실제로 동시에 실행되므로 GPU 활용률이 높다.

단점:

  • 오류 격리가 제공되지 않는다. 한 클라이언트 프로세스의 크래시가 GPU 리셋을 유발하여 다른 모든 프로세스에 영향을 줄 수 있다.
  • 메모리 보호가 완전하지 않다.

8.4 Kubernetes에서 MPS 활용

현재 NVIDIA Device Plugin은 MPS를 직접 지원하지 않지만, NVIDIA의 DRA(Dynamic Resource Allocation) 드라이버를 통해 MPS를 구성할 수 있다. 또는 MPS를 지원하는 별도의 Device Plugin을 설치하여 사용할 수 있다.

GKE(Google Kubernetes Engine) 등 일부 관리형 Kubernetes 서비스에서는 MPS를 네이티브로 지원한다.

# GKE에서의 MPS 사용 예시
apiVersion: v1
kind: Pod
metadata:
  name: mps-workload
spec:
  containers:
    - name: cuda-app
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/gpu: 1

일반적인 온프레미스 Kubernetes 환경에서 MPS를 사용하려면 MPS Control Daemon을 사이드카 패턴이나 DaemonSet으로 배포하고, CUDA_MPS_PIPE_DIRECTORYCUDA_MPS_LOG_DIRECTORY 환경 변수를 설정해야 한다.


9. Node Affinity, Taint/Toleration으로 GPU 노드 분리

GPU 노드는 고가의 리소스이므로 GPU가 필요하지 않은 워크로드가 GPU 노드에 스케줄링되는 것을 방지해야 한다. Kubernetes의 Taint/Toleration과 Node Affinity를 조합하여 이를 구현한다.

9.1 GPU 노드에 Taint 적용

# GPU 노드에 Taint 추가
kubectl taint nodes gpu-node-01 \
  nvidia.com/gpu=present:NoSchedule

kubectl taint nodes gpu-node-02 \
  nvidia.com/gpu=present:NoSchedule

이 Taint가 적용되면 해당 Toleration이 없는 Pod는 GPU 노드에 스케줄링되지 않는다.

9.2 GPU 워크로드에 Toleration 추가

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-job
spec:
  tolerations:
    - key: 'nvidia.com/gpu'
      operator: 'Equal'
      value: 'present'
      effect: 'NoSchedule'
  containers:
    - name: training
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/gpu: 1

9.3 Node Affinity로 특정 GPU 모델 선택

GPU Feature Discovery가 생성한 Label을 활용하여 특정 GPU 모델이 장착된 노드에만 워크로드를 스케줄링할 수 있다.

apiVersion: v1
kind: Pod
metadata:
  name: a100-training
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                  - 'A100-SXM4-80GB'
                  - 'A100-SXM4-40GB'
              - key: nvidia.com/gpu.memory
                operator: Gt
                values:
                  - '40000'
  tolerations:
    - key: 'nvidia.com/gpu'
      operator: 'Equal'
      value: 'present'
      effect: 'NoSchedule'
  containers:
    - name: llm-training
      image: nvcr.io/nvidia/pytorch:24.01-py3
      command: ['python', 'train.py']
      resources:
        limits:
          nvidia.com/gpu: 4

9.4 실전 전략: Taint + Affinity 조합

GPU 노드 관리의 Best Practice는 Taint로 비GPU 워크로드를 차단하고, Node Affinity로 GPU 워크로드를 적절한 노드로 유도하는 이중 전략이다.

  • Taint: GPU 노드에 접근할 수 있는 Pod를 제한하는 "hard" 제약
  • Node Affinity: Pod가 원하는 GPU 사양의 노드에 배치되도록 하는 "soft" 또는 "hard" 선호

Toleration은 Pod가 Taint된 노드에 스케줄링될 수 있게 허용할 뿐, 해당 노드를 강제하지는 않는다. 따라서 Node Affinity와 반드시 함께 사용해야 GPU 워크로드가 정확히 GPU 노드에만 배치된다.


10. DCGM Exporter + Prometheus + Grafana 모니터링

10.1 DCGM Exporter 개요

DCGM Exporter는 NVIDIA Data Center GPU Manager(DCGM)의 Go API를 사용하여 GPU 텔레메트리 데이터를 수집하고 Prometheus 형식으로 노출하는 컴포넌트이다. GPU Operator를 설치하면 기본적으로 DCGM Exporter가 DaemonSet으로 배포된다.

주요 수집 메트릭:

메트릭설명
DCGM_FI_DEV_GPU_UTILGPU 사용률 (%)
DCGM_FI_DEV_MEM_COPY_UTIL메모리 복사 사용률 (%)
DCGM_FI_DEV_FB_USEDFramebuffer 사용 메모리 (MB)
DCGM_FI_DEV_FB_FREEFramebuffer 여유 메모리 (MB)
DCGM_FI_DEV_GPU_TEMPGPU 온도 (C)
DCGM_FI_DEV_POWER_USAGE전력 사용량 (W)
DCGM_FI_DEV_SM_CLOCKSM Clock 주파수 (MHz)
DCGM_FI_DEV_PCIE_TX_THROUGHPUTPCIe TX 처리량
DCGM_FI_DEV_PCIE_RX_THROUGHPUTPCIe RX 처리량

10.2 Prometheus 스택 설치

kube-prometheus-stack을 사용하여 Prometheus와 Grafana를 함께 배포한다.

# Prometheus 커뮤니티 Helm 리포지토리 추가
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

GPU 메트릭 수집을 위한 values 파일을 작성한다.

# kube-prometheus-stack-values.yaml
prometheus:
  service:
    type: NodePort
    nodePort: 30090
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    additionalScrapeConfigs:
      - job_name: gpu-metrics
        scrape_interval: 1s
        metrics_path: /metrics
        scheme: http
        kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
                - gpu-operator
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_node_name]
            action: replace
            target_label: kubernetes_node

grafana:
  service:
    type: NodePort
    nodePort: 32322
helm install prometheus-community/kube-prometheus-stack \
  --create-namespace --namespace prometheus \
  --generate-name \
  --values kube-prometheus-stack-values.yaml

10.3 Grafana Dashboard 구성

NVIDIA 공식 DCGM Exporter 대시보드를 Import한다.

  1. Grafana에 접속한다 (http://<node-ip>:32322).
  2. Dashboards > Import로 이동한다.
  3. Dashboard ID 12239를 입력한다 (NVIDIA DCGM Exporter Dashboard).
  4. Data Source로 Prometheus를 선택한다.

이 대시보드는 GPU 활용률, 메모리 사용량, 온도, 전력 사용량 등을 실시간으로 시각화한다. MIG와 Non-MIG GPU를 모두 지원하는 대시보드는 ID 23382를 사용한다.

10.4 Alert 설정 예시

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-alerts
  namespace: prometheus
spec:
  groups:
    - name: gpu.rules
      rules:
        - alert: GPUHighTemperature
          expr: DCGM_FI_DEV_GPU_TEMP > 85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU temperature is high on {{ $labels.kubernetes_node }}'
            description: 'GPU {{ $labels.gpu }} temperature is {{ $value }}C'
        - alert: GPUMemoryAlmostFull
          expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.95
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: 'GPU memory usage above 95% on {{ $labels.kubernetes_node }}'
        - alert: GPUHighUtilization
          expr: DCGM_FI_DEV_GPU_UTIL > 95
          for: 30m
          labels:
            severity: info
          annotations:
            summary: 'Sustained high GPU utilization on {{ $labels.kubernetes_node }}'

11. GPU Feature Discovery 활용

11.1 개요

GPU Feature Discovery(GFD)는 노드에 장착된 GPU의 속성을 자동으로 감지하여 Kubernetes Node Label을 생성하는 컴포넌트이다. Node Feature Discovery(NFD)를 기반으로 동작하며, GPU Operator에 포함되어 자동 배포된다.

11.2 생성되는 Label 목록

Label설명예시 값
nvidia.com/gpu.productGPU 모델명Tesla-T4, A100-SXM4-80GB
nvidia.com/gpu.memoryGPU 메모리 용량 (MB)40960
nvidia.com/gpu.countGPU 수량4
nvidia.com/gpu.familyGPU 아키텍처 패밀리ampere, hopper
nvidia.com/gpu.compute.majorCUDA Compute Capability (Major)8
nvidia.com/gpu.compute.minorCUDA Compute Capability (Minor)0
nvidia.com/cuda.driver.majorCUDA 드라이버 Major 버전535
nvidia.com/cuda.driver.minorCUDA 드라이버 Minor 버전129
nvidia.com/cuda.runtime.majorCUDA Runtime Major 버전12
nvidia.com/cuda.runtime.minorCUDA Runtime Minor 버전2
nvidia.com/gpu.machine머신 타입 (클라우드에서는 인스턴스 타입)p4d.24xlarge
nvidia.com/gpu.replicasGPU replica 수 (공유 시)4
nvidia.com/mig.strategyMIG 전략single, mixed

11.3 Label 활용 예시

GFD Label을 활용하면 GPU 사양에 따른 세밀한 스케줄링이 가능하다.

# Hopper 아키텍처 GPU에서만 실행
apiVersion: batch/v1
kind: Job
metadata:
  name: h100-inference
spec:
  template:
    spec:
      nodeSelector:
        nvidia.com/gpu.family: hopper
      containers:
        - name: inference
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          resources:
            limits:
              nvidia.com/gpu: 1
      restartPolicy: Never
# 80GB 이상 GPU 메모리가 필요한 대규모 모델 학습
apiVersion: batch/v1
kind: Job
metadata:
  name: large-model-training
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: nvidia.com/gpu.memory
                    operator: Gt
                    values:
                      - '79000'
                  - key: nvidia.com/gpu.count
                    operator: Gt
                    values:
                      - '7'
      containers:
        - name: training
          image: nvcr.io/nvidia/pytorch:24.01-py3
          resources:
            limits:
              nvidia.com/gpu: 8
      restartPolicy: Never

12. 실전 YAML 예제와 트러블슈팅

12.1 PyTorch 분산 학습 Job

apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  parallelism: 2
  completions: 2
  template:
    metadata:
      labels:
        app: distributed-training
    spec:
      tolerations:
        - key: 'nvidia.com/gpu'
          operator: 'Equal'
          value: 'present'
          effect: 'NoSchedule'
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: nvidia.com/gpu.product
                    operator: In
                    values:
                      - 'A100-SXM4-80GB'
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - distributed-training
                topologyKey: kubernetes.io/hostname
      containers:
        - name: pytorch-trainer
          image: nvcr.io/nvidia/pytorch:24.01-py3
          command:
            - torchrun
            - --nproc_per_node=4
            - --nnodes=2
            - train.py
          resources:
            limits:
              nvidia.com/gpu: 4
              memory: '64Gi'
              cpu: '16'
            requests:
              memory: '64Gi'
              cpu: '16'
          env:
            - name: NCCL_DEBUG
              value: 'INFO'
          volumeMounts:
            - name: dshm
              mountPath: /dev/shm
            - name: training-data
              mountPath: /data
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: '16Gi'
        - name: training-data
          persistentVolumeClaim:
            claimName: training-data-pvc
      restartPolicy: Never
  backoffLimit: 3

주의사항: PyTorch 분산 학습에서 /dev/shm(공유 메모리)의 크기가 부족하면 DataLoadernum_workers > 0 설정 시 크래시가 발생할 수 있다. emptyDir을 Memory medium으로 마운트하여 충분한 크기를 할당해야 한다.

12.2 Triton Inference Server 배포

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton-inference
  template:
    metadata:
      labels:
        app: triton-inference
    spec:
      tolerations:
        - key: 'nvidia.com/gpu'
          operator: 'Equal'
          value: 'present'
          effect: 'NoSchedule'
      nodeSelector:
        nvidia.com/gpu.family: ampere
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          args:
            - tritonserver
            - --model-repository=/models
            - --strict-model-config=false
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: '16Gi'
              cpu: '4'
          readinessProbe:
            httpGet:
              path: /v2/health/ready
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /v2/health/live
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 15
          volumeMounts:
            - name: model-store
              mountPath: /models
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-store-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: triton-inference
spec:
  selector:
    app: triton-inference
  ports:
    - name: http
      port: 8000
      targetPort: 8000
    - name: grpc
      port: 8001
      targetPort: 8001
    - name: metrics
      port: 8002
      targetPort: 8002

12.3 트러블슈팅 가이드

GPU가 노드에서 인식되지 않는 경우

# 1. GPU Operator Pod 상태 확인
kubectl get pods -n gpu-operator

# 2. 드라이버 Pod 로그 확인
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# 3. Device Plugin Pod 로그 확인
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset

# 4. 노드의 allocatable 리소스 확인
kubectl describe node gpu-node-01 | grep -A 5 "Allocatable"

# 5. NFD Label 확인
kubectl get node gpu-node-01 -o json | \
  jq '.metadata.labels | to_entries[] | select(.key | startswith("nvidia"))'

Pod가 GPU 노드에 스케줄링되지 않는 경우

# 1. Pod 이벤트 확인
kubectl describe pod <pod-name>
# "Insufficient nvidia.com/gpu" 메시지 확인

# 2. 노드의 GPU 할당 상태 확인
kubectl describe node gpu-node-01 | grep -A 3 "Allocated resources"

# 3. GPU를 사용 중인 Pod 목록 확인
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.containers[].resources.limits["nvidia.com/gpu"] != null) | .metadata.name'

MIG 구성이 적용되지 않는 경우

# 1. MIG Manager 로그 확인
kubectl logs -n gpu-operator -l app=nvidia-mig-manager

# 2. MIG 구성 상태 Label 확인
kubectl get node gpu-node-01 -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'
# "success" 가 아니면 구성 실패

# 3. GPU 모드 확인 (노드에서)
nvidia-smi -L
nvidia-smi --query-gpu=mig.mode.current --format=csv

Time-Slicing 적용 후 GPU 수가 변경되지 않는 경우

# 1. ConfigMap 내용 확인
kubectl get configmap -n gpu-operator time-slicing-config -o yaml

# 2. Device Plugin 재시작
kubectl rollout restart -n gpu-operator \
  daemonset/nvidia-device-plugin-daemonset

# 3. 재시작 후 노드 리소스 확인
kubectl get node gpu-node-01 -o jsonpath='{.status.allocatable.nvidia\.com/gpu}'

References

Kubernetes GPU Workload Management: The Complete NVIDIA GPU Operator Guide

1. Why You Need GPUs in Kubernetes

With the explosive growth of AI/ML workloads, GPUs are no longer optional but essential infrastructure. GPU-accelerated computing is demanded across diverse domains including LLM training, inference serving, computer vision, and scientific simulations. To operate these workloads at scale, Kubernetes-based orchestration is inevitable.

Using GPUs in Kubernetes provides the following benefits:

  • Automated Resource Scheduling: GPUs can be declaratively requested and allocated on a per-Pod basis through the nvidia.com/gpu resource type.
  • Multi-Tenancy: Namespace, ResourceQuota, and LimitRange can be leveraged to isolate and fairly distribute GPU resources across teams.
  • GPU Sharing: Technologies such as MIG, Time-Slicing, and MPS allow multiple workloads to share a single GPU, maximizing cost efficiency.
  • Auto Scaling: Combined with tools like HPA (Horizontal Pod Autoscaler) or Karpenter, GPU workloads can be automatically scaled based on demand.
  • Operational Standardization: Operational tasks such as GPU driver installation, monitoring, and incident response can be automated using the Operator pattern.

However, to properly leverage GPUs in Kubernetes, multiple software components including GPU drivers, Container Toolkit, Device Plugin, and monitoring tools must be accurately installed and managed. To address this complexity, NVIDIA provides the GPU Operator.


2. NVIDIA GPU Operator Architecture

The NVIDIA GPU Operator uses the Kubernetes Operator Framework to automatically provision and manage all NVIDIA software components needed on GPU nodes. According to the official documentation, the GPU Operator includes the following components:

2.1 Core Components

ComponentRole
NVIDIA GPU DriverProvides the interface between the GPU and the operating system, enabling CUDA.
NVIDIA Container ToolkitEnables container runtimes (containerd, CRI-O) to access GPUs.
NVIDIA Device PluginExposes GPUs as Kubernetes resources through the kubelet API.
GPU Feature Discovery (GFD)Detects GPU properties (model, memory, CUDA version, etc.) on nodes and automatically generates Node Labels.
DCGM ExporterExposes GPU metrics in Prometheus format based on NVIDIA DCGM (Data Center GPU Manager).
MIG ManagerManages Multi-Instance GPU configurations using the Kubernetes controller pattern.
GPU Direct Storage (GDS)Enables direct data transfer between storage devices and GPU memory.

2.2 How It Works

When the GPU Operator is deployed, initialization proceeds in the following order:

  1. gpu-operator Pod starts: Acts as the controller that monitors and reconciles the state of all components.
  2. Node Feature Discovery (NFD) deployment: Deploys Pods to each node in the cluster to detect GPU presence and add relevant Labels.
  3. Driver and Toolkit installation: Deploys NVIDIA drivers and Container Toolkit as DaemonSets on nodes where GPUs are detected.
  4. Device Plugin deployment: Registers GPUs as nvidia.com/gpu resources with kubelet.
  5. Auxiliary component deployment: Deploys GFD, DCGM Exporter, MIG Manager, etc.

All components are declaratively managed through a Custom Resource called ClusterPolicy. The Operator continuously compares the desired state and actual state of the ClusterPolicy and automatically corrects any discrepancies.


3. GPU Operator Installation

The GPU Operator is installed via Helm Chart. Let's walk through the installation step by step based on the official installation guide.

3.1 Prerequisites

  • kubectl and helm CLI must be installed.
  • containerd or CRI-O must be used as the container runtime.
  • All Worker Nodes running GPU workloads should have the same OS version (unless drivers are pre-installed separately).
  • If Pod Security Admission (PSA) is used, the Namespace must be configured with the privileged level.

3.2 Installation Procedure

# 1. Create Namespace and configure PSA
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator \
  pod-security.kubernetes.io/enforce=privileged

# 2. Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# 3. Install GPU Operator (default configuration)
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1

3.3 Key Installation Options

ParameterPurposeDefault
driver.enabledWhether to deploy the NVIDIA drivertrue
toolkit.enabledWhether to deploy Container Toolkittrue
nfd.enabledWhether to deploy Node Feature Discoverytrue
dcgmExporter.enabledWhether to enable GPU telemetrytrue
cdi.enabledWhether to use Container Device Interfacetrue
driver.versionSpecify a particular driver versionVaries by release
mig.strategyMIG strategy setting (none, single, mixed)none

When the NVIDIA driver is already installed on the host:

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1 \
  --set driver.enabled=false

When both the driver and Toolkit are pre-installed:

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1 \
  --set driver.enabled=false \
  --set toolkit.enabled=false

3.4 Installation Verification

After installation is complete, you can verify GPU operation with a simple CUDA sample:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vectoradd
      image: 'nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04'
      resources:
        limits:
          nvidia.com/gpu: 1
kubectl apply -f cuda-vectoradd.yaml
kubectl logs pod/cuda-vectoradd
# [Vector addition of 50000 elements]
# Test PASSED

4. NVIDIA Device Plugin In-Depth Analysis

The NVIDIA Device Plugin implements the Kubernetes Device Plugin Framework and is a core component of the GPU Operator.

4.1 Key Features

  • GPU Enumeration: Detects all GPUs installed on a node and reports the count to kubelet.
  • GPU Health Monitoring: Continuously checks GPU status and excludes unhealthy GPUs from scheduling.
  • GPU Allocation: Allocates available GPUs when a Pod requests nvidia.com/gpu resources.
  • Resource Sharing: Supports GPU sharing strategies such as Time-Slicing and MIG.

4.2 How It Works

The Device Plugin is deployed as a DaemonSet and runs on each GPU node. The operation flow is as follows:

  1. The Plugin starts a gRPC server and registers with kubelet's Device Plugin socket.
  2. kubelet reflects the capacity of nvidia.com/gpu resources in the Node object.
  3. When a Pod requests those resources, the kube-scheduler schedules it to an appropriate node.
  4. kubelet requests GPU allocation from the Device Plugin, and the Plugin configures the necessary device files and environment variables for the container.

4.3 Resource Request/Limit Configuration

GPU resources can only be requested through resources.limits. Unlike CPU or memory, requests cannot be set separately -- setting limits automatically applies the same value as requests.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: gpu-container
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/gpu: 2 # Request 2 GPUs

Important Constraints:

  • GPUs can only be requested in integer units (nvidia.com/gpu: 0.5 is not allowed).
  • GPUs are not shared across nodes (a single Pod cannot simultaneously use GPUs from multiple nodes).
  • If limits is not specified, no GPU will be allocated.

5. GPU Resource Requests/Limits In Detail

5.1 Basic Resource Model

In Kubernetes, GPUs are treated as Extended Resources. They are requested using the resource name nvidia.com/gpu, which is registered with kubelet by the NVIDIA Device Plugin.

resources:
  limits:
    nvidia.com/gpu: 1 # Exclusively allocate 1 GPU
    memory: '16Gi' # System memory is set separately from the GPU
    cpu: '4'

5.2 GPU Limitation via ResourceQuota

In multi-tenant environments, GPU usage per Namespace can be restricted:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: team-ml
spec:
  hard:
    requests.nvidia.com/gpu: '4' # Maximum 4 GPUs
    limits.nvidia.com/gpu: '4'

5.3 Default Values via LimitRange

apiVersion: v1
kind: LimitRange
metadata:
  name: gpu-limit-range
  namespace: team-ml
spec:
  limits:
    - type: Container
      default:
        nvidia.com/gpu: '1'
      defaultRequest:
        nvidia.com/gpu: '1'
      max:
        nvidia.com/gpu: '2'

6. GPU Sharing Strategy: MIG (Multi-Instance GPU)

6.1 MIG Overview

Multi-Instance GPU (MIG) is a hardware-level GPU partitioning technology supported on NVIDIA Ampere architecture and above (A100, A30, H100, etc.). A single GPU can be divided into up to 7 independent GPU instances, each with dedicated computing resources, memory, and cache.

The key advantage of MIG is hardware-level isolation. Each MIG instance provides independent memory space and fault isolation, so a failure in one instance does not affect others.

6.2 MIG Strategies

The GPU Operator supports two MIG strategies:

  • Single Strategy: All GPUs on a node have the same MIG configuration. The resource name remains nvidia.com/gpu.
  • Mixed Strategy: Some GPUs can operate in MIG mode while others run in Full GPU mode. MIG instances are exposed as separate resources in the form nvidia.com/mig-<slice>.

6.3 MIG Installation and Configuration

Install the GPU Operator with MIG enabled:

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1 \
  --set mig.strategy=single

Apply a predefined MIG profile to a node:

# Apply MIG configuration Label to the node
kubectl label nodes gpu-node-01 \
  nvidia.com/mig.config=all-1g.10gb --overwrite

Key predefined profiles:

ProfileDescription
all-1g.10gbPartition all GPUs into 1g.10gb instances
all-3g.40gbPartition all GPUs into 3g.40gb instances
all-balancedMix various sizes of instances
all-disabledDisable MIG mode

6.4 Custom MIG Configuration

Custom configurations beyond the default profiles are also possible:

apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      custom-profile:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 5
            "2g.20gb": 1

Link the custom ConfigMap to the ClusterPolicy:

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/migManager/config/name","value":"custom-mig-config"}]'

Verify the configuration status:

kubectl get node gpu-node-01 -o=jsonpath='{.metadata.labels}' | jq .
# Verify "nvidia.com/mig.config.state": "success"

6.5 Using MIG Resources

Pods using MIG instances request resources as follows:

apiVersion: v1
kind: Pod
metadata:
  name: mig-workload
spec:
  containers:
    - name: cuda-app
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1 # Request 1 MIG 1g.10gb instance

7. GPU Time-Slicing Configuration

7.1 Time-Slicing Overview

Time-Slicing is a method that uses NVIDIA GPU's time-division scheduler to allow multiple Pods to share a single GPU over time. Unlike MIG, it does not provide memory isolation or fault isolation, but it can be used on older GPUs (T4, V100, etc.) that do not support MIG and allows sharing GPUs with a larger number of users/workloads.

7.2 ConfigMap Configuration

Time-Slicing is configured through a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

Key Configuration Fields:

FieldTypeDescription
renameByDefaultbooleanWhen true, advertises the resource as nvidia.com/gpu.shared. When false, adds a -SHARED suffix to the product label.
failRequestsGreaterThanOnebooleanRejects Pods requesting more than 1 GPU replica. Setting to true is recommended.
resources.namestringResource name (nvidia.com/gpu, etc.)
resources.replicasintegerNumber of Time-Slice replicas per GPU

7.3 Cluster Application

Apply cluster-wide:

kubectl apply -f time-slicing-config.yaml

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  -n gpu-operator --type merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

Apply to specific nodes only:

# When per-GPU-model configurations are defined in the ConfigMap
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  -n gpu-operator --type merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config"}}}}'

# Apply Label to the target node
kubectl label node gpu-node-01 \
  nvidia.com/device-plugin.config=tesla-t4

7.4 Verifying the Result

When Time-Slicing is applied, the node's Labels and allocatable resources will change:

kubectl get node gpu-node-01 -o jsonpath='{.status.allocatable}' | jq .
# "nvidia.com/gpu": "4"  (1 physical GPU * 4 replicas)

kubectl get node gpu-node-01 --show-labels | grep nvidia
# nvidia.com/gpu.replicas=4
# nvidia.com/gpu.product=Tesla-T4-SHARED

7.5 Key Limitations

  • No memory/fault isolation: GPU memory is shared between Time-Slice replicas, and a failure in one process can affect others.
  • No proportional compute guarantee: Requesting multiple GPU replicas does not guarantee proportionally more computing power.
  • DCGM Exporter limitation: When GPU Time-Slicing is enabled, associating metrics with individual containers is not supported.
  • Manual restart required on ConfigMap changes: The Operator does not automatically detect ConfigMap changes, so the DaemonSet must be restarted manually.
kubectl rollout restart -n gpu-operator \
  daemonset/nvidia-device-plugin-daemonset

8. MPS (Multi-Process Service) Configuration

8.1 MPS Overview

NVIDIA MPS (Multi-Process Service) is a client-server architecture that allows multiple CUDA processes to run simultaneously on a single GPU. It reduces the overhead of traditional CUDA context switching and multiplexes CUDA kernels from multiple processes into a single GPU context, improving GPU utilization.

8.2 MIG vs Time-Slicing vs MPS Comparison

FeatureMIGTime-SlicingMPS
Memory IsolationYes (Hardware)NoPartial (Software)
Fault IsolationYesNoNo
Supported GPUsAmpere and aboveAll NVIDIA GPUsVolta and above recommended
Max Partitions7Unlimited48 (Pre-Volta: 16)
Concurrent KernelsYesNo (Time-divided)Yes
Partition FlexibilityFixed profilesEqual divisionArbitrary sizes

8.3 MPS Pros and Cons

Pros:

  • Unlike MIG, arbitrary-sized GPU slices can be created.
  • Unlike Time-Slicing, memory allocation limits can be enforced, reducing OOM errors.
  • CUDA kernels from multiple processes actually execute concurrently, resulting in higher GPU utilization.

Cons:

  • No fault isolation is provided. A crash in one client process can trigger a GPU reset, affecting all other processes.
  • Memory protection is not complete.

8.4 Using MPS in Kubernetes

Currently, the NVIDIA Device Plugin does not directly support MPS, but MPS can be configured through NVIDIA's DRA (Dynamic Resource Allocation) driver. Alternatively, a separate Device Plugin that supports MPS can be installed.

Some managed Kubernetes services such as GKE (Google Kubernetes Engine) natively support MPS.

# MPS usage example on GKE
apiVersion: v1
kind: Pod
metadata:
  name: mps-workload
spec:
  containers:
    - name: cuda-app
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/gpu: 1

In typical on-premises Kubernetes environments, to use MPS you need to deploy the MPS Control Daemon as a sidecar or DaemonSet and configure the CUDA_MPS_PIPE_DIRECTORY and CUDA_MPS_LOG_DIRECTORY environment variables.


9. GPU Node Isolation with Node Affinity and Taint/Toleration

GPU nodes are expensive resources, so workloads that do not require GPUs should be prevented from being scheduled on GPU nodes. This is achieved by combining Kubernetes Taint/Toleration and Node Affinity.

9.1 Applying Taints to GPU Nodes

# Add Taint to GPU nodes
kubectl taint nodes gpu-node-01 \
  nvidia.com/gpu=present:NoSchedule

kubectl taint nodes gpu-node-02 \
  nvidia.com/gpu=present:NoSchedule

Once this Taint is applied, Pods without the corresponding Toleration will not be scheduled on GPU nodes.

9.2 Adding Tolerations to GPU Workloads

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-job
spec:
  tolerations:
    - key: 'nvidia.com/gpu'
      operator: 'Equal'
      value: 'present'
      effect: 'NoSchedule'
  containers:
    - name: training
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/gpu: 1

9.3 Selecting Specific GPU Models with Node Affinity

Using Labels generated by GPU Feature Discovery, workloads can be scheduled only on nodes with specific GPU models:

apiVersion: v1
kind: Pod
metadata:
  name: a100-training
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                  - 'A100-SXM4-80GB'
                  - 'A100-SXM4-40GB'
              - key: nvidia.com/gpu.memory
                operator: Gt
                values:
                  - '40000'
  tolerations:
    - key: 'nvidia.com/gpu'
      operator: 'Equal'
      value: 'present'
      effect: 'NoSchedule'
  containers:
    - name: llm-training
      image: nvcr.io/nvidia/pytorch:24.01-py3
      command: ['python', 'train.py']
      resources:
        limits:
          nvidia.com/gpu: 4

9.4 Production Strategy: Combining Taint + Affinity

The best practice for GPU node management is a dual strategy of blocking non-GPU workloads with Taints and directing GPU workloads to appropriate nodes with Node Affinity.

  • Taint: A "hard" constraint that restricts which Pods can access GPU nodes
  • Node Affinity: A "soft" or "hard" preference that places Pods on nodes with desired GPU specifications

Tolerations only allow a Pod to be scheduled on a Tainted node; they do not force the Pod to that node. Therefore, Node Affinity must be used together to ensure GPU workloads are placed exclusively on GPU nodes.


10. Monitoring with DCGM Exporter + Prometheus + Grafana

10.1 DCGM Exporter Overview

DCGM Exporter is a component that collects GPU telemetry data using the Go API of NVIDIA Data Center GPU Manager (DCGM) and exposes it in Prometheus format. When the GPU Operator is installed, DCGM Exporter is deployed as a DaemonSet by default.

Key collected metrics:

MetricDescription
DCGM_FI_DEV_GPU_UTILGPU utilization (%)
DCGM_FI_DEV_MEM_COPY_UTILMemory copy utilization (%)
DCGM_FI_DEV_FB_USEDFramebuffer used memory (MB)
DCGM_FI_DEV_FB_FREEFramebuffer free memory (MB)
DCGM_FI_DEV_GPU_TEMPGPU temperature (C)
DCGM_FI_DEV_POWER_USAGEPower usage (W)
DCGM_FI_DEV_SM_CLOCKSM Clock frequency (MHz)
DCGM_FI_DEV_PCIE_TX_THROUGHPUTPCIe TX throughput
DCGM_FI_DEV_PCIE_RX_THROUGHPUTPCIe RX throughput

10.2 Installing the Prometheus Stack

Deploy Prometheus and Grafana together using kube-prometheus-stack:

# Add Prometheus community Helm repository
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

Create a values file for GPU metric collection:

# kube-prometheus-stack-values.yaml
prometheus:
  service:
    type: NodePort
    nodePort: 30090
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    additionalScrapeConfigs:
      - job_name: gpu-metrics
        scrape_interval: 1s
        metrics_path: /metrics
        scheme: http
        kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
                - gpu-operator
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_node_name]
            action: replace
            target_label: kubernetes_node

grafana:
  service:
    type: NodePort
    nodePort: 32322
helm install prometheus-community/kube-prometheus-stack \
  --create-namespace --namespace prometheus \
  --generate-name \
  --values kube-prometheus-stack-values.yaml

10.3 Grafana Dashboard Setup

Import the official NVIDIA DCGM Exporter dashboard:

  1. Access Grafana (http://<node-ip>:32322).
  2. Navigate to Dashboards > Import.
  3. Enter Dashboard ID 12239 (NVIDIA DCGM Exporter Dashboard).
  4. Select Prometheus as the Data Source.

This dashboard visualizes GPU utilization, memory usage, temperature, and power consumption in real time. For a dashboard that supports both MIG and Non-MIG GPUs, use ID 23382.

10.4 Alert Configuration Example

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-alerts
  namespace: prometheus
spec:
  groups:
    - name: gpu.rules
      rules:
        - alert: GPUHighTemperature
          expr: DCGM_FI_DEV_GPU_TEMP > 85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU temperature is high on {{ $labels.kubernetes_node }}'
            description: 'GPU {{ $labels.gpu }} temperature is {{ $value }}C'
        - alert: GPUMemoryAlmostFull
          expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.95
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: 'GPU memory usage above 95% on {{ $labels.kubernetes_node }}'
        - alert: GPUHighUtilization
          expr: DCGM_FI_DEV_GPU_UTIL > 95
          for: 30m
          labels:
            severity: info
          annotations:
            summary: 'Sustained high GPU utilization on {{ $labels.kubernetes_node }}'

11. Leveraging GPU Feature Discovery

11.1 Overview

GPU Feature Discovery (GFD) is a component that automatically detects GPU properties installed on nodes and generates Kubernetes Node Labels. It operates on top of Node Feature Discovery (NFD) and is automatically deployed as part of the GPU Operator.

11.2 Generated Labels

LabelDescriptionExample Value
nvidia.com/gpu.productGPU model nameTesla-T4, A100-SXM4-80GB
nvidia.com/gpu.memoryGPU memory capacity (MB)40960
nvidia.com/gpu.countGPU count4
nvidia.com/gpu.familyGPU architecture familyampere, hopper
nvidia.com/gpu.compute.majorCUDA Compute Capability (Major)8
nvidia.com/gpu.compute.minorCUDA Compute Capability (Minor)0
nvidia.com/cuda.driver.majorCUDA Driver Major version535
nvidia.com/cuda.driver.minorCUDA Driver Minor version129
nvidia.com/cuda.runtime.majorCUDA Runtime Major version12
nvidia.com/cuda.runtime.minorCUDA Runtime Minor version2
nvidia.com/gpu.machineMachine type (instance type in cloud)p4d.24xlarge
nvidia.com/gpu.replicasGPU replica count (when sharing)4
nvidia.com/mig.strategyMIG strategysingle, mixed

11.3 Label Usage Examples

GFD Labels enable fine-grained scheduling based on GPU specifications:

# Run only on Hopper architecture GPUs
apiVersion: batch/v1
kind: Job
metadata:
  name: h100-inference
spec:
  template:
    spec:
      nodeSelector:
        nvidia.com/gpu.family: hopper
      containers:
        - name: inference
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          resources:
            limits:
              nvidia.com/gpu: 1
      restartPolicy: Never
# Large model training requiring 80GB+ GPU memory
apiVersion: batch/v1
kind: Job
metadata:
  name: large-model-training
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: nvidia.com/gpu.memory
                    operator: Gt
                    values:
                      - '79000'
                  - key: nvidia.com/gpu.count
                    operator: Gt
                    values:
                      - '7'
      containers:
        - name: training
          image: nvcr.io/nvidia/pytorch:24.01-py3
          resources:
            limits:
              nvidia.com/gpu: 8
      restartPolicy: Never

12. Production YAML Examples and Troubleshooting

12.1 PyTorch Distributed Training Job

apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  parallelism: 2
  completions: 2
  template:
    metadata:
      labels:
        app: distributed-training
    spec:
      tolerations:
        - key: 'nvidia.com/gpu'
          operator: 'Equal'
          value: 'present'
          effect: 'NoSchedule'
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: nvidia.com/gpu.product
                    operator: In
                    values:
                      - 'A100-SXM4-80GB'
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - distributed-training
                topologyKey: kubernetes.io/hostname
      containers:
        - name: pytorch-trainer
          image: nvcr.io/nvidia/pytorch:24.01-py3
          command:
            - torchrun
            - --nproc_per_node=4
            - --nnodes=2
            - train.py
          resources:
            limits:
              nvidia.com/gpu: 4
              memory: '64Gi'
              cpu: '16'
            requests:
              memory: '64Gi'
              cpu: '16'
          env:
            - name: NCCL_DEBUG
              value: 'INFO'
          volumeMounts:
            - name: dshm
              mountPath: /dev/shm
            - name: training-data
              mountPath: /data
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: '16Gi'
        - name: training-data
          persistentVolumeClaim:
            claimName: training-data-pvc
      restartPolicy: Never
  backoffLimit: 3

Note: In PyTorch distributed training, if the size of /dev/shm (shared memory) is insufficient, crashes can occur when DataLoader's num_workers > 0 is set. Mount an emptyDir with Memory medium and allocate a sufficient size.

12.2 Triton Inference Server Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton-inference
  template:
    metadata:
      labels:
        app: triton-inference
    spec:
      tolerations:
        - key: 'nvidia.com/gpu'
          operator: 'Equal'
          value: 'present'
          effect: 'NoSchedule'
      nodeSelector:
        nvidia.com/gpu.family: ampere
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          args:
            - tritonserver
            - --model-repository=/models
            - --strict-model-config=false
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: '16Gi'
              cpu: '4'
          readinessProbe:
            httpGet:
              path: /v2/health/ready
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /v2/health/live
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 15
          volumeMounts:
            - name: model-store
              mountPath: /models
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-store-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: triton-inference
spec:
  selector:
    app: triton-inference
  ports:
    - name: http
      port: 8000
      targetPort: 8000
    - name: grpc
      port: 8001
      targetPort: 8001
    - name: metrics
      port: 8002
      targetPort: 8002

12.3 Troubleshooting Guide

GPU Not Recognized on Node

# 1. Check GPU Operator Pod status
kubectl get pods -n gpu-operator

# 2. Check Driver Pod logs
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# 3. Check Device Plugin Pod logs
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset

# 4. Check node's allocatable resources
kubectl describe node gpu-node-01 | grep -A 5 "Allocatable"

# 5. Check NFD Labels
kubectl get node gpu-node-01 -o json | \
  jq '.metadata.labels | to_entries[] | select(.key | startswith("nvidia"))'

Pod Not Being Scheduled on GPU Node

# 1. Check Pod events
kubectl describe pod <pod-name>
# Look for "Insufficient nvidia.com/gpu" message

# 2. Check node's GPU allocation status
kubectl describe node gpu-node-01 | grep -A 3 "Allocated resources"

# 3. List Pods currently using GPUs
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.containers[].resources.limits["nvidia.com/gpu"] != null) | .metadata.name'

MIG Configuration Not Applied

# 1. Check MIG Manager logs
kubectl logs -n gpu-operator -l app=nvidia-mig-manager

# 2. Check MIG configuration state Label
kubectl get node gpu-node-01 -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'
# If not "success", configuration failed

# 3. Check GPU mode (on the node)
nvidia-smi -L
nvidia-smi --query-gpu=mig.mode.current --format=csv

GPU Count Not Changed After Time-Slicing

# 1. Check ConfigMap content
kubectl get configmap -n gpu-operator time-slicing-config -o yaml

# 2. Restart Device Plugin
kubectl rollout restart -n gpu-operator \
  daemonset/nvidia-device-plugin-daemonset

# 3. Check node resources after restart
kubectl get node gpu-node-01 -o jsonpath='{.status.allocatable.nvidia\.com/gpu}'

References

Quiz

Q1: What is the main topic covered in "Kubernetes GPU Workload Management: The Complete NVIDIA GPU Operator Guide"?

A comprehensive analysis of how to efficiently manage GPU resources and operate AI workloads in Kubernetes clusters, based on the official NVIDIA GPU Operator documentation.

Q2: Why You Need GPUs in Kubernetes? With the explosive growth of AI/ML workloads, GPUs are no longer optional but essential infrastructure. GPU-accelerated computing is demanded across diverse domains including LLM training, inference serving, computer vision, and scientific simulations.

Q3: Describe the NVIDIA GPU Operator Architecture. The NVIDIA GPU Operator uses the Kubernetes Operator Framework to automatically provision and manage all NVIDIA software components needed on GPU nodes.

Q4: What are the key steps for GPU Operator Installation? The GPU Operator is installed via Helm Chart. Let's walk through the installation step by step based on the official installation guide. 3.1 Prerequisites kubectl and helm CLI must be installed. containerd or CRI-O must be used as the container runtime.

Q5: How does NVIDIA Device Plugin In-Depth Analysis work? The NVIDIA Device Plugin implements the Kubernetes Device Plugin Framework and is a core component of the GPU Operator. 4.1 Key Features GPU Enumeration: Detects all GPUs installed on a node and reports the count to kubelet.