Skip to content
Published on

Kubernetes GPU Workload Management: The Complete NVIDIA GPU Operator Guide

Authors
  • Name
    Twitter

1. Why You Need GPUs in Kubernetes

With the explosive growth of AI/ML workloads, GPUs are no longer optional but essential infrastructure. GPU-accelerated computing is demanded across diverse domains including LLM training, inference serving, computer vision, and scientific simulations. To operate these workloads at scale, Kubernetes-based orchestration is inevitable.

Using GPUs in Kubernetes provides the following benefits:

  • Automated Resource Scheduling: GPUs can be declaratively requested and allocated on a per-Pod basis through the nvidia.com/gpu resource type.
  • Multi-Tenancy: Namespace, ResourceQuota, and LimitRange can be leveraged to isolate and fairly distribute GPU resources across teams.
  • GPU Sharing: Technologies such as MIG, Time-Slicing, and MPS allow multiple workloads to share a single GPU, maximizing cost efficiency.
  • Auto Scaling: Combined with tools like HPA (Horizontal Pod Autoscaler) or Karpenter, GPU workloads can be automatically scaled based on demand.
  • Operational Standardization: Operational tasks such as GPU driver installation, monitoring, and incident response can be automated using the Operator pattern.

However, to properly leverage GPUs in Kubernetes, multiple software components including GPU drivers, Container Toolkit, Device Plugin, and monitoring tools must be accurately installed and managed. To address this complexity, NVIDIA provides the GPU Operator.


2. NVIDIA GPU Operator Architecture

The NVIDIA GPU Operator uses the Kubernetes Operator Framework to automatically provision and manage all NVIDIA software components needed on GPU nodes. According to the official documentation, the GPU Operator includes the following components:

2.1 Core Components

ComponentRole
NVIDIA GPU DriverProvides the interface between the GPU and the operating system, enabling CUDA.
NVIDIA Container ToolkitEnables container runtimes (containerd, CRI-O) to access GPUs.
NVIDIA Device PluginExposes GPUs as Kubernetes resources through the kubelet API.
GPU Feature Discovery (GFD)Detects GPU properties (model, memory, CUDA version, etc.) on nodes and automatically generates Node Labels.
DCGM ExporterExposes GPU metrics in Prometheus format based on NVIDIA DCGM (Data Center GPU Manager).
MIG ManagerManages Multi-Instance GPU configurations using the Kubernetes controller pattern.
GPU Direct Storage (GDS)Enables direct data transfer between storage devices and GPU memory.

2.2 How It Works

When the GPU Operator is deployed, initialization proceeds in the following order:

  1. gpu-operator Pod starts: Acts as the controller that monitors and reconciles the state of all components.
  2. Node Feature Discovery (NFD) deployment: Deploys Pods to each node in the cluster to detect GPU presence and add relevant Labels.
  3. Driver and Toolkit installation: Deploys NVIDIA drivers and Container Toolkit as DaemonSets on nodes where GPUs are detected.
  4. Device Plugin deployment: Registers GPUs as nvidia.com/gpu resources with kubelet.
  5. Auxiliary component deployment: Deploys GFD, DCGM Exporter, MIG Manager, etc.

All components are declaratively managed through a Custom Resource called ClusterPolicy. The Operator continuously compares the desired state and actual state of the ClusterPolicy and automatically corrects any discrepancies.


3. GPU Operator Installation

The GPU Operator is installed via Helm Chart. Let's walk through the installation step by step based on the official installation guide.

3.1 Prerequisites

  • kubectl and helm CLI must be installed.
  • containerd or CRI-O must be used as the container runtime.
  • All Worker Nodes running GPU workloads should have the same OS version (unless drivers are pre-installed separately).
  • If Pod Security Admission (PSA) is used, the Namespace must be configured with the privileged level.

3.2 Installation Procedure

# 1. Create Namespace and configure PSA
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator \
  pod-security.kubernetes.io/enforce=privileged

# 2. Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# 3. Install GPU Operator (default configuration)
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1

3.3 Key Installation Options

ParameterPurposeDefault
driver.enabledWhether to deploy the NVIDIA drivertrue
toolkit.enabledWhether to deploy Container Toolkittrue
nfd.enabledWhether to deploy Node Feature Discoverytrue
dcgmExporter.enabledWhether to enable GPU telemetrytrue
cdi.enabledWhether to use Container Device Interfacetrue
driver.versionSpecify a particular driver versionVaries by release
mig.strategyMIG strategy setting (none, single, mixed)none

When the NVIDIA driver is already installed on the host:

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1 \
  --set driver.enabled=false

When both the driver and Toolkit are pre-installed:

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1 \
  --set driver.enabled=false \
  --set toolkit.enabled=false

3.4 Installation Verification

After installation is complete, you can verify GPU operation with a simple CUDA sample:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vectoradd
      image: 'nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04'
      resources:
        limits:
          nvidia.com/gpu: 1
kubectl apply -f cuda-vectoradd.yaml
kubectl logs pod/cuda-vectoradd
# [Vector addition of 50000 elements]
# Test PASSED

4. NVIDIA Device Plugin In-Depth Analysis

The NVIDIA Device Plugin implements the Kubernetes Device Plugin Framework and is a core component of the GPU Operator.

4.1 Key Features

  • GPU Enumeration: Detects all GPUs installed on a node and reports the count to kubelet.
  • GPU Health Monitoring: Continuously checks GPU status and excludes unhealthy GPUs from scheduling.
  • GPU Allocation: Allocates available GPUs when a Pod requests nvidia.com/gpu resources.
  • Resource Sharing: Supports GPU sharing strategies such as Time-Slicing and MIG.

4.2 How It Works

The Device Plugin is deployed as a DaemonSet and runs on each GPU node. The operation flow is as follows:

  1. The Plugin starts a gRPC server and registers with kubelet's Device Plugin socket.
  2. kubelet reflects the capacity of nvidia.com/gpu resources in the Node object.
  3. When a Pod requests those resources, the kube-scheduler schedules it to an appropriate node.
  4. kubelet requests GPU allocation from the Device Plugin, and the Plugin configures the necessary device files and environment variables for the container.

4.3 Resource Request/Limit Configuration

GPU resources can only be requested through resources.limits. Unlike CPU or memory, requests cannot be set separately -- setting limits automatically applies the same value as requests.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: gpu-container
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/gpu: 2 # Request 2 GPUs

Important Constraints:

  • GPUs can only be requested in integer units (nvidia.com/gpu: 0.5 is not allowed).
  • GPUs are not shared across nodes (a single Pod cannot simultaneously use GPUs from multiple nodes).
  • If limits is not specified, no GPU will be allocated.

5. GPU Resource Requests/Limits In Detail

5.1 Basic Resource Model

In Kubernetes, GPUs are treated as Extended Resources. They are requested using the resource name nvidia.com/gpu, which is registered with kubelet by the NVIDIA Device Plugin.

resources:
  limits:
    nvidia.com/gpu: 1 # Exclusively allocate 1 GPU
    memory: '16Gi' # System memory is set separately from the GPU
    cpu: '4'

5.2 GPU Limitation via ResourceQuota

In multi-tenant environments, GPU usage per Namespace can be restricted:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: team-ml
spec:
  hard:
    requests.nvidia.com/gpu: '4' # Maximum 4 GPUs
    limits.nvidia.com/gpu: '4'

5.3 Default Values via LimitRange

apiVersion: v1
kind: LimitRange
metadata:
  name: gpu-limit-range
  namespace: team-ml
spec:
  limits:
    - type: Container
      default:
        nvidia.com/gpu: '1'
      defaultRequest:
        nvidia.com/gpu: '1'
      max:
        nvidia.com/gpu: '2'

6. GPU Sharing Strategy: MIG (Multi-Instance GPU)

6.1 MIG Overview

Multi-Instance GPU (MIG) is a hardware-level GPU partitioning technology supported on NVIDIA Ampere architecture and above (A100, A30, H100, etc.). A single GPU can be divided into up to 7 independent GPU instances, each with dedicated computing resources, memory, and cache.

The key advantage of MIG is hardware-level isolation. Each MIG instance provides independent memory space and fault isolation, so a failure in one instance does not affect others.

6.2 MIG Strategies

The GPU Operator supports two MIG strategies:

  • Single Strategy: All GPUs on a node have the same MIG configuration. The resource name remains nvidia.com/gpu.
  • Mixed Strategy: Some GPUs can operate in MIG mode while others run in Full GPU mode. MIG instances are exposed as separate resources in the form nvidia.com/mig-<slice>.

6.3 MIG Installation and Configuration

Install the GPU Operator with MIG enabled:

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1 \
  --set mig.strategy=single

Apply a predefined MIG profile to a node:

# Apply MIG configuration Label to the node
kubectl label nodes gpu-node-01 \
  nvidia.com/mig.config=all-1g.10gb --overwrite

Key predefined profiles:

ProfileDescription
all-1g.10gbPartition all GPUs into 1g.10gb instances
all-3g.40gbPartition all GPUs into 3g.40gb instances
all-balancedMix various sizes of instances
all-disabledDisable MIG mode

6.4 Custom MIG Configuration

Custom configurations beyond the default profiles are also possible:

apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      custom-profile:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 5
            "2g.20gb": 1

Link the custom ConfigMap to the ClusterPolicy:

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/migManager/config/name","value":"custom-mig-config"}]'

Verify the configuration status:

kubectl get node gpu-node-01 -o=jsonpath='{.metadata.labels}' | jq .
# Verify "nvidia.com/mig.config.state": "success"

6.5 Using MIG Resources

Pods using MIG instances request resources as follows:

apiVersion: v1
kind: Pod
metadata:
  name: mig-workload
spec:
  containers:
    - name: cuda-app
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1 # Request 1 MIG 1g.10gb instance

7. GPU Time-Slicing Configuration

7.1 Time-Slicing Overview

Time-Slicing is a method that uses NVIDIA GPU's time-division scheduler to allow multiple Pods to share a single GPU over time. Unlike MIG, it does not provide memory isolation or fault isolation, but it can be used on older GPUs (T4, V100, etc.) that do not support MIG and allows sharing GPUs with a larger number of users/workloads.

7.2 ConfigMap Configuration

Time-Slicing is configured through a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

Key Configuration Fields:

FieldTypeDescription
renameByDefaultbooleanWhen true, advertises the resource as nvidia.com/gpu.shared. When false, adds a -SHARED suffix to the product label.
failRequestsGreaterThanOnebooleanRejects Pods requesting more than 1 GPU replica. Setting to true is recommended.
resources.namestringResource name (nvidia.com/gpu, etc.)
resources.replicasintegerNumber of Time-Slice replicas per GPU

7.3 Cluster Application

Apply cluster-wide:

kubectl apply -f time-slicing-config.yaml

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  -n gpu-operator --type merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

Apply to specific nodes only:

# When per-GPU-model configurations are defined in the ConfigMap
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  -n gpu-operator --type merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config"}}}}'

# Apply Label to the target node
kubectl label node gpu-node-01 \
  nvidia.com/device-plugin.config=tesla-t4

7.4 Verifying the Result

When Time-Slicing is applied, the node's Labels and allocatable resources will change:

kubectl get node gpu-node-01 -o jsonpath='{.status.allocatable}' | jq .
# "nvidia.com/gpu": "4"  (1 physical GPU * 4 replicas)

kubectl get node gpu-node-01 --show-labels | grep nvidia
# nvidia.com/gpu.replicas=4
# nvidia.com/gpu.product=Tesla-T4-SHARED

7.5 Key Limitations

  • No memory/fault isolation: GPU memory is shared between Time-Slice replicas, and a failure in one process can affect others.
  • No proportional compute guarantee: Requesting multiple GPU replicas does not guarantee proportionally more computing power.
  • DCGM Exporter limitation: When GPU Time-Slicing is enabled, associating metrics with individual containers is not supported.
  • Manual restart required on ConfigMap changes: The Operator does not automatically detect ConfigMap changes, so the DaemonSet must be restarted manually.
kubectl rollout restart -n gpu-operator \
  daemonset/nvidia-device-plugin-daemonset

8. MPS (Multi-Process Service) Configuration

8.1 MPS Overview

NVIDIA MPS (Multi-Process Service) is a client-server architecture that allows multiple CUDA processes to run simultaneously on a single GPU. It reduces the overhead of traditional CUDA context switching and multiplexes CUDA kernels from multiple processes into a single GPU context, improving GPU utilization.

8.2 MIG vs Time-Slicing vs MPS Comparison

FeatureMIGTime-SlicingMPS
Memory IsolationYes (Hardware)NoPartial (Software)
Fault IsolationYesNoNo
Supported GPUsAmpere and aboveAll NVIDIA GPUsVolta and above recommended
Max Partitions7Unlimited48 (Pre-Volta: 16)
Concurrent KernelsYesNo (Time-divided)Yes
Partition FlexibilityFixed profilesEqual divisionArbitrary sizes

8.3 MPS Pros and Cons

Pros:

  • Unlike MIG, arbitrary-sized GPU slices can be created.
  • Unlike Time-Slicing, memory allocation limits can be enforced, reducing OOM errors.
  • CUDA kernels from multiple processes actually execute concurrently, resulting in higher GPU utilization.

Cons:

  • No fault isolation is provided. A crash in one client process can trigger a GPU reset, affecting all other processes.
  • Memory protection is not complete.

8.4 Using MPS in Kubernetes

Currently, the NVIDIA Device Plugin does not directly support MPS, but MPS can be configured through NVIDIA's DRA (Dynamic Resource Allocation) driver. Alternatively, a separate Device Plugin that supports MPS can be installed.

Some managed Kubernetes services such as GKE (Google Kubernetes Engine) natively support MPS.

# MPS usage example on GKE
apiVersion: v1
kind: Pod
metadata:
  name: mps-workload
spec:
  containers:
    - name: cuda-app
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/gpu: 1

In typical on-premises Kubernetes environments, to use MPS you need to deploy the MPS Control Daemon as a sidecar or DaemonSet and configure the CUDA_MPS_PIPE_DIRECTORY and CUDA_MPS_LOG_DIRECTORY environment variables.


9. GPU Node Isolation with Node Affinity and Taint/Toleration

GPU nodes are expensive resources, so workloads that do not require GPUs should be prevented from being scheduled on GPU nodes. This is achieved by combining Kubernetes Taint/Toleration and Node Affinity.

9.1 Applying Taints to GPU Nodes

# Add Taint to GPU nodes
kubectl taint nodes gpu-node-01 \
  nvidia.com/gpu=present:NoSchedule

kubectl taint nodes gpu-node-02 \
  nvidia.com/gpu=present:NoSchedule

Once this Taint is applied, Pods without the corresponding Toleration will not be scheduled on GPU nodes.

9.2 Adding Tolerations to GPU Workloads

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-job
spec:
  tolerations:
    - key: 'nvidia.com/gpu'
      operator: 'Equal'
      value: 'present'
      effect: 'NoSchedule'
  containers:
    - name: training
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/gpu: 1

9.3 Selecting Specific GPU Models with Node Affinity

Using Labels generated by GPU Feature Discovery, workloads can be scheduled only on nodes with specific GPU models:

apiVersion: v1
kind: Pod
metadata:
  name: a100-training
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                  - 'A100-SXM4-80GB'
                  - 'A100-SXM4-40GB'
              - key: nvidia.com/gpu.memory
                operator: Gt
                values:
                  - '40000'
  tolerations:
    - key: 'nvidia.com/gpu'
      operator: 'Equal'
      value: 'present'
      effect: 'NoSchedule'
  containers:
    - name: llm-training
      image: nvcr.io/nvidia/pytorch:24.01-py3
      command: ['python', 'train.py']
      resources:
        limits:
          nvidia.com/gpu: 4

9.4 Production Strategy: Combining Taint + Affinity

The best practice for GPU node management is a dual strategy of blocking non-GPU workloads with Taints and directing GPU workloads to appropriate nodes with Node Affinity.

  • Taint: A "hard" constraint that restricts which Pods can access GPU nodes
  • Node Affinity: A "soft" or "hard" preference that places Pods on nodes with desired GPU specifications

Tolerations only allow a Pod to be scheduled on a Tainted node; they do not force the Pod to that node. Therefore, Node Affinity must be used together to ensure GPU workloads are placed exclusively on GPU nodes.


10. Monitoring with DCGM Exporter + Prometheus + Grafana

10.1 DCGM Exporter Overview

DCGM Exporter is a component that collects GPU telemetry data using the Go API of NVIDIA Data Center GPU Manager (DCGM) and exposes it in Prometheus format. When the GPU Operator is installed, DCGM Exporter is deployed as a DaemonSet by default.

Key collected metrics:

MetricDescription
DCGM_FI_DEV_GPU_UTILGPU utilization (%)
DCGM_FI_DEV_MEM_COPY_UTILMemory copy utilization (%)
DCGM_FI_DEV_FB_USEDFramebuffer used memory (MB)
DCGM_FI_DEV_FB_FREEFramebuffer free memory (MB)
DCGM_FI_DEV_GPU_TEMPGPU temperature (C)
DCGM_FI_DEV_POWER_USAGEPower usage (W)
DCGM_FI_DEV_SM_CLOCKSM Clock frequency (MHz)
DCGM_FI_DEV_PCIE_TX_THROUGHPUTPCIe TX throughput
DCGM_FI_DEV_PCIE_RX_THROUGHPUTPCIe RX throughput

10.2 Installing the Prometheus Stack

Deploy Prometheus and Grafana together using kube-prometheus-stack:

# Add Prometheus community Helm repository
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

Create a values file for GPU metric collection:

# kube-prometheus-stack-values.yaml
prometheus:
  service:
    type: NodePort
    nodePort: 30090
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    additionalScrapeConfigs:
      - job_name: gpu-metrics
        scrape_interval: 1s
        metrics_path: /metrics
        scheme: http
        kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
                - gpu-operator
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_node_name]
            action: replace
            target_label: kubernetes_node

grafana:
  service:
    type: NodePort
    nodePort: 32322
helm install prometheus-community/kube-prometheus-stack \
  --create-namespace --namespace prometheus \
  --generate-name \
  --values kube-prometheus-stack-values.yaml

10.3 Grafana Dashboard Setup

Import the official NVIDIA DCGM Exporter dashboard:

  1. Access Grafana (http://<node-ip>:32322).
  2. Navigate to Dashboards > Import.
  3. Enter Dashboard ID 12239 (NVIDIA DCGM Exporter Dashboard).
  4. Select Prometheus as the Data Source.

This dashboard visualizes GPU utilization, memory usage, temperature, and power consumption in real time. For a dashboard that supports both MIG and Non-MIG GPUs, use ID 23382.

10.4 Alert Configuration Example

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-alerts
  namespace: prometheus
spec:
  groups:
    - name: gpu.rules
      rules:
        - alert: GPUHighTemperature
          expr: DCGM_FI_DEV_GPU_TEMP > 85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU temperature is high on {{ $labels.kubernetes_node }}'
            description: 'GPU {{ $labels.gpu }} temperature is {{ $value }}C'
        - alert: GPUMemoryAlmostFull
          expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.95
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: 'GPU memory usage above 95% on {{ $labels.kubernetes_node }}'
        - alert: GPUHighUtilization
          expr: DCGM_FI_DEV_GPU_UTIL > 95
          for: 30m
          labels:
            severity: info
          annotations:
            summary: 'Sustained high GPU utilization on {{ $labels.kubernetes_node }}'

11. Leveraging GPU Feature Discovery

11.1 Overview

GPU Feature Discovery (GFD) is a component that automatically detects GPU properties installed on nodes and generates Kubernetes Node Labels. It operates on top of Node Feature Discovery (NFD) and is automatically deployed as part of the GPU Operator.

11.2 Generated Labels

LabelDescriptionExample Value
nvidia.com/gpu.productGPU model nameTesla-T4, A100-SXM4-80GB
nvidia.com/gpu.memoryGPU memory capacity (MB)40960
nvidia.com/gpu.countGPU count4
nvidia.com/gpu.familyGPU architecture familyampere, hopper
nvidia.com/gpu.compute.majorCUDA Compute Capability (Major)8
nvidia.com/gpu.compute.minorCUDA Compute Capability (Minor)0
nvidia.com/cuda.driver.majorCUDA Driver Major version535
nvidia.com/cuda.driver.minorCUDA Driver Minor version129
nvidia.com/cuda.runtime.majorCUDA Runtime Major version12
nvidia.com/cuda.runtime.minorCUDA Runtime Minor version2
nvidia.com/gpu.machineMachine type (instance type in cloud)p4d.24xlarge
nvidia.com/gpu.replicasGPU replica count (when sharing)4
nvidia.com/mig.strategyMIG strategysingle, mixed

11.3 Label Usage Examples

GFD Labels enable fine-grained scheduling based on GPU specifications:

# Run only on Hopper architecture GPUs
apiVersion: batch/v1
kind: Job
metadata:
  name: h100-inference
spec:
  template:
    spec:
      nodeSelector:
        nvidia.com/gpu.family: hopper
      containers:
        - name: inference
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          resources:
            limits:
              nvidia.com/gpu: 1
      restartPolicy: Never
# Large model training requiring 80GB+ GPU memory
apiVersion: batch/v1
kind: Job
metadata:
  name: large-model-training
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: nvidia.com/gpu.memory
                    operator: Gt
                    values:
                      - '79000'
                  - key: nvidia.com/gpu.count
                    operator: Gt
                    values:
                      - '7'
      containers:
        - name: training
          image: nvcr.io/nvidia/pytorch:24.01-py3
          resources:
            limits:
              nvidia.com/gpu: 8
      restartPolicy: Never

12. Production YAML Examples and Troubleshooting

12.1 PyTorch Distributed Training Job

apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  parallelism: 2
  completions: 2
  template:
    metadata:
      labels:
        app: distributed-training
    spec:
      tolerations:
        - key: 'nvidia.com/gpu'
          operator: 'Equal'
          value: 'present'
          effect: 'NoSchedule'
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: nvidia.com/gpu.product
                    operator: In
                    values:
                      - 'A100-SXM4-80GB'
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - distributed-training
                topologyKey: kubernetes.io/hostname
      containers:
        - name: pytorch-trainer
          image: nvcr.io/nvidia/pytorch:24.01-py3
          command:
            - torchrun
            - --nproc_per_node=4
            - --nnodes=2
            - train.py
          resources:
            limits:
              nvidia.com/gpu: 4
              memory: '64Gi'
              cpu: '16'
            requests:
              memory: '64Gi'
              cpu: '16'
          env:
            - name: NCCL_DEBUG
              value: 'INFO'
          volumeMounts:
            - name: dshm
              mountPath: /dev/shm
            - name: training-data
              mountPath: /data
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: '16Gi'
        - name: training-data
          persistentVolumeClaim:
            claimName: training-data-pvc
      restartPolicy: Never
  backoffLimit: 3

Note: In PyTorch distributed training, if the size of /dev/shm (shared memory) is insufficient, crashes can occur when DataLoader's num_workers > 0 is set. Mount an emptyDir with Memory medium and allocate a sufficient size.

12.2 Triton Inference Server Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton-inference
  template:
    metadata:
      labels:
        app: triton-inference
    spec:
      tolerations:
        - key: 'nvidia.com/gpu'
          operator: 'Equal'
          value: 'present'
          effect: 'NoSchedule'
      nodeSelector:
        nvidia.com/gpu.family: ampere
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          args:
            - tritonserver
            - --model-repository=/models
            - --strict-model-config=false
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: '16Gi'
              cpu: '4'
          readinessProbe:
            httpGet:
              path: /v2/health/ready
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /v2/health/live
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 15
          volumeMounts:
            - name: model-store
              mountPath: /models
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-store-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: triton-inference
spec:
  selector:
    app: triton-inference
  ports:
    - name: http
      port: 8000
      targetPort: 8000
    - name: grpc
      port: 8001
      targetPort: 8001
    - name: metrics
      port: 8002
      targetPort: 8002

12.3 Troubleshooting Guide

GPU Not Recognized on Node

# 1. Check GPU Operator Pod status
kubectl get pods -n gpu-operator

# 2. Check Driver Pod logs
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# 3. Check Device Plugin Pod logs
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset

# 4. Check node's allocatable resources
kubectl describe node gpu-node-01 | grep -A 5 "Allocatable"

# 5. Check NFD Labels
kubectl get node gpu-node-01 -o json | \
  jq '.metadata.labels | to_entries[] | select(.key | startswith("nvidia"))'

Pod Not Being Scheduled on GPU Node

# 1. Check Pod events
kubectl describe pod <pod-name>
# Look for "Insufficient nvidia.com/gpu" message

# 2. Check node's GPU allocation status
kubectl describe node gpu-node-01 | grep -A 3 "Allocated resources"

# 3. List Pods currently using GPUs
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.containers[].resources.limits["nvidia.com/gpu"] != null) | .metadata.name'

MIG Configuration Not Applied

# 1. Check MIG Manager logs
kubectl logs -n gpu-operator -l app=nvidia-mig-manager

# 2. Check MIG configuration state Label
kubectl get node gpu-node-01 -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'
# If not "success", configuration failed

# 3. Check GPU mode (on the node)
nvidia-smi -L
nvidia-smi --query-gpu=mig.mode.current --format=csv

GPU Count Not Changed After Time-Slicing

# 1. Check ConfigMap content
kubectl get configmap -n gpu-operator time-slicing-config -o yaml

# 2. Restart Device Plugin
kubectl rollout restart -n gpu-operator \
  daemonset/nvidia-device-plugin-daemonset

# 3. Check node resources after restart
kubectl get node gpu-node-01 -o jsonpath='{.status.allocatable.nvidia\.com/gpu}'

References