- Authors
- Name
- 1. Why You Need GPUs in Kubernetes
- 2. NVIDIA GPU Operator Architecture
- 3. GPU Operator Installation
- 4. NVIDIA Device Plugin In-Depth Analysis
- 5. GPU Resource Requests/Limits In Detail
- 6. GPU Sharing Strategy: MIG (Multi-Instance GPU)
- 7. GPU Time-Slicing Configuration
- 8. MPS (Multi-Process Service) Configuration
- 9. GPU Node Isolation with Node Affinity and Taint/Toleration
- 10. Monitoring with DCGM Exporter + Prometheus + Grafana
- 11. Leveraging GPU Feature Discovery
- 12. Production YAML Examples and Troubleshooting
- References
1. Why You Need GPUs in Kubernetes
With the explosive growth of AI/ML workloads, GPUs are no longer optional but essential infrastructure. GPU-accelerated computing is demanded across diverse domains including LLM training, inference serving, computer vision, and scientific simulations. To operate these workloads at scale, Kubernetes-based orchestration is inevitable.
Using GPUs in Kubernetes provides the following benefits:
- Automated Resource Scheduling: GPUs can be declaratively requested and allocated on a per-Pod basis through the
nvidia.com/gpuresource type. - Multi-Tenancy: Namespace, ResourceQuota, and LimitRange can be leveraged to isolate and fairly distribute GPU resources across teams.
- GPU Sharing: Technologies such as MIG, Time-Slicing, and MPS allow multiple workloads to share a single GPU, maximizing cost efficiency.
- Auto Scaling: Combined with tools like HPA (Horizontal Pod Autoscaler) or Karpenter, GPU workloads can be automatically scaled based on demand.
- Operational Standardization: Operational tasks such as GPU driver installation, monitoring, and incident response can be automated using the Operator pattern.
However, to properly leverage GPUs in Kubernetes, multiple software components including GPU drivers, Container Toolkit, Device Plugin, and monitoring tools must be accurately installed and managed. To address this complexity, NVIDIA provides the GPU Operator.
2. NVIDIA GPU Operator Architecture
The NVIDIA GPU Operator uses the Kubernetes Operator Framework to automatically provision and manage all NVIDIA software components needed on GPU nodes. According to the official documentation, the GPU Operator includes the following components:
2.1 Core Components
| Component | Role |
|---|---|
| NVIDIA GPU Driver | Provides the interface between the GPU and the operating system, enabling CUDA. |
| NVIDIA Container Toolkit | Enables container runtimes (containerd, CRI-O) to access GPUs. |
| NVIDIA Device Plugin | Exposes GPUs as Kubernetes resources through the kubelet API. |
| GPU Feature Discovery (GFD) | Detects GPU properties (model, memory, CUDA version, etc.) on nodes and automatically generates Node Labels. |
| DCGM Exporter | Exposes GPU metrics in Prometheus format based on NVIDIA DCGM (Data Center GPU Manager). |
| MIG Manager | Manages Multi-Instance GPU configurations using the Kubernetes controller pattern. |
| GPU Direct Storage (GDS) | Enables direct data transfer between storage devices and GPU memory. |
2.2 How It Works
When the GPU Operator is deployed, initialization proceeds in the following order:
- gpu-operator Pod starts: Acts as the controller that monitors and reconciles the state of all components.
- Node Feature Discovery (NFD) deployment: Deploys Pods to each node in the cluster to detect GPU presence and add relevant Labels.
- Driver and Toolkit installation: Deploys NVIDIA drivers and Container Toolkit as DaemonSets on nodes where GPUs are detected.
- Device Plugin deployment: Registers GPUs as
nvidia.com/gpuresources with kubelet. - Auxiliary component deployment: Deploys GFD, DCGM Exporter, MIG Manager, etc.
All components are declaratively managed through a Custom Resource called ClusterPolicy. The Operator continuously compares the desired state and actual state of the ClusterPolicy and automatically corrects any discrepancies.
3. GPU Operator Installation
The GPU Operator is installed via Helm Chart. Let's walk through the installation step by step based on the official installation guide.
3.1 Prerequisites
kubectlandhelmCLI must be installed.- containerd or CRI-O must be used as the container runtime.
- All Worker Nodes running GPU workloads should have the same OS version (unless drivers are pre-installed separately).
- If Pod Security Admission (PSA) is used, the Namespace must be configured with the privileged level.
3.2 Installation Procedure
# 1. Create Namespace and configure PSA
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator \
pod-security.kubernetes.io/enforce=privileged
# 2. Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# 3. Install GPU Operator (default configuration)
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v25.10.1
3.3 Key Installation Options
| Parameter | Purpose | Default |
|---|---|---|
driver.enabled | Whether to deploy the NVIDIA driver | true |
toolkit.enabled | Whether to deploy Container Toolkit | true |
nfd.enabled | Whether to deploy Node Feature Discovery | true |
dcgmExporter.enabled | Whether to enable GPU telemetry | true |
cdi.enabled | Whether to use Container Device Interface | true |
driver.version | Specify a particular driver version | Varies by release |
mig.strategy | MIG strategy setting (none, single, mixed) | none |
When the NVIDIA driver is already installed on the host:
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v25.10.1 \
--set driver.enabled=false
When both the driver and Toolkit are pre-installed:
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v25.10.1 \
--set driver.enabled=false \
--set toolkit.enabled=false
3.4 Installation Verification
After installation is complete, you can verify GPU operation with a simple CUDA sample:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: 'nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04'
resources:
limits:
nvidia.com/gpu: 1
kubectl apply -f cuda-vectoradd.yaml
kubectl logs pod/cuda-vectoradd
# [Vector addition of 50000 elements]
# Test PASSED
4. NVIDIA Device Plugin In-Depth Analysis
The NVIDIA Device Plugin implements the Kubernetes Device Plugin Framework and is a core component of the GPU Operator.
4.1 Key Features
- GPU Enumeration: Detects all GPUs installed on a node and reports the count to kubelet.
- GPU Health Monitoring: Continuously checks GPU status and excludes unhealthy GPUs from scheduling.
- GPU Allocation: Allocates available GPUs when a Pod requests
nvidia.com/gpuresources. - Resource Sharing: Supports GPU sharing strategies such as Time-Slicing and MIG.
4.2 How It Works
The Device Plugin is deployed as a DaemonSet and runs on each GPU node. The operation flow is as follows:
- The Plugin starts a gRPC server and registers with kubelet's Device Plugin socket.
- kubelet reflects the capacity of
nvidia.com/gpuresources in the Node object. - When a Pod requests those resources, the kube-scheduler schedules it to an appropriate node.
- kubelet requests GPU allocation from the Device Plugin, and the Plugin configures the necessary device files and environment variables for the container.
4.3 Resource Request/Limit Configuration
GPU resources can only be requested through resources.limits. Unlike CPU or memory, requests cannot be set separately -- setting limits automatically applies the same value as requests.
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: gpu-container
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: 2 # Request 2 GPUs
Important Constraints:
- GPUs can only be requested in integer units (
nvidia.com/gpu: 0.5is not allowed). - GPUs are not shared across nodes (a single Pod cannot simultaneously use GPUs from multiple nodes).
- If
limitsis not specified, no GPU will be allocated.
5. GPU Resource Requests/Limits In Detail
5.1 Basic Resource Model
In Kubernetes, GPUs are treated as Extended Resources. They are requested using the resource name nvidia.com/gpu, which is registered with kubelet by the NVIDIA Device Plugin.
resources:
limits:
nvidia.com/gpu: 1 # Exclusively allocate 1 GPU
memory: '16Gi' # System memory is set separately from the GPU
cpu: '4'
5.2 GPU Limitation via ResourceQuota
In multi-tenant environments, GPU usage per Namespace can be restricted:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: team-ml
spec:
hard:
requests.nvidia.com/gpu: '4' # Maximum 4 GPUs
limits.nvidia.com/gpu: '4'
5.3 Default Values via LimitRange
apiVersion: v1
kind: LimitRange
metadata:
name: gpu-limit-range
namespace: team-ml
spec:
limits:
- type: Container
default:
nvidia.com/gpu: '1'
defaultRequest:
nvidia.com/gpu: '1'
max:
nvidia.com/gpu: '2'
6. GPU Sharing Strategy: MIG (Multi-Instance GPU)
6.1 MIG Overview
Multi-Instance GPU (MIG) is a hardware-level GPU partitioning technology supported on NVIDIA Ampere architecture and above (A100, A30, H100, etc.). A single GPU can be divided into up to 7 independent GPU instances, each with dedicated computing resources, memory, and cache.
The key advantage of MIG is hardware-level isolation. Each MIG instance provides independent memory space and fault isolation, so a failure in one instance does not affect others.
6.2 MIG Strategies
The GPU Operator supports two MIG strategies:
- Single Strategy: All GPUs on a node have the same MIG configuration. The resource name remains
nvidia.com/gpu. - Mixed Strategy: Some GPUs can operate in MIG mode while others run in Full GPU mode. MIG instances are exposed as separate resources in the form
nvidia.com/mig-<slice>.
6.3 MIG Installation and Configuration
Install the GPU Operator with MIG enabled:
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v25.10.1 \
--set mig.strategy=single
Apply a predefined MIG profile to a node:
# Apply MIG configuration Label to the node
kubectl label nodes gpu-node-01 \
nvidia.com/mig.config=all-1g.10gb --overwrite
Key predefined profiles:
| Profile | Description |
|---|---|
all-1g.10gb | Partition all GPUs into 1g.10gb instances |
all-3g.40gb | Partition all GPUs into 3g.40gb instances |
all-balanced | Mix various sizes of instances |
all-disabled | Disable MIG mode |
6.4 Custom MIG Configuration
Custom configurations beyond the default profiles are also possible:
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-mig-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
custom-profile:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 5
"2g.20gb": 1
Link the custom ConfigMap to the ClusterPolicy:
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
--type='json' \
-p='[{"op":"replace","path":"/spec/migManager/config/name","value":"custom-mig-config"}]'
Verify the configuration status:
kubectl get node gpu-node-01 -o=jsonpath='{.metadata.labels}' | jq .
# Verify "nvidia.com/mig.config.state": "success"
6.5 Using MIG Resources
Pods using MIG instances request resources as follows:
apiVersion: v1
kind: Pod
metadata:
name: mig-workload
spec:
containers:
- name: cuda-app
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/mig-1g.10gb: 1 # Request 1 MIG 1g.10gb instance
7. GPU Time-Slicing Configuration
7.1 Time-Slicing Overview
Time-Slicing is a method that uses NVIDIA GPU's time-division scheduler to allow multiple Pods to share a single GPU over time. Unlike MIG, it does not provide memory isolation or fault isolation, but it can be used on older GPUs (T4, V100, etc.) that do not support MIG and allows sharing GPUs with a larger number of users/workloads.
7.2 ConfigMap Configuration
Time-Slicing is configured through a ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4
Key Configuration Fields:
| Field | Type | Description |
|---|---|---|
renameByDefault | boolean | When true, advertises the resource as nvidia.com/gpu.shared. When false, adds a -SHARED suffix to the product label. |
failRequestsGreaterThanOne | boolean | Rejects Pods requesting more than 1 GPU replica. Setting to true is recommended. |
resources.name | string | Resource name (nvidia.com/gpu, etc.) |
resources.replicas | integer | Number of Time-Slice replicas per GPU |
7.3 Cluster Application
Apply cluster-wide:
kubectl apply -f time-slicing-config.yaml
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
-n gpu-operator --type merge \
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'
Apply to specific nodes only:
# When per-GPU-model configurations are defined in the ConfigMap
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
-n gpu-operator --type merge \
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config"}}}}'
# Apply Label to the target node
kubectl label node gpu-node-01 \
nvidia.com/device-plugin.config=tesla-t4
7.4 Verifying the Result
When Time-Slicing is applied, the node's Labels and allocatable resources will change:
kubectl get node gpu-node-01 -o jsonpath='{.status.allocatable}' | jq .
# "nvidia.com/gpu": "4" (1 physical GPU * 4 replicas)
kubectl get node gpu-node-01 --show-labels | grep nvidia
# nvidia.com/gpu.replicas=4
# nvidia.com/gpu.product=Tesla-T4-SHARED
7.5 Key Limitations
- No memory/fault isolation: GPU memory is shared between Time-Slice replicas, and a failure in one process can affect others.
- No proportional compute guarantee: Requesting multiple GPU replicas does not guarantee proportionally more computing power.
- DCGM Exporter limitation: When GPU Time-Slicing is enabled, associating metrics with individual containers is not supported.
- Manual restart required on ConfigMap changes: The Operator does not automatically detect ConfigMap changes, so the DaemonSet must be restarted manually.
kubectl rollout restart -n gpu-operator \
daemonset/nvidia-device-plugin-daemonset
8. MPS (Multi-Process Service) Configuration
8.1 MPS Overview
NVIDIA MPS (Multi-Process Service) is a client-server architecture that allows multiple CUDA processes to run simultaneously on a single GPU. It reduces the overhead of traditional CUDA context switching and multiplexes CUDA kernels from multiple processes into a single GPU context, improving GPU utilization.
8.2 MIG vs Time-Slicing vs MPS Comparison
| Feature | MIG | Time-Slicing | MPS |
|---|---|---|---|
| Memory Isolation | Yes (Hardware) | No | Partial (Software) |
| Fault Isolation | Yes | No | No |
| Supported GPUs | Ampere and above | All NVIDIA GPUs | Volta and above recommended |
| Max Partitions | 7 | Unlimited | 48 (Pre-Volta: 16) |
| Concurrent Kernels | Yes | No (Time-divided) | Yes |
| Partition Flexibility | Fixed profiles | Equal division | Arbitrary sizes |
8.3 MPS Pros and Cons
Pros:
- Unlike MIG, arbitrary-sized GPU slices can be created.
- Unlike Time-Slicing, memory allocation limits can be enforced, reducing OOM errors.
- CUDA kernels from multiple processes actually execute concurrently, resulting in higher GPU utilization.
Cons:
- No fault isolation is provided. A crash in one client process can trigger a GPU reset, affecting all other processes.
- Memory protection is not complete.
8.4 Using MPS in Kubernetes
Currently, the NVIDIA Device Plugin does not directly support MPS, but MPS can be configured through NVIDIA's DRA (Dynamic Resource Allocation) driver. Alternatively, a separate Device Plugin that supports MPS can be installed.
Some managed Kubernetes services such as GKE (Google Kubernetes Engine) natively support MPS.
# MPS usage example on GKE
apiVersion: v1
kind: Pod
metadata:
name: mps-workload
spec:
containers:
- name: cuda-app
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: 1
In typical on-premises Kubernetes environments, to use MPS you need to deploy the MPS Control Daemon as a sidecar or DaemonSet and configure the CUDA_MPS_PIPE_DIRECTORY and CUDA_MPS_LOG_DIRECTORY environment variables.
9. GPU Node Isolation with Node Affinity and Taint/Toleration
GPU nodes are expensive resources, so workloads that do not require GPUs should be prevented from being scheduled on GPU nodes. This is achieved by combining Kubernetes Taint/Toleration and Node Affinity.
9.1 Applying Taints to GPU Nodes
# Add Taint to GPU nodes
kubectl taint nodes gpu-node-01 \
nvidia.com/gpu=present:NoSchedule
kubectl taint nodes gpu-node-02 \
nvidia.com/gpu=present:NoSchedule
Once this Taint is applied, Pods without the corresponding Toleration will not be scheduled on GPU nodes.
9.2 Adding Tolerations to GPU Workloads
apiVersion: v1
kind: Pod
metadata:
name: gpu-training-job
spec:
tolerations:
- key: 'nvidia.com/gpu'
operator: 'Equal'
value: 'present'
effect: 'NoSchedule'
containers:
- name: training
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: 1
9.3 Selecting Specific GPU Models with Node Affinity
Using Labels generated by GPU Feature Discovery, workloads can be scheduled only on nodes with specific GPU models:
apiVersion: v1
kind: Pod
metadata:
name: a100-training
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- 'A100-SXM4-80GB'
- 'A100-SXM4-40GB'
- key: nvidia.com/gpu.memory
operator: Gt
values:
- '40000'
tolerations:
- key: 'nvidia.com/gpu'
operator: 'Equal'
value: 'present'
effect: 'NoSchedule'
containers:
- name: llm-training
image: nvcr.io/nvidia/pytorch:24.01-py3
command: ['python', 'train.py']
resources:
limits:
nvidia.com/gpu: 4
9.4 Production Strategy: Combining Taint + Affinity
The best practice for GPU node management is a dual strategy of blocking non-GPU workloads with Taints and directing GPU workloads to appropriate nodes with Node Affinity.
- Taint: A "hard" constraint that restricts which Pods can access GPU nodes
- Node Affinity: A "soft" or "hard" preference that places Pods on nodes with desired GPU specifications
Tolerations only allow a Pod to be scheduled on a Tainted node; they do not force the Pod to that node. Therefore, Node Affinity must be used together to ensure GPU workloads are placed exclusively on GPU nodes.
10. Monitoring with DCGM Exporter + Prometheus + Grafana
10.1 DCGM Exporter Overview
DCGM Exporter is a component that collects GPU telemetry data using the Go API of NVIDIA Data Center GPU Manager (DCGM) and exposes it in Prometheus format. When the GPU Operator is installed, DCGM Exporter is deployed as a DaemonSet by default.
Key collected metrics:
| Metric | Description |
|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU utilization (%) |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory copy utilization (%) |
DCGM_FI_DEV_FB_USED | Framebuffer used memory (MB) |
DCGM_FI_DEV_FB_FREE | Framebuffer free memory (MB) |
DCGM_FI_DEV_GPU_TEMP | GPU temperature (C) |
DCGM_FI_DEV_POWER_USAGE | Power usage (W) |
DCGM_FI_DEV_SM_CLOCK | SM Clock frequency (MHz) |
DCGM_FI_DEV_PCIE_TX_THROUGHPUT | PCIe TX throughput |
DCGM_FI_DEV_PCIE_RX_THROUGHPUT | PCIe RX throughput |
10.2 Installing the Prometheus Stack
Deploy Prometheus and Grafana together using kube-prometheus-stack:
# Add Prometheus community Helm repository
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm repo update
Create a values file for GPU metric collection:
# kube-prometheus-stack-values.yaml
prometheus:
service:
type: NodePort
nodePort: 30090
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
additionalScrapeConfigs:
- job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- gpu-operator
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node
grafana:
service:
type: NodePort
nodePort: 32322
helm install prometheus-community/kube-prometheus-stack \
--create-namespace --namespace prometheus \
--generate-name \
--values kube-prometheus-stack-values.yaml
10.3 Grafana Dashboard Setup
Import the official NVIDIA DCGM Exporter dashboard:
- Access Grafana (
http://<node-ip>:32322). - Navigate to Dashboards > Import.
- Enter Dashboard ID
12239(NVIDIA DCGM Exporter Dashboard). - Select Prometheus as the Data Source.
This dashboard visualizes GPU utilization, memory usage, temperature, and power consumption in real time. For a dashboard that supports both MIG and Non-MIG GPUs, use ID 23382.
10.4 Alert Configuration Example
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-alerts
namespace: prometheus
spec:
groups:
- name: gpu.rules
rules:
- alert: GPUHighTemperature
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 5m
labels:
severity: warning
annotations:
summary: 'GPU temperature is high on {{ $labels.kubernetes_node }}'
description: 'GPU {{ $labels.gpu }} temperature is {{ $value }}C'
- alert: GPUMemoryAlmostFull
expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.95
for: 5m
labels:
severity: critical
annotations:
summary: 'GPU memory usage above 95% on {{ $labels.kubernetes_node }}'
- alert: GPUHighUtilization
expr: DCGM_FI_DEV_GPU_UTIL > 95
for: 30m
labels:
severity: info
annotations:
summary: 'Sustained high GPU utilization on {{ $labels.kubernetes_node }}'
11. Leveraging GPU Feature Discovery
11.1 Overview
GPU Feature Discovery (GFD) is a component that automatically detects GPU properties installed on nodes and generates Kubernetes Node Labels. It operates on top of Node Feature Discovery (NFD) and is automatically deployed as part of the GPU Operator.
11.2 Generated Labels
| Label | Description | Example Value |
|---|---|---|
nvidia.com/gpu.product | GPU model name | Tesla-T4, A100-SXM4-80GB |
nvidia.com/gpu.memory | GPU memory capacity (MB) | 40960 |
nvidia.com/gpu.count | GPU count | 4 |
nvidia.com/gpu.family | GPU architecture family | ampere, hopper |
nvidia.com/gpu.compute.major | CUDA Compute Capability (Major) | 8 |
nvidia.com/gpu.compute.minor | CUDA Compute Capability (Minor) | 0 |
nvidia.com/cuda.driver.major | CUDA Driver Major version | 535 |
nvidia.com/cuda.driver.minor | CUDA Driver Minor version | 129 |
nvidia.com/cuda.runtime.major | CUDA Runtime Major version | 12 |
nvidia.com/cuda.runtime.minor | CUDA Runtime Minor version | 2 |
nvidia.com/gpu.machine | Machine type (instance type in cloud) | p4d.24xlarge |
nvidia.com/gpu.replicas | GPU replica count (when sharing) | 4 |
nvidia.com/mig.strategy | MIG strategy | single, mixed |
11.3 Label Usage Examples
GFD Labels enable fine-grained scheduling based on GPU specifications:
# Run only on Hopper architecture GPUs
apiVersion: batch/v1
kind: Job
metadata:
name: h100-inference
spec:
template:
spec:
nodeSelector:
nvidia.com/gpu.family: hopper
containers:
- name: inference
image: nvcr.io/nvidia/tritonserver:24.01-py3
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never
# Large model training requiring 80GB+ GPU memory
apiVersion: batch/v1
kind: Job
metadata:
name: large-model-training
spec:
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.memory
operator: Gt
values:
- '79000'
- key: nvidia.com/gpu.count
operator: Gt
values:
- '7'
containers:
- name: training
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: 8
restartPolicy: Never
12. Production YAML Examples and Troubleshooting
12.1 PyTorch Distributed Training Job
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training
spec:
parallelism: 2
completions: 2
template:
metadata:
labels:
app: distributed-training
spec:
tolerations:
- key: 'nvidia.com/gpu'
operator: 'Equal'
value: 'present'
effect: 'NoSchedule'
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- 'A100-SXM4-80GB'
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- distributed-training
topologyKey: kubernetes.io/hostname
containers:
- name: pytorch-trainer
image: nvcr.io/nvidia/pytorch:24.01-py3
command:
- torchrun
- --nproc_per_node=4
- --nnodes=2
- train.py
resources:
limits:
nvidia.com/gpu: 4
memory: '64Gi'
cpu: '16'
requests:
memory: '64Gi'
cpu: '16'
env:
- name: NCCL_DEBUG
value: 'INFO'
volumeMounts:
- name: dshm
mountPath: /dev/shm
- name: training-data
mountPath: /data
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: '16Gi'
- name: training-data
persistentVolumeClaim:
claimName: training-data-pvc
restartPolicy: Never
backoffLimit: 3
Note: In PyTorch distributed training, if the size of /dev/shm (shared memory) is insufficient, crashes can occur when DataLoader's num_workers > 0 is set. Mount an emptyDir with Memory medium and allocate a sufficient size.
12.2 Triton Inference Server Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference
spec:
replicas: 2
selector:
matchLabels:
app: triton-inference
template:
metadata:
labels:
app: triton-inference
spec:
tolerations:
- key: 'nvidia.com/gpu'
operator: 'Equal'
value: 'present'
effect: 'NoSchedule'
nodeSelector:
nvidia.com/gpu.family: ampere
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-py3
args:
- tritonserver
- --model-repository=/models
- --strict-model-config=false
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: grpc
- containerPort: 8002
name: metrics
resources:
limits:
nvidia.com/gpu: 1
memory: '16Gi'
cpu: '4'
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 15
volumeMounts:
- name: model-store
mountPath: /models
volumes:
- name: model-store
persistentVolumeClaim:
claimName: model-store-pvc
---
apiVersion: v1
kind: Service
metadata:
name: triton-inference
spec:
selector:
app: triton-inference
ports:
- name: http
port: 8000
targetPort: 8000
- name: grpc
port: 8001
targetPort: 8001
- name: metrics
port: 8002
targetPort: 8002
12.3 Troubleshooting Guide
GPU Not Recognized on Node
# 1. Check GPU Operator Pod status
kubectl get pods -n gpu-operator
# 2. Check Driver Pod logs
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset
# 3. Check Device Plugin Pod logs
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset
# 4. Check node's allocatable resources
kubectl describe node gpu-node-01 | grep -A 5 "Allocatable"
# 5. Check NFD Labels
kubectl get node gpu-node-01 -o json | \
jq '.metadata.labels | to_entries[] | select(.key | startswith("nvidia"))'
Pod Not Being Scheduled on GPU Node
# 1. Check Pod events
kubectl describe pod <pod-name>
# Look for "Insufficient nvidia.com/gpu" message
# 2. Check node's GPU allocation status
kubectl describe node gpu-node-01 | grep -A 3 "Allocated resources"
# 3. List Pods currently using GPUs
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.containers[].resources.limits["nvidia.com/gpu"] != null) | .metadata.name'
MIG Configuration Not Applied
# 1. Check MIG Manager logs
kubectl logs -n gpu-operator -l app=nvidia-mig-manager
# 2. Check MIG configuration state Label
kubectl get node gpu-node-01 -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'
# If not "success", configuration failed
# 3. Check GPU mode (on the node)
nvidia-smi -L
nvidia-smi --query-gpu=mig.mode.current --format=csv
GPU Count Not Changed After Time-Slicing
# 1. Check ConfigMap content
kubectl get configmap -n gpu-operator time-slicing-config -o yaml
# 2. Restart Device Plugin
kubectl rollout restart -n gpu-operator \
daemonset/nvidia-device-plugin-daemonset
# 3. Check node resources after restart
kubectl get node gpu-node-01 -o jsonpath='{.status.allocatable.nvidia\.com/gpu}'
References
- NVIDIA GPU Operator - Official Documentation
- NVIDIA GPU Operator - Installation Guide
- NVIDIA GPU Operator - GPU Sharing (Time-Slicing)
- NVIDIA GPU Operator - MIG Configuration
- NVIDIA Device Plugin for Kubernetes - GitHub
- NVIDIA GPU Feature Discovery - Documentation
- NVIDIA Multi-Instance GPU User Guide
- NVIDIA MPS (Multi-Process Service) Official Documentation
- NVIDIA DCGM Exporter - GitHub
- NVIDIA DCGM Exporter - Official Documentation
- NVIDIA DCGM Exporter Grafana Dashboard
- Prometheus GPU Telemetry Configuration Guide
- Kubernetes Device Plugins - Official Documentation
- NVIDIA Cloud Native Technologies
- NVIDIA GPU Operator - GitHub
- NVIDIA GPU Operator Helm Chart - NGC Catalog