- Authors
- Name
- 1. Introduction to GPU Operator
- 2. Detailed Analysis of the 7 Core Components
- 3. Installation Guide (Helm-based)
- 4. GPU Sharing Strategies
- 5. KubeVirt and GPU Operator Integration
- 6. Monitoring and Observability
- 7. Production Deployment Best Practices
- 8. Troubleshooting Guide
- 9. References
1. Introduction to GPU Operator
1.1 What is GPU Operator?
NVIDIA GPU Operator is a Kubernetes Operator that automates the installation, configuration, and management of all software components required to utilize GPUs in a Kubernetes cluster. Built on the Kubernetes Operator Framework, it consistently manages drivers, runtimes, device plugins, monitoring tools, and more from Day-0 provisioning through Day-2 operations.
Traditionally, using GPUs in Kubernetes required manually performing the following tasks on each node:
- Installing NVIDIA drivers (exact version matching the kernel version)
- Installing NVIDIA Container Toolkit and configuring Container Runtime
- Deploying NVIDIA Device Plugin
- Installing monitoring tools (DCGM Exporter)
- Deploying Node Feature Discovery
GPU Operator manages all these processes declaratively through a single Custom Resource Definition (CRD) called ClusterPolicy.
1.2 Why is GPU Operator Needed?
Manual GPU driver installation has the following serious operational issues:
| Issue | Manual Management | GPU Operator |
|---|---|---|
| Driver Installation | SSH into each node for manual install | Auto-deploy via DaemonSet |
| Kernel Update Response | Driver recompile needed on kernel update | Auto-match pre-compiled drivers |
| Version Consistency | Risk of driver version mismatch per node | Unified version via ClusterPolicy |
| New Node Addition | Provisioning script maintenance required | Label-based auto-detect and install |
| Monitoring | Separate setup required | Auto-deploy DCGM Exporter |
| GPU Sharing | Manual MIG/Time-Slicing setup | Declarative ConfigMap-based setup |
| Upgrades | Manual update after node drain | Rolling Update support |
1.3 Supported Environments
Supported OS:
| OS | Version |
|---|---|
| Ubuntu | 20.04, 22.04, 24.04 LTS |
| Red Hat Enterprise Linux | 8.x, 9.x |
| CentOS Stream | 8, 9 |
| Rocky Linux | 8.x, 9.x |
| SUSE Linux Enterprise | 15 SP4+ |
Supported GPUs:
| GPU Generation | Example Models | MIG Support |
|---|---|---|
| Turing | T4, RTX 2080 | X |
| Ampere | A100, A30, A10, A2 | O (A100, A30) |
| Hopper | H100, H200 | O |
| Ada Lovelace | L40, L40S, L4 | X |
| Blackwell | B100, B200, GB200 | O |
Supported Kubernetes Version: v1.25+ (recommended: v1.28+)
Supported Container Runtime: containerd, CRI-O
1.4 Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ Kubernetes Control Plane │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ GPU Operator Controller │ │
│ │ (Deployment in gpu-operator namespace) │ │
│ │ │ │
│ │ ┌────────────────┐ ┌─────────────────────────────────┐ │ │
│ │ │ ClusterPolicy │───▶│ Reconciliation Loop │ │ │
│ │ │ (CRD) │ │ - Watch GPU Nodes │ │ │
│ │ └────────────────┘ │ - Deploy DaemonSets │ │ │
│ │ │ - Manage Lifecycle │ │ │
│ │ └─────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐
│ GPU Worker #1 │ │ GPU Worker #2│ │ GPU Worker #3 │
│ │ │ │ │ │
│ ┌──────────────┐ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
│ │ NFD │ │ │ │ NFD │ │ │ │ NFD │ │
│ │ (Node Feature│ │ │ │ │ │ │ │ │ │
│ │ Discovery) │ │ │ └──────────┘ │ │ └──────────────┘ │
│ └──────────────┘ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
│ ┌──────────────┐ │ │ │ GFD │ │ │ │ GFD │ │
│ │ GFD │ │ │ │ │ │ │ │ │ │
│ │ (GPU Feature │ │ │ └──────────┘ │ │ └──────────────┘ │
│ │ Discovery) │ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
│ └──────────────┘ │ │ │ Driver │ │ │ │ Driver │ │
│ ┌──────────────┐ │ │ │ │ │ │ │ │ │
│ │ NVIDIA Driver│ │ │ └──────────┘ │ │ └──────────────┘ │
│ │ (DaemonSet) │ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
│ └──────────────┘ │ │ │ Toolkit │ │ │ │ Toolkit │ │
│ ┌──────────────┐ │ │ │ │ │ │ │ │ │
│ │ Container │ │ │ └──────────┘ │ │ └──────────────┘ │
│ │ Toolkit │ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
│ └──────────────┘ │ │ │ Device │ │ │ │ Device │ │
│ ┌──────────────┐ │ │ │ Plugin │ │ │ │ Plugin │ │
│ │ Device Plugin│ │ │ └──────────┘ │ │ └──────────────┘ │
│ └──────────────┘ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
│ ┌──────────────┐ │ │ │ DCGM │ │ │ │ DCGM │ │
│ │ DCGM Exporter│ │ │ │ Exporter │ │ │ │ Exporter │ │
│ └──────────────┘ │ │ └──────────┘ │ │ └──────────────┘ │
│ ┌──────────────┐ │ │ ┌──────────┐ │ │ ┌──────────────┐ │
│ │ MIG Manager │ │ │ │ MIG Mgr │ │ │ │ MIG Manager │ │
│ └──────────────┘ │ │ └──────────┘ │ │ └──────────────┘ │
│ │ │ │ │ │
│ ╔════════════╗ │ │ ╔══════════╗ │ │ ╔════════════╗ │
│ ║ NVIDIA GPU ║ │ │ ║ NVIDIA ║ │ │ ║ NVIDIA GPU ║ │
│ ║ (A100) ║ │ │ ║ GPU(H100)║ │ │ ║ (T4) ║ │
│ ╚════════════╝ │ │ ╚══════════╝ │ │ ╚════════════╝ │
└──────────────────┘ └──────────────┘ └──────────────────┘
The GPU Operator Controller watches the ClusterPolicy CRD, automatically detects worker nodes with GPUs, and deploys all necessary components as DaemonSets. A node is recognized as having an NVIDIA GPU when it has the label feature.node.kubernetes.io/pci-10de.present=true.
2. Detailed Analysis of the 7 Core Components
GPU Operator consists of 7 core components that form the complete software stack required for GPU workloads. Each component is deployed as a DaemonSet and has dependencies on each other.
2.1 NVIDIA Driver (nvidia-driver-daemonset)
The NVIDIA Driver DaemonSet automatically installs the kernel module required to communicate with GPU hardware on each GPU node. This is the most fundamental and essential component of GPU Operator.
Key Features:
- Auto-build and load NVIDIA kernel modules (
nvidia.ko,nvidia-modeset.ko,nvidia-uvm.ko) - Auto-match drivers to host kernel version
- Support for both pre-compiled and run-compiled drivers
- Driver version management and upgrades
Pre-compiled vs Run-compiled:
| Property | Pre-compiled | Run-compiled |
|---|---|---|
| Build Time | None (already built) | 10-20 min on node |
| Kernel Compatibility | Matched to specific kernel version | Supports all kernel versions |
| Node Boot Time | Fast | Slow |
| Supported OS | Ubuntu (primarily) | Most Linux |
| Setting | kernelModuleType: precompiled | kernelModuleType: runcompiled |
Driver Configuration Example in ClusterPolicy:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
driver:
enabled: true
repository: nvcr.io/nvidia
image: driver
version: '560.35.03'
kernelModuleType: auto # auto | precompiled | runcompiled
manager:
env:
- name: ENABLE_AUTO_DRAIN
value: 'true'
rdma:
enabled: false
licensingConfig:
nlsEnabled: false
configMapName: ''
Note: If NVIDIA drivers are already installed on the host, set
driver.enabled=false. Conflicting drivers will prevent the GPU from functioning properly.
2.2 NVIDIA Container Toolkit (nvidia-container-toolkit)
NVIDIA Container Toolkit acts as a bridge enabling Container Runtimes (containerd, CRI-O) to access GPUs from within containers.
Core Operation Principle:
┌─────────────────────────────────────────────────────┐
│ Container Creation Flow │
│ │
│ kubelet │
│ │ │
│ ▼ │
│ containerd/CRI-O │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ NVIDIA Container Runtime Hook │ │
│ │ (nvidia-container-runtime-hook) │ │
│ │ │ │
│ │ 1. Modify OCI spec │ │
│ │ 2. Mount GPU devices │ │
│ │ 3. Inject NVIDIA libraries │ │
│ │ 4. Set environment variables │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ libnvidia-container │ │
│ │ - GPU device node binding │ │
│ │ - CUDA library mount │ │
│ │ - Driver library mount │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Container (with GPU access) │
│ /dev/nvidia0, /dev/nvidiactl, ... │
└─────────────────────────────────────────────────────┘
CDI (Container Device Interface) Support:
Starting from GPU Operator v25.x, CDI is enabled by default, allowing the container runtime to recognize GPU devices in a standardized manner.
spec:
toolkit:
enabled: true
version: v1.17.3-ubuntu22.04
cdi:
enabled: true # Enable CDI (default: true)
default: true # Use CDI as default device allocation method
2.3 NVIDIA Device Plugin (nvidia-device-plugin)
NVIDIA Device Plugin registers GPUs as Kubernetes Extended Resources with kubelet, allowing Pods to request nvidia.com/gpu resources.
Operation Flow:
- Device Plugin registers with kubelet's Device Plugin gRPC interface
- Reports the number of GPUs on the node to kubelet
- Kubernetes Scheduler schedules Pods requesting
nvidia.com/gputo the appropriate node - Device Plugin passes GPU device info to kubelet when a Pod runs
GPU Request Example in Pod:
apiVersion: v1
kind: Pod
metadata:
name: cuda-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-test
image: nvcr.io/nvidia/cuda:12.6.0-base-ubuntu22.04
command: ['nvidia-smi']
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
Resource Types Managed by Device Plugin:
| Resource Name | Description | Use Case |
|---|---|---|
nvidia.com/gpu | Full GPU (or Time-Slicing shared) | Default GPU workloads |
nvidia.com/mig-1g.5gb | MIG 1g.5gb instance | A100 MIG partition |
nvidia.com/mig-2g.10gb | MIG 2g.10gb instance | A100 MIG partition |
nvidia.com/mig-3g.20gb | MIG 3g.20gb instance | A100 MIG partition |
nvidia.com/mig-7g.40gb | MIG 7g.40gb full instance | A100 full MIG |
nvidia.com/gpu.shared | Time-Slicing shared GPU | When renameByDefault is enabled |
2.4 NVIDIA DCGM & DCGM Exporter
DCGM (Data Center GPU Manager) is a tool for monitoring NVIDIA GPU health and performance, while DCGM Exporter exposes these metrics in Prometheus format.
Key Collected Metrics:
| Metric Name | Description | Unit |
|---|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU core utilization | % |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory bandwidth utilization | % |
DCGM_FI_DEV_GPU_TEMP | GPU temperature | C |
DCGM_FI_DEV_POWER_USAGE | Power consumption | W |
DCGM_FI_DEV_FB_FREE | Available framebuffer memory | MiB |
DCGM_FI_DEV_FB_USED | Used framebuffer memory | MiB |
DCGM_FI_DEV_XID_ERRORS | XID error count | count |
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL | Single-bit ECC errors (volatile) | count |
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL | Double-bit ECC errors (persistent) | count |
2.5 NVIDIA MIG Manager
MIG (Multi-Instance GPU) Manager manages the partitioning of a single physical GPU into multiple independent GPU instances on NVIDIA Ampere and newer architectures (A100, A30, H100, etc.).
Unlike Time-Slicing, MIG provides hardware-level memory isolation and compute isolation. Each MIG instance has its own memory, cache, and SM (Streaming Multiprocessor).
A100 40GB MIG Profiles:
| Profile | GPU Slice | Memory | Max Instances |
|---|---|---|---|
1g.5gb | 1/7 | 5 GB | 7 |
2g.10gb | 2/7 | 10 GB | 3 |
3g.20gb | 3/7 | 20 GB | 2 |
7g.40gb | 7/7 | 40 GB | 1 |
2.6 Node Feature Discovery (NFD)
Node Feature Discovery (NFD) automatically detects hardware characteristics of Kubernetes nodes and registers them as node labels. As a dependency of GPU Operator, it plays a key role in identifying GPU nodes.
GPU Operator uses the feature.node.kubernetes.io/pci-10de.present=true label to automatically identify nodes with NVIDIA GPUs and deploy GPU-related DaemonSets only to those nodes.
Note: If NFD is already installed in the cluster, set
nfd.enabled=falseduring GPU Operator installation to prevent duplicate deployment.
2.7 GPU Feature Discovery (GFD)
GPU Feature Discovery (GFD) registers GPU-specific detailed information as node labels, similar to NFD but focused on GPU details. GFD communicates with the NVIDIA Driver to collect GPU model, driver version, CUDA version, and other information.
Component Deployment Order (Dependency Chain):
NFD (Node Feature Discovery)
└──▶ GPU Feature Discovery (GFD)
└──▶ NVIDIA Driver
└──▶ NVIDIA Container Toolkit
└──▶ NVIDIA Device Plugin
└──▶ DCGM / DCGM Exporter
└──▶ MIG Manager (MIG-capable GPUs only)
└──▶ Validator (Verification)
Component Role Summary Table:
| Component | DaemonSet Name | Role | Namespace |
|---|---|---|---|
| NFD | node-feature-discovery-worker | Node hardware labeling | gpu-operator |
| GFD | gpu-feature-discovery | GPU detail labeling | gpu-operator |
| Driver | nvidia-driver-daemonset | GPU kernel module install | gpu-operator |
| Toolkit | nvidia-container-toolkit-daemonset | Container Runtime GPU integration | gpu-operator |
| Device Plugin | nvidia-device-plugin-daemonset | GPU resource kubelet registration | gpu-operator |
| DCGM | nvidia-dcgm | GPU metrics collection daemon | gpu-operator |
| DCGM Exporter | nvidia-dcgm-exporter | Prometheus metrics exposure | gpu-operator |
| MIG Manager | nvidia-mig-manager | MIG partition management | gpu-operator |
| Validator | nvidia-operator-validator | Full stack verification | gpu-operator |
3. Installation Guide (Helm-based)
3.1 Prerequisites
3.1.1 Hardware Requirements
- Worker nodes with NVIDIA GPUs (Turing architecture or newer recommended)
- PCIe 3.0+ slots
- IOMMU support (when using KubeVirt GPU Passthrough)
3.1.2 Software Requirements
# Verify kubectl and helm installation
kubectl version --client
helm version
# Verify Kubernetes cluster access
kubectl get nodes
# Check Container Runtime (containerd or CRI-O)
kubectl get nodes -o wide
3.1.3 Disable Nouveau Driver
The GPU Operator Driver DaemonSet requires the open-source Nouveau driver to be disabled to load the NVIDIA driver.
# Check if nouveau module is loaded
lsmod | grep nouveau
# Disable nouveau (if needed)
cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF
# Regenerate initramfs
sudo update-initramfs -u
# Reboot
sudo reboot
3.2 Helm Repo Addition and Basic Installation
# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Check available versions
helm search repo nvidia/gpu-operator --versions
# Basic installation (latest version)
helm install gpu-operator \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v25.10.1 \
--wait
3.3 Verification
# nvidia-smi execution test
kubectl run nvidia-smi \
--restart=Never \
--image=nvcr.io/nvidia/cuda:12.6.0-base-ubuntu22.04 \
--overrides='{"spec":{"restartPolicy":"Never","containers":[{"name":"nvidia-smi","image":"nvcr.io/nvidia/cuda:12.6.0-base-ubuntu22.04","command":["nvidia-smi"],"resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}' \
&& kubectl logs nvidia-smi
# Cleanup
kubectl delete pod nvidia-smi
4. GPU Sharing Strategies
4.1 Time-Slicing
Time-Slicing shares a single physical GPU among multiple containers by time-based division. Supported on all NVIDIA GPUs.
Characteristics:
- No hardware-level isolation (no memory isolation)
- Risk of GPU memory OOM
- All NVIDIA GPUs supported (Turing+)
- Simple and flexible configuration
4.2 MIG (Multi-Instance GPU)
MIG is a hardware-level GPU partitioning technology supported on NVIDIA Ampere and newer architectures (A100, A30, H100, H200).
4.3 vGPU (Virtual GPU)
NVIDIA vGPU virtualizes GPUs for sharing across multiple VMs. Primarily used with KubeVirt VMs and requires a separate NVIDIA license.
4.4 GPU Sharing Strategy Comparison
| Property | Time-Slicing | MIG | vGPU |
|---|---|---|---|
| Isolation Level | None (time-sharing only) | Hardware (memory+compute) | Hardware (virtualization) |
| Memory Isolation | X | O | O |
| Fault Isolation | X | O | O |
| Supported GPUs | All NVIDIA GPUs | A100, A30, H100, H200 | vGPU-capable GPUs |
| Max Partitions | Unlimited by config | Varies by GPU (max 7) | Varies by GPU |
| Performance Overhead | Context Switching | Minimal | Virtualization overhead |
| License | Free | Free | Paid (NVIDIA AI Enterprise) |
| Use Cases | Dev/test, light inference | Production inference, multi-tenant | VM-based workloads |
5. KubeVirt and GPU Operator Integration
5.1 KubeVirt Introduction
KubeVirt is a project that enables managing virtual machines (VMs) on Kubernetes. By integrating KubeVirt with GPU Operator, you can assign physical GPUs directly to VMs (Passthrough) or allocate virtual GPUs (vGPU).
Important: A single GPU worker node can only run one type of GPU workload. You cannot mix container, vm-passthrough, and vm-vgpu on the same node.
5.2 GPU Passthrough (PCI Passthrough)
GPU Passthrough assigns the entire physical GPU directly to a KubeVirt VM. The VM can use the GPU at native performance, but GPU sharing is not possible.
Key Steps:
- Enable IOMMU/VT-d in BIOS
- Add kernel parameters (
intel_iommu=on iommu=pt) - Load VFIO driver module
- Install GPU Operator with
sandboxWorkloads.enabled=true - Label GPU node with
nvidia.com/gpu.workload.config=vm-passthrough - Register GPU device in KubeVirt CR
- Assign GPU in VirtualMachine spec
5.3 vGPU with KubeVirt
vGPU splits a single physical GPU into multiple virtual GPUs that can be shared across multiple VMs. Unlike GPU Passthrough, a single GPU can be used by multiple VMs simultaneously.
Prerequisites:
- NVIDIA AI Enterprise license or vGPU license
- GPU supporting vGPU (A100, A10, A30, A16, L40, L4, T4, etc.)
5.4 Live Migration with GPU Constraints
| Constraint | GPU Passthrough | vGPU |
|---|---|---|
| Live Migration | Not supported | Limited support (NVIDIA vGPU Migration) |
| Reason | PCIe direct assignment cannot be migrated | vGPU state transfer needed |
| Alternative | Cold Migration (stop VM then migrate) | NVIDIA vGPU Migration license required |
6. Monitoring and Observability
6.1 DCGM Exporter -> Prometheus -> Grafana Pipeline
GPU Operator's DCGM Exporter collects metrics from each GPU node and exposes them in Prometheus format on port 9400. Prometheus scrapes these metrics, and Grafana visualizes them using Dashboard ID 12239 (NVIDIA DCGM Exporter Dashboard).
6.2 Alert Rule Examples
Key alerts include GPU temperature warnings (over 85C), GPU temperature critical (over 90C), GPU memory nearly full (over 95% usage), double-bit ECC errors, prolonged high GPU utilization (over 95% for 30+ min), and XID errors.
7. Production Deployment Best Practices
7.1 Node Labeling Strategy
In production, GPU nodes and CPU nodes should be clearly distinguished.
7.2 Taints & Tolerations
Set Taints on GPU nodes so only GPU workloads are scheduled.
7.3 ResourceQuota for GPU Allocation Limits
Limit GPU usage per namespace.
7.4 PriorityClass Configuration
Configure PriorityClass to preempt lower-priority workloads when GPU resources are scarce.
7.5 Driver Upgrade Strategy
GPU Operator supports Rolling Updates for driver upgrades. Set ENABLE_AUTO_DRAIN=true for automatic node drain before driver upgrade, and maxUnavailable: 1 to upgrade one node at a time in production.
8. Troubleshooting Guide
8.1 GPU not detected (Driver Loading Failure)
Symptoms: nvidia-driver-daemonset Pod in CrashLoopBackOff or Error state
Check driver Pod logs, verify Nouveau module conflicts, check for missing kernel headers, check dmesg for GPU-related errors, and verify GPU hardware recognition.
8.2 nvidia-smi not found in container
Check Container Toolkit Pod status, Device Plugin Pod status, node GPU resources, and CDI configuration.
8.3 CUDA version mismatch
Verify the driver's CUDA support version is greater than or equal to the application's CUDA version.
8.4 MIG partition failure
Check MIG Manager logs, verify MIG status on the node, check for running GPU processes, and retry MIG configuration.
8.5 KubeVirt VM GPU not recognized
Verify node workload label, VFIO/vGPU Manager status, Sandbox Device Plugin status, GPU resources on the node, KubeVirt CR permittedHostDevices, and IOMMU group configuration.
9. References
Official Documentation
| Resource | URL |
|---|---|
| NVIDIA GPU Operator Official Docs | https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/ |
| GPU Operator Installation Guide | https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html |
| GPU Operator with KubeVirt | https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-kubevirt.html |
| GPU Time-Slicing Guide | https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html |
| GPU Operator MIG Guide | https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html |
| GPU Operator Troubleshooting | https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html |
| NVIDIA DCGM Exporter | https://github.com/NVIDIA/dcgm-exporter |
| NVIDIA MIG User Guide | https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html |
KubeVirt Related
| Resource | URL |
|---|---|
| KubeVirt Official Docs | https://kubevirt.io/user-guide/ |
| KubeVirt Host Devices | https://kubevirt.io/user-guide/compute/host-devices/ |
Additional Resources
| Resource | URL |
|---|---|
| GPU Operator GitHub | https://github.com/NVIDIA/gpu-operator |
| DCGM Exporter Grafana Dashboard | https://grafana.com/grafana/dashboards/12239 |