- Published on
GPU Software Engineer Complete Guide: From CUDA Architecture to vGPU/MIG, InfiniBand, and K8s GPU Scheduling — System Optimization Mastery
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- 1. The Rare Career of GPU Software Engineer
- 2. JD Line-by-Line Dissection
- 3. GPU Architecture Deep Dive
- 4. GPU Virtualization Technology
- 5. High-Speed Networking: InfiniBand and RDMA
- 6. Kubernetes GPU Management
- 7. AI Workload Optimization
- 8. Linux System Troubleshooting
- 9. Interview Questions: Top 30
- 10. 10-Month Study Roadmap
- 11. Portfolio Projects (3)
- 12. Quiz
- 13. References
1. The Rare Career of GPU Software Engineer
"People Who USE GPUs" vs "People Who MAKE GPUs Work"
Since 2024, the keyword dominating the AI industry has been "GPU." Every company is fighting to acquire GPUs, but the number of engineers who can properly operate acquired GPUs is vanishingly small.
A critical distinction is needed here:
| Aspect | People Who USE GPUs | People Who MAKE GPUs Work |
|---|---|---|
| Role | ML Engineer, Researcher | GPU Software Engineer |
| Focus | Model accuracy, training algorithms | GPU utilization, memory bandwidth, scheduling |
| Tools | PyTorch, TensorFlow | nvidia-smi, Nsight, DCGM, NCCL |
| Key Question | "Why isn't this model performing?" | "Why is this GPU only 70% utilized?" |
| Abstraction Level | Python API | CUDA kernels, drivers, hypervisors |
| Response Area | Model architecture changes | XID error analysis, PCIe bottleneck resolution, MIG config |
When an ML Engineer calls model.to('cuda') in PyTorch, a GPU Software Engineer understands and optimizes which path that call takes, which driver calls it traverses, and which memory region the allocation lands in.
Market Value of This Role
The supply-demand imbalance for GPU Software Engineers is extreme:
- Demand side: As of 2025, over 5 million NVIDIA GPUs are installed in datacenters worldwide. Millions more are deployed annually, and systems engineers are needed to operate them.
- Supply side: Engineers with deep GPU system software knowledge have traditionally existed only within NVIDIA itself, HPC research labs, and major cloud providers. In the Korean market, this talent pool is extremely limited.
- Salary premium: In the US, GPU/CUDA Engineers typically command base salaries of USD 250K-400K. In Korea, companies are increasingly offering exceptional compensation for experts in this field.
LG Uplus GPU Technology TF Mission
Understanding why LG Uplus established the GPU Technology TF is essential:
- Telecom AI Infrastructure Business: LG Uplus is pursuing not only its own AI services but also providing GPU cloud services to enterprise customers.
- GPU Multi-tenancy: Sharing a single GPU among multiple customers requires vGPU/MIG technology.
- Leveraging Network Strengths: As a telecom company, they have competency in InfiniBand/RoCE high-speed network design.
- End-to-End Optimization: This team is responsible for the entire pipeline from hardware selection to virtualization, containers, and AI workload onboarding.
Related Role Comparison
| Role | Core Competency | GPU Depth | Infra Depth |
|---|---|---|---|
| ML Engineer | Model development, training pipelines | Low (API level) | Low |
| MLOps Engineer | CI/CD, model serving, pipeline automation | Medium | Medium |
| GPU SW Engineer | GPU architecture, virtualization, drivers | Very High | High |
| Infra SRE | Server/network availability, monitoring | Medium | Very High |
| HPC Engineer | Parallel computing, MPI, schedulers | High | High |
GPU Software Engineer sits at the intersection of all these roles, specifically responsible for the software layer closest to hardware.
2. JD Line-by-Line Dissection
Let's analyze what each item in the LG Uplus GPU Software Engineer JD actually means.
Responsibilities
"GPU resource management and performance optimization"
This is not simply monitoring nvidia-smi. Specifically:
- Finding and resolving causes when GPU utilization (SM Occupancy) is below expectations
- Analyzing HBM memory bandwidth utilization and kernel-level optimization
- Managing power capping vs performance tradeoffs
- Determining RMA decisions when ECC errors occur
- Designing cluster-level GPU allocation policies
"GPU virtualization technology development and optimization"
This is the core differentiator for this position:
- Designing vGPU profiles: determining which virtual GPU size to allocate to which customer
- MIG partitioning strategy: configuring A100/H100 MIG profiles to match workloads
- Establishing technical selection criteria between PCI Passthrough vs vGPU vs MIG
- Building pipelines for GPU allocation to VMs in KubeVirt environments
"AI/ML workload GPU onboarding and performance optimization"
Efficiently deploying customer AI models onto GPU infrastructure:
- Model profiling: analyzing GPU memory requirements, compute requirements
- Matching appropriate GPU type/size (A100-40GB vs A100-80GB vs H100)
- Distributed training setup: NCCL communication optimization, InfiniBand utilization
- Inference serving: Triton Inference Server configuration, batch size optimization
Qualification Analysis
"BS+ in CS (MS preferred in systems/network/OS)"
The reason for MS preference is clear. Knowledge in this field is mostly covered in graduate-level courses:
- Operating Systems: memory management, scheduling, device drivers
- Computer Architecture: cache hierarchy, memory models, parallel processing
- Networking: RDMA, high-performance protocols
"GPU or system software practical experience"
The key phrase is "system software." This means:
- Kernel module development/debugging experience
- Interaction with device drivers
- Low-level performance profiling (perf, ftrace, eBPF)
- C/C++ level systems programming
Required Skills Analysis
The following sections will cover each required skill in depth.
3. GPU Architecture Deep Dive
3-1. GPU Compute Architecture
SM (Streaming Multiprocessor) Architecture
The core compute unit of a GPU is the SM (Streaming Multiprocessor). Understanding the SM structure of modern NVIDIA GPUs is the starting point for all GPU optimization.
SM Internal Components:
SM (Streaming Multiprocessor)
├── CUDA Cores (INT32 + FP32)
│ └── H100: 128 FP32 cores per SM
├── Tensor Cores
│ └── H100: 4th-gen Tensor Cores (FP8 support)
├── RT Cores (Ray Tracing, unused in datacenters)
├── Warp Schedulers (4)
│ └── Each scheduler independently dispatches warps
├── Register File (256KB per SM in H100)
├── Shared Memory / L1 Cache (unified, up to 228KB)
├── Load/Store Units
├── Special Function Units (SFU)
│ └── Transcendental functions: sin, cos, exp
└── Texture Units
Warps and the SIMT Model:
The fundamental unit of GPU execution is a Warp (32 threads). All threads in the same Warp execute the same instruction simultaneously. This is the SIMT (Single Instruction, Multiple Threads) model.
Grid (entire workload)
├── Block 0
│ ├── Warp 0 (Thread 0~31) → Same instruction, simultaneous execution
│ ├── Warp 1 (Thread 32~63) → Same instruction, simultaneous execution
│ └── Warp N ...
├── Block 1
│ ├── Warp 0
│ └── ...
└── Block M
Warp Divergence problem: When threads within a Warp take different branches (if/else), both branches are executed sequentially, degrading performance. This is called "Warp Divergence" and is a pattern that must be avoided in GPU programming.
// Bad example: Warp Divergence
__global__ void kernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx % 2 == 0) {
data[idx] = expensive_operation_A(data[idx]);
} else {
data[idx] = expensive_operation_B(data[idx]);
}
// Even/odd threads in same Warp take different branches -> sequential execution
}
// Good example: Threads in same Warp take same branch
__global__ void kernel_optimized(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int warp_id = idx / 32;
if (warp_id % 2 == 0) {
data[idx] = expensive_operation_A(data[idx]);
} else {
data[idx] = expensive_operation_B(data[idx]);
}
// All threads in same Warp take same branch
}
Architecture Generation Comparison
| Feature | Ampere (A100) | Hopper (H100) | Blackwell (B200) |
|---|---|---|---|
| SM Count | 108 | 132 | 192 |
| CUDA Cores | 6,912 | 16,896 | 21,760+ |
| Tensor Cores | 432 (3rd gen) | 528 (4th gen) | 768 (5th gen) |
| Memory | HBM2e 80GB | HBM3 80GB | HBM3e 192GB |
| Memory Bandwidth | 2.0 TB/s | 3.35 TB/s | 8.0 TB/s |
| NVLink | 3.0 (600GB/s) | 4.0 (900GB/s) | 5.0 (1.8TB/s) |
| Transformer Engine | None | FP8 support | FP4 support |
| MIG Support | Up to 7 instances | Up to 7 instances | Up to 7 instances |
| TDP | 400W | 700W | 1000W |
| FP16 Tensor Perf | 312 TFLOPS | 989 TFLOPS | 2,250+ TFLOPS |
Transformer Engine: Introduced with H100, the Transformer Engine provides hardware-level FP8 precision support. It dynamically switches tensors between FP8/FP16 per layer during training, halving memory usage while minimizing accuracy loss.
NVLink and NVSwitch: High-speed direct communication paths between GPUs.
NVLink Topology (DGX H100):
GPU 0 <-- NVLink 4.0 (900GB/s) --> GPU 1
| |
NVSwitch (fully connected) NVSwitch
| |
GPU 2 <-- NVLink 4.0 (900GB/s) --> GPU 3
| |
... 8-GPU fully connected ...
GPU 6 <-- NVLink 4.0 (900GB/s) --> GPU 7
Total bandwidth: 8 GPUs x 900GB/s = 7.2TB/s (bidirectional)
3-2. GPU Memory Hierarchy (Critical!)
Understanding the GPU memory hierarchy accounts for 80% of GPU performance optimization. All GPU performance problems ultimately reduce to memory problems.
Memory Hierarchy (Fast -> Slow):
1. Register (Fastest)
├── Capacity: Up to 255 per thread (32-bit)
├── Latency: ~1 cycle
├── Bandwidth: Infinite (directly connected to ALU)
└── Note: Compiler auto-allocates; spills to local memory when exceeded
2. Shared Memory
├── Capacity: 48KB ~ 228KB per SM (configurable)
├── Latency: ~5 cycles (~5x slower than registers)
├── Bandwidth: ~19TB/s (H100)
├── Feature: Shared among threads in same Block
└── Warning: Bank Conflicts possible
3. L1 Cache
├── Unified with Shared Memory (ratio configurable)
├── H100: Shared Memory + L1 = 228KB per SM
└── Auto-cached, not directly programmable
4. L2 Cache
├── Capacity: H100 = 50MB (shared across all SMs)
├── Latency: ~200 cycles
└── A100: 40MB, Blackwell: up to 128MB
5. Global Memory (HBM)
├── Capacity: 40GB ~ 192GB
├── Latency: ~600 cycles (~600x slower than registers)
├── Bandwidth: 2.0 ~ 8.0 TB/s (by generation)
└── Accessible from all threads
Memory Bandwidth Utilization Calculation:
Determining whether GPU performance is memory-bound or compute-bound is the core skill.
Arithmetic Intensity = FLOPs / Bytes Accessed
H100:
- Peak Compute: 989 TFLOPS (FP16 Tensor)
- Peak Memory BW: 3.35 TB/s
Balance Point (Roofline Analysis):
989 TFLOPS / 3.35 TB/s = 295 FLOPs/Byte
-> Arithmetic Intensity < 295: Memory-Bound
-> Arithmetic Intensity > 295: Compute-Bound
Examples:
- Vector addition: 1 FLOP / 12 Bytes = 0.08 -> Extremely Memory-Bound
- Matrix multiply (NxN): 2N FLOPs / 8 Bytes = O(N) -> Compute-Bound for large N
- Transformer Attention: Typically Memory-Bound (especially during inference)
Memory Coalescing:
When 32 threads in a Warp access contiguous memory, hardware merges this into a single large transaction. Non-contiguous access splits into multiple transactions, wasting bandwidth.
// Good: Coalesced Access (contiguous)
// Thread 0 -> data[0], Thread 1 -> data[1], ..., Thread 31 -> data[31]
__global__ void coalesced(float* data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = data[idx]; // Single 128-byte transaction
}
// Bad: Strided Access (non-contiguous)
// Thread 0 -> data[0], Thread 1 -> data[32], ..., Thread 31 -> data[31*32]
__global__ void strided(float* data, int stride) {
int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
float val = data[idx]; // 32 separate transactions!
}
Bank Conflicts:
Shared Memory is divided into 32 banks. When 2+ threads in the same Warp access the same bank, accesses are serialized, degrading performance.
Shared Memory Bank Layout (4-byte granularity):
Bank 0: addr 0, 128, 256, ...
Bank 1: addr 4, 132, 260, ...
Bank 2: addr 8, 136, 264, ...
...
Bank 31: addr 124, 252, 380, ...
Bank Conflict Example:
Thread 0 -> Bank 0 (addr 0)
Thread 1 -> Bank 0 (addr 128) <- Same bank! Conflict
-> 2-way bank conflict: 2x slower
Avoidance: Add padding
__shared__ float tile[32][33]; // Pad to 33 (instead of 32)
// Avoids bank conflicts on column access
3-3. CUDA Programming Fundamentals
Grid, Block, Thread Hierarchy
CUDA Execution Model:
Grid (1)
├── Block (0,0) -- Block (1,0) -- Block (2,0)
├── Block (0,1) -- Block (1,1) -- Block (2,1)
└── Block (0,2) -- Block (1,2) -- Block (2,2)
Each Block:
├── Thread (0,0) ... Thread (15,0)
├── Thread (0,1) ... Thread (15,1)
└── Thread (0,15) ... Thread (15,15)
Constraints:
- Max 1024 threads per Block
- Block dimensions: max (1024, 1024, 64)
- Grid dimensions: max (2^31-1, 65535, 65535)
CUDA Code Example: Vector Addition
#include <stdio.h>
#include <cuda_runtime.h>
// GPU kernel function
__global__ void vectorAdd(const float* A, const float* B, float* C, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
C[idx] = A[idx] + B[idx];
}
}
int main() {
int n = 1 << 20; // 1M elements
size_t size = n * sizeof(float);
// Host memory allocation
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C = (float*)malloc(size);
// Initialize
for (int i = 0; i < n; i++) {
h_A[i] = 1.0f;
h_B[i] = 2.0f;
}
// Device memory allocation
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
// Host -> Device copy
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Kernel launch
int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;
vectorAdd<<<gridSize, blockSize>>>(d_A, d_B, d_C, n);
// Device -> Host copy
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Verify
printf("C[0] = %f (expected 3.0)\n", h_C[0]);
// Free memory
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
free(h_A); free(h_B); free(h_C);
return 0;
}
CUDA Code Example: Matrix Multiplication (Shared Memory Tiling)
#define TILE_SIZE 32
__global__ void matMul(const float* A, const float* B, float* C, int N) {
__shared__ float tileA[TILE_SIZE][TILE_SIZE];
__shared__ float tileB[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
// Load from Global Memory to Shared Memory
if (row < N && t * TILE_SIZE + threadIdx.x < N)
tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
else
tileA[threadIdx.y][threadIdx.x] = 0.0f;
if (col < N && t * TILE_SIZE + threadIdx.y < N)
tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
else
tileB[threadIdx.y][threadIdx.x] = 0.0f;
__syncthreads(); // Synchronize all threads in Block
for (int k = 0; k < TILE_SIZE; k++)
sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
__syncthreads();
}
if (row < N && col < N)
C[row * N + col] = sum;
}
Key CUDA Libraries
| Library | Purpose | Key API |
|---|---|---|
| cuBLAS | Linear algebra (matrix ops) | cublasSgemm, cublasGemmEx |
| cuDNN | Deep learning primitives | cudnnConvolutionForward |
| cuFFT | Fast Fourier Transform | cufftExecC2C |
| cuSPARSE | Sparse matrix operations | cusparseSpMV |
| Thrust | C++ parallel algorithms (STL-like) | thrust::sort, thrust::reduce |
| CUTLASS | GEMM customization | Template-based GEMM |
3-4. GPU Profiling and Performance Analysis
nvidia-smi Detailed Usage
# Basic status check
nvidia-smi
# Monitor at 1-second intervals
nvidia-smi dmon -s pucvmet -d 1
# Detailed GPU process info
nvidia-smi pmon -d 1
# Query format (for scripts)
nvidia-smi --query-gpu=timestamp,gpu_bus_id,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw --format=csv -l 1
# MIG status check
nvidia-smi mig -lgi
nvidia-smi mig -lci
# GPU topology check (NVLink connections)
nvidia-smi topo -m
# ECC error check
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv
NVIDIA Nsight Systems (System-Level Profiling)
# Full application profiling
nsys profile --stats=true -o report python train.py
# Trace CUDA API calls + GPU kernels + memory transfers
nsys profile --trace=cuda,nvtx,osrt -o detailed_report python train.py
# Visualize results (GUI)
nsys-ui report.nsys-rep
Nsight Systems provides a timeline view for:
- Identifying CPU-GPU synchronization points
- Checking kernel execution and memory transfer overlap
- Identifying CPU bottlenecks (data loading, preprocessing)
- Measuring NCCL communication time (distributed training)
NVIDIA Nsight Compute (Kernel-Level Analysis)
# Detailed analysis of specific kernel
ncu --target-processes all --set full -o kernel_report python train.py
# Profile specific kernel only
ncu --kernel-name "volta_sgemm" --launch-count 10 -o sgemm_report ./my_app
# Check key metrics
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_active,dram__throughput.avg.pct_of_peak_sustained_active ./my_app
Key Metrics:
- SM Occupancy: Ratio of active warps in SM (higher is better, typically aim for 50%+)
- Compute Throughput: Compute utilization (% of peak)
- Memory Throughput: Memory bandwidth utilization (% of peak)
- Warp Stall Reasons: Why warps are waiting (memory, synchronization, etc.)
DCGM (Data Center GPU Manager)
For large-scale cluster monitoring, nvidia-smi alone is insufficient. DCGM provides:
# Start DCGM
sudo systemctl start nvidia-dcgm
# Health check
dcgmi health -g 0 -c
# Run diagnostics (Level 3: most detailed)
dcgmi diag -r 3 -g 0
# Metric collection (Prometheus integration)
dcgm-exporter &
# Prometheus scrapes http://localhost:9400/metrics
GPU Utilization Low: Analysis Pattern
GPU Utilization Low (<50%)
├── CPU bottleneck?
│ ├── Slow data loading -> Increase num_workers, prefetch
│ ├── Heavy preprocessing -> Use DALI (GPU preprocessing)
│ └── Python GIL -> Multiprocessing
├── Memory transfer bottleneck?
│ ├── PCIe bandwidth saturated -> Use GPU Direct
│ └── Unnecessary CPU-GPU copies -> Use pinned memory
├── Small kernels + large overhead?
│ ├── Kernel launch overhead -> Use CUDA Graphs
│ └── Excessive synchronization -> Optimize async execution
├── Batch size too small?
│ └── Not enough work to fill GPU -> Increase batch or use Gradient Accumulation
└── Communication overhead? (distributed training)
├── AllReduce taking too long -> NCCL tuning
└── Network bottleneck -> Check InfiniBand
4. GPU Virtualization Technology
4-1. Virtualization Fundamentals
Type 1 vs Type 2 Hypervisor
Type 1 (Bare-metal): Type 2 (Hosted):
+-----------------+ +-----------------+
| VM1 VM2 | | VM1 VM2 |
| +-----++-----+ | | +-----++-----+ |
| |Guest||Guest| | | |Guest||Guest| |
| | OS || OS | | | | OS || OS | |
| +-----++-----+ | | +-----++-----+ |
| Hypervisor | | Hypervisor |
| (ESXi, KVM) | | (VirtualBox) |
| Hardware | | Host OS |
+-----------------+ | Hardware |
+-----------------+
In the LG Uplus environment, KVM is the core technology. KVM is a Type 1 hypervisor that operates as a Linux kernel module, using QEMU as the userspace emulator.
IOMMU (Intel VT-d / AMD-Vi)
IOMMU is an essential hardware feature for GPU virtualization:
Without IOMMU (unsafe):
VM -> (virtual address) -> Physical Memory (direct access -> can access other VM memory)
With IOMMU (safe):
VM -> (virtual address) -> IOMMU translation -> (physical address, isolated)
└── DMA requests also isolated!
Verifying IOMMU is enabled:
# Check IOMMU groups
find /sys/kernel/iommu_groups/ -type l
# Check kernel boot parameters
cat /proc/cmdline | grep iommu
# Should contain intel_iommu=on or amd_iommu=on
# Check devices per IOMMU group
for g in /sys/kernel/iommu_groups/*/devices/*; do
echo "IOMMU Group $(basename $(dirname $(dirname $g))):"
lspci -nns $(basename $g)
done
4-2. PCI Passthrough
PCI Passthrough is the most basic method of directly assigning a physical GPU to a VM.
PCI Passthrough Architecture:
Host (Linux + KVM)
├── GPU 0 -> VFIO driver binding -> VM1 (direct access)
├── GPU 1 -> VFIO driver binding -> VM2 (direct access)
├── GPU 2 -> NVIDIA driver -> Host use
└── GPU 3 -> NVIDIA driver -> Host use
Setup procedure:
# 1. Enable IOMMU (GRUB)
# Add to /etc/default/grub:
# GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"
# 2. Find GPU PCI ID
lspci -nn | grep NVIDIA
# 41:00.0 3D controller [0302]: NVIDIA Corporation A100 [10de:20b2]
# 3. Bind to VFIO driver
echo "10de 20b2" > /sys/bus/pci/drivers/vfio-pci/new_id
echo "0000:41:00.0" > /sys/bus/pci/devices/0000:41:00.0/driver/unbind
echo "0000:41:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
# 4. Add device when starting VM with QEMU/KVM
# -device vfio-pci,host=41:00.0
Pros and Cons:
| Pros | Cons |
|---|---|
| Native performance (nearly zero overhead) | 1 GPU = 1 VM (no sharing) |
| Simple setup | No live migration |
| All CUDA features supported | Potential GPU resource waste |
4-3. vGPU (NVIDIA Virtual GPU)
vGPU uses time-slicing to share a single physical GPU across multiple VMs.
vGPU Architecture:
Physical GPU (A100-80GB)
├── vGPU Instance 1 (A100-4C, 4GB) -> VM1
├── vGPU Instance 2 (A100-4C, 4GB) -> VM2
├── vGPU Instance 3 (A100-8C, 8GB) -> VM3
└── ... (as many as remaining capacity allows)
Time-slicing:
t=0ms [VM1 runs] -> t=16ms [VM2 runs] -> t=32ms [VM3 runs] -> ...
vGPU Profile Types
| Series | Purpose | Example |
|---|---|---|
| A-series | Virtual Application | A100-1-5A (5GB, VDI apps) |
| B-series | Virtual PC | A100-2-10B (10GB, VDI desktop) |
| C-series | Compute | A100-4-20C (20GB, AI compute) |
| Q-series | Quadro | A100-8-40Q (40GB, professional graphics) |
For LG Uplus GPU Technology TF, C-series (Compute) will be the primary focus.
vGPU Scheduler
# vGPU Scheduler Types
Equal Share:
- Equal time allocation to all vGPUs
- Fair but no priority setting possible
Fixed Share:
- Time allocation proportional to vGPU profile size
- 4GB vGPU: 8GB vGPU = 1:2 time
Best Effort:
- Redistributes idle vGPU time to active vGPUs
- Most efficient but performance less predictable
4-4. MIG (Multi-Instance GPU)
MIG is an A100/H100-exclusive technology that physically partitions the GPU. Unlike vGPU's time-slicing, MIG completely isolates SMs and memory.
MIG Architecture (A100-80GB):
Full GPU: 108 SM, 80GB HBM2e
├── MIG Instance 1 (7g.80gb): 98 SM, 80GB <- Nearly full (solo use)
or
├── MIG Instance 1 (4g.40gb): 56 SM, 40GB
├── MIG Instance 2 (3g.40gb): 42 SM, 40GB
or
├── MIG Instance 1 (3g.40gb): 42 SM, 40GB
├── MIG Instance 2 (2g.20gb): 28 SM, 20GB
├── MIG Instance 3 (1g.10gb): 14 SM, 10GB
├── MIG Instance 4 (1g.10gb): 14 SM, 10GB
or (maximum partition)
├── MIG Instance 1~7 (1g.10gb): 14 SM each, 10GB each (x7)
MIG Configuration Commands
# Enable MIG
sudo nvidia-smi -i 0 -mig 1
# Check available MIG profiles
nvidia-smi mig -lgip
# Create GPU Instances
sudo nvidia-smi mig -i 0 -cgi 9,14,14,14 # 3g.40gb + 1g.10gb x3
# Create Compute Instances
sudo nvidia-smi mig -i 0 -cci
# Check current MIG status
nvidia-smi mig -lgi
nvidia-smi mig -lci
# Delete MIG instances
sudo nvidia-smi mig -i 0 -dci
sudo nvidia-smi mig -i 0 -dgi
# Disable MIG
sudo nvidia-smi -i 0 -mig 0
MIG vs vGPU Comparison
| Feature | MIG | vGPU |
|---|---|---|
| Isolation Level | Physical (SM + memory fully separated) | Time-slicing (software isolation) |
| Performance Predictability | Consistent (dedicated resources) | Variable (affected by other VMs) |
| Max Instances | 7 (A100/H100) | Many (within GPU memory limits) |
| Supported GPUs | A100, H100, A30 | Most datacenter GPUs |
| Flexibility | Fixed profiles (reconfiguration needed) | Dynamic allocation possible |
| Licensing | No additional license needed | vGPU license required |
| Use Cases | Inference serving, small-scale training | VDI, mixed workloads |
MIG on K8s: NVIDIA MIG Manager
# MIG Configuration ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.10gb:
- device-filter: ["0x20B210DE"]
devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
mixed-config:
- device-filter: ["0x20B210DE"]
devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 1
"1g.10gb": 4
4-5. SR-IOV (NIC Virtualization)
SR-IOV virtualizes NICs for direct VM assignment. This is important when combined with GPU Direct RDMA.
SR-IOV Structure:
Physical NIC (ConnectX-7)
├── PF (Physical Function): Managed by host driver
├── VF 0 (Virtual Function) -> VM1 (direct assignment, native performance)
├── VF 1 -> VM2
├── VF 2 -> VM3
└── ... (up to 128 VFs)
Advantages:
- VMs access NIC directly without virtual bridge
- Near-native network performance
- Minimal CPU overhead
GPU Direct RDMA Combination:
GPU in VM <-> VF(SR-IOV NIC) <-> InfiniBand <-> Remote GPU
(PCIe direct) (SR-IOV bypass) (RDMA)
4-6. KubeVirt
KubeVirt manages VMs as first-class resources on Kubernetes. It is a core technology when containers and VMs need to run on the same platform.
# KubeVirt VM with GPU PCI Passthrough
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: gpu-vm
spec:
running: true
template:
spec:
domain:
devices:
hostDevices:
- name: gpu
deviceName: nvidia.com/A100
resources:
requests:
memory: '32Gi'
cpu: '8'
volumes:
- name: rootdisk
containerDisk:
image: quay.io/containerdisks/ubuntu:22.04
KubeVirt + vGPU:
# KubeVirt VM with vGPU allocation
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: vgpu-vm
spec:
template:
spec:
domain:
devices:
gpus:
- name: vgpu
deviceName: nvidia.com/NVIDIA_A100-4C
resources:
requests:
memory: '16Gi'
Use Cases:
- Legacy VM workloads: Migrating existing VM-based AI workloads to K8s
- Mixed environments: Running containers + VMs simultaneously on the same K8s cluster
- GPU sharing: Flexibly allocating GPUs to VMs and containers through vGPU
5. High-Speed Networking: InfiniBand and RDMA
5-1. InfiniBand Architecture
Distributed GPU training performance is determined by the network. No matter how fast the GPUs are, slow inter-GPU communication degrades overall performance.
InfiniBand vs Ethernet Comparison
| Feature | InfiniBand NDR | RoCE v2 (100GbE) | TCP/IP (100GbE) |
|---|---|---|---|
| Bandwidth | 400 Gbps | 100 Gbps | 100 Gbps |
| Latency | 0.5us | 1~2us | 10~50us |
| RDMA Support | Native | RoCE v2 | None (kernel path) |
| CPU Overhead | Nearly zero | Low | High |
| Congestion Control | Credit-based | PFC/ECN | TCP congestion control |
| Cost | Very high | Medium | Low |
| Use Case | HPC, AI training | AI training (cloud) | General workloads |
InfiniBand Generations
InfiniBand Speed Evolution:
SDR (2001): 10 Gbps
DDR (2005): 20 Gbps
QDR (2008): 40 Gbps
FDR (2011): 56 Gbps
EDR (2014): 100 Gbps
HDR (2018): 200 Gbps
NDR (2022): 400 Gbps
XDR (2024): 800 Gbps
GDR (2026): 1.6 Tbps (planned)
InfiniBand Network Components
InfiniBand Fabric Structure:
Leaf Switch (ToR)
├── HCA (Host Channel Adapter) -- Server 1 [GPU 0~7]
├── HCA -- Server 2 [GPU 0~7]
├── HCA -- Server 3 [GPU 0~7]
└── HCA -- Server 4 [GPU 0~7]
Spine Switch
├── Leaf Switch 1
├── Leaf Switch 2
├── Leaf Switch 3
└── Leaf Switch 4
Management Components:
- Subnet Manager (OpenSM): LID assignment, routing table management
- LID (Local ID): Subnet address (16-bit)
- GID (Global ID): Global address (128-bit, IPv6-like)
- GUID (Globally Unique ID): Hardware unique identifier
5-2. RDMA (Remote Direct Memory Access)
RDMA accesses remote memory directly without CPU involvement. It is the backbone of distributed GPU training.
TCP/IP Transfer (Traditional):
App -> Socket API -> TCP/IP Stack (Kernel) -> NIC Driver -> NIC -> Network
^ CPU involved (copy, checksum, segmentation)
RDMA Transfer:
App -> RDMA Verbs -> NIC (direct) -> Network
^ Zero-copy, CPU bypass
RDMA Transport Types
| Transport | Description | Use Case |
|---|---|---|
| InfiniBand | Native RDMA | HPC, AI clusters |
| RoCE v2 | RDMA over UDP/IP | Cloud environments |
| iWARP | RDMA over TCP/IP | Legacy environments |
RDMA Programming Basics
// ibverbs-based RDMA Write example (simplified)
#include <infiniband/verbs.h>
// 1. Open device
struct ibv_context *ctx = ibv_open_device(dev);
// 2. Create Protection Domain
struct ibv_pd *pd = ibv_alloc_pd(ctx);
// 3. Register memory (for NIC direct access)
struct ibv_mr *mr = ibv_reg_mr(pd, buf, size,
IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
// 4. Create Queue Pair
struct ibv_qp *qp = ibv_create_qp(pd, &qp_init_attr);
// 5. RDMA Write (write directly to remote memory)
struct ibv_send_wr wr;
wr.opcode = IBV_WR_RDMA_WRITE;
wr.wr.rdma.remote_addr = remote_addr;
wr.wr.rdma.rkey = remote_key;
ibv_post_send(qp, &wr, &bad_wr);
5-3. GPU Direct
GPU Direct RDMA
GPU Direct RDMA enables direct data transfer from GPU memory to remote GPU memory.
Normal Path (without GPU Direct):
GPU0 -> PCIe -> Host Memory -> NIC -> Network -> NIC -> Host Memory -> PCIe -> GPU1
(copy1) (copy2) (copy3) (copy4)
GPU Direct RDMA:
GPU0 -> PCIe -> NIC -> Network -> NIC -> PCIe -> GPU1
(direct) (direct)
CPU bypass, 2x reduction in copy count
GPU Direct Storage (GDS)
Normal Storage Access:
NVMe -> Host Memory (bounce buffer) -> GPU Memory
CPU involved, 2 copies
GPU Direct Storage:
NVMe -> GPU Memory (direct)
CPU bypass, 1 copy
Use Cases: Large dataset loading (checkpoint recovery, data preprocessing)
NCCL + InfiniBand Combination
# NCCL environment variables (distributed training)
export NCCL_IB_HCA=mlx5_0,mlx5_1 # Specify InfiniBand HCAs
export NCCL_IB_GID_INDEX=3 # RoCE v2 GID index
export NCCL_SOCKET_IFNAME=eth0 # Control channel interface
export NCCL_DEBUG=INFO # Debug logging
# NCCL topology file (GPU-NIC mapping optimization)
export NCCL_TOPO_FILE=/path/to/topo.xml
# NCCL AllReduce benchmark
/usr/local/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 8
5-4. Network Performance Tuning
InfiniBand Benchmarks
# Bandwidth test
# Server: ib_write_bw --size=65536
# Client: ib_write_bw --size=65536 <server_ip>
# Latency test
# Server: ib_write_lat
# Client: ib_write_lat <server_ip>
# Example results (NDR 400Gbps):
# Bandwidth: ~48 GB/s (theoretical 50 GB/s)
# Latency: ~0.6 us
PFC (Priority Flow Control) Configuration
PFC is essential in RoCE v2 environments:
# Mellanox NIC PFC configuration
mlnx_qos -i eth0 --pfc 0,0,0,1,0,0,0,0
# Enable PFC only on Priority 3 (RoCE traffic)
# DSCP -> Priority mapping
mlnx_qos -i eth0 --trust dscp
ECMP (Equal-Cost Multi-Path) Routing
ECMP in Large-Scale InfiniBand Fabrics:
Server A --- Leaf 1 -+- Spine 1 -+- Leaf 3 --- Server C
+- Spine 2 -+
+- Spine 3 -+
+- Spine 4 -+
-> Load-balance across 4 equal-cost paths
-> Hash-based (source/destination LID) distribution
-> Adaptive Routing (AR): Dynamic path selection based on congestion
6. Kubernetes GPU Management
6-1. NVIDIA GPU Operator
GPU Operator automatically deploys the GPU software stack to K8s clusters.
GPU Operator Components:
GPU Operator
├── NVIDIA Driver (DaemonSet)
│ └── Auto-builds/installs kernel module
├── NVIDIA Container Toolkit
│ └── Adds GPU support to container runtime
├── NVIDIA Device Plugin
│ └── Registers GPU resources with K8s
├── DCGM Exporter
│ └── GPU metrics -> Prometheus
├── MIG Manager
│ └── Auto-applies MIG profiles
├── GPU Feature Discovery (GFD)
│ └── Auto-adds GPU labels to nodes
└── NVIDIA Validator
└── Validates installation state
Installation:
# Install GPU Operator via Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set mig.strategy=mixed \
--set dcgmExporter.enabled=true
6-2. GPU Device Plugin
# Allocating GPUs to a Pod
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvcr.io/nvidia/cuda:12.3.0-runtime-ubuntu22.04
resources:
limits:
nvidia.com/gpu: 2 # Request 2 GPUs
command: ['nvidia-smi']
Time-Slicing Configuration (GPU Sharing)
For GPUs that don't support MIG, multiple Pods can share a GPU:
# GPU Time-Slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Split 1 GPU into 4 (time-slicing)
6-3. GPU Scheduling
Basic Scheduling
K8s default GPU scheduling is simple: place Pods on nodes with sufficient available GPUs. However, large-scale GPU clusters require more sophisticated scheduling.
Topology-Aware Scheduling
DGX H100 GPU Topology:
GPU0 - NVLink - GPU1 (same NVSwitch domain)
GPU2 - NVLink - GPU3 (same NVSwitch domain)
GPU4 - NVLink - GPU5 (same NVSwitch domain)
GPU6 - NVLink - GPU7 (same NVSwitch domain)
GPU0 - PCIe - GPU4 (different domain, PCIe connection)
-> 4-GPU training: GPU0,1,2,3 (NVLink) >> GPU0,2,4,6 (PCIe)
# Topology-aware scheduling with NodeSelector
apiVersion: v1
kind: Pod
metadata:
name: distributed-training
spec:
nodeSelector:
nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'
nvidia.com/gpu.count: '8'
containers:
- name: trainer
resources:
limits:
nvidia.com/gpu: 4
Gang Scheduling
In distributed training, all GPUs must be allocated simultaneously. Partial allocation wastes resources as allocated GPUs wait for the rest.
# Gang Scheduling with Volcano
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distributed-training
spec:
minAvailable: 4 # Schedule minimum 4 Pods simultaneously
schedulerName: volcano
tasks:
- replicas: 4
name: worker
template:
spec:
containers:
- name: trainer
image: training-image:latest
resources:
limits:
nvidia.com/gpu: 8 # 8 GPUs per node
Bin-packing vs Spread Strategy
Bin-packing (resource consolidation):
Node1: [GPU0 used, GPU1 used, GPU2 used, GPU3 free]
Node2: [GPU0 free, GPU1 free, GPU2 free, GPU3 free]
-> Pros: Power savings on idle nodes, resource efficiency
-> Cons: Potential hot spots
Spread (distribution):
Node1: [GPU0 used, GPU1 free, GPU2 used, GPU3 free]
Node2: [GPU0 used, GPU1 free, GPU2 used, GPU3 free]
-> Pros: Load distribution, fault isolation
-> Cons: Resource fragmentation
GPU Feature Discovery (GFD)
# Example node labels added by GFD
nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
nvidia.com/gpu.count=8
nvidia.com/gpu.memory=81920
nvidia.com/cuda.driver.major=535
nvidia.com/mig.strategy=mixed
nvidia.com/gpu.family=ampere
nvidia.com/mig-1g.10gb.count=4
nvidia.com/mig-3g.40gb.count=1
6-4. GPU Monitoring on K8s
DCGM Exporter + Prometheus + Grafana
# DCGM Exporter DaemonSet (included in GPU Operator)
# Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
endpoints:
- port: metrics
interval: 15s
Key Prometheus Metrics:
# GPU Utilization
DCGM_FI_DEV_GPU_UTIL # SM utilization (%)
DCGM_FI_DEV_MEM_COPY_UTIL # Memory bandwidth utilization (%)
# Memory
DCGM_FI_DEV_FB_USED # Used framebuffer (MB)
DCGM_FI_DEV_FB_FREE # Free framebuffer (MB)
# Temperature/Power
DCGM_FI_DEV_GPU_TEMP # GPU temperature (C)
DCGM_FI_DEV_POWER_USAGE # Power usage (W)
# Errors
DCGM_FI_DEV_XID_ERRORS # XID error codes
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL # Single-bit ECC errors
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL # Double-bit ECC errors
# PCIe
DCGM_FI_DEV_PCIE_TX_THROUGHPUT # PCIe transmit throughput
DCGM_FI_DEV_PCIE_RX_THROUGHPUT # PCIe receive throughput
Alert Configuration Examples:
# Prometheus Alert Rules
groups:
- name: gpu-alerts
rules:
- alert: GPUMemoryAlmostFull
expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
for: 5m
labels:
severity: warning
annotations:
summary: 'GPU memory usage above 95%'
- alert: GPUThermalThrottling
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 2m
labels:
severity: critical
annotations:
summary: 'GPU temperature exceeds 85C'
- alert: GPUXIDError
expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
labels:
severity: critical
annotations:
summary: 'GPU XID error detected'
7. AI Workload Optimization
7-1. Training Optimization
Mixed Precision Training
# PyTorch Automatic Mixed Precision (AMP)
import torch
from torch.cuda.amp import autocast, GradScaler
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler()
for data, target in dataloader:
optimizer.zero_grad()
# Forward Pass in FP16
with autocast():
output = model(data.cuda())
loss = criterion(output, target.cuda())
# Loss Scaling + Backward Pass
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Precision Comparison:
| Precision | Bits | Memory Savings | Tensor Core Support | Use Case |
|---|---|---|---|---|
| FP32 | 32 | Baseline | Yes (lower throughput) | Default training |
| TF32 | 19 | - | Yes (A100+) | Auto-applied |
| FP16 | 16 | 2x | Yes | Mixed Precision |
| BF16 | 16 | 2x | Yes (A100+) | LLM training (wider range) |
| FP8 (E4M3) | 8 | 4x | Yes (H100+) | Transformer Engine |
| INT8 | 8 | 4x | Yes | Inference quantization |
DeepSpeed ZeRO
# DeepSpeed ZeRO Stage 3 Configuration
# ds_config.json
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9
},
"bf16": {
"enabled": true
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}
ZeRO Memory Partitioning:
ZeRO Stage 0 (Default):
GPU0: [Model] + [Gradient] + [Optimizer State]
GPU1: [Model] + [Gradient] + [Optimizer State]
-> Full replication on all GPUs
ZeRO Stage 1 (Optimizer partitioning):
GPU0: [Model] + [Gradient] + [Optimizer 1/2]
GPU1: [Model] + [Gradient] + [Optimizer 2/2]
-> ~1.5x memory savings
ZeRO Stage 2 (+ Gradient partitioning):
GPU0: [Model] + [Gradient 1/2] + [Optimizer 1/2]
GPU1: [Model] + [Gradient 2/2] + [Optimizer 2/2]
-> ~2x memory savings
ZeRO Stage 3 (+ Model partitioning):
GPU0: [Model 1/2] + [Gradient 1/2] + [Optimizer 1/2]
GPU1: [Model 2/2] + [Gradient 2/2] + [Optimizer 2/2]
-> ~Nx memory savings (N = number of GPUs)
Parallelism Strategy Comparison
Data Parallelism:
Split input data into N parts across N GPUs with identical models
-> AllReduce for gradient synchronization
-> Communication volume: O(model_size)
Tensor Parallelism:
Split a single layer (matrix) into N parts across N GPUs
-> Communication needed at each layer in Forward/Backward
-> Requires high-speed inter-GPU communication (NVLink)
Pipeline Parallelism:
Place model layers sequentially across N GPUs
-> Process micro-batches in pipeline fashion
-> Minimizing bubbles (idle time) is key
3D Parallelism (LLM Training):
Data Parallel x Tensor Parallel x Pipeline Parallel
Example: 256 GPUs = 32 DP x 4 TP x 2 PP
7-2. Inference Optimization
TensorRT Optimization
# Model optimization via TensorRT (Python API)
import tensorrt as trt
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
# Parse ONNX model
with open("model.onnx", "rb") as f:
parser.parse(f.read())
# Build configuration
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB
config.set_flag(trt.BuilderFlag.FP16) # Enable FP16 quantization
# Build engine
engine = builder.build_serialized_network(network, config)
Triton Inference Server
Triton Architecture:
Client -> HTTP/gRPC -> Triton Server
├── Model Repository
│ ├── model_a/ (TensorRT)
│ ├── model_b/ (ONNX Runtime)
│ └── model_c/ (Python Backend)
├── Scheduler
│ ├── Dynamic Batching
│ └── Sequence Batching
├── Model Ensemble
│ └── Preprocessing -> Model -> Postprocessing pipeline
└── Metrics (Prometheus)
# Triton Model Configuration (config.pbtxt)
name: "my_model"
platform: "tensorrt_plan"
max_batch_size: 64
input [
{
name: "input"
data_type: TYPE_FP16
dims: [ 3, 224, 224 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP16
dims: [ 1000 ]
}
]
dynamic_batching {
preferred_batch_size: [ 16, 32, 64 ]
max_queue_delay_microseconds: 100
}
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]
vLLM: LLM Inference Optimization
from vllm import LLM, SamplingParams
# Start vLLM server
llm = LLM(
model="meta-llama/Llama-3-70B",
tensor_parallel_size=4, # 4 GPU Tensor Parallel
gpu_memory_utilization=0.9, # Use 90% GPU memory
max_model_len=8192,
dtype="bfloat16",
)
# Key optimization techniques:
# 1. PagedAttention: Manages KV Cache in pages (memory efficiency)
# 2. Continuous Batching: Dynamically adds/removes requests from batches
# 3. Prefix Caching: Reuses KV Cache for common prefixes
7-3. Performance Bottleneck Analysis Patterns
[Performance Diagnosis Flowchart]
1. Check nvidia-smi
├── GPU Util < 30%
│ ├── Likely CPU/IO bottleneck
│ │ ├── Check top/htop -> CPU at 100%? -> Optimize data loading
│ │ └── Check iostat -> Disk I/O? -> Use NVMe/GDS
│ └── Kernels too small -> CUDA Graphs, increase batch
├── GPU Util > 90%, performance still low
│ ├── Possibly Memory-Bound
│ │ ├── Nsight Compute -> Check Memory Throughput
│ │ └── Check Memory Coalescing patterns
│ └── Possibly Warp Divergence
│ └── Nsight Compute -> Check Warp Stall Reasons
└── GPU Util irregular (fluctuating)
├── Synchronization bottleneck -> Optimize async execution
└── Communication bottleneck (distributed) -> NCCL profiling
2. Distributed Training Bottleneck
├── Check NCCL AllReduce time
│ ├── Check NCCL regions in Nsight Systems
│ └── Analyze communication/compute ratio
├── Check InfiniBand bandwidth
│ └── ib_write_bw benchmark
└── Check GPU topology
└── nvidia-smi topo -m
8. Linux System Troubleshooting
8-1. GPU-Related Linux Commands
# GPU device information
lspci -vv -s $(lspci | grep NVIDIA | head -1 | awk '{print $1}')
# GPU driver version
cat /proc/driver/nvidia/version
# CUDA version
nvcc --version
# GPU memory usage detail
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv
# GPU memory per process
nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv
# PCIe bandwidth check
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv
# GPU clock information
nvidia-smi --query-gpu=clocks.current.graphics,clocks.current.memory,clocks.max.graphics,clocks.max.memory --format=csv
# GPU-related messages in dmesg
dmesg | grep -i -E "nvidia|nvrm|gpu|xid"
# InfiniBand status check
ibstat
ibstatus
ibv_devinfo
# RDMA device check
rdma link show
rdma resource show
# NIC status
ethtool -i mlx5_core0
mlxlink -d /dev/mst/mt4125_pciconf0 -m
# Kernel module status
lsmod | grep nvidia
lsmod | grep mlx
lsmod | grep vfio
# NUMA topology (GPU-CPU affinity)
numactl --hardware
lstopo --of ascii
nvidia-smi topo -m
8-2. Common GPU Issues and Solutions
XID Error Interpretation
XID Errors are error codes reported by the NVIDIA GPU driver. They appear in dmesg.
| XID Code | Meaning | Severity | Response |
|---|---|---|---|
| XID 13 | Graphics Engine Exception | High | Possible CUDA kernel bug, update driver |
| XID 31 | GPU Memory Page Fault | High | Memory access error, check code |
| XID 43 | GPU stopped processing | High | GPU hang, reset needed |
| XID 45 | Preemptive cleanup | Medium | Timeout, check workload |
| XID 48 | Double Bit ECC Error | Critical | Hardware defect, RMA |
| XID 63 | ECC page retirement | Medium | Page retired, RMA if accumulated |
| XID 64 | ECC page retirement (DBE) | High | Double-bit error, consider RMA |
| XID 79 | GPU has fallen off the bus | Critical | PCIe disconnection, check hardware |
| XID 94 | Contained ECC error | Medium | ECC error within MIG instance |
| XID 95 | Uncontained ECC error | Critical | MIG isolation failure, GPU reset needed |
# Monitor XID errors
dmesg -w | grep "NVRM: Xid"
# Example output:
# NVRM: Xid (PCI:0000:41:00): 79, pid=0, GPU has fallen off the bus
# NVRM: Xid (PCI:0000:41:00): 48, pid=12345, DBE (double bit error)
GPU Reset / Driver Reload
# Attempt GPU reset on hang
nvidia-smi --gpu-reset -i 0
# Driver reload (all GPU processes must be terminated)
# 1. Check GPU-using processes
fuser -v /dev/nvidia*
# 2. Kill processes
kill -9 <pid>
# 3. Unload/load driver
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
sudo modprobe nvidia
# If that doesn't work
sudo systemctl restart nvidia-persistenced
CUDA OOM Debugging
# Debugging OOM in PyTorch
# Check memory usage
import torch
print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved()/1e9:.2f} GB")
print(f"Max Allocated: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
# Memory snapshot (detailed analysis)
torch.cuda.memory._record_memory_history(max_entries=100000)
# ... run training code ...
snapshot = torch.cuda.memory._snapshot()
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
# Visualize at https://pytorch.org/memory_viz
OOM Mitigation:
- Reduce batch size
- Use Gradient Accumulation
- Enable Mixed Precision (FP16/BF16)
- Gradient Checkpointing (Activation Recomputation)
- Apply DeepSpeed ZeRO Stage 2/3
- Offload model parameters to CPU/NVMe
ECC Errors and RMA Procedure
# Check ECC errors
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.aggregate.total --format=csv
# Check Retired Pages
nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv
# RMA Criteria:
# - Recurring Uncorrected (Double-bit) ECC errors
# - Retired Pages exceeding threshold (typically 60+ pages)
# - Multiple XID 48 occurrences
# - GPU fell off bus (XID 79)
Thermal Throttling Response
# Temperature monitoring
nvidia-smi --query-gpu=temperature.gpu,temperature.memory --format=csv -l 1
# Check/set power limit
nvidia-smi --query-gpu=power.limit,power.default_limit,power.max_limit --format=csv
sudo nvidia-smi -pl 300 # Set power limit to 300W
# Check clock speeds (decrease during throttling)
nvidia-smi --query-gpu=clocks.current.graphics,clocks.max.graphics --format=csv
# Check throttling reason
nvidia-smi --query-gpu=clocks_throttle_reasons.active --format=csv
Thermal Throttling Prevention:
- Verify server room cooling capacity (400-1000W heat per GPU)
- Optimize airflow (Hot/Cold Aisle separation)
- Consider liquid cooling (DGX H100 supports liquid cooling option)
- Set power limits (performance vs temperature tradeoff)
9. Interview Questions: Top 30
GPU Architecture and CUDA (10 Questions)
Q1. Explain the internal structure of an SM (Streaming Multiprocessor) and describe how Warp Divergence affects performance.
Key Answer Points: SM consists of CUDA Cores, Tensor Cores, Warp Scheduler, Register File, Shared Memory/L1 Cache. A Warp (32 threads) executes the same instruction under the SIMT model; if/else branches cause both paths to execute sequentially, resulting in up to 2x performance degradation.
Q2. Explain the GPU memory hierarchy from Register to Global Memory, including latency and optimization strategy for each level.
Key Answer Points: Register (1 cycle) to Shared Memory (5 cycles) to L1/L2 Cache to Global Memory (HBM, 600 cycles). Use Shared Memory tiling to reduce Global Memory accesses; maximize bandwidth utilization through Memory Coalescing.
Q3. What is Memory Coalescing and why is it important?
Key Answer Points: When 32 threads in a Warp access contiguous memory, hardware merges into a single 128-byte transaction. Non-contiguous (strided) access splits into 32 separate transactions, utilizing only 1/32 of bandwidth.
Q4. Explain the Roofline Model and how to determine whether a given kernel is Memory-Bound or Compute-Bound.
Key Answer Points: Calculate Arithmetic Intensity (FLOPs/Byte) and compare with hardware's Balance Point (Peak FLOPS / Peak BW). For H100, the balance point is 295 FLOPs/Byte; below that is Memory-Bound.
Q5. Explain the key differences between A100 and H100, and how H100's Transformer Engine impacts training performance.
Key Answer Points: H100 offers 4th-gen Tensor Cores (FP8), NVLink 4.0 (900GB/s), HBM3 (3.35TB/s), Transformer Engine. Transformer Engine dynamically switches between FP8 and FP16, achieving 2x throughput improvement.
Q6. Explain the CUDA Grid-Block-Thread hierarchy and the criteria for choosing Block size.
Key Answer Points: Grid is a collection of Blocks; Blocks contain Threads and execute on the same SM. Block size should be a multiple of 32 (Warp size), maximize SM Occupancy, and account for Shared Memory and Register usage. 128 or 256 is typically a good starting point.
Q7. Explain the difference between NVIDIA Nsight Systems and Nsight Compute, and their respective use scenarios.
Key Answer Points: Nsight Systems shows system-level timeline (CPU-GPU interactions, kernel launches, memory transfers). Nsight Compute provides detailed analysis of individual kernels including SM Occupancy, Memory Throughput, and Warp Stall Reasons.
Q8. What are Shared Memory Bank Conflicts and how do you avoid them?
Key Answer Points: Shared Memory has 32 banks; when threads in the same Warp access the same bank simultaneously, accesses are serialized. Avoid by adding padding (array width of 33 instead of 32) or redesigning access patterns.
Q9. Explain methods to reduce Host-Device memory transfer overhead in CUDA.
Key Answer Points: Pinned Memory (cudaMallocHost), async transfers (cudaMemcpyAsync + CUDA Streams), Zero-copy Memory (Unified Memory), overlapping transfers with computation, CUDA Graphs.
Q10. If GPU utilization is only 30%, what debugging steps would you follow?
Key Answer Points: (1) nvidia-smi for basic memory/utilization check -> (2) Nsight Systems for CPU vs GPU time ratio -> (3) Check data loading bottleneck (num_workers, prefetch) -> (4) Check kernel size (increase batch) -> (5) Check synchronization bottleneck (CUDA Graphs) -> (6) Check PCIe bottleneck.
Virtualization and Networking (10 Questions)
Q11. Compare PCI Passthrough, vGPU, and MIG, describing appropriate use scenarios for each.
Key Answer Points: PCI Passthrough = 1:1 assignment (max performance), vGPU = time-slicing (flexibility), MIG = physical partition (isolation + predictability). Large training: Passthrough; multi-tenant inference: MIG; mixed VDI: vGPU.
Q12. Explain the role of IOMMU and its importance in GPU virtualization.
Key Answer Points: IOMMU (Intel VT-d) translates device DMA requests to virtual addresses, isolating VM memory. Without it, GPUs could access other VMs' memory, creating security vulnerabilities.
Q13. Explain MIG profile configuration. When maximally partitioning an A100-80GB, what are each instance's specifications?
Key Answer Points: Up to 7 1g.10gb instances. Each has about 14 SMs, 10GB HBM2e, independent L2 Cache, separate memory controller. Must create GPU Instance (GI) first, then Compute Instance (CI) within it for CUDA use.
Q14. Explain the differences between InfiniBand and Ethernet, and why InfiniBand is important for distributed training.
Key Answer Points: InfiniBand natively supports RDMA with 0.5us latency and CPU-bypass transfer. NDR provides 400Gbps bandwidth. Distributed training's AllReduce communication exchanges hundreds of GBs, so bandwidth and latency directly impact performance.
Q15. Explain how RDMA's Zero-copy transfer works.
Key Answer Points: Application registers memory via ibverbs API, allowing NIC to directly DMA access that memory. Data transfers directly from user-space memory to NIC without kernel buffer intermediary.
Q16. What advantages does GPU Direct RDMA provide for distributed training?
Key Answer Points: Transfers directly from GPU memory to NIC, eliminating Host Memory bounce buffer. Doubles effective PCIe bandwidth utilization and reduces transfer latency.
Q17. Explain the differences between RoCE v2 and InfiniBand, and why PFC configuration is critical in RoCE v2 environments.
Key Answer Points: RoCE v2 runs RDMA over UDP/IP, leveraging existing Ethernet infrastructure. However, Ethernet is lossy by design, so PFC (Priority Flow Control) is needed to pause transmission during congestion to maintain RDMA's lossless requirement.
Q18. Explain the role of NCCL and how it works in distributed training.
Key Answer Points: NCCL optimizes collective communication (AllReduce, Broadcast, AllGather) across multiple GPUs. It auto-detects NVLink, NVSwitch, InfiniBand for optimal communication paths, using Ring-AllReduce or Tree-AllReduce algorithms.
Q19. What is SR-IOV and how is it used in GPU clusters?
Key Answer Points: SR-IOV partitions physical NICs into multiple VFs (Virtual Functions) for direct VM assignment. In GPU clusters, InfiniBand/RoCE NICs are partitioned via SR-IOV in VM environments to maintain GPU Direct RDMA performance.
Q20. Compare two methods for assigning GPUs to VMs in KubeVirt.
Key Answer Points: (1) PCI Passthrough: hostDevices for direct physical GPU assignment, max performance, 1:1 mapping. (2) vGPU: mediated devices for virtual GPU assignment, GPU sharing possible but with overhead. MIG + KubeVirt combination is also possible.
K8s and Performance Optimization (10 Questions)
Q21. Explain the components of NVIDIA GPU Operator and each component's role.
Key Answer Points: Driver (kernel module), Container Toolkit (runtime integration), Device Plugin (K8s resource registration), DCGM Exporter (metrics), MIG Manager (auto MIG configuration), GFD (node labeling), Validator (verification).
Q22. Why is Topology-aware scheduling important for GPU scheduling in K8s?
Key Answer Points: NVLink-connected GPUs communicate 6-10x faster than PCIe-connected GPUs. Random GPU allocation in distributed training forces communication through PCIe instead of NVLink, significantly degrading performance.
Q23. Explain why Gang Scheduling is essential for distributed training.
Key Answer Points: AllReduce communication requires all workers to participate. If only 3 of 4 GPUs are allocated, the 3 sit idle waiting for the 4th. All-or-nothing allocation via schedulers like Volcano/Kueue is necessary.
Q24. Explain the difference between GPU Time-Slicing and MIG from a K8s perspective.
Key Answer Points: Time-Slicing is software time-sharing with no performance isolation but works on all GPUs. MIG is physical partitioning with full isolation but only supports A100/H100. In K8s, they are requested as nvidia.com/gpu replicas and nvidia.com/mig-Xg.XXgb respectively.
Q25. Name 5 key metrics to monitor with DCGM Exporter and explain each.
Key Answer Points: GPU_UTIL (SM utilization), MEM_COPY_UTIL (memory bandwidth), FB_USED (memory usage), GPU_TEMP (temperature), XID_ERRORS (hardware errors). POWER_USAGE, ECC_SBE/DBE are also important.
Q26. Explain how Mixed Precision Training works and why Loss Scaling is necessary.
Key Answer Points: Forward Pass runs in FP16 for Tensor Core utilization, gradients computed in FP16 then applied to FP32 master weights. FP16's narrow range causes small gradients to round to zero; Loss Scaling prevents this.
Q27. Explain DeepSpeed ZeRO's 3 Stages and compare memory savings for each.
Key Answer Points: Stage 1 (Optimizer State partition, ~1.5x), Stage 2 (+Gradient partition, ~2x), Stage 3 (+Model Parameter partition, ~Nx). Stage 3 has the highest communication overhead, requiring high-speed networks like InfiniBand.
Q28. Explain how Triton Inference Server's Dynamic Batching improves inference efficiency.
Key Answer Points: Individual requests are queued and batched within a configured maximum wait time for combined processing. GPUs show higher throughput with larger batches, so dynamic batching balances latency and throughput.
Q29. Describe the debugging procedure when XID 79 ("GPU has fallen off the bus") occurs.
Key Answer Points: (1) Check dmesg for surrounding logs -> (2) Check PCIe link status (lspci) -> (3) Attempt GPU reset (nvidia-smi --gpu-reset) -> (4) Check physical connection (reseat) -> (5) Test different slot -> (6) RMA if recurring.
Q30. How would you design the GPU software stack for a new 1000 H100 GPU cluster?
Key Answer Points: (1) OS: Ubuntu 22.04 + latest kernel -> (2) Driver: NVIDIA Driver 535+ -> (3) Network: InfiniBand NDR + NCCL -> (4) Container: K8s + GPU Operator + DCGM -> (5) Scheduling: Volcano (Gang) + GFD (Topology-aware) -> (6) Monitoring: DCGM Exporter + Prometheus + Grafana -> (7) Storage: GPU Direct Storage + distributed filesystem -> (8) MIG/vGPU: per-workload partitioning strategy.
10. 10-Month Study Roadmap
Month 1-2: GPU Fundamentals and CUDA Programming
Goal: Understand GPU architecture + develop CUDA programming skills
| Week | Topic | Activity |
|---|---|---|
| 1 | GPU Architecture | Read NVIDIA GPU architecture whitepapers (Ampere, Hopper) |
| 2 | CUDA Basics | Implement vector addition, matrix multiplication |
| 3 | CUDA Optimization | Shared Memory tiling, Memory Coalescing exercises |
| 4 | Advanced CUDA | Warp-level primitives, CUDA Streams |
| 5-6 | cuBLAS/cuDNN | Library usage, performance comparison |
| 7-8 | Profiling | Analyze real kernels with Nsight Systems/Compute |
Resources:
- NVIDIA CUDA Programming Guide
- "Programming Massively Parallel Processors" (David Kirk, Wen-mei Hwu)
- NVIDIA DLI (Deep Learning Institute) CUDA courses
Month 3-4: Linux Systems + GPU Drivers
Goal: Kernel/driver level understanding + GPU troubleshooting
| Week | Topic | Activity |
|---|---|---|
| 1-2 | Linux Kernel Basics | Memory management, device drivers, PCIe |
| 3-4 | GPU Drivers | NVIDIA driver installation/configuration, module structure |
| 5-6 | Troubleshooting | XID error analysis, ECC error response, dmesg analysis |
| 7-8 | Performance Tools | System analysis using perf, strace, eBPF |
Month 5-6: Virtualization (Core!)
Goal: KVM/QEMU + PCI Passthrough + vGPU + MIG hands-on
| Week | Topic | Activity |
|---|---|---|
| 1-2 | KVM/QEMU | VM creation, IOMMU setup, basic virtualization |
| 3-4 | PCI Passthrough | GPU VFIO binding, GPU assignment to VM |
| 5-6 | MIG | MIG profile configuration, performance testing |
| 7-8 | vGPU | vGPU license setup, scheduler comparison |
Month 7-8: Networking (InfiniBand/RDMA)
Goal: InfiniBand architecture + RDMA programming + NCCL tuning
| Week | Topic | Activity |
|---|---|---|
| 1-2 | InfiniBand Basics | Architecture, Subnet Manager, basic commands |
| 3-4 | RDMA | ibverbs programming, benchmarks |
| 5-6 | GPU Direct | GPU Direct RDMA/Storage hands-on |
| 7-8 | NCCL Tuning | Distributed training NCCL benchmarks, env var optimization |
Month 9-10: Kubernetes + Integration Project
Goal: K8s GPU management + large-scale cluster operations + portfolio completion
| Week | Topic | Activity |
|---|---|---|
| 1-2 | GPU Operator | Installation, configuration, MIG Manager |
| 3-4 | Scheduling | Topology-aware, Gang Scheduling (Volcano) |
| 5-6 | Monitoring | DCGM + Prometheus + Grafana dashboards |
| 7-8 | Integration Project | Complete portfolio projects + interview prep |
11. Portfolio Projects (3)
Project 1: CUDA Kernel Optimization (Matrix Multiplication Benchmark)
Goal: Naive CUDA to Shared Memory Tiling to Tensor Core utilization to cuBLAS comparison
Project Structure:
cuda-matmul-benchmark/
├── src/
│ ├── naive_matmul.cu # Naive implementation
│ ├── tiled_matmul.cu # Shared Memory tiling
│ ├── wmma_matmul.cu # Tensor Core (WMMA API)
│ └── cublas_matmul.cu # cuBLAS wrapper
├── benchmark/
│ ├── run_benchmarks.sh
│ └── plot_results.py # Result visualization
├── profiles/
│ ├── nsight_systems/
│ └── nsight_compute/
└── README.md
Key Deliverables:
- GFLOPS comparison table for each implementation
- Nsight Compute profiling results (SM Occupancy, Memory Throughput)
- Achievement rate vs cuBLAS (typically Naive: 1
5%, Tiled: 2040%, Tensor Core: 60~80%)
Project 2: MIG + K8s Multi-Tenant GPU Cluster
Goal: Build a GPU sharing cluster using MIG
Project Structure:
mig-k8s-multitenant/
├── infra/
│ ├── gpu-operator-values.yaml
│ ├── mig-config.yaml
│ └── monitoring/
│ ├── dcgm-dashboard.json # Grafana dashboard
│ └── alert-rules.yaml # Prometheus alerts
├── workloads/
│ ├── inference-deployment.yaml # MIG 1g.10gb inference
│ ├── training-job.yaml # MIG 3g.40gb training
│ └── notebook-statefulset.yaml # MIG 2g.20gb Jupyter
├── scheduler/
│ ├── gang-scheduling.yaml # Volcano config
│ └── priority-classes.yaml
└── docs/
├── architecture.md
└── benchmark-results.md
Key Deliverables:
- 3 workloads running simultaneously on 1 A100-80GB via MIG partitioning
- Performance isolation verification (noisy neighbor testing)
- Grafana dashboard: per-instance GPU utilization, memory, temperature
Project 3: Distributed Training Performance Profiling (NCCL + InfiniBand)
Goal: Analyze and optimize communication bottlenecks in distributed training
Project Structure:
distributed-training-profiler/
├── benchmarks/
│ ├── nccl_allreduce.sh # NCCL benchmarks
│ ├── ib_bandwidth.sh # InfiniBand bandwidth
│ └── multi_node_training.py # Actual training script
├── profiling/
│ ├── nsight_distributed.sh # Distributed env profiling
│ └── nccl_debug_analysis.py # NCCL log analysis
├── optimization/
│ ├── nccl_env_tuning.sh # NCCL env var optimization
│ └── topology_optimization.py # GPU-NIC topology optimization
└── results/
├── scaling_efficiency.png # Scaling efficiency graph
└── communication_breakdown.png # Communication time analysis
Key Deliverables:
- 2-node, 4-node, 8-node scaling efficiency measurements
- NCCL AllReduce time vs compute time ratio analysis
- Before/after comparison of env var tuning (NCCL_IB_HCA, NCCL_ALGO, etc.)
- Nsight Systems timeline showing communication/compute overlap
12. Quiz
Q1. How many FP32 CUDA Cores are in a single H100 SM, and how many total SMs does the H100 have?
Answer: A single SM has 128 FP32 CUDA Cores, and the H100 has a total of 132 SMs. Therefore, the total CUDA Core count is 128 x 132 = 16,896. For comparison, the A100 has 64 FP32 Cores per SM x 108 SMs = 6,912 total.
Q2. When maximally partitioning an A100-80GB with 1g.10gb MIG profiles, how many instances are created and how many SMs does each have?
Answer: A maximum of 7 1g.10gb instances are created. Each instance has approximately 14 SMs and 10GB HBM2e memory. Each instance has independent L2 Cache and separate memory controllers, so performance is physically isolated. To use CUDA, you must first create a GPU Instance (GI) and then create a Compute Instance (CI) within it.
Q3. What are three key differences between RDMA over InfiniBand and RoCE v2?
Answer: (1) Transport layer: InfiniBand uses its own transport protocol, while RoCE v2 operates over UDP/IP. (2) Congestion control: InfiniBand uses credit-based flow control for native lossless behavior, while RoCE v2 is Ethernet-based and requires PFC (Priority Flow Control) configuration for lossless operation. (3) Infrastructure: InfiniBand requires dedicated switches/cables, while RoCE v2 can leverage existing Ethernet switches at lower cost. InfiniBand NDR (400Gbps) generally offers higher bandwidth than RoCE (100-200Gbps).
Q4. Explain why Gang Scheduling is necessary in Kubernetes and name at least two schedulers that support it.
Answer: In distributed training, AllReduce communication requires all workers to participate for completion. If only some GPUs are allocated out of the needed total, the allocated ones sit idle waiting for the rest, wasting resources. Gang Scheduling provides all-or-nothing allocation. Schedulers supporting this include Volcano, Kueue (K8s SIG Scheduling), and YuniKorn (Apache). The default K8s scheduler (kube-scheduler) does not support Gang Scheduling.
Q5. When GPU utilization (SM Utilization) exceeds 90% but training speed is slow, name three possible causes and how to diagnose each.
Answer: (1) Memory-Bound: SMs are active but memory bandwidth is saturated. Check Memory Throughput in Nsight Compute; review Shared Memory usage and Memory Coalescing patterns. (2) Warp Divergence: Conditional branches cause sequential execution within Warps. Check Branch Efficiency metric in Nsight Compute. (3) Low SM Occupancy + high compute: Few Warps executing with high arithmetic intensity. Check Active Warps per SM and adjust Block size and Register usage. Additionally, underutilization of Tensor Cores (not using FP16/BF16) can also be a cause.
13. References
Official Documentation
- NVIDIA CUDA Programming Guide - NVIDIA Developer
- NVIDIA A100 Whitepaper - NVIDIA
- NVIDIA H100 Whitepaper - NVIDIA
- NVIDIA MIG User Guide - NVIDIA Developer
- NVIDIA Virtual GPU Software Documentation - NVIDIA
- NVIDIA GPU Operator Documentation - NVIDIA
- NVIDIA DCGM Documentation - NVIDIA
- NVIDIA NCCL Documentation - NVIDIA
Networking
- InfiniBand Architecture Specification - IBTA
- RDMA Aware Programming User Manual - NVIDIA Networking (Mellanox)
- RoCE v2 Deployment Guide - NVIDIA Networking
Kubernetes
- NVIDIA Device Plugin for Kubernetes - GitHub
- Volcano: Kubernetes Native Batch System - volcano.sh
- Kueue: Kubernetes-native Job Queueing - K8s SIG Scheduling
Training/Inference Optimization
- DeepSpeed Documentation - Microsoft
- Triton Inference Server Documentation - NVIDIA
- vLLM Documentation - vLLM Project
- Flash Attention Paper - Tri Dao et al.
Books
- "Programming Massively Parallel Processors" - David Kirk, Wen-mei Hwu
- "CUDA by Example" - Jason Sanders, Edward Kandrot
- "Computer Architecture: A Quantitative Approach" - Hennessy, Patterson
- "Understanding Linux Kernel" - Daniel P. Bovet, Marco Cesati
Community/Blogs
- NVIDIA Developer Blog - developer.nvidia.com/blog
- NVIDIA GTC Sessions (Free) - nvidia.com/gtc
- Horace He's "Making Deep Learning Go Brrrr" Blog Series
- Lily Chen's GPU Mode Community - Discord