GPU Software Engineer Complete Guide: From CUDA Architecture to vGPU/MIG, InfiniBand, and K8s GPU Scheduling — System Optimization Mastery

1. The Rare Career of GPU Software Engineer
2. JD Line-by-Line Dissection
3. GPU Architecture Deep Dive
4. GPU Virtualization Technology
5. High-Speed Networking: InfiniBand and RDMA
6. Kubernetes GPU Management
7. AI Workload Optimization
8. Linux System Troubleshooting
- 8-1. GPU-Related Linux Commands
- 8-2. Common GPU Issues and Solutions
9. Interview Questions: Top 30
10. 10-Month Study Roadmap
11. Portfolio Projects (3)
12. Quiz
13. References

1. The Rare Career of GPU Software Engineer

"People Who USE GPUs" vs "People Who MAKE GPUs Work"

Since 2024, the keyword dominating the AI industry has been "GPU." Every company is fighting to acquire GPUs, but the number of engineers who can properly operate acquired GPUs is vanishingly small.

A critical distinction is needed here:

Aspect	People Who USE GPUs	People Who MAKE GPUs Work
Role	ML Engineer, Researcher	GPU Software Engineer
Focus	Model accuracy, training algorithms	GPU utilization, memory bandwidth, scheduling
Tools	PyTorch, TensorFlow	nvidia-smi, Nsight, DCGM, NCCL
Key Question	"Why isn't this model performing?"	"Why is this GPU only 70% utilized?"
Abstraction Level	Python API	CUDA kernels, drivers, hypervisors
Response Area	Model architecture changes	XID error analysis, PCIe bottleneck resolution, MIG config

When an ML Engineer calls model.to('cuda') in PyTorch, a GPU Software Engineer understands and optimizes which path that call takes, which driver calls it traverses, and which memory region the allocation lands in.

Market Value of This Role

The supply-demand imbalance for GPU Software Engineers is extreme:

Demand side: As of 2025, over 5 million NVIDIA GPUs are installed in datacenters worldwide. Millions more are deployed annually, and systems engineers are needed to operate them.
Supply side: Engineers with deep GPU system software knowledge have traditionally existed only within NVIDIA itself, HPC research labs, and major cloud providers. In the Korean market, this talent pool is extremely limited.
Salary premium: In the US, GPU/CUDA Engineers typically command base salaries of USD 250K-400K. In Korea, companies are increasingly offering exceptional compensation for experts in this field.

LG Uplus GPU Technology TF Mission

Understanding why LG Uplus established the GPU Technology TF is essential:

Telecom AI Infrastructure Business: LG Uplus is pursuing not only its own AI services but also providing GPU cloud services to enterprise customers.
GPU Multi-tenancy: Sharing a single GPU among multiple customers requires vGPU/MIG technology.
Leveraging Network Strengths: As a telecom company, they have competency in InfiniBand/RoCE high-speed network design.
End-to-End Optimization: This team is responsible for the entire pipeline from hardware selection to virtualization, containers, and AI workload onboarding.

Role	Core Competency	GPU Depth	Infra Depth
ML Engineer	Model development, training pipelines	Low (API level)	Low
MLOps Engineer	CI/CD, model serving, pipeline automation	Medium	Medium
GPU SW Engineer	GPU architecture, virtualization, drivers	Very High	High
Infra SRE	Server/network availability, monitoring	Medium	Very High
HPC Engineer	Parallel computing, MPI, schedulers	High	High

GPU Software Engineer sits at the intersection of all these roles, specifically responsible for the software layer closest to hardware.

2. JD Line-by-Line Dissection

Let's analyze what each item in the LG Uplus GPU Software Engineer JD actually means.

Responsibilities

"GPU resource management and performance optimization"

This is not simply monitoring nvidia-smi. Specifically:

Finding and resolving causes when GPU utilization (SM Occupancy) is below expectations
Analyzing HBM memory bandwidth utilization and kernel-level optimization
Managing power capping vs performance tradeoffs
Determining RMA decisions when ECC errors occur
Designing cluster-level GPU allocation policies

"GPU virtualization technology development and optimization"

This is the core differentiator for this position:

Designing vGPU profiles: determining which virtual GPU size to allocate to which customer
MIG partitioning strategy: configuring A100/H100 MIG profiles to match workloads
Establishing technical selection criteria between PCI Passthrough vs vGPU vs MIG
Building pipelines for GPU allocation to VMs in KubeVirt environments

"AI/ML workload GPU onboarding and performance optimization"

Efficiently deploying customer AI models onto GPU infrastructure:

Model profiling: analyzing GPU memory requirements, compute requirements
Matching appropriate GPU type/size (A100-40GB vs A100-80GB vs H100)
Distributed training setup: NCCL communication optimization, InfiniBand utilization
Inference serving: Triton Inference Server configuration, batch size optimization

Qualification Analysis

"BS+ in CS (MS preferred in systems/network/OS)"

The reason for MS preference is clear. Knowledge in this field is mostly covered in graduate-level courses:

Operating Systems: memory management, scheduling, device drivers
Computer Architecture: cache hierarchy, memory models, parallel processing
Networking: RDMA, high-performance protocols

"GPU or system software practical experience"

The key phrase is "system software." This means:

Kernel module development/debugging experience
Interaction with device drivers
Low-level performance profiling (perf, ftrace, eBPF)
C/C++ level systems programming

Required Skills Analysis

The following sections will cover each required skill in depth.

3. GPU Architecture Deep Dive

3-1. GPU Compute Architecture

SM (Streaming Multiprocessor) Architecture

The core compute unit of a GPU is the SM (Streaming Multiprocessor). Understanding the SM structure of modern NVIDIA GPUs is the starting point for all GPU optimization.

SM Internal Components:

SM (Streaming Multiprocessor)
├── CUDA Cores (INT32 + FP32)
│   └── H100: 128 FP32 cores per SM
├── Tensor Cores
│   └── H100: 4th-gen Tensor Cores (FP8 support)
├── RT Cores (Ray Tracing, unused in datacenters)
├── Warp Schedulers (4)
│   └── Each scheduler independently dispatches warps
├── Register File (256KB per SM in H100)
├── Shared Memory / L1 Cache (unified, up to 228KB)
├── Load/Store Units
├── Special Function Units (SFU)
│   └── Transcendental functions: sin, cos, exp
└── Texture Units

Warps and the SIMT Model:

The fundamental unit of GPU execution is a Warp (32 threads). All threads in the same Warp execute the same instruction simultaneously. This is the SIMT (Single Instruction, Multiple Threads) model.

Grid (entire workload)
├── Block 0
│   ├── Warp 0  (Thread 0~31)   → Same instruction, simultaneous execution
│   ├── Warp 1  (Thread 32~63)  → Same instruction, simultaneous execution
│   └── Warp N  ...
├── Block 1
│   ├── Warp 0
│   └── ...
└── Block M

Warp Divergence problem: When threads within a Warp take different branches (if/else), both branches are executed sequentially, degrading performance. This is called "Warp Divergence" and is a pattern that must be avoided in GPU programming.

// Bad example: Warp Divergence
__global__ void kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx % 2 == 0) {
        data[idx] = expensive_operation_A(data[idx]);
    } else {
        data[idx] = expensive_operation_B(data[idx]);
    }
    // Even/odd threads in same Warp take different branches -> sequential execution
}

// Good example: Threads in same Warp take same branch
__global__ void kernel_optimized(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int warp_id = idx / 32;
    if (warp_id % 2 == 0) {
        data[idx] = expensive_operation_A(data[idx]);
    } else {
        data[idx] = expensive_operation_B(data[idx]);
    }
    // All threads in same Warp take same branch
}

Architecture Generation Comparison

Feature	Ampere (A100)	Hopper (H100)	Blackwell (B200)
SM Count	108	132	192
CUDA Cores	6,912	16,896	21,760+
Tensor Cores	432 (3rd gen)	528 (4th gen)	768 (5th gen)
Memory	HBM2e 80GB	HBM3 80GB	HBM3e 192GB
Memory Bandwidth	2.0 TB/s	3.35 TB/s	8.0 TB/s
NVLink	3.0 (600GB/s)	4.0 (900GB/s)	5.0 (1.8TB/s)
Transformer Engine	None	FP8 support	FP4 support
MIG Support	Up to 7 instances	Up to 7 instances	Up to 7 instances
TDP	400W	700W	1000W
FP16 Tensor Perf	312 TFLOPS	989 TFLOPS	2,250+ TFLOPS

Transformer Engine: Introduced with H100, the Transformer Engine provides hardware-level FP8 precision support. It dynamically switches tensors between FP8/FP16 per layer during training, halving memory usage while minimizing accuracy loss.

NVLink and NVSwitch: High-speed direct communication paths between GPUs.

NVLink Topology (DGX H100):
GPU 0 <-- NVLink 4.0 (900GB/s) --> GPU 1
  |                                    |
  NVSwitch (fully connected)        NVSwitch
  |                                    |
GPU 2 <-- NVLink 4.0 (900GB/s) --> GPU 3
  |                                    |
  ...        8-GPU fully connected     ...
GPU 6 <-- NVLink 4.0 (900GB/s) --> GPU 7

Total bandwidth: 8 GPUs x 900GB/s = 7.2TB/s (bidirectional)

3-2. GPU Memory Hierarchy (Critical!)

Understanding the GPU memory hierarchy accounts for 80% of GPU performance optimization. All GPU performance problems ultimately reduce to memory problems.

Memory Hierarchy (Fast -> Slow):

1. Register (Fastest)
   ├── Capacity: Up to 255 per thread (32-bit)
   ├── Latency: ~1 cycle
   ├── Bandwidth: Infinite (directly connected to ALU)
   └── Note: Compiler auto-allocates; spills to local memory when exceeded

2. Shared Memory
   ├── Capacity: 48KB ~ 228KB per SM (configurable)
   ├── Latency: ~5 cycles (~5x slower than registers)
   ├── Bandwidth: ~19TB/s (H100)
   ├── Feature: Shared among threads in same Block
   └── Warning: Bank Conflicts possible

3. L1 Cache
   ├── Unified with Shared Memory (ratio configurable)
   ├── H100: Shared Memory + L1 = 228KB per SM
   └── Auto-cached, not directly programmable

4. L2 Cache
   ├── Capacity: H100 = 50MB (shared across all SMs)
   ├── Latency: ~200 cycles
   └── A100: 40MB, Blackwell: up to 128MB

5. Global Memory (HBM)
   ├── Capacity: 40GB ~ 192GB
   ├── Latency: ~600 cycles (~600x slower than registers)
   ├── Bandwidth: 2.0 ~ 8.0 TB/s (by generation)
   └── Accessible from all threads

Memory Bandwidth Utilization Calculation:

Determining whether GPU performance is memory-bound or compute-bound is the core skill.

Arithmetic Intensity = FLOPs / Bytes Accessed

H100:
- Peak Compute: 989 TFLOPS (FP16 Tensor)
- Peak Memory BW: 3.35 TB/s

Balance Point (Roofline Analysis):
  989 TFLOPS / 3.35 TB/s = 295 FLOPs/Byte

-> Arithmetic Intensity < 295: Memory-Bound
-> Arithmetic Intensity > 295: Compute-Bound

Examples:
- Vector addition: 1 FLOP / 12 Bytes = 0.08 -> Extremely Memory-Bound
- Matrix multiply (NxN): 2N FLOPs / 8 Bytes = O(N) -> Compute-Bound for large N
- Transformer Attention: Typically Memory-Bound (especially during inference)

Memory Coalescing:

When 32 threads in a Warp access contiguous memory, hardware merges this into a single large transaction. Non-contiguous access splits into multiple transactions, wasting bandwidth.

// Good: Coalesced Access (contiguous)
// Thread 0 -> data[0], Thread 1 -> data[1], ..., Thread 31 -> data[31]
__global__ void coalesced(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = data[idx];  // Single 128-byte transaction
}

// Bad: Strided Access (non-contiguous)
// Thread 0 -> data[0], Thread 1 -> data[32], ..., Thread 31 -> data[31*32]
__global__ void strided(float* data, int stride) {
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    float val = data[idx];  // 32 separate transactions!
}

Bank Conflicts:

Shared Memory is divided into 32 banks. When 2+ threads in the same Warp access the same bank, accesses are serialized, degrading performance.

Shared Memory Bank Layout (4-byte granularity):
Bank 0: addr 0, 128, 256, ...
Bank 1: addr 4, 132, 260, ...
Bank 2: addr 8, 136, 264, ...
...
Bank 31: addr 124, 252, 380, ...

Bank Conflict Example:
Thread 0 -> Bank 0 (addr 0)
Thread 1 -> Bank 0 (addr 128)  <- Same bank! Conflict
-> 2-way bank conflict: 2x slower

Avoidance: Add padding
__shared__ float tile[32][33];  // Pad to 33 (instead of 32)
// Avoids bank conflicts on column access

3-3. CUDA Programming Fundamentals

Grid, Block, Thread Hierarchy

CUDA Execution Model:

Grid (1)
├── Block (0,0)  --  Block (1,0)  --  Block (2,0)
├── Block (0,1)  --  Block (1,1)  --  Block (2,1)
└── Block (0,2)  --  Block (1,2)  --  Block (2,2)

Each Block:
├── Thread (0,0) ... Thread (15,0)
├── Thread (0,1) ... Thread (15,1)
└── Thread (0,15) ... Thread (15,15)

Constraints:
- Max 1024 threads per Block
- Block dimensions: max (1024, 1024, 64)
- Grid dimensions: max (2^31-1, 65535, 65535)

CUDA Code Example: Vector Addition

#include <stdio.h>
#include <cuda_runtime.h>

// GPU kernel function
__global__ void vectorAdd(const float* A, const float* B, float* C, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    int n = 1 << 20;  // 1M elements
    size_t size = n * sizeof(float);

    // Host memory allocation
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    // Initialize
    for (int i = 0; i < n; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // Device memory allocation
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Host -> Device copy
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Kernel launch
    int blockSize = 256;
    int gridSize = (n + blockSize - 1) / blockSize;
    vectorAdd<<<gridSize, blockSize>>>(d_A, d_B, d_C, n);

    // Device -> Host copy
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Verify
    printf("C[0] = %f (expected 3.0)\n", h_C[0]);

    // Free memory
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}

CUDA Code Example: Matrix Multiplication (Shared Memory Tiling)

#define TILE_SIZE 32

__global__ void matMul(const float* A, const float* B, float* C, int N) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Load from Global Memory to Shared Memory
        if (row < N && t * TILE_SIZE + threadIdx.x < N)
            tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        else
            tileA[threadIdx.y][threadIdx.x] = 0.0f;

        if (col < N && t * TILE_SIZE + threadIdx.y < N)
            tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tileB[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();  // Synchronize all threads in Block

        for (int k = 0; k < TILE_SIZE; k++)
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];

        __syncthreads();
    }

    if (row < N && col < N)
        C[row * N + col] = sum;
}

Key CUDA Libraries

Library	Purpose	Key API
cuBLAS	Linear algebra (matrix ops)	cublasSgemm, cublasGemmEx
cuDNN	Deep learning primitives	cudnnConvolutionForward
cuFFT	Fast Fourier Transform	cufftExecC2C
cuSPARSE	Sparse matrix operations	cusparseSpMV
Thrust	C++ parallel algorithms (STL-like)	thrust::sort, thrust::reduce
CUTLASS	GEMM customization	Template-based GEMM

3-4. GPU Profiling and Performance Analysis

nvidia-smi Detailed Usage

# Basic status check
nvidia-smi

# Monitor at 1-second intervals
nvidia-smi dmon -s pucvmet -d 1

# Detailed GPU process info
nvidia-smi pmon -d 1

# Query format (for scripts)
nvidia-smi --query-gpu=timestamp,gpu_bus_id,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw --format=csv -l 1

# MIG status check
nvidia-smi mig -lgi
nvidia-smi mig -lci

# GPU topology check (NVLink connections)
nvidia-smi topo -m

# ECC error check
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv

NVIDIA Nsight Systems (System-Level Profiling)

# Full application profiling
nsys profile --stats=true -o report python train.py

# Trace CUDA API calls + GPU kernels + memory transfers
nsys profile --trace=cuda,nvtx,osrt -o detailed_report python train.py

# Visualize results (GUI)
nsys-ui report.nsys-rep

Nsight Systems provides a timeline view for:

Identifying CPU-GPU synchronization points
Checking kernel execution and memory transfer overlap
Identifying CPU bottlenecks (data loading, preprocessing)
Measuring NCCL communication time (distributed training)

NVIDIA Nsight Compute (Kernel-Level Analysis)

# Detailed analysis of specific kernel
ncu --target-processes all --set full -o kernel_report python train.py

# Profile specific kernel only
ncu --kernel-name "volta_sgemm" --launch-count 10 -o sgemm_report ./my_app

# Check key metrics
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_active,dram__throughput.avg.pct_of_peak_sustained_active ./my_app

Key Metrics:

SM Occupancy: Ratio of active warps in SM (higher is better, typically aim for 50%+)
Compute Throughput: Compute utilization (% of peak)
Memory Throughput: Memory bandwidth utilization (% of peak)
Warp Stall Reasons: Why warps are waiting (memory, synchronization, etc.)

DCGM (Data Center GPU Manager)

For large-scale cluster monitoring, nvidia-smi alone is insufficient. DCGM provides:

# Start DCGM
sudo systemctl start nvidia-dcgm

# Health check
dcgmi health -g 0 -c

# Run diagnostics (Level 3: most detailed)
dcgmi diag -r 3 -g 0

# Metric collection (Prometheus integration)
dcgm-exporter &
# Prometheus scrapes http://localhost:9400/metrics

GPU Utilization Low: Analysis Pattern

GPU Utilization Low (<50%)
├── CPU bottleneck?
│   ├── Slow data loading -> Increase num_workers, prefetch
│   ├── Heavy preprocessing -> Use DALI (GPU preprocessing)
│   └── Python GIL -> Multiprocessing
├── Memory transfer bottleneck?
│   ├── PCIe bandwidth saturated -> Use GPU Direct
│   └── Unnecessary CPU-GPU copies -> Use pinned memory
├── Small kernels + large overhead?
│   ├── Kernel launch overhead -> Use CUDA Graphs
│   └── Excessive synchronization -> Optimize async execution
├── Batch size too small?
│   └── Not enough work to fill GPU -> Increase batch or use Gradient Accumulation
└── Communication overhead? (distributed training)
    ├── AllReduce taking too long -> NCCL tuning
    └── Network bottleneck -> Check InfiniBand

4. GPU Virtualization Technology

4-1. Virtualization Fundamentals

Type 1 vs Type 2 Hypervisor

Type 1 (Bare-metal):           Type 2 (Hosted):
+-----------------+            +-----------------+
|   VM1    VM2    |            |   VM1    VM2    |
| +-----++-----+ |            | +-----++-----+ |
| |Guest||Guest| |            | |Guest||Guest| |
| | OS  || OS  | |            | | OS  || OS  | |
| +-----++-----+ |            | +-----++-----+ |
|  Hypervisor     |            |  Hypervisor     |
|  (ESXi, KVM)   |            |  (VirtualBox)   |
|  Hardware       |            |  Host OS        |
+-----------------+            |  Hardware       |
                               +-----------------+

In the LG Uplus environment, KVM is the core technology. KVM is a Type 1 hypervisor that operates as a Linux kernel module, using QEMU as the userspace emulator.

IOMMU (Intel VT-d / AMD-Vi)

IOMMU is an essential hardware feature for GPU virtualization:

Without IOMMU (unsafe):
VM -> (virtual address) -> Physical Memory (direct access -> can access other VM memory)

With IOMMU (safe):
VM -> (virtual address) -> IOMMU translation -> (physical address, isolated)
                            └── DMA requests also isolated!

Verifying IOMMU is enabled:

# Check IOMMU groups
find /sys/kernel/iommu_groups/ -type l

# Check kernel boot parameters
cat /proc/cmdline | grep iommu
# Should contain intel_iommu=on or amd_iommu=on

# Check devices per IOMMU group
for g in /sys/kernel/iommu_groups/*/devices/*; do
    echo "IOMMU Group $(basename $(dirname $(dirname $g))):"
    lspci -nns $(basename $g)
done

4-2. PCI Passthrough

PCI Passthrough is the most basic method of directly assigning a physical GPU to a VM.

PCI Passthrough Architecture:

Host (Linux + KVM)
├── GPU 0 -> VFIO driver binding -> VM1 (direct access)
├── GPU 1 -> VFIO driver binding -> VM2 (direct access)
├── GPU 2 -> NVIDIA driver -> Host use
└── GPU 3 -> NVIDIA driver -> Host use

Setup procedure:

# 1. Enable IOMMU (GRUB)
# Add to /etc/default/grub:
# GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"

# 2. Find GPU PCI ID
lspci -nn | grep NVIDIA
# 41:00.0 3D controller [0302]: NVIDIA Corporation A100 [10de:20b2]

# 3. Bind to VFIO driver
echo "10de 20b2" > /sys/bus/pci/drivers/vfio-pci/new_id
echo "0000:41:00.0" > /sys/bus/pci/devices/0000:41:00.0/driver/unbind
echo "0000:41:00.0" > /sys/bus/pci/drivers/vfio-pci/bind

# 4. Add device when starting VM with QEMU/KVM
# -device vfio-pci,host=41:00.0

Pros and Cons:

Pros	Cons
Native performance (nearly zero overhead)	1 GPU = 1 VM (no sharing)
Simple setup	No live migration
All CUDA features supported	Potential GPU resource waste

4-3. vGPU (NVIDIA Virtual GPU)

vGPU uses time-slicing to share a single physical GPU across multiple VMs.

vGPU Architecture:

Physical GPU (A100-80GB)
├── vGPU Instance 1 (A100-4C, 4GB) -> VM1
├── vGPU Instance 2 (A100-4C, 4GB) -> VM2
├── vGPU Instance 3 (A100-8C, 8GB) -> VM3
└── ... (as many as remaining capacity allows)

Time-slicing:
t=0ms  [VM1 runs] -> t=16ms [VM2 runs] -> t=32ms [VM3 runs] -> ...

vGPU Profile Types

Series	Purpose	Example
A-series	Virtual Application	A100-1-5A (5GB, VDI apps)
B-series	Virtual PC	A100-2-10B (10GB, VDI desktop)
C-series	Compute	A100-4-20C (20GB, AI compute)
Q-series	Quadro	A100-8-40Q (40GB, professional graphics)

For LG Uplus GPU Technology TF, C-series (Compute) will be the primary focus.

vGPU Scheduler

# vGPU Scheduler Types
Equal Share:
  - Equal time allocation to all vGPUs
  - Fair but no priority setting possible

Fixed Share:
  - Time allocation proportional to vGPU profile size
  - 4GB vGPU: 8GB vGPU = 1:2 time

Best Effort:
  - Redistributes idle vGPU time to active vGPUs
  - Most efficient but performance less predictable

4-4. MIG (Multi-Instance GPU)

MIG is an A100/H100-exclusive technology that physically partitions the GPU. Unlike vGPU's time-slicing, MIG completely isolates SMs and memory.

MIG Architecture (A100-80GB):

Full GPU: 108 SM, 80GB HBM2e
├── MIG Instance 1 (7g.80gb): 98 SM, 80GB  <- Nearly full (solo use)
or
├── MIG Instance 1 (4g.40gb): 56 SM, 40GB
├── MIG Instance 2 (3g.40gb): 42 SM, 40GB
or
├── MIG Instance 1 (3g.40gb): 42 SM, 40GB
├── MIG Instance 2 (2g.20gb): 28 SM, 20GB
├── MIG Instance 3 (1g.10gb): 14 SM, 10GB
├── MIG Instance 4 (1g.10gb): 14 SM, 10GB
or (maximum partition)
├── MIG Instance 1~7 (1g.10gb): 14 SM each, 10GB each (x7)

MIG Configuration Commands

# Enable MIG
sudo nvidia-smi -i 0 -mig 1

# Check available MIG profiles
nvidia-smi mig -lgip

# Create GPU Instances
sudo nvidia-smi mig -i 0 -cgi 9,14,14,14  # 3g.40gb + 1g.10gb x3

# Create Compute Instances
sudo nvidia-smi mig -i 0 -cci

# Check current MIG status
nvidia-smi mig -lgi
nvidia-smi mig -lci

# Delete MIG instances
sudo nvidia-smi mig -i 0 -dci
sudo nvidia-smi mig -i 0 -dgi

# Disable MIG
sudo nvidia-smi -i 0 -mig 0

MIG vs vGPU Comparison

Feature	MIG	vGPU
Isolation Level	Physical (SM + memory fully separated)	Time-slicing (software isolation)
Performance Predictability	Consistent (dedicated resources)	Variable (affected by other VMs)
Max Instances	7 (A100/H100)	Many (within GPU memory limits)
Supported GPUs	A100, H100, A30	Most datacenter GPUs
Flexibility	Fixed profiles (reconfiguration needed)	Dynamic allocation possible
Licensing	No additional license needed	vGPU license required
Use Cases	Inference serving, small-scale training	VDI, mixed workloads

MIG on K8s: NVIDIA MIG Manager

# MIG Configuration ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - device-filter: ["0x20B210DE"]
          devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      mixed-config:
        - device-filter: ["0x20B210DE"]
          devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "1g.10gb": 4

4-5. SR-IOV (NIC Virtualization)

SR-IOV virtualizes NICs for direct VM assignment. This is important when combined with GPU Direct RDMA.

SR-IOV Structure:

Physical NIC (ConnectX-7)
├── PF (Physical Function): Managed by host driver
├── VF 0 (Virtual Function) -> VM1 (direct assignment, native performance)
├── VF 1 -> VM2
├── VF 2 -> VM3
└── ... (up to 128 VFs)

Advantages:
- VMs access NIC directly without virtual bridge
- Near-native network performance
- Minimal CPU overhead

GPU Direct RDMA Combination:
GPU in VM <-> VF(SR-IOV NIC) <-> InfiniBand <-> Remote GPU
  (PCIe direct)  (SR-IOV bypass)    (RDMA)

4-6. KubeVirt

KubeVirt manages VMs as first-class resources on Kubernetes. It is a core technology when containers and VMs need to run on the same platform.

# KubeVirt VM with GPU PCI Passthrough
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: gpu-vm
spec:
  running: true
  template:
    spec:
      domain:
        devices:
          hostDevices:
            - name: gpu
              deviceName: nvidia.com/A100
        resources:
          requests:
            memory: '32Gi'
            cpu: '8'
      volumes:
        - name: rootdisk
          containerDisk:
            image: quay.io/containerdisks/ubuntu:22.04

KubeVirt + vGPU:

# KubeVirt VM with vGPU allocation
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vgpu-vm
spec:
  template:
    spec:
      domain:
        devices:
          gpus:
            - name: vgpu
              deviceName: nvidia.com/NVIDIA_A100-4C
        resources:
          requests:
            memory: '16Gi'

Use Cases:

Legacy VM workloads: Migrating existing VM-based AI workloads to K8s
Mixed environments: Running containers + VMs simultaneously on the same K8s cluster
GPU sharing: Flexibly allocating GPUs to VMs and containers through vGPU

5. High-Speed Networking: InfiniBand and RDMA

5-1. InfiniBand Architecture

Distributed GPU training performance is determined by the network. No matter how fast the GPUs are, slow inter-GPU communication degrades overall performance.

InfiniBand vs Ethernet Comparison

Feature	InfiniBand NDR	RoCE v2 (100GbE)	TCP/IP (100GbE)
Bandwidth	400 Gbps	100 Gbps	100 Gbps
Latency	0.5us	1~2us	10~50us
RDMA Support	Native	RoCE v2	None (kernel path)
CPU Overhead	Nearly zero	Low	High
Congestion Control	Credit-based	PFC/ECN	TCP congestion control
Cost	Very high	Medium	Low
Use Case	HPC, AI training	AI training (cloud)	General workloads

InfiniBand Generations

InfiniBand Speed Evolution:
SDR  (2001):   10 Gbps
DDR  (2005):   20 Gbps
QDR  (2008):   40 Gbps
FDR  (2011):  56 Gbps
EDR  (2014): 100 Gbps
HDR  (2018): 200 Gbps
NDR  (2022): 400 Gbps
XDR  (2024): 800 Gbps
GDR  (2026): 1.6 Tbps (planned)

InfiniBand Network Components

InfiniBand Fabric Structure:

Leaf Switch (ToR)
├── HCA (Host Channel Adapter) -- Server 1 [GPU 0~7]
├── HCA -- Server 2 [GPU 0~7]
├── HCA -- Server 3 [GPU 0~7]
└── HCA -- Server 4 [GPU 0~7]

Spine Switch
├── Leaf Switch 1
├── Leaf Switch 2
├── Leaf Switch 3
└── Leaf Switch 4

Management Components:
- Subnet Manager (OpenSM): LID assignment, routing table management
- LID (Local ID): Subnet address (16-bit)
- GID (Global ID): Global address (128-bit, IPv6-like)
- GUID (Globally Unique ID): Hardware unique identifier

5-2. RDMA (Remote Direct Memory Access)

RDMA accesses remote memory directly without CPU involvement. It is the backbone of distributed GPU training.

TCP/IP Transfer (Traditional):
App -> Socket API -> TCP/IP Stack (Kernel) -> NIC Driver -> NIC -> Network
                     ^ CPU involved (copy, checksum, segmentation)

RDMA Transfer:
App -> RDMA Verbs -> NIC (direct) -> Network
       ^ Zero-copy, CPU bypass

RDMA Transport Types

Transport	Description	Use Case
InfiniBand	Native RDMA	HPC, AI clusters
RoCE v2	RDMA over UDP/IP	Cloud environments
iWARP	RDMA over TCP/IP	Legacy environments

RDMA Programming Basics

// ibverbs-based RDMA Write example (simplified)
#include <infiniband/verbs.h>

// 1. Open device
struct ibv_context *ctx = ibv_open_device(dev);

// 2. Create Protection Domain
struct ibv_pd *pd = ibv_alloc_pd(ctx);

// 3. Register memory (for NIC direct access)
struct ibv_mr *mr = ibv_reg_mr(pd, buf, size,
    IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);

// 4. Create Queue Pair
struct ibv_qp *qp = ibv_create_qp(pd, &qp_init_attr);

// 5. RDMA Write (write directly to remote memory)
struct ibv_send_wr wr;
wr.opcode = IBV_WR_RDMA_WRITE;
wr.wr.rdma.remote_addr = remote_addr;
wr.wr.rdma.rkey = remote_key;
ibv_post_send(qp, &wr, &bad_wr);

5-3. GPU Direct

GPU Direct RDMA

GPU Direct RDMA enables direct data transfer from GPU memory to remote GPU memory.

Normal Path (without GPU Direct):
GPU0 -> PCIe -> Host Memory -> NIC -> Network -> NIC -> Host Memory -> PCIe -> GPU1
       (copy1)                (copy2)          (copy3)               (copy4)

GPU Direct RDMA:
GPU0 -> PCIe -> NIC -> Network -> NIC -> PCIe -> GPU1
       (direct)                         (direct)
CPU bypass, 2x reduction in copy count

GPU Direct Storage (GDS)

Normal Storage Access:
NVMe -> Host Memory (bounce buffer) -> GPU Memory
        CPU involved, 2 copies

GPU Direct Storage:
NVMe -> GPU Memory (direct)
        CPU bypass, 1 copy

Use Cases: Large dataset loading (checkpoint recovery, data preprocessing)

NCCL + InfiniBand Combination

# NCCL environment variables (distributed training)
export NCCL_IB_HCA=mlx5_0,mlx5_1  # Specify InfiniBand HCAs
export NCCL_IB_GID_INDEX=3         # RoCE v2 GID index
export NCCL_SOCKET_IFNAME=eth0     # Control channel interface
export NCCL_DEBUG=INFO             # Debug logging

# NCCL topology file (GPU-NIC mapping optimization)
export NCCL_TOPO_FILE=/path/to/topo.xml

# NCCL AllReduce benchmark
/usr/local/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 8

5-4. Network Performance Tuning

InfiniBand Benchmarks

# Bandwidth test
# Server: ib_write_bw --size=65536
# Client: ib_write_bw --size=65536 <server_ip>

# Latency test
# Server: ib_write_lat
# Client: ib_write_lat <server_ip>

# Example results (NDR 400Gbps):
# Bandwidth: ~48 GB/s (theoretical 50 GB/s)
# Latency: ~0.6 us

PFC (Priority Flow Control) Configuration

PFC is essential in RoCE v2 environments:

# Mellanox NIC PFC configuration
mlnx_qos -i eth0 --pfc 0,0,0,1,0,0,0,0
# Enable PFC only on Priority 3 (RoCE traffic)

# DSCP -> Priority mapping
mlnx_qos -i eth0 --trust dscp

ECMP (Equal-Cost Multi-Path) Routing

ECMP in Large-Scale InfiniBand Fabrics:

Server A --- Leaf 1 -+- Spine 1 -+- Leaf 3 --- Server C
                      +- Spine 2 -+
                      +- Spine 3 -+
                      +- Spine 4 -+

-> Load-balance across 4 equal-cost paths
-> Hash-based (source/destination LID) distribution
-> Adaptive Routing (AR): Dynamic path selection based on congestion

6. Kubernetes GPU Management

6-1. NVIDIA GPU Operator

GPU Operator automatically deploys the GPU software stack to K8s clusters.

GPU Operator Components:

GPU Operator
├── NVIDIA Driver (DaemonSet)
│   └── Auto-builds/installs kernel module
├── NVIDIA Container Toolkit
│   └── Adds GPU support to container runtime
├── NVIDIA Device Plugin
│   └── Registers GPU resources with K8s
├── DCGM Exporter
│   └── GPU metrics -> Prometheus
├── MIG Manager
│   └── Auto-applies MIG profiles
├── GPU Feature Discovery (GFD)
│   └── Auto-adds GPU labels to nodes
└── NVIDIA Validator
    └── Validates installation state

Installation:

# Install GPU Operator via Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set mig.strategy=mixed \
  --set dcgmExporter.enabled=true

6-2. GPU Device Plugin

# Allocating GPUs to a Pod
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:12.3.0-runtime-ubuntu22.04
      resources:
        limits:
          nvidia.com/gpu: 2 # Request 2 GPUs
      command: ['nvidia-smi']

For GPUs that don't support MIG, multiple Pods can share a GPU:

# GPU Time-Slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4  # Split 1 GPU into 4 (time-slicing)

6-3. GPU Scheduling

Basic Scheduling

K8s default GPU scheduling is simple: place Pods on nodes with sufficient available GPUs. However, large-scale GPU clusters require more sophisticated scheduling.

Topology-Aware Scheduling

DGX H100 GPU Topology:
GPU0 - NVLink - GPU1 (same NVSwitch domain)
GPU2 - NVLink - GPU3 (same NVSwitch domain)
GPU4 - NVLink - GPU5 (same NVSwitch domain)
GPU6 - NVLink - GPU7 (same NVSwitch domain)

GPU0 - PCIe - GPU4 (different domain, PCIe connection)

-> 4-GPU training: GPU0,1,2,3 (NVLink) >> GPU0,2,4,6 (PCIe)

# Topology-aware scheduling with NodeSelector
apiVersion: v1
kind: Pod
metadata:
  name: distributed-training
spec:
  nodeSelector:
    nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'
    nvidia.com/gpu.count: '8'
  containers:
    - name: trainer
      resources:
        limits:
          nvidia.com/gpu: 4

Gang Scheduling

In distributed training, all GPUs must be allocated simultaneously. Partial allocation wastes resources as allocated GPUs wait for the rest.

# Gang Scheduling with Volcano
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  minAvailable: 4 # Schedule minimum 4 Pods simultaneously
  schedulerName: volcano
  tasks:
    - replicas: 4
      name: worker
      template:
        spec:
          containers:
            - name: trainer
              image: training-image:latest
              resources:
                limits:
                  nvidia.com/gpu: 8 # 8 GPUs per node

Bin-packing vs Spread Strategy

Bin-packing (resource consolidation):
Node1: [GPU0 used, GPU1 used, GPU2 used, GPU3 free]
Node2: [GPU0 free, GPU1 free, GPU2 free, GPU3 free]
-> Pros: Power savings on idle nodes, resource efficiency
-> Cons: Potential hot spots

Spread (distribution):
Node1: [GPU0 used, GPU1 free, GPU2 used, GPU3 free]
Node2: [GPU0 used, GPU1 free, GPU2 used, GPU3 free]
-> Pros: Load distribution, fault isolation
-> Cons: Resource fragmentation

GPU Feature Discovery (GFD)

# Example node labels added by GFD
nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
nvidia.com/gpu.count=8
nvidia.com/gpu.memory=81920
nvidia.com/cuda.driver.major=535
nvidia.com/mig.strategy=mixed
nvidia.com/gpu.family=ampere
nvidia.com/mig-1g.10gb.count=4
nvidia.com/mig-3g.40gb.count=1

6-4. GPU Monitoring on K8s

DCGM Exporter + Prometheus + Grafana

# DCGM Exporter DaemonSet (included in GPU Operator)
# Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

Key Prometheus Metrics:

# GPU Utilization
DCGM_FI_DEV_GPU_UTIL             # SM utilization (%)
DCGM_FI_DEV_MEM_COPY_UTIL        # Memory bandwidth utilization (%)

# Memory
DCGM_FI_DEV_FB_USED              # Used framebuffer (MB)
DCGM_FI_DEV_FB_FREE              # Free framebuffer (MB)

# Temperature/Power
DCGM_FI_DEV_GPU_TEMP             # GPU temperature (C)
DCGM_FI_DEV_POWER_USAGE          # Power usage (W)

# Errors
DCGM_FI_DEV_XID_ERRORS           # XID error codes
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL    # Single-bit ECC errors
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL    # Double-bit ECC errors

# PCIe
DCGM_FI_DEV_PCIE_TX_THROUGHPUT   # PCIe transmit throughput
DCGM_FI_DEV_PCIE_RX_THROUGHPUT   # PCIe receive throughput

Alert Configuration Examples:

# Prometheus Alert Rules
groups:
  - name: gpu-alerts
    rules:
      - alert: GPUMemoryAlmostFull
        expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'GPU memory usage above 95%'

      - alert: GPUThermalThrottling
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: 'GPU temperature exceeds 85C'

      - alert: GPUXIDError
        expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: 'GPU XID error detected'

7. AI Workload Optimization

7-1. Training Optimization

Mixed Precision Training

# PyTorch Automatic Mixed Precision (AMP)
import torch
from torch.cuda.amp import autocast, GradScaler

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler()

for data, target in dataloader:
    optimizer.zero_grad()

    # Forward Pass in FP16
    with autocast():
        output = model(data.cuda())
        loss = criterion(output, target.cuda())

    # Loss Scaling + Backward Pass
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Precision Comparison:

Precision	Bits	Memory Savings	Tensor Core Support	Use Case
FP32	32	Baseline	Yes (lower throughput)	Default training
TF32	19	-	Yes (A100+)	Auto-applied
FP16	16	2x	Yes	Mixed Precision
BF16	16	2x	Yes (A100+)	LLM training (wider range)
FP8 (E4M3)	8	4x	Yes (H100+)	Transformer Engine
INT8	8	4x	Yes	Inference quantization

DeepSpeed ZeRO

# DeepSpeed ZeRO Stage 3 Configuration
# ds_config.json
{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9
    },
    "bf16": {
        "enabled": true
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

ZeRO Memory Partitioning:

ZeRO Stage 0 (Default):
GPU0: [Model] + [Gradient] + [Optimizer State]
GPU1: [Model] + [Gradient] + [Optimizer State]
-> Full replication on all GPUs

ZeRO Stage 1 (Optimizer partitioning):
GPU0: [Model] + [Gradient] + [Optimizer 1/2]
GPU1: [Model] + [Gradient] + [Optimizer 2/2]
-> ~1.5x memory savings

ZeRO Stage 2 (+ Gradient partitioning):
GPU0: [Model] + [Gradient 1/2] + [Optimizer 1/2]
GPU1: [Model] + [Gradient 2/2] + [Optimizer 2/2]
-> ~2x memory savings

ZeRO Stage 3 (+ Model partitioning):
GPU0: [Model 1/2] + [Gradient 1/2] + [Optimizer 1/2]
GPU1: [Model 2/2] + [Gradient 2/2] + [Optimizer 2/2]
-> ~Nx memory savings (N = number of GPUs)

Parallelism Strategy Comparison

Data Parallelism:
Split input data into N parts across N GPUs with identical models
-> AllReduce for gradient synchronization
-> Communication volume: O(model_size)

Tensor Parallelism:
Split a single layer (matrix) into N parts across N GPUs
-> Communication needed at each layer in Forward/Backward
-> Requires high-speed inter-GPU communication (NVLink)

Pipeline Parallelism:
Place model layers sequentially across N GPUs
-> Process micro-batches in pipeline fashion
-> Minimizing bubbles (idle time) is key

3D Parallelism (LLM Training):
Data Parallel x Tensor Parallel x Pipeline Parallel
Example: 256 GPUs = 32 DP x 4 TP x 2 PP

7-2. Inference Optimization

TensorRT Optimization

# Model optimization via TensorRT (Python API)
import tensorrt as trt

logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
    1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)

# Parse ONNX model
with open("model.onnx", "rb") as f:
    parser.parse(f.read())

# Build configuration
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB
config.set_flag(trt.BuilderFlag.FP16)  # Enable FP16 quantization

# Build engine
engine = builder.build_serialized_network(network, config)

Triton Inference Server

Triton Architecture:
Client -> HTTP/gRPC -> Triton Server
                       ├── Model Repository
                       │   ├── model_a/ (TensorRT)
                       │   ├── model_b/ (ONNX Runtime)
                       │   └── model_c/ (Python Backend)
                       ├── Scheduler
                       │   ├── Dynamic Batching
                       │   └── Sequence Batching
                       ├── Model Ensemble
                       │   └── Preprocessing -> Model -> Postprocessing pipeline
                       └── Metrics (Prometheus)

# Triton Model Configuration (config.pbtxt)
name: "my_model"
platform: "tensorrt_plan"
max_batch_size: 64

input [
  {
    name: "input"
    data_type: TYPE_FP16
    dims: [ 3, 224, 224 ]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP16
    dims: [ 1000 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 16, 32, 64 ]
  max_queue_delay_microseconds: 100
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

vLLM: LLM Inference Optimization

from vllm import LLM, SamplingParams

# Start vLLM server
llm = LLM(
    model="meta-llama/Llama-3-70B",
    tensor_parallel_size=4,      # 4 GPU Tensor Parallel
    gpu_memory_utilization=0.9,  # Use 90% GPU memory
    max_model_len=8192,
    dtype="bfloat16",
)

# Key optimization techniques:
# 1. PagedAttention: Manages KV Cache in pages (memory efficiency)
# 2. Continuous Batching: Dynamically adds/removes requests from batches
# 3. Prefix Caching: Reuses KV Cache for common prefixes

7-3. Performance Bottleneck Analysis Patterns

[Performance Diagnosis Flowchart]

1. Check nvidia-smi
   ├── GPU Util < 30%
   │   ├── Likely CPU/IO bottleneck
   │   │   ├── Check top/htop -> CPU at 100%? -> Optimize data loading
   │   │   └── Check iostat -> Disk I/O? -> Use NVMe/GDS
   │   └── Kernels too small -> CUDA Graphs, increase batch
   ├── GPU Util > 90%, performance still low
   │   ├── Possibly Memory-Bound
   │   │   ├── Nsight Compute -> Check Memory Throughput
   │   │   └── Check Memory Coalescing patterns
   │   └── Possibly Warp Divergence
   │       └── Nsight Compute -> Check Warp Stall Reasons
   └── GPU Util irregular (fluctuating)
       ├── Synchronization bottleneck -> Optimize async execution
       └── Communication bottleneck (distributed) -> NCCL profiling

2. Distributed Training Bottleneck
   ├── Check NCCL AllReduce time
   │   ├── Check NCCL regions in Nsight Systems
   │   └── Analyze communication/compute ratio
   ├── Check InfiniBand bandwidth
   │   └── ib_write_bw benchmark
   └── Check GPU topology
       └── nvidia-smi topo -m

8. Linux System Troubleshooting

# GPU device information
lspci -vv -s $(lspci | grep NVIDIA | head -1 | awk '{print $1}')

# GPU driver version
cat /proc/driver/nvidia/version

# CUDA version
nvcc --version

# GPU memory usage detail
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv

# GPU memory per process
nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv

# PCIe bandwidth check
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv

# GPU clock information
nvidia-smi --query-gpu=clocks.current.graphics,clocks.current.memory,clocks.max.graphics,clocks.max.memory --format=csv

# GPU-related messages in dmesg
dmesg | grep -i -E "nvidia|nvrm|gpu|xid"

# InfiniBand status check
ibstat
ibstatus
ibv_devinfo

# RDMA device check
rdma link show
rdma resource show

# NIC status
ethtool -i mlx5_core0
mlxlink -d /dev/mst/mt4125_pciconf0 -m

# Kernel module status
lsmod | grep nvidia
lsmod | grep mlx
lsmod | grep vfio

# NUMA topology (GPU-CPU affinity)
numactl --hardware
lstopo --of ascii
nvidia-smi topo -m

8-2. Common GPU Issues and Solutions

XID Error Interpretation

XID Errors are error codes reported by the NVIDIA GPU driver. They appear in dmesg.

XID Code	Meaning	Severity	Response
XID 13	Graphics Engine Exception	High	Possible CUDA kernel bug, update driver
XID 31	GPU Memory Page Fault	High	Memory access error, check code
XID 43	GPU stopped processing	High	GPU hang, reset needed
XID 45	Preemptive cleanup	Medium	Timeout, check workload
XID 48	Double Bit ECC Error	Critical	Hardware defect, RMA
XID 63	ECC page retirement	Medium	Page retired, RMA if accumulated
XID 64	ECC page retirement (DBE)	High	Double-bit error, consider RMA
XID 79	GPU has fallen off the bus	Critical	PCIe disconnection, check hardware
XID 94	Contained ECC error	Medium	ECC error within MIG instance
XID 95	Uncontained ECC error	Critical	MIG isolation failure, GPU reset needed

# Monitor XID errors
dmesg -w | grep "NVRM: Xid"

# Example output:
# NVRM: Xid (PCI:0000:41:00): 79, pid=0, GPU has fallen off the bus
# NVRM: Xid (PCI:0000:41:00): 48, pid=12345, DBE (double bit error)

GPU Reset / Driver Reload

# Attempt GPU reset on hang
nvidia-smi --gpu-reset -i 0

# Driver reload (all GPU processes must be terminated)
# 1. Check GPU-using processes
fuser -v /dev/nvidia*

# 2. Kill processes
kill -9 <pid>

# 3. Unload/load driver
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
sudo modprobe nvidia

# If that doesn't work
sudo systemctl restart nvidia-persistenced

CUDA OOM Debugging

# Debugging OOM in PyTorch

# Check memory usage
import torch
print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved:  {torch.cuda.memory_reserved()/1e9:.2f} GB")
print(f"Max Allocated: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")

# Memory snapshot (detailed analysis)
torch.cuda.memory._record_memory_history(max_entries=100000)
# ... run training code ...
snapshot = torch.cuda.memory._snapshot()
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
# Visualize at https://pytorch.org/memory_viz

OOM Mitigation:

Reduce batch size
Use Gradient Accumulation
Enable Mixed Precision (FP16/BF16)
Gradient Checkpointing (Activation Recomputation)
Apply DeepSpeed ZeRO Stage 2/3
Offload model parameters to CPU/NVMe

ECC Errors and RMA Procedure

# Check ECC errors
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.aggregate.total --format=csv

# Check Retired Pages
nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv

# RMA Criteria:
# - Recurring Uncorrected (Double-bit) ECC errors
# - Retired Pages exceeding threshold (typically 60+ pages)
# - Multiple XID 48 occurrences
# - GPU fell off bus (XID 79)

Thermal Throttling Response

# Temperature monitoring
nvidia-smi --query-gpu=temperature.gpu,temperature.memory --format=csv -l 1

# Check/set power limit
nvidia-smi --query-gpu=power.limit,power.default_limit,power.max_limit --format=csv
sudo nvidia-smi -pl 300  # Set power limit to 300W

# Check clock speeds (decrease during throttling)
nvidia-smi --query-gpu=clocks.current.graphics,clocks.max.graphics --format=csv

# Check throttling reason
nvidia-smi --query-gpu=clocks_throttle_reasons.active --format=csv

Thermal Throttling Prevention:

Verify server room cooling capacity (400-1000W heat per GPU)
Optimize airflow (Hot/Cold Aisle separation)
Consider liquid cooling (DGX H100 supports liquid cooling option)
Set power limits (performance vs temperature tradeoff)

9. Interview Questions: Top 30

GPU Architecture and CUDA (10 Questions)

Q1. Explain the internal structure of an SM (Streaming Multiprocessor) and describe how Warp Divergence affects performance.

Key Answer Points: SM consists of CUDA Cores, Tensor Cores, Warp Scheduler, Register File, Shared Memory/L1 Cache. A Warp (32 threads) executes the same instruction under the SIMT model; if/else branches cause both paths to execute sequentially, resulting in up to 2x performance degradation.

Q2. Explain the GPU memory hierarchy from Register to Global Memory, including latency and optimization strategy for each level.

Key Answer Points: Register (1 cycle) to Shared Memory (5 cycles) to L1/L2 Cache to Global Memory (HBM, 600 cycles). Use Shared Memory tiling to reduce Global Memory accesses; maximize bandwidth utilization through Memory Coalescing.

Q3. What is Memory Coalescing and why is it important?

Key Answer Points: When 32 threads in a Warp access contiguous memory, hardware merges into a single 128-byte transaction. Non-contiguous (strided) access splits into 32 separate transactions, utilizing only 1/32 of bandwidth.

Q4. Explain the Roofline Model and how to determine whether a given kernel is Memory-Bound or Compute-Bound.

Key Answer Points: Calculate Arithmetic Intensity (FLOPs/Byte) and compare with hardware's Balance Point (Peak FLOPS / Peak BW). For H100, the balance point is 295 FLOPs/Byte; below that is Memory-Bound.

Q5. Explain the key differences between A100 and H100, and how H100's Transformer Engine impacts training performance.

Key Answer Points: H100 offers 4th-gen Tensor Cores (FP8), NVLink 4.0 (900GB/s), HBM3 (3.35TB/s), Transformer Engine. Transformer Engine dynamically switches between FP8 and FP16, achieving 2x throughput improvement.

Q6. Explain the CUDA Grid-Block-Thread hierarchy and the criteria for choosing Block size.

Key Answer Points: Grid is a collection of Blocks; Blocks contain Threads and execute on the same SM. Block size should be a multiple of 32 (Warp size), maximize SM Occupancy, and account for Shared Memory and Register usage. 128 or 256 is typically a good starting point.

Q7. Explain the difference between NVIDIA Nsight Systems and Nsight Compute, and their respective use scenarios.

Key Answer Points: Nsight Systems shows system-level timeline (CPU-GPU interactions, kernel launches, memory transfers). Nsight Compute provides detailed analysis of individual kernels including SM Occupancy, Memory Throughput, and Warp Stall Reasons.

Q8. What are Shared Memory Bank Conflicts and how do you avoid them?

Key Answer Points: Shared Memory has 32 banks; when threads in the same Warp access the same bank simultaneously, accesses are serialized. Avoid by adding padding (array width of 33 instead of 32) or redesigning access patterns.

Q9. Explain methods to reduce Host-Device memory transfer overhead in CUDA.

Key Answer Points: Pinned Memory (cudaMallocHost), async transfers (cudaMemcpyAsync + CUDA Streams), Zero-copy Memory (Unified Memory), overlapping transfers with computation, CUDA Graphs.

Q10. If GPU utilization is only 30%, what debugging steps would you follow?

Key Answer Points: (1) nvidia-smi for basic memory/utilization check -> (2) Nsight Systems for CPU vs GPU time ratio -> (3) Check data loading bottleneck (num_workers, prefetch) -> (4) Check kernel size (increase batch) -> (5) Check synchronization bottleneck (CUDA Graphs) -> (6) Check PCIe bottleneck.

Virtualization and Networking (10 Questions)

Q11. Compare PCI Passthrough, vGPU, and MIG, describing appropriate use scenarios for each.

Key Answer Points: PCI Passthrough = 1:1 assignment (max performance), vGPU = time-slicing (flexibility), MIG = physical partition (isolation + predictability). Large training: Passthrough; multi-tenant inference: MIG; mixed VDI: vGPU.

Q12. Explain the role of IOMMU and its importance in GPU virtualization.

Key Answer Points: IOMMU (Intel VT-d) translates device DMA requests to virtual addresses, isolating VM memory. Without it, GPUs could access other VMs' memory, creating security vulnerabilities.

Q13. Explain MIG profile configuration. When maximally partitioning an A100-80GB, what are each instance's specifications?

Key Answer Points: Up to 7 1g.10gb instances. Each has about 14 SMs, 10GB HBM2e, independent L2 Cache, separate memory controller. Must create GPU Instance (GI) first, then Compute Instance (CI) within it for CUDA use.

Q14. Explain the differences between InfiniBand and Ethernet, and why InfiniBand is important for distributed training.

Key Answer Points: InfiniBand natively supports RDMA with 0.5us latency and CPU-bypass transfer. NDR provides 400Gbps bandwidth. Distributed training's AllReduce communication exchanges hundreds of GBs, so bandwidth and latency directly impact performance.

Q15. Explain how RDMA's Zero-copy transfer works.

Key Answer Points: Application registers memory via ibverbs API, allowing NIC to directly DMA access that memory. Data transfers directly from user-space memory to NIC without kernel buffer intermediary.

Q16. What advantages does GPU Direct RDMA provide for distributed training?

Key Answer Points: Transfers directly from GPU memory to NIC, eliminating Host Memory bounce buffer. Doubles effective PCIe bandwidth utilization and reduces transfer latency.

Q17. Explain the differences between RoCE v2 and InfiniBand, and why PFC configuration is critical in RoCE v2 environments.

Key Answer Points: RoCE v2 runs RDMA over UDP/IP, leveraging existing Ethernet infrastructure. However, Ethernet is lossy by design, so PFC (Priority Flow Control) is needed to pause transmission during congestion to maintain RDMA's lossless requirement.

Q18. Explain the role of NCCL and how it works in distributed training.

Key Answer Points: NCCL optimizes collective communication (AllReduce, Broadcast, AllGather) across multiple GPUs. It auto-detects NVLink, NVSwitch, InfiniBand for optimal communication paths, using Ring-AllReduce or Tree-AllReduce algorithms.

Q19. What is SR-IOV and how is it used in GPU clusters?

Key Answer Points: SR-IOV partitions physical NICs into multiple VFs (Virtual Functions) for direct VM assignment. In GPU clusters, InfiniBand/RoCE NICs are partitioned via SR-IOV in VM environments to maintain GPU Direct RDMA performance.

Q20. Compare two methods for assigning GPUs to VMs in KubeVirt.

Key Answer Points: (1) PCI Passthrough: hostDevices for direct physical GPU assignment, max performance, 1:1 mapping. (2) vGPU: mediated devices for virtual GPU assignment, GPU sharing possible but with overhead. MIG + KubeVirt combination is also possible.

K8s and Performance Optimization (10 Questions)

Q21. Explain the components of NVIDIA GPU Operator and each component's role.

Key Answer Points: Driver (kernel module), Container Toolkit (runtime integration), Device Plugin (K8s resource registration), DCGM Exporter (metrics), MIG Manager (auto MIG configuration), GFD (node labeling), Validator (verification).

Q22. Why is Topology-aware scheduling important for GPU scheduling in K8s?

Key Answer Points: NVLink-connected GPUs communicate 6-10x faster than PCIe-connected GPUs. Random GPU allocation in distributed training forces communication through PCIe instead of NVLink, significantly degrading performance.

Q23. Explain why Gang Scheduling is essential for distributed training.

Key Answer Points: AllReduce communication requires all workers to participate. If only 3 of 4 GPUs are allocated, the 3 sit idle waiting for the 4th. All-or-nothing allocation via schedulers like Volcano/Kueue is necessary.

Q24. Explain the difference between GPU Time-Slicing and MIG from a K8s perspective.

Key Answer Points: Time-Slicing is software time-sharing with no performance isolation but works on all GPUs. MIG is physical partitioning with full isolation but only supports A100/H100. In K8s, they are requested as nvidia.com/gpu replicas and nvidia.com/mig-Xg.XXgb respectively.

Q25. Name 5 key metrics to monitor with DCGM Exporter and explain each.

Key Answer Points: GPU_UTIL (SM utilization), MEM_COPY_UTIL (memory bandwidth), FB_USED (memory usage), GPU_TEMP (temperature), XID_ERRORS (hardware errors). POWER_USAGE, ECC_SBE/DBE are also important.

Q26. Explain how Mixed Precision Training works and why Loss Scaling is necessary.

Key Answer Points: Forward Pass runs in FP16 for Tensor Core utilization, gradients computed in FP16 then applied to FP32 master weights. FP16's narrow range causes small gradients to round to zero; Loss Scaling prevents this.

Q27. Explain DeepSpeed ZeRO's 3 Stages and compare memory savings for each.

Key Answer Points: Stage 1 (Optimizer State partition, ~1.5x), Stage 2 (+Gradient partition, ~2x), Stage 3 (+Model Parameter partition, ~Nx). Stage 3 has the highest communication overhead, requiring high-speed networks like InfiniBand.

Q28. Explain how Triton Inference Server's Dynamic Batching improves inference efficiency.

Key Answer Points: Individual requests are queued and batched within a configured maximum wait time for combined processing. GPUs show higher throughput with larger batches, so dynamic batching balances latency and throughput.

Q29. Describe the debugging procedure when XID 79 ("GPU has fallen off the bus") occurs.

Key Answer Points: (1) Check dmesg for surrounding logs -> (2) Check PCIe link status (lspci) -> (3) Attempt GPU reset (nvidia-smi --gpu-reset) -> (4) Check physical connection (reseat) -> (5) Test different slot -> (6) RMA if recurring.

Q30. How would you design the GPU software stack for a new 1000 H100 GPU cluster?

Key Answer Points: (1) OS: Ubuntu 22.04 + latest kernel -> (2) Driver: NVIDIA Driver 535+ -> (3) Network: InfiniBand NDR + NCCL -> (4) Container: K8s + GPU Operator + DCGM -> (5) Scheduling: Volcano (Gang) + GFD (Topology-aware) -> (6) Monitoring: DCGM Exporter + Prometheus + Grafana -> (7) Storage: GPU Direct Storage + distributed filesystem -> (8) MIG/vGPU: per-workload partitioning strategy.

10. 10-Month Study Roadmap

Month 1-2: GPU Fundamentals and CUDA Programming

Goal: Understand GPU architecture + develop CUDA programming skills

Week	Topic	Activity
1	GPU Architecture	Read NVIDIA GPU architecture whitepapers (Ampere, Hopper)
2	CUDA Basics	Implement vector addition, matrix multiplication
3	CUDA Optimization	Shared Memory tiling, Memory Coalescing exercises
4	Advanced CUDA	Warp-level primitives, CUDA Streams
5-6	cuBLAS/cuDNN	Library usage, performance comparison
7-8	Profiling	Analyze real kernels with Nsight Systems/Compute

Resources:

NVIDIA CUDA Programming Guide
"Programming Massively Parallel Processors" (David Kirk, Wen-mei Hwu)
NVIDIA DLI (Deep Learning Institute) CUDA courses

Month 3-4: Linux Systems + GPU Drivers

Goal: Kernel/driver level understanding + GPU troubleshooting

Week	Topic	Activity
1-2	Linux Kernel Basics	Memory management, device drivers, PCIe
3-4	GPU Drivers	NVIDIA driver installation/configuration, module structure
5-6	Troubleshooting	XID error analysis, ECC error response, dmesg analysis
7-8	Performance Tools	System analysis using perf, strace, eBPF

Month 5-6: Virtualization (Core!)

Goal: KVM/QEMU + PCI Passthrough + vGPU + MIG hands-on

Week	Topic	Activity
1-2	KVM/QEMU	VM creation, IOMMU setup, basic virtualization
3-4	PCI Passthrough	GPU VFIO binding, GPU assignment to VM
5-6	MIG	MIG profile configuration, performance testing
7-8	vGPU	vGPU license setup, scheduler comparison

Month 7-8: Networking (InfiniBand/RDMA)

Goal: InfiniBand architecture + RDMA programming + NCCL tuning

Week	Topic	Activity
1-2	InfiniBand Basics	Architecture, Subnet Manager, basic commands
3-4	RDMA	ibverbs programming, benchmarks
5-6	GPU Direct	GPU Direct RDMA/Storage hands-on
7-8	NCCL Tuning	Distributed training NCCL benchmarks, env var optimization

Month 9-10: Kubernetes + Integration Project

Goal: K8s GPU management + large-scale cluster operations + portfolio completion

Week	Topic	Activity
1-2	GPU Operator	Installation, configuration, MIG Manager
3-4	Scheduling	Topology-aware, Gang Scheduling (Volcano)
5-6	Monitoring	DCGM + Prometheus + Grafana dashboards
7-8	Integration Project	Complete portfolio projects + interview prep

11. Portfolio Projects (3)

Project 1: CUDA Kernel Optimization (Matrix Multiplication Benchmark)

Goal: Naive CUDA to Shared Memory Tiling to Tensor Core utilization to cuBLAS comparison

Project Structure:
cuda-matmul-benchmark/
├── src/
│   ├── naive_matmul.cu          # Naive implementation
│   ├── tiled_matmul.cu          # Shared Memory tiling
│   ├── wmma_matmul.cu           # Tensor Core (WMMA API)
│   └── cublas_matmul.cu         # cuBLAS wrapper
├── benchmark/
│   ├── run_benchmarks.sh
│   └── plot_results.py          # Result visualization
├── profiles/
│   ├── nsight_systems/
│   └── nsight_compute/
└── README.md

Key Deliverables:

GFLOPS comparison table for each implementation
Nsight Compute profiling results (SM Occupancy, Memory Throughput)
Achievement rate vs cuBLAS (typically Naive: 1~~5%, Tiled: 20~~40%, Tensor Core: 60~80%)

Project 2: MIG + K8s Multi-Tenant GPU Cluster

Goal: Build a GPU sharing cluster using MIG

Project Structure:
mig-k8s-multitenant/
├── infra/
│   ├── gpu-operator-values.yaml
│   ├── mig-config.yaml
│   └── monitoring/
│       ├── dcgm-dashboard.json    # Grafana dashboard
│       └── alert-rules.yaml       # Prometheus alerts
├── workloads/
│   ├── inference-deployment.yaml  # MIG 1g.10gb inference
│   ├── training-job.yaml          # MIG 3g.40gb training
│   └── notebook-statefulset.yaml  # MIG 2g.20gb Jupyter
├── scheduler/
│   ├── gang-scheduling.yaml       # Volcano config
│   └── priority-classes.yaml
└── docs/
    ├── architecture.md
    └── benchmark-results.md

Key Deliverables:

3 workloads running simultaneously on 1 A100-80GB via MIG partitioning
Performance isolation verification (noisy neighbor testing)
Grafana dashboard: per-instance GPU utilization, memory, temperature

Project 3: Distributed Training Performance Profiling (NCCL + InfiniBand)

Goal: Analyze and optimize communication bottlenecks in distributed training

Project Structure:
distributed-training-profiler/
├── benchmarks/
│   ├── nccl_allreduce.sh          # NCCL benchmarks
│   ├── ib_bandwidth.sh            # InfiniBand bandwidth
│   └── multi_node_training.py     # Actual training script
├── profiling/
│   ├── nsight_distributed.sh      # Distributed env profiling
│   └── nccl_debug_analysis.py     # NCCL log analysis
├── optimization/
│   ├── nccl_env_tuning.sh         # NCCL env var optimization
│   └── topology_optimization.py   # GPU-NIC topology optimization
└── results/
    ├── scaling_efficiency.png     # Scaling efficiency graph
    └── communication_breakdown.png # Communication time analysis

Key Deliverables:

2-node, 4-node, 8-node scaling efficiency measurements
NCCL AllReduce time vs compute time ratio analysis
Before/after comparison of env var tuning (NCCL_IB_HCA, NCCL_ALGO, etc.)
Nsight Systems timeline showing communication/compute overlap

12. Quiz

Q1. How many FP32 CUDA Cores are in a single H100 SM, and how many total SMs does the H100 have?

Answer: A single SM has 128 FP32 CUDA Cores, and the H100 has a total of 132 SMs. Therefore, the total CUDA Core count is 128 x 132 = 16,896. For comparison, the A100 has 64 FP32 Cores per SM x 108 SMs = 6,912 total.

Q2. When maximally partitioning an A100-80GB with 1g.10gb MIG profiles, how many instances are created and how many SMs does each have?

Answer: A maximum of 7 1g.10gb instances are created. Each instance has approximately 14 SMs and 10GB HBM2e memory. Each instance has independent L2 Cache and separate memory controllers, so performance is physically isolated. To use CUDA, you must first create a GPU Instance (GI) and then create a Compute Instance (CI) within it.

Q3. What are three key differences between RDMA over InfiniBand and RoCE v2?

Answer: (1) Transport layer: InfiniBand uses its own transport protocol, while RoCE v2 operates over UDP/IP. (2) Congestion control: InfiniBand uses credit-based flow control for native lossless behavior, while RoCE v2 is Ethernet-based and requires PFC (Priority Flow Control) configuration for lossless operation. (3) Infrastructure: InfiniBand requires dedicated switches/cables, while RoCE v2 can leverage existing Ethernet switches at lower cost. InfiniBand NDR (400Gbps) generally offers higher bandwidth than RoCE (100-200Gbps).

Q4. Explain why Gang Scheduling is necessary in Kubernetes and name at least two schedulers that support it.

Answer: In distributed training, AllReduce communication requires all workers to participate for completion. If only some GPUs are allocated out of the needed total, the allocated ones sit idle waiting for the rest, wasting resources. Gang Scheduling provides all-or-nothing allocation. Schedulers supporting this include Volcano, Kueue (K8s SIG Scheduling), and YuniKorn (Apache). The default K8s scheduler (kube-scheduler) does not support Gang Scheduling.

Q5. When GPU utilization (SM Utilization) exceeds 90% but training speed is slow, name three possible causes and how to diagnose each.

Answer: (1) Memory-Bound: SMs are active but memory bandwidth is saturated. Check Memory Throughput in Nsight Compute; review Shared Memory usage and Memory Coalescing patterns. (2) Warp Divergence: Conditional branches cause sequential execution within Warps. Check Branch Efficiency metric in Nsight Compute. (3) Low SM Occupancy + high compute: Few Warps executing with high arithmetic intensity. Check Active Warps per SM and adjust Block size and Register usage. Additionally, underutilization of Tensor Cores (not using FP16/BF16) can also be a cause.

13. References

Official Documentation

NVIDIA CUDA Programming Guide - NVIDIA Developer
NVIDIA A100 Whitepaper - NVIDIA
NVIDIA H100 Whitepaper - NVIDIA
NVIDIA MIG User Guide - NVIDIA Developer
NVIDIA Virtual GPU Software Documentation - NVIDIA
NVIDIA GPU Operator Documentation - NVIDIA
NVIDIA DCGM Documentation - NVIDIA
NVIDIA NCCL Documentation - NVIDIA

Networking

InfiniBand Architecture Specification - IBTA
RDMA Aware Programming User Manual - NVIDIA Networking (Mellanox)
RoCE v2 Deployment Guide - NVIDIA Networking

Kubernetes

NVIDIA Device Plugin for Kubernetes - GitHub
Volcano: Kubernetes Native Batch System - volcano.sh
Kueue: Kubernetes-native Job Queueing - K8s SIG Scheduling

Training/Inference Optimization

DeepSpeed Documentation - Microsoft
Triton Inference Server Documentation - NVIDIA
vLLM Documentation - vLLM Project
Flash Attention Paper - Tri Dao et al.

Books

"Programming Massively Parallel Processors" - David Kirk, Wen-mei Hwu
"CUDA by Example" - Jason Sanders, Edward Kandrot
"Computer Architecture: A Quantitative Approach" - Hennessy, Patterson
"Understanding Linux Kernel" - Daniel P. Bovet, Marco Cesati

Community/Blogs

NVIDIA Developer Blog - developer.nvidia.com/blog
NVIDIA GTC Sessions (Free) - nvidia.com/gtc
Horace He's "Making Deep Learning Go Brrrr" Blog Series
Lily Chen's GPU Mode Community - Discord

1. The Rare Career of GPU Software Engineer

"People Who USE GPUs" vs "People Who MAKE GPUs Work"

Market Value of This Role

LG Uplus GPU Technology TF Mission

Related Role Comparison

2. JD Line-by-Line Dissection

Responsibilities

Qualification Analysis

Required Skills Analysis

3. GPU Architecture Deep Dive

3-1. GPU Compute Architecture

SM (Streaming Multiprocessor) Architecture

Architecture Generation Comparison

3-2. GPU Memory Hierarchy (Critical!)

3-3. CUDA Programming Fundamentals

Grid, Block, Thread Hierarchy

CUDA Code Example: Vector Addition

CUDA Code Example: Matrix Multiplication (Shared Memory Tiling)

Key CUDA Libraries

3-4. GPU Profiling and Performance Analysis

nvidia-smi Detailed Usage

NVIDIA Nsight Systems (System-Level Profiling)

NVIDIA Nsight Compute (Kernel-Level Analysis)

DCGM (Data Center GPU Manager)

GPU Utilization Low: Analysis Pattern

4. GPU Virtualization Technology

4-1. Virtualization Fundamentals

Type 1 vs Type 2 Hypervisor

IOMMU (Intel VT-d / AMD-Vi)

4-2. PCI Passthrough

4-3. vGPU (NVIDIA Virtual GPU)

vGPU Profile Types

vGPU Scheduler

4-4. MIG (Multi-Instance GPU)

MIG Configuration Commands

MIG vs vGPU Comparison

MIG on K8s: NVIDIA MIG Manager

4-5. SR-IOV (NIC Virtualization)

4-6. KubeVirt

5. High-Speed Networking: InfiniBand and RDMA

5-1. InfiniBand Architecture

InfiniBand vs Ethernet Comparison

InfiniBand Generations

InfiniBand Network Components

5-2. RDMA (Remote Direct Memory Access)

RDMA Transport Types

RDMA Programming Basics

5-3. GPU Direct

GPU Direct RDMA

GPU Direct Storage (GDS)

NCCL + InfiniBand Combination

5-4. Network Performance Tuning

InfiniBand Benchmarks

PFC (Priority Flow Control) Configuration

ECMP (Equal-Cost Multi-Path) Routing

6. Kubernetes GPU Management

6-1. NVIDIA GPU Operator

6-2. GPU Device Plugin

Time-Slicing Configuration (GPU Sharing)

6-3. GPU Scheduling

Basic Scheduling

Topology-Aware Scheduling

Gang Scheduling

Bin-packing vs Spread Strategy

GPU Feature Discovery (GFD)

6-4. GPU Monitoring on K8s

DCGM Exporter + Prometheus + Grafana

7. AI Workload Optimization

7-1. Training Optimization

Mixed Precision Training

DeepSpeed ZeRO

Parallelism Strategy Comparison

7-2. Inference Optimization

TensorRT Optimization

Triton Inference Server

vLLM: LLM Inference Optimization

7-3. Performance Bottleneck Analysis Patterns

8. Linux System Troubleshooting

8-1. GPU-Related Linux Commands

8-2. Common GPU Issues and Solutions