Split View: GPU Software Engineer 합격 가이드: CUDA 아키텍처부터 vGPU/MIG, InfiniBand, K8s GPU 스케줄링까지 시스템 최적화 완전 정복

GPU Software Engineer 합격 가이드: CUDA 아키텍처부터 vGPU/MIG, InfiniBand, K8s GPU 스케줄링까지 시스템 최적화 완전 정복

1. GPU Software Engineer라는 희소한 커리어
2. JD 라인 바이 라인 해부
3. GPU 아키텍처 심화
4. GPU 가상화 기술
5. 고속 네트워크: InfiniBand와 RDMA
6. Kubernetes GPU 관리
7. AI 워크로드 최적화
8. Linux 시스템 트러블슈팅
- 8-1. GPU 관련 Linux 명령어
- 8-2. 흔한 GPU 이슈와 해결
9. 면접 예상 질문 30선
10. 10개월 학습 로드맵
11. 포트폴리오 프로젝트 3개
12. 퀴즈
13. 참고 자료

1. GPU Software Engineer라는 희소한 커리어

"GPU를 쓰는 사람" vs "GPU가 일하게 만드는 사람"

2024년 이후 AI 업계를 관통하는 키워드는 "GPU"입니다. 모든 기업이 GPU를 확보하기 위해 사투를 벌이고 있지만, 정작 확보한 GPU를 제대로 운용할 수 있는 엔지니어는 극소수입니다.

여기서 결정적 구분이 필요합니다:

구분	GPU를 쓰는 사람	GPU가 일하게 만드는 사람
역할	ML Engineer, Researcher	GPU Software Engineer
관심사	모델 정확도, 학습 알고리즘	GPU 활용률, 메모리 대역폭, 스케줄링
사용 도구	PyTorch, TensorFlow	nvidia-smi, Nsight, DCGM, NCCL
주요 질문	"이 모델 성능이 왜 안 나오지?"	"이 GPU가 왜 70%만 쓰이고 있지?"
추상화 레벨	Python API	CUDA 커널, 드라이버, 하이퍼바이저
대응 영역	모델 아키텍처 변경	XID 에러 분석, PCIe 병목 해소, MIG 설정

ML Engineer가 PyTorch에서 model.to('cuda')를 호출하면, GPU Software Engineer는 그 호출이 어떤 경로를 타고, 어떤 드라이버 콜을 거쳐, 어떤 메모리 영역에 할당되는지 이해하고 최적화합니다.

이 역할의 시장 가치

GPU Software Engineer의 수요-공급 불균형은 극심합니다:

수요 측면: 2025년 기준 전 세계 데이터센터에 설치된 NVIDIA GPU는 약 500만 장 이상. 매년 수백만 장이 추가 배포되고 있으며, 이를 운영할 시스템 엔지니어가 필요합니다.
공급 측면: GPU 시스템 소프트웨어를 깊이 이해하는 엔지니어는 전통적으로 NVIDIA 내부, HPC 연구소, 대형 클라우드 사업자에만 존재했습니다. 한국 시장에서는 이 인력풀이 극히 제한적입니다.
연봉 프리미엄: 미국 기준 GPU/CUDA Engineer는 base salary 250K~400K USD가 일반적이며, 한국에서도 이 분야 전문가에게 파격적 대우를 제공하는 사례가 증가하고 있습니다.

LG유플러스 GPU기술 TF의 미션

LG유플러스가 GPU기술 TF를 신설한 배경을 이해해야 합니다:

통신사의 AI 인프라 사업: LG유플러스는 자체 AI 서비스뿐 아니라, 기업 고객에게 GPU 클라우드를 제공하는 사업을 추진하고 있습니다.
GPU 멀티테넌시: 한 장의 GPU를 여러 고객이 나눠 쓰게 하려면 vGPU/MIG 기술이 필수입니다.
네트워크 강점 활용: 통신사로서 InfiniBand/RoCE 고속 네트워크 설계에 대한 역량이 있습니다.
엔드-투-엔드 최적화: 하드웨어 선정부터 가상화, 컨테이너, AI 워크로드 온보딩까지 전 구간을 책임질 팀입니다.

직군	핵심 역량	GPU 깊이	인프라 깊이
ML Engineer	모델 개발, 학습 파이프라인	낮음 (API 수준)	낮음
MLOps Engineer	CI/CD, 모델 서빙, 파이프라인 자동화	중간	중간
GPU SW Engineer	GPU 아키텍처, 가상화, 드라이버	매우 높음	높음
Infra SRE	서버/네트워크 가용성, 모니터링	중간	매우 높음
HPC Engineer	병렬 컴퓨팅, MPI, 스케줄러	높음	높음

2. JD 라인 바이 라인 해부

LG유플러스 GPU Software Engineer JD의 각 항목이 실제로 의미하는 바를 분석합니다.

담당 업무

"GPU 리소스 관리 및 성능 최적화"

이것은 단순히 nvidia-smi를 모니터링하는 것이 아닙니다. 구체적으로:

GPU 활용률(SM Occupancy)이 기대 이하일 때 원인을 찾아 해결
HBM 메모리 대역폭 활용률 분석 및 커널 레벨 최적화
전력 제한(Power Capping)과 성능 간 트레이드오프 관리
ECC 에러 발생 시 RMA 판단 및 대응
클러스터 단위 GPU 할당 정책 설계

"GPU 가상화 기술 개발 및 최적화"

이것이 이 포지션의 핵심 차별점입니다:

vGPU 프로필 설계: 어떤 고객에게 어떤 크기의 가상 GPU를 할당할지
MIG 파티셔닝 전략: A100/H100의 MIG 프로필을 워크로드에 맞게 구성
PCI Passthrough vs vGPU vs MIG 사이의 기술 선택 기준 수립
KubeVirt 환경에서 VM에 GPU를 할당하는 파이프라인 구축

"AI/ML 워크로드 GPU 온보딩 및 성능 최적화"

고객이 가져오는 AI 모델을 GPU 인프라에 효율적으로 배포하는 작업:

모델 프로파일링: GPU 메모리 요구량, 연산 요구량 분석
적절한 GPU 타입/크기 매칭 (A100-40GB vs A100-80GB vs H100)
분산 학습 설정: NCCL 통신 최적화, InfiniBand 활용
추론 서빙: Triton Inference Server 설정, 배치 크기 최적화

지원 자격 분석

"컴퓨터 과학 학사 이상 (시스템/네트워크/OS 전공 석사 우대)"

석사 우대 사유가 명확합니다. 이 분야의 지식은 대부분 대학원 수준의 과목에서 다뤄집니다:

운영체제: 메모리 관리, 스케줄링, 디바이스 드라이버
컴퓨터 아키텍처: 캐시 계층, 메모리 모델, 병렬 처리
네트워크: RDMA, 고성능 프로토콜

"GPU 또는 시스템 소프트웨어 실무 경험"

핵심 키워드는 "시스템 소프트웨어"입니다. 이것은:

커널 모듈 개발/디버깅 경험
디바이스 드라이버와의 상호작용
저수준 성능 프로파일링 (perf, ftrace, eBPF)
C/C++ 수준의 시스템 프로그래밍

필수 기술 분석

다음 섹션들에서 각 필수 기술을 깊이 있게 다루겠습니다.

3. GPU 아키텍처 심화

3-1. GPU 연산 구조

SM (Streaming Multiprocessor) 아키텍처

GPU의 핵심 연산 단위는 SM(Streaming Multiprocessor)입니다. 현대 NVIDIA GPU의 SM 구조를 이해하는 것이 모든 GPU 최적화의 출발점입니다.

SM 내부 구성요소:

SM (Streaming Multiprocessor)
├── CUDA Cores (INT32 + FP32)
│   └── H100: SM당 128개 FP32 코어
├── Tensor Cores
│   └── H100: SM당 4세대 Tensor Core (FP8 지원)
├── RT Cores (Ray Tracing, 데이터센터에선 미사용)
├── Warp Scheduler (4개)
│   └── 각 스케줄러가 독립적으로 warp 디스패치
├── Register File (256KB per SM in H100)
├── Shared Memory / L1 Cache (통합, 최대 228KB)
├── Load/Store Units
├── Special Function Units (SFU)
│   └── sin, cos, exp 등 초월함수 계산
└── Texture Units

Warp와 SIMT 모델:

GPU 실행의 기본 단위는 **Warp(32개 스레드)**입니다. 같은 Warp 내의 모든 스레드는 동일한 명령어를 동시에 실행합니다. 이것이 SIMT(Single Instruction, Multiple Threads) 모델입니다.

Grid (전체 작업)
├── Block 0
│   ├── Warp 0  (Thread 0~31)   → 같은 명령어 동시 실행
│   ├── Warp 1  (Thread 32~63)  → 같은 명령어 동시 실행
│   └── Warp N  ...
├── Block 1
│   ├── Warp 0
│   └── ...
└── Block M

Warp Divergence 문제: Warp 내의 스레드가 서로 다른 분기(if/else)를 타면, 두 분기를 순차적으로 실행하여 성능이 저하됩니다. 이를 "Warp Divergence"라 하며, GPU 프로그래밍에서 반드시 피해야 할 패턴입니다.

// 나쁜 예: Warp Divergence 발생
__global__ void kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx % 2 == 0) {
        data[idx] = expensive_operation_A(data[idx]);
    } else {
        data[idx] = expensive_operation_B(data[idx]);
    }
    // Warp 내 짝수/홀수 스레드가 다른 분기 → 순차 실행
}

// 좋은 예: 같은 Warp 내 스레드가 같은 분기를 타도록 설계
__global__ void kernel_optimized(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int warp_id = idx / 32;
    if (warp_id % 2 == 0) {
        data[idx] = expensive_operation_A(data[idx]);
    } else {
        data[idx] = expensive_operation_B(data[idx]);
    }
    // 같은 Warp의 모든 스레드가 같은 분기
}

아키텍처 세대별 비교

특성	Ampere (A100)	Hopper (H100)	Blackwell (B200)
SM 수	108	132	192
CUDA Cores	6,912	16,896	21,760+
Tensor Cores	432 (3세대)	528 (4세대)	768 (5세대)
메모리	HBM2e 80GB	HBM3 80GB	HBM3e 192GB
메모리 대역폭	2.0 TB/s	3.35 TB/s	8.0 TB/s
NVLink	3.0 (600GB/s)	4.0 (900GB/s)	5.0 (1.8TB/s)
Transformer Engine	없음	FP8 지원	FP4 지원
MIG 지원	최대 7 인스턴스	최대 7 인스턴스	최대 7 인스턴스
TDP	400W	700W	1000W
FP16 Tensor 성능	312 TFLOPS	989 TFLOPS	2,250+ TFLOPS

Transformer Engine: H100부터 도입된 Transformer Engine은 FP8 정밀도를 하드웨어 수준에서 지원합니다. 학습 중 각 레이어의 텐서를 동적으로 FP8/FP16 간 전환하여 메모리 사용량을 절반으로 줄이면서 정확도 손실을 최소화합니다.

NVLink과 NVSwitch: GPU 간 고속 직접 통신 경로입니다.

NVLink 토폴로지 (DGX H100 기준):
GPU 0 ←── NVLink 4.0 (900GB/s) ──→ GPU 1
  │                                    │
  NVSwitch (완전 연결)               NVSwitch
  │                                    │
GPU 2 ←── NVLink 4.0 (900GB/s) ──→ GPU 3
  │                                    │
  ...           8-GPU 완전 연결         ...
GPU 6 ←── NVLink 4.0 (900GB/s) ──→ GPU 7

총 대역폭: 8 GPU x 900GB/s = 7.2TB/s (양방향)

3-2. GPU 메모리 계층 (핵심!)

GPU 메모리 계층을 이해하는 것은 GPU 성능 최적화의 **80%**를 차지합니다. 모든 GPU 성능 문제는 결국 메모리 문제로 귀결됩니다.

메모리 계층 (빠름 → 느림):

1. Register (가장 빠름)
   ├── 용량: 스레드당 최대 255개 (32비트)
   ├── 지연: ~1 cycle
   ├── 대역폭: 무한 (ALU에 직접 연결)
   └── 특징: 컴파일러가 자동 할당, 초과 시 local memory로 spill

2. Shared Memory
   ├── 용량: SM당 48KB ~ 228KB (설정 가능)
   ├── 지연: ~5 cycles (register 대비 ~5x 느림)
   ├── 대역폭: ~19TB/s (H100 기준)
   ├── 특징: 같은 Block 내 스레드 간 공유
   └── 주의: Bank Conflict 발생 가능

3. L1 Cache
   ├── Shared Memory와 통합 (설정으로 비율 조절)
   ├── H100: Shared Memory + L1 = 228KB per SM
   └── 자동 캐시, 프로그래머가 직접 제어하지 않음

4. L2 Cache
   ├── 용량: H100 = 50MB (전체 SM 공유)
   ├── 지연: ~200 cycles
   └── A100: 40MB, Blackwell: 최대 128MB

5. Global Memory (HBM)
   ├── 용량: 40GB ~ 192GB
   ├── 지연: ~600 cycles (register 대비 ~600x 느림)
   ├── 대역폭: 2.0 ~ 8.0 TB/s (세대별)
   └── 모든 스레드에서 접근 가능

메모리 대역폭 활용률 계산:

GPU 성능이 메모리 병목(Memory-Bound)인지 연산 병목(Compute-Bound)인지 판별하는 것이 핵심입니다.

Arithmetic Intensity (산술 강도) = FLOPs / Bytes Accessed

H100 기준:
- Peak Compute: 989 TFLOPS (FP16 Tensor)
- Peak Memory BW: 3.35 TB/s

균형점 (Roofline 분석):
  989 TFLOPS / 3.35 TB/s = 295 FLOPs/Byte

→ 산술 강도가 295보다 낮으면 Memory-Bound
→ 산술 강도가 295보다 높으면 Compute-Bound

실제 예시:
- 벡터 덧셈: 1 FLOP / 12 Bytes = 0.08 → 극도로 Memory-Bound
- 행렬 곱셈 (NxN): 2N FLOPs / 8 Bytes ≈ O(N) → N이 크면 Compute-Bound
- Transformer Attention: 보통 Memory-Bound (특히 추론 시)

Memory Coalescing (메모리 병합 접근):

Warp의 32개 스레드가 연속 메모리를 접근하면, 하드웨어가 이를 하나의 큰 트랜잭션으로 병합합니다. 비연속 접근은 여러 트랜잭션으로 분리되어 대역폭 낭비가 발생합니다.

// 좋은 예: Coalesced Access (연속 접근)
// Thread 0 → data[0], Thread 1 → data[1], ..., Thread 31 → data[31]
__global__ void coalesced(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = data[idx];  // 128-byte 트랜잭션 1개
}

// 나쁜 예: Strided Access (비연속 접근)
// Thread 0 → data[0], Thread 1 → data[32], ..., Thread 31 → data[31*32]
__global__ void strided(float* data, int stride) {
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    float val = data[idx];  // 32개의 개별 트랜잭션!
}

Bank Conflict:

Shared Memory는 32개의 뱅크(Bank)로 나뉘어 있습니다. 같은 Warp 내에서 2개 이상의 스레드가 동일 뱅크에 접근하면 순차 처리되어 성능이 저하됩니다.

Shared Memory 뱅크 구조 (4바이트 단위):
Bank 0: addr 0, 128, 256, ...
Bank 1: addr 4, 132, 260, ...
Bank 2: addr 8, 136, 264, ...
...
Bank 31: addr 124, 252, 380, ...

Bank Conflict 예시:
Thread 0 → Bank 0 (addr 0)
Thread 1 → Bank 0 (addr 128)  ← 같은 뱅크! 충돌
→ 2-way bank conflict: 2배 느려짐

회피 방법: 패딩 추가
__shared__ float tile[32][33];  // 33으로 패딩 (32 대신)
// 열 접근 시 Bank Conflict 회피

3-3. CUDA 프로그래밍 기초

Grid, Block, Thread 계층

CUDA 실행 모델:

Grid (1개)
├── Block (0,0)  ──  Block (1,0)  ──  Block (2,0)
├── Block (0,1)  ──  Block (1,1)  ──  Block (2,1)
└── Block (0,2)  ──  Block (1,2)  ──  Block (2,2)

각 Block:
├── Thread (0,0) ... Thread (15,0)
├── Thread (0,1) ... Thread (15,1)
└── Thread (0,15) ... Thread (15,15)

제약:
- Block당 최대 1024 스레드
- Block 차원: 최대 (1024, 1024, 64)
- Grid 차원: 최대 (2^31-1, 65535, 65535)

CUDA 코드 예제: 벡터 덧셈

#include <stdio.h>
#include <cuda_runtime.h>

// GPU 커널 함수
__global__ void vectorAdd(const float* A, const float* B, float* C, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    int n = 1 << 20;  // 1M 원소
    size_t size = n * sizeof(float);

    // 호스트 메모리 할당
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    // 초기화
    for (int i = 0; i < n; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // 디바이스 메모리 할당
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // 호스트 → 디바이스 복사
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // 커널 실행
    int blockSize = 256;
    int gridSize = (n + blockSize - 1) / blockSize;
    vectorAdd<<<gridSize, blockSize>>>(d_A, d_B, d_C, n);

    // 디바이스 → 호스트 복사
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // 검증
    printf("C[0] = %f (expected 3.0)\n", h_C[0]);

    // 메모리 해제
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}

CUDA 코드 예제: 행렬 곱셈 (Shared Memory 활용)

#define TILE_SIZE 32

__global__ void matMul(const float* A, const float* B, float* C, int N) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Global Memory → Shared Memory 로드
        if (row < N && t * TILE_SIZE + threadIdx.x < N)
            tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        else
            tileA[threadIdx.y][threadIdx.x] = 0.0f;

        if (col < N && t * TILE_SIZE + threadIdx.y < N)
            tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tileB[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();  // Block 내 모든 스레드 동기화

        for (int k = 0; k < TILE_SIZE; k++)
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];

        __syncthreads();
    }

    if (row < N && col < N)
        C[row * N + col] = sum;
}

주요 CUDA 라이브러리

라이브러리	용도	핵심 API
cuBLAS	선형 대수 (행렬 연산)	cublasSgemm, cublasGemmEx
cuDNN	딥러닝 프리미티브	cudnnConvolutionForward
cuFFT	Fast Fourier Transform	cufftExecC2C
cuSPARSE	희소 행렬 연산	cusparseSpMV
Thrust	C++ 병렬 알고리즘 (STL-like)	thrust::sort, thrust::reduce
CUTLASS	GEMM 커스터마이징	템플릿 기반 GEMM

3-4. GPU 프로파일링과 성능 분석

nvidia-smi 상세 활용

# 기본 상태 확인
nvidia-smi

# 1초 간격 모니터링
nvidia-smi dmon -s pucvmet -d 1

# GPU 프로세스 상세 정보
nvidia-smi pmon -d 1

# 쿼리 포맷 (스크립트에서 활용)
nvidia-smi --query-gpu=timestamp,gpu_bus_id,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw --format=csv -l 1

# MIG 상태 확인
nvidia-smi mig -lgi
nvidia-smi mig -lci

# GPU 토폴로지 확인 (NVLink 연결 상태)
nvidia-smi topo -m

# ECC 에러 확인
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv

NVIDIA Nsight Systems (시스템 레벨 프로파일링)

# 전체 애플리케이션 프로파일링
nsys profile --stats=true -o report python train.py

# CUDA API 호출 + GPU 커널 + 메모리 전송 추적
nsys profile --trace=cuda,nvtx,osrt -o detailed_report python train.py

# 결과 시각화 (GUI)
nsys-ui report.nsys-rep

Nsight Systems는 타임라인 뷰를 제공하여:

CPU와 GPU 간의 동기화 지점 파악
커널 실행과 메모리 전송의 오버랩 여부 확인
CPU 병목 (데이터 로딩, 전처리) 식별
NCCL 통신 시간 측정 (분산 학습)

NVIDIA Nsight Compute (커널 레벨 분석)

# 특정 커널 상세 분석
ncu --target-processes all --set full -o kernel_report python train.py

# 특정 커널만 프로파일링
ncu --kernel-name "volta_sgemm" --launch-count 10 -o sgemm_report ./my_app

# 주요 메트릭 확인
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_active,dram__throughput.avg.pct_of_peak_sustained_active ./my_app

핵심 메트릭:

SM Occupancy: SM의 활성 Warp 비율 (높을수록 좋음, 보통 50% 이상 목표)
Compute Throughput: 연산 처리율 (피크 대비 %)
Memory Throughput: 메모리 대역폭 활용률 (피크 대비 %)
Warp Stall Reasons: Warp가 대기하는 원인 (메모리 대기, 동기화 등)

DCGM (Data Center GPU Manager)

대규모 클러스터를 모니터링할 때 nvidia-smi만으로는 부족합니다. DCGM은:

# DCGM 시작
sudo systemctl start nvidia-dcgm

# 헬스 체크
dcgmi health -g 0 -c

# 진단 실행 (Level 3: 가장 상세)
dcgmi diag -r 3 -g 0

# 메트릭 수집 (Prometheus 연동)
dcgm-exporter &
# http://localhost:9400/metrics 로 Prometheus가 수집

GPU 활용률이 낮은 이유 분석 패턴

GPU Utilization 낮음 (<50%)
├── CPU 병목?
│   ├── 데이터 로딩이 느림 → num_workers 증가, prefetch
│   ├── 전처리가 무거움 → DALI (GPU 전처리) 사용
│   └── Python GIL → 멀티프로세싱
├── 메모리 전송 병목?
│   ├── PCIe 대역폭 포화 → GPU Direct 활용
│   └── 불필요한 CPU-GPU 복사 → 고정 메모리(pinned memory) 사용
├── 작은 커널 + 큰 오버헤드?
│   ├── 커널 런치 오버헤드 → CUDA Graph 사용
│   └── 동기화 과다 → 비동기 실행 최적화
├── 배치 크기가 작음?
│   └── GPU를 충분히 채우지 못함 → 배치 증가 또는 Gradient Accumulation
└── 통신 오버헤드? (분산 학습)
    ├── AllReduce 시간이 김 → NCCL 튜닝
    └── 네트워크 병목 → InfiniBand 확인

4. GPU 가상화 기술

4-1. 가상화 기초

Type 1 vs Type 2 Hypervisor

Type 1 (Bare-metal):           Type 2 (Hosted):
┌─────────────────┐            ┌─────────────────┐
│   VM1    VM2    │            │   VM1    VM2    │
│ ┌─────┐┌─────┐ │            │ ┌─────┐┌─────┐ │
│ │Guest││Guest│ │            │ │Guest││Guest│ │
│ │ OS  ││ OS  │ │            │ │ OS  ││ OS  │ │
│ └─────┘└─────┘ │            │ └─────┘└─────┘ │
│  Hypervisor     │            │  Hypervisor     │
│  (ESXi, KVM)   │            │  (VirtualBox)   │
│  Hardware       │            │  Host OS        │
└─────────────────┘            │  Hardware       │
                               └─────────────────┘

LG유플러스 환경에서는 KVM이 핵심입니다. KVM은 Linux 커널 모듈로 동작하는 Type 1 하이퍼바이저이며, QEMU를 사용자 공간 에뮬레이터로 사용합니다.

IOMMU (Intel VT-d / AMD-Vi)

IOMMU는 GPU 가상화의 필수 하드웨어 기능입니다:

IOMMU 없이 (안전하지 않음):
VM → (가상 주소) → 물리 메모리 (직접 접근 → 다른 VM 메모리 침범 가능)

IOMMU 있을 때 (안전):
VM → (가상 주소) → IOMMU 변환 → (물리 주소, 격리됨)
                     └── DMA 요청도 격리!

IOMMU 활성화 확인:

# IOMMU 그룹 확인
find /sys/kernel/iommu_groups/ -type l

# 커널 부팅 파라미터 확인
cat /proc/cmdline | grep iommu
# intel_iommu=on 또는 amd_iommu=on 이어야 함

# IOMMU 그룹별 디바이스 확인
for g in /sys/kernel/iommu_groups/*/devices/*; do
    echo "IOMMU Group $(basename $(dirname $(dirname $g))):"
    lspci -nns $(basename $g)
done

4-2. PCI Passthrough

PCI Passthrough는 물리 GPU를 VM에 직접 할당하는 가장 기본적인 방식입니다.

PCI Passthrough 아키텍처:

Host (Linux + KVM)
├── GPU 0 → VFIO 드라이버 바인딩 → VM1 (직접 접근)
├── GPU 1 → VFIO 드라이버 바인딩 → VM2 (직접 접근)
├── GPU 2 → NVIDIA 드라이버 → Host 사용
└── GPU 3 → NVIDIA 드라이버 → Host 사용

설정 절차:

# 1. IOMMU 활성화 (GRUB)
# /etc/default/grub 에 추가:
# GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"

# 2. GPU의 PCI ID 확인
lspci -nn | grep NVIDIA
# 41:00.0 3D controller [0302]: NVIDIA Corporation A100 [10de:20b2]

# 3. VFIO 드라이버에 바인딩
echo "10de 20b2" > /sys/bus/pci/drivers/vfio-pci/new_id
echo "0000:41:00.0" > /sys/bus/pci/devices/0000:41:00.0/driver/unbind
echo "0000:41:00.0" > /sys/bus/pci/drivers/vfio-pci/bind

# 4. QEMU/KVM에서 VM 시작 시 디바이스 추가
# -device vfio-pci,host=41:00.0

장단점:

장점	단점
네이티브 성능 (오버헤드 거의 없음)	1 GPU = 1 VM (공유 불가)
간단한 설정	라이브 마이그레이션 불가
모든 CUDA 기능 지원	GPU 자원 낭비 가능

4-3. vGPU (NVIDIA Virtual GPU)

vGPU는 시간 분할(Time-Slicing)로 하나의 물리 GPU를 여러 VM이 공유합니다.

vGPU 아키텍처:

Physical GPU (A100-80GB)
├── vGPU Instance 1 (A100-4C, 4GB) → VM1
├── vGPU Instance 2 (A100-4C, 4GB) → VM2
├── vGPU Instance 3 (A100-8C, 8GB) → VM3
└── ... (남은 용량만큼 추가 가능)

시간 분할:
t=0ms  [VM1 실행] → t=16ms [VM2 실행] → t=32ms [VM3 실행] → ...

vGPU 프로필 유형

시리즈	용도	예시
A-series	Virtual Application	A100-1-5A (5GB, VDI 앱)
B-series	Virtual PC	A100-2-10B (10GB, VDI 데스크탑)
C-series	Compute	A100-4-20C (20GB, AI 연산)
Q-series	Quadro	A100-8-40Q (40GB, 전문 그래픽)

LG유플러스 GPU기술 TF에서는 **C-series (Compute)**가 주력이 될 것입니다.

vGPU Scheduler

# vGPU 스케줄러 유형
Equal Share:
  - 모든 vGPU에 동일 시간 할당
  - 공정하지만 우선순위 설정 불가

Fixed Share:
  - vGPU 프로필 크기에 비례하여 시간 할당
  - 4GB vGPU: 8GB vGPU = 1:2 시간

Best Effort:
  - 유휴 vGPU의 시간을 활성 vGPU에 재분배
  - 가장 효율적이지만 성능 예측이 어려움

4-4. MIG (Multi-Instance GPU)

MIG는 A100/H100 전용 기술로, GPU를 물리적으로 분할합니다. vGPU의 시간 분할과 달리, MIG는 SM과 메모리를 완전히 격리합니다.

MIG 아키텍처 (A100-80GB):

전체 GPU: 108 SM, 80GB HBM2e
├── MIG Instance 1 (7g.80gb): 98 SM, 80GB  ← 거의 전체 (단독 사용)
또는
├── MIG Instance 1 (4g.40gb): 56 SM, 40GB
├── MIG Instance 2 (3g.40gb): 42 SM, 40GB
또는
├── MIG Instance 1 (3g.40gb): 42 SM, 40GB
├── MIG Instance 2 (2g.20gb): 28 SM, 20GB
├── MIG Instance 3 (1g.10gb): 14 SM, 10GB
├── MIG Instance 4 (1g.10gb): 14 SM, 10GB
또는 (최대 분할)
├── MIG Instance 1~7 (1g.10gb): 각 14 SM, 각 10GB (x7)

MIG 설정 명령어

# MIG 활성화
sudo nvidia-smi -i 0 -mig 1

# 사용 가능한 MIG 프로필 확인
nvidia-smi mig -lgip

# GPU Instance 생성
sudo nvidia-smi mig -i 0 -cgi 9,14,14,14  # 3g.40gb + 1g.10gb x3

# Compute Instance 생성
sudo nvidia-smi mig -i 0 -cci

# 현재 MIG 상태 확인
nvidia-smi mig -lgi
nvidia-smi mig -lci

# MIG 인스턴스 삭제
sudo nvidia-smi mig -i 0 -dci
sudo nvidia-smi mig -i 0 -dgi

# MIG 비활성화
sudo nvidia-smi -i 0 -mig 0

MIG vs vGPU 비교

특성	MIG	vGPU
격리 수준	물리적 (SM + 메모리 완전 분리)	시간 분할 (소프트웨어 격리)
성능 예측	일정함 (전용 리소스)	변동 가능 (다른 VM 영향)
최대 인스턴스	7개 (A100/H100)	GPU 메모리 한도 내 다수
지원 GPU	A100, H100, A30	대부분의 데이터센터 GPU
유연성	고정 프로필 (변경 시 재구성)	동적 할당 가능
라이선스	추가 라이선스 불필요	vGPU 라이선스 필요
사용 사례	추론 서빙, 소규모 학습	VDI, 혼합 워크로드

K8s에서 MIG 사용: NVIDIA MIG Manager

# MIG 설정 ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - device-filter: ["0x20B210DE"]
          devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      mixed-config:
        - device-filter: ["0x20B210DE"]
          devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "1g.10gb": 4

4-5. SR-IOV (NIC 가상화)

SR-IOV는 NIC를 가상화하여 VM에 직접 할당합니다. GPU Direct RDMA와 조합할 때 중요합니다.

SR-IOV 구조:

Physical NIC (ConnectX-7)
├── PF (Physical Function): 호스트 드라이버 관리
├── VF 0 (Virtual Function) → VM1 (직접 할당, 네이티브 성능)
├── VF 1 → VM2
├── VF 2 → VM3
└── ... (최대 128개 VF)

장점:
- 가상 브릿지 없이 VM이 NIC에 직접 접근
- 네이티브에 가까운 네트워크 성능
- CPU 오버헤드 최소화

GPU Direct RDMA 조합:
VM 내 GPU ←→ VF(SR-IOV NIC) ←→ InfiniBand ←→ 원격 GPU
   (PCIe 직접)  (SR-IOV 바이패스)  (RDMA)

4-6. KubeVirt

KubeVirt는 Kubernetes 위에서 VM을 1등급 리소스로 관리합니다. 컨테이너와 VM을 동일 플랫폼에서 운영해야 할 때 핵심 기술입니다.

# KubeVirt VM에 GPU PCI Passthrough
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: gpu-vm
spec:
  running: true
  template:
    spec:
      domain:
        devices:
          hostDevices:
            - name: gpu
              deviceName: nvidia.com/A100
        resources:
          requests:
            memory: '32Gi'
            cpu: '8'
      volumes:
        - name: rootdisk
          containerDisk:
            image: quay.io/containerdisks/ubuntu:22.04

KubeVirt + vGPU:

# KubeVirt VM에 vGPU 할당
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vgpu-vm
spec:
  template:
    spec:
      domain:
        devices:
          gpus:
            - name: vgpu
              deviceName: nvidia.com/NVIDIA_A100-4C
        resources:
          requests:
            memory: '16Gi'

사용 사례:

레거시 VM 워크로드: 기존 VM 기반 AI 워크로드를 K8s로 마이그레이션
혼합 환경: 같은 K8s 클러스터에서 컨테이너 + VM 동시 운영
GPU 공유: vGPU를 통해 VM과 컨테이너에 GPU를 유연하게 할당

5. 고속 네트워크: InfiniBand와 RDMA

5-1. InfiniBand 아키텍처

분산 GPU 학습의 성능은 네트워크에 의해 결정됩니다. GPU가 아무리 빨라도 GPU 간 통신이 느리면 전체 성능이 저하됩니다.

InfiniBand vs Ethernet 비교

특성	InfiniBand NDR	RoCE v2 (100GbE)	TCP/IP (100GbE)
대역폭	400 Gbps	100 Gbps	100 Gbps
지연 시간	0.5us	1~2us	10~50us
RDMA 지원	네이티브	RoCE v2	없음 (커널 경유)
CPU 오버헤드	거의 없음	낮음	높음
혼잡 제어	Credit-based	PFC/ECN	TCP 혼잡 제어
비용	매우 높음	중간	낮음
사용처	HPC, AI 학습	AI 학습 (클라우드)	일반 워크로드

InfiniBand 세대

InfiniBand 속도 진화:
SDR  (2001):   10 Gbps
DDR  (2005):   20 Gbps
QDR  (2008):   40 Gbps
FDR  (2011):  56 Gbps
EDR  (2014): 100 Gbps
HDR  (2018): 200 Gbps
NDR  (2022): 400 Gbps
XDR  (2024): 800 Gbps
GDR  (2026): 1.6 Tbps (예정)

InfiniBand 네트워크 구성요소

InfiniBand 패브릭 구조:

Leaf Switch (ToR)
├── HCA (Host Channel Adapter) ─ Server 1 [GPU 0~7]
├── HCA ─ Server 2 [GPU 0~7]
├── HCA ─ Server 3 [GPU 0~7]
└── HCA ─ Server 4 [GPU 0~7]

Spine Switch
├── Leaf Switch 1
├── Leaf Switch 2
├── Leaf Switch 3
└── Leaf Switch 4

관리 요소:
- Subnet Manager (OpenSM): LID 할당, 라우팅 테이블 관리
- LID (Local ID): 서브넷 내 주소 (16비트)
- GID (Global ID): 글로벌 주소 (128비트, IPv6 유사)
- GUID (Globally Unique ID): 하드웨어 고유 식별자

5-2. RDMA (Remote Direct Memory Access)

RDMA는 CPU를 거치지 않고 원격 메모리에 직접 접근하는 기술입니다. GPU 분산 학습의 핵심입니다.

TCP/IP 전송 (기존):
App → Socket API → TCP/IP Stack (커널) → NIC Driver → NIC → 네트워크
                   ↑ CPU 개입 (복사, 체크섬, 세그먼테이션)

RDMA 전송:
App → RDMA Verbs → NIC (직접) → 네트워크
       ↑ Zero-copy, CPU 바이패스

RDMA 전송 유형

전송 타입	설명	사용 사례
InfiniBand	네이티브 RDMA	HPC, AI 클러스터
RoCE v2	RDMA over UDP/IP	클라우드 환경
iWARP	RDMA over TCP/IP	레거시 환경

RDMA 프로그래밍 기초

// ibverbs 기반 RDMA Write 예시 (단순화)
#include <infiniband/verbs.h>

// 1. 디바이스 열기
struct ibv_context *ctx = ibv_open_device(dev);

// 2. Protection Domain 생성
struct ibv_pd *pd = ibv_alloc_pd(ctx);

// 3. 메모리 등록 (NIC가 직접 접근할 수 있도록)
struct ibv_mr *mr = ibv_reg_mr(pd, buf, size,
    IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);

// 4. Queue Pair 생성
struct ibv_qp *qp = ibv_create_qp(pd, &qp_init_attr);

// 5. RDMA Write (원격 메모리에 직접 쓰기)
struct ibv_send_wr wr;
wr.opcode = IBV_WR_RDMA_WRITE;
wr.wr.rdma.remote_addr = remote_addr;
wr.wr.rdma.rkey = remote_key;
ibv_post_send(qp, &wr, &bad_wr);

5-3. GPU Direct

GPU Direct RDMA

GPU Direct RDMA를 사용하면 GPU 메모리에서 원격 GPU 메모리로 직접 데이터를 전송할 수 있습니다.

일반 경로 (GPU Direct 없이):
GPU0 → PCIe → Host Memory → NIC → 네트워크 → NIC → Host Memory → PCIe → GPU1
       (복사1)              (복사2)         (복사3)              (복사4)

GPU Direct RDMA:
GPU0 → PCIe → NIC → 네트워크 → NIC → PCIe → GPU1
       (직접)                        (직접)
CPU 바이패스, 복사 횟수 2배 감소

GPU Direct Storage (GDS)

일반 스토리지 접근:
NVMe → Host Memory (bounce buffer) → GPU Memory
       CPU 개입, 2번 복사

GPU Direct Storage:
NVMe → GPU Memory (직접)
       CPU 바이패스, 1번 복사

사용 사례: 대규모 데이터셋 로딩 (체크포인트 복구, 데이터 전처리)

NCCL + InfiniBand 조합

# NCCL 환경 변수 설정 (분산 학습)
export NCCL_IB_HCA=mlx5_0,mlx5_1  # InfiniBand HCA 지정
export NCCL_IB_GID_INDEX=3         # RoCE v2 GID 인덱스
export NCCL_SOCKET_IFNAME=eth0     # 제어 채널 인터페이스
export NCCL_DEBUG=INFO             # 디버그 로깅

# NCCL 토폴로지 파일 (GPU-NIC 매핑 최적화)
export NCCL_TOPO_FILE=/path/to/topo.xml

# NCCL AllReduce 벤치마크
/usr/local/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 8

5-4. 네트워크 성능 튜닝

InfiniBand 벤치마크

# 대역폭 테스트
# 서버: ib_write_bw --size=65536
# 클라이언트: ib_write_bw --size=65536 <server_ip>

# 레이턴시 테스트
# 서버: ib_write_lat
# 클라이언트: ib_write_lat <server_ip>

# 결과 예시 (NDR 400Gbps):
# Bandwidth: ~48 GB/s (이론 50 GB/s)
# Latency: ~0.6 us

PFC (Priority Flow Control) 설정

RoCE v2 환경에서는 PFC가 필수입니다:

# Mellanox NIC PFC 설정
mlnx_qos -i eth0 --pfc 0,0,0,1,0,0,0,0
# Priority 3에만 PFC 활성화 (RoCE 트래픽)

# DSCP → Priority 매핑
mlnx_qos -i eth0 --trust dscp

ECMP (Equal-Cost Multi-Path) 라우팅

대규모 InfiniBand 패브릭에서 ECMP:

Server A ─── Leaf 1 ─┬─ Spine 1 ─┬─ Leaf 3 ─── Server C
                      ├─ Spine 2 ─┤
                      ├─ Spine 3 ─┤
                      └─ Spine 4 ─┘

→ 4개의 동일 비용 경로를 로드 밸런싱
→ 해시 기반 (소스/목적지 LID) 분산
→ Adaptive Routing (AR): 혼잡 상태에 따라 동적 경로 변경

6. Kubernetes GPU 관리

6-1. NVIDIA GPU Operator

GPU Operator는 K8s 클러스터에 GPU 소프트웨어 스택을 자동으로 배포합니다.

GPU Operator 컴포넌트:

GPU Operator
├── NVIDIA Driver (DaemonSet)
│   └── 커널 모듈 자동 빌드/설치
├── NVIDIA Container Toolkit
│   └── 컨테이너 런타임에 GPU 지원 추가
├── NVIDIA Device Plugin
│   └── K8s에 GPU 리소스 등록
├── DCGM Exporter
│   └── GPU 메트릭 → Prometheus
├── MIG Manager
│   └── MIG 프로필 자동 적용
├── GPU Feature Discovery (GFD)
│   └── 노드에 GPU 라벨 자동 추가
└── NVIDIA Validator
    └── 설치 상태 검증

설치:

# Helm으로 GPU Operator 설치
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set mig.strategy=mixed \
  --set dcgmExporter.enabled=true

6-2. GPU Device Plugin

# Pod에 GPU 할당
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:12.3.0-runtime-ubuntu22.04
      resources:
        limits:
          nvidia.com/gpu: 2 # GPU 2장 요청
      command: ['nvidia-smi']

Time-Slicing 설정 (GPU 공유)

MIG를 지원하지 않는 GPU에서 여러 Pod이 GPU를 공유하게 할 수 있습니다:

# GPU Time-Slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4  # 1 GPU를 4개로 분할 (시간 분할)

6-3. GPU Scheduling

기본 스케줄링

K8s의 기본 GPU 스케줄링은 단순합니다: Pod이 요청한 GPU 수만큼 사용 가능한 노드에 배치합니다. 하지만 대규모 GPU 클러스터에서는 더 정교한 스케줄링이 필요합니다.

Topology-Aware Scheduling

DGX H100 GPU 토폴로지:
GPU0 ─ NVLink ─ GPU1 (같은 NVSwitch 도메인)
GPU2 ─ NVLink ─ GPU3 (같은 NVSwitch 도메인)
GPU4 ─ NVLink ─ GPU5 (같은 NVSwitch 도메인)
GPU6 ─ NVLink ─ GPU7 (같은 NVSwitch 도메인)

GPU0 ─ PCIe ─ GPU4 (다른 도메인, PCIe 연결)

→ 4-GPU 학습 시: GPU0,1,2,3 (NVLink 연결) >> GPU0,2,4,6 (PCIe 연결)

# Topology-aware scheduling을 위한 NodeSelector
apiVersion: v1
kind: Pod
metadata:
  name: distributed-training
spec:
  nodeSelector:
    nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'
    nvidia.com/gpu.count: '8'
  containers:
    - name: trainer
      resources:
        limits:
          nvidia.com/gpu: 4

Gang Scheduling

분산 학습에서는 모든 GPU가 동시에 할당되어야 합니다. 일부만 할당되면 나머지를 기다리며 자원이 낭비됩니다.

# Volcano를 사용한 Gang Scheduling
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  minAvailable: 4 # 최소 4개 Pod 동시 스케줄링
  schedulerName: volcano
  tasks:
    - replicas: 4
      name: worker
      template:
        spec:
          containers:
            - name: trainer
              image: training-image:latest
              resources:
                limits:
                  nvidia.com/gpu: 8 # 노드당 8 GPU

Bin-packing vs Spread 전략

Bin-packing (자원 밀집):
Node1: [GPU0 사용, GPU1 사용, GPU2 사용, GPU3 빈]
Node2: [GPU0 빈, GPU1 빈, GPU2 빈, GPU3 빈]
→ 장점: 유휴 노드 전원 절약, 자원 효율성
→ 단점: 핫스팟 발생 가능

Spread (분산):
Node1: [GPU0 사용, GPU1 빈, GPU2 사용, GPU3 빈]
Node2: [GPU0 사용, GPU1 빈, GPU2 사용, GPU3 빈]
→ 장점: 부하 분산, 장애 격리
→ 단점: 자원 파편화

GPU Feature Discovery (GFD)

# GFD가 추가하는 노드 라벨 예시
nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
nvidia.com/gpu.count=8
nvidia.com/gpu.memory=81920
nvidia.com/cuda.driver.major=535
nvidia.com/mig.strategy=mixed
nvidia.com/gpu.family=ampere
nvidia.com/mig-1g.10gb.count=4
nvidia.com/mig-3g.40gb.count=1

6-4. GPU 모니터링 on K8s

DCGM Exporter + Prometheus + Grafana

# DCGM Exporter DaemonSet (GPU Operator에 포함)
# Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

핵심 Prometheus 메트릭:

# GPU 활용률
DCGM_FI_DEV_GPU_UTIL             # SM 활용률 (%)
DCGM_FI_DEV_MEM_COPY_UTIL        # 메모리 대역폭 활용률 (%)

# 메모리
DCGM_FI_DEV_FB_USED              # 사용 중인 프레임버퍼 (MB)
DCGM_FI_DEV_FB_FREE              # 여유 프레임버퍼 (MB)

# 온도/전력
DCGM_FI_DEV_GPU_TEMP             # GPU 온도 (C)
DCGM_FI_DEV_POWER_USAGE          # 전력 사용량 (W)

# 에러
DCGM_FI_DEV_XID_ERRORS           # XID 에러 코드
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL    # Single-bit ECC 에러
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL    # Double-bit ECC 에러

# PCIe
DCGM_FI_DEV_PCIE_TX_THROUGHPUT   # PCIe 전송 처리량
DCGM_FI_DEV_PCIE_RX_THROUGHPUT   # PCIe 수신 처리량

알람 설정 예시:

# Prometheus Alert Rules
groups:
  - name: gpu-alerts
    rules:
      - alert: GPUMemoryAlmostFull
        expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'GPU memory usage above 95%'

      - alert: GPUThermalThrottling
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: 'GPU temperature exceeds 85C'

      - alert: GPUXIDError
        expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: 'GPU XID error detected'

7. AI 워크로드 최적화

7-1. 학습 최적화

Mixed Precision Training

# PyTorch Automatic Mixed Precision (AMP)
import torch
from torch.cuda.amp import autocast, GradScaler

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler()

for data, target in dataloader:
    optimizer.zero_grad()

    # FP16으로 Forward Pass
    with autocast():
        output = model(data.cuda())
        loss = criterion(output, target.cuda())

    # Loss Scaling + Backward Pass
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

정밀도별 비교:

정밀도	비트	메모리 절약	Tensor Core 지원	사용처
FP32	32	기준	O (낮은 처리량)	기본 학습
TF32	19	-	O (A100+)	자동 적용
FP16	16	2x	O	Mixed Precision
BF16	16	2x	O (A100+)	LLM 학습 (더 넓은 범위)
FP8 (E4M3)	8	4x	O (H100+)	Transformer Engine
INT8	8	4x	O	추론 양자화

DeepSpeed ZeRO

# DeepSpeed ZeRO Stage 3 설정
# ds_config.json
{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9
    },
    "bf16": {
        "enabled": true
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

ZeRO 메모리 분할:

ZeRO Stage 0 (기본):
GPU0: [Model] + [Gradient] + [Optimizer State]
GPU1: [Model] + [Gradient] + [Optimizer State]
→ 모든 GPU에 전체 복제

ZeRO Stage 1 (Optimizer 분할):
GPU0: [Model] + [Gradient] + [Optimizer 1/2]
GPU1: [Model] + [Gradient] + [Optimizer 2/2]
→ 메모리 ~1.5x 절약

ZeRO Stage 2 (+ Gradient 분할):
GPU0: [Model] + [Gradient 1/2] + [Optimizer 1/2]
GPU1: [Model] + [Gradient 2/2] + [Optimizer 2/2]
→ 메모리 ~2x 절약

ZeRO Stage 3 (+ Model 분할):
GPU0: [Model 1/2] + [Gradient 1/2] + [Optimizer 1/2]
GPU1: [Model 2/2] + [Gradient 2/2] + [Optimizer 2/2]
→ 메모리 ~N배 절약 (N = GPU 수)

병렬화 전략 비교

Data Parallelism:
입력 데이터를 N등분하여 N개 GPU에서 동일 모델 학습
→ AllReduce로 그래디언트 동기화
→ 통신량: O(model_size)

Tensor Parallelism:
하나의 레이어(행렬)를 N등분하여 N개 GPU에 분배
→ Forward/Backward 각 레이어마다 통신 필요
→ GPU 간 고속 통신(NVLink) 필수

Pipeline Parallelism:
모델의 레이어를 N등분하여 N개 GPU에 순차 배치
→ 마이크로배치를 파이프라인으로 처리
→ 버블(유휴 시간) 최소화가 핵심

3D Parallelism (LLM 학습):
Data Parallel x Tensor Parallel x Pipeline Parallel
예: 256 GPU = 32 DP x 4 TP x 2 PP

7-2. 추론 최적화

TensorRT 최적화

# TensorRT를 통한 모델 최적화 (Python API)
import tensorrt as trt

logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
    1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)

# ONNX 모델 파싱
with open("model.onnx", "rb") as f:
    parser.parse(f.read())

# 빌드 설정
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB
config.set_flag(trt.BuilderFlag.FP16)  # FP16 양자화 활성화

# 엔진 빌드
engine = builder.build_serialized_network(network, config)

Triton Inference Server

Triton 아키텍처:
Client → HTTP/gRPC → Triton Server
                      ├── Model Repository
                      │   ├── model_a/ (TensorRT)
                      │   ├── model_b/ (ONNX Runtime)
                      │   └── model_c/ (Python Backend)
                      ├── Scheduler
                      │   ├── Dynamic Batching
                      │   └── Sequence Batching
                      ├── Model Ensemble
                      │   └── 전처리 → 모델 → 후처리 파이프라인
                      └── Metrics (Prometheus)

# Triton 모델 설정 (config.pbtxt)
name: "my_model"
platform: "tensorrt_plan"
max_batch_size: 64

input [
  {
    name: "input"
    data_type: TYPE_FP16
    dims: [ 3, 224, 224 ]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP16
    dims: [ 1000 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 16, 32, 64 ]
  max_queue_delay_microseconds: 100
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

vLLM: LLM 추론 최적화

from vllm import LLM, SamplingParams

# vLLM 서버 시작
llm = LLM(
    model="meta-llama/Llama-3-70B",
    tensor_parallel_size=4,      # 4 GPU Tensor Parallel
    gpu_memory_utilization=0.9,  # GPU 메모리 90% 사용
    max_model_len=8192,
    dtype="bfloat16",
)

# 핵심 최적화 기술:
# 1. PagedAttention: KV Cache를 페이지 단위로 관리 (메모리 효율)
# 2. Continuous Batching: 요청을 동적으로 배치에 추가/제거
# 3. Prefix Caching: 공통 프리픽스의 KV Cache 재사용

7-3. 성능 병목 분석 패턴

[성능 문제 진단 플로우차트]

1. nvidia-smi 확인
   ├── GPU Util < 30%
   │   ├── CPU/IO 병목 가능성 높음
   │   │   ├── top/htop 확인 → CPU 100%? → 데이터 로딩 최적화
   │   │   └── iostat 확인 → 디스크 I/O? → NVMe/GDS 사용
   │   └── 커널이 너무 작음 → CUDA Graph, 배치 증가
   ├── GPU Util > 90%, 성능 낮음
   │   ├── Memory-Bound 가능성
   │   │   ├── Nsight Compute → Memory Throughput 확인
   │   │   └── Memory Coalescing 패턴 점검
   │   └── Warp Divergence 가능성
   │       └── Nsight Compute → Warp Stall Reasons
   └── GPU Util 불규칙 (오르락내리락)
       ├── 동기화 병목 → 비동기 실행 최적화
       └── 통신 병목 (분산) → NCCL 프로파일링

2. 분산 학습 병목
   ├── NCCL AllReduce 시간 확인
   │   ├── Nsight Systems에서 NCCL 영역 확인
   │   └── 통신/연산 비율 분석
   ├── InfiniBand 대역폭 확인
   │   └── ib_write_bw 벤치마크
   └── GPU 토폴로지 확인
       └── nvidia-smi topo -m

8. Linux 시스템 트러블슈팅

8-1. GPU 관련 Linux 명령어

# GPU 디바이스 정보
lspci -vv -s $(lspci | grep NVIDIA | head -1 | awk '{print $1}')

# GPU 드라이버 버전
cat /proc/driver/nvidia/version

# CUDA 버전
nvcc --version

# GPU 메모리 사용량 상세
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv

# GPU 프로세스별 메모리
nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv

# PCIe 대역폭 확인
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv

# GPU 클럭 정보
nvidia-smi --query-gpu=clocks.current.graphics,clocks.current.memory,clocks.max.graphics,clocks.max.memory --format=csv

# dmesg에서 GPU 관련 메시지
dmesg | grep -i -E "nvidia|nvrm|gpu|xid"

# InfiniBand 상태 확인
ibstat
ibstatus
ibv_devinfo

# RDMA 디바이스 확인
rdma link show
rdma resource show

# NIC 상태
ethtool -i mlx5_core0
mlxlink -d /dev/mst/mt4125_pciconf0 -m

# 커널 모듈 상태
lsmod | grep nvidia
lsmod | grep mlx
lsmod | grep vfio

# NUMA 토폴로지 (GPU-CPU 친화도)
numactl --hardware
lstopo --of ascii
nvidia-smi topo -m

8-2. 흔한 GPU 이슈와 해결

XID Error 해석

XID Error는 NVIDIA GPU 드라이버가 보고하는 오류 코드입니다. dmesg에서 확인됩니다.

XID 코드	의미	심각도	대응
XID 13	Graphics Engine Exception	높음	CUDA 커널 버그 가능, 드라이버 업데이트
XID 31	GPU Memory Page Fault	높음	메모리 접근 오류, 코드 점검
XID 43	GPU stopped processing	높음	GPU hang, 리셋 필요
XID 45	Preemptive cleanup	중간	타임아웃, 워크로드 점검
XID 48	Double Bit ECC Error	긴급	하드웨어 결함, RMA
XID 63	ECC page retirement	중간	페이지 은퇴, 누적 시 RMA
XID 64	ECC page retirement (DBE)	높음	Double-bit 에러, RMA 고려
XID 79	GPU has fallen off the bus	긴급	PCIe 연결 끊김, 하드웨어 점검
XID 94	Contained ECC error	중간	MIG 인스턴스 내 ECC 에러
XID 95	Uncontained ECC error	긴급	MIG 격리 실패, GPU 리셋 필요

# XID 에러 모니터링
dmesg -w | grep "NVRM: Xid"

# 예시 출력:
# NVRM: Xid (PCI:0000:41:00): 79, pid=0, GPU has fallen off the bus
# NVRM: Xid (PCI:0000:41:00): 48, pid=12345, DBE (double bit error)

GPU Reset / 드라이버 재로드

# GPU hang 시 리셋 시도
nvidia-smi --gpu-reset -i 0

# 드라이버 재로드 (모든 GPU 프로세스 종료 필요)
# 1. GPU 사용 프로세스 확인
fuser -v /dev/nvidia*

# 2. 프로세스 종료
kill -9 <pid>

# 3. 드라이버 언로드/로드
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
sudo modprobe nvidia

# 완전히 안 되면
sudo systemctl restart nvidia-persistenced

CUDA OOM 디버깅

# PyTorch에서 OOM 발생 시 디버깅

# 메모리 사용량 확인
import torch
print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved:  {torch.cuda.memory_reserved()/1e9:.2f} GB")
print(f"Max Allocated: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")

# 메모리 스냅샷 (상세 분석)
torch.cuda.memory._record_memory_history(max_entries=100000)
# ... 학습 코드 실행 ...
snapshot = torch.cuda.memory._snapshot()
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
# https://pytorch.org/memory_viz 에서 시각화

OOM 대처:

배치 크기 줄이기
Gradient Accumulation 사용
Mixed Precision (FP16/BF16) 활성화
Gradient Checkpointing (Activation Recomputation)
DeepSpeed ZeRO Stage 2/3 적용
모델 파라미터 오프로딩 (CPU/NVMe)

ECC 에러와 RMA 절차

# ECC 에러 확인
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.aggregate.total --format=csv

# Retired Pages 확인
nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv

# RMA 판단 기준:
# - Uncorrected (Double-bit) ECC 에러가 반복 발생
# - Retired Pages가 임계치 초과 (보통 60+ pages)
# - XID 48이 여러 번 발생
# - GPU가 bus에서 분리됨 (XID 79)

Thermal Throttling 대응

# 온도 모니터링
nvidia-smi --query-gpu=temperature.gpu,temperature.memory --format=csv -l 1

# 전력 제한 확인/설정
nvidia-smi --query-gpu=power.limit,power.default_limit,power.max_limit --format=csv
sudo nvidia-smi -pl 300  # 전력 제한을 300W로 설정

# 클럭 속도 확인 (Throttling 시 감소)
nvidia-smi --query-gpu=clocks.current.graphics,clocks.max.graphics --format=csv

# Throttling 원인 확인
nvidia-smi --query-gpu=clocks_throttle_reasons.active --format=csv

Thermal Throttling 예방:

서버실 냉각 용량 확인 (GPU당 400~1000W 발열)
에어플로우 최적화 (Hot/Cold Aisle 분리)
수냉 시스템 고려 (DGX H100은 수냉 옵션 지원)
전력 제한 설정 (성능 vs 온도 트레이드오프)

9. 면접 예상 질문 30선

GPU 아키텍처와 CUDA (10개)

Q1. SM(Streaming Multiprocessor)의 내부 구조를 설명하고, Warp Divergence가 성능에 미치는 영향을 설명하세요.

모범 답변 포인트: SM은 CUDA Cores, Tensor Cores, Warp Scheduler, Register File, Shared Memory/L1 Cache로 구성됩니다. Warp(32개 스레드)는 SIMT 모델로 동일 명령어를 실행하며, if/else 분기 시 양쪽을 순차 실행하므로 최대 2배 성능 저하가 발생합니다.

Q2. GPU 메모리 계층을 Register부터 Global Memory까지 설명하고, 각각의 지연 시간과 최적화 전략을 말씀하세요.

모범 답변 포인트: Register(1 cycle) → Shared Memory(5 cycles) → L1/L2 Cache → Global Memory(HBM, 600 cycles). Shared Memory를 타일링에 활용하여 Global Memory 접근 줄이기, Memory Coalescing으로 대역폭 활용률 극대화.

Q3. Memory Coalescing이란 무엇이며, 왜 중요한가요?

모범 답변 포인트: Warp의 32개 스레드가 연속 메모리를 접근하면 하나의 128-byte 트랜잭션으로 병합됩니다. 비연속 접근(strided)은 32개 개별 트랜잭션으로 분리되어 대역폭을 1/32만 활용합니다.

Q4. Roofline Model을 설명하고, 주어진 커널이 Memory-Bound인지 Compute-Bound인지 판별하는 방법을 알려주세요.

모범 답변 포인트: Arithmetic Intensity(FLOPs/Byte)를 계산하여 하드웨어의 Balance Point(Peak FLOPS / Peak BW)와 비교합니다. H100 기준 295 FLOPs/Byte가 균형점이며, 이보다 낮으면 Memory-Bound입니다.

Q5. A100과 H100의 주요 차이점을 설명하고, H100의 Transformer Engine이 학습 성능에 미치는 영향을 설명하세요.

모범 답변 포인트: H100은 FP8 지원 Tensor Core(4세대), NVLink 4.0(900GB/s), HBM3(3.35TB/s), Transformer Engine을 제공합니다. Transformer Engine은 FP8과 FP16을 동적으로 전환하여 2x 처리량 향상을 달성합니다.

Q6. CUDA Grid-Block-Thread 계층을 설명하고, Block 크기를 결정하는 기준을 말씀하세요.

모범 답변 포인트: Grid는 Block의 집합, Block은 Thread의 집합이며 같은 SM에서 실행됩니다. Block 크기는 32의 배수(Warp 크기), SM Occupancy 최대화, Shared Memory 사용량, Register 사용량을 고려하여 결정합니다. 보통 128 또는 256이 좋은 시작점입니다.

Q7. NVIDIA Nsight Systems와 Nsight Compute의 차이점과 각각의 사용 시나리오를 설명하세요.

모범 답변 포인트: Nsight Systems는 시스템 레벨 타임라인(CPU-GPU 상호작용, 커널 런치, 메모리 전송)을 보여주고, Nsight Compute는 개별 커널의 SM Occupancy, Memory Throughput, Warp Stall Reasons 등을 상세 분석합니다.

Q8. Shared Memory의 Bank Conflict가 무엇이며, 어떻게 회피하나요?

모범 답변 포인트: Shared Memory는 32개 뱅크로 구성되며, 같은 Warp에서 같은 뱅크에 동시 접근하면 순차 처리됩니다. 패딩(배열 폭을 33으로)을 추가하거나 접근 패턴을 재설계하여 회피합니다.

Q9. CUDA에서 Host-Device 메모리 전송의 오버헤드를 줄이는 방법들을 설명하세요.

모범 답변 포인트: Pinned Memory(cudaMallocHost), 비동기 전송(cudaMemcpyAsync + CUDA Stream), Zero-copy Memory(Unified Memory), 전송과 연산 오버랩, CUDA Graph를 활용합니다.

Q10. GPU 활용률이 30%밖에 안 나오는 상황에서 어떤 순서로 디버깅하시겠습니까?

모범 답변 포인트: (1) nvidia-smi로 메모리/활용률 기본 확인 → (2) Nsight Systems로 CPU vs GPU 시간 비율 분석 → (3) 데이터 로딩 병목 확인(num_workers, prefetch) → (4) 커널 크기 확인(배치 증가) → (5) 동기화 병목 확인(CUDA Graph) → (6) PCIe 병목 확인.

가상화와 네트워크 (10개)

Q11. PCI Passthrough, vGPU, MIG의 차이점을 비교하고, 각각의 적절한 사용 시나리오를 설명하세요.

모범 답변 포인트: PCI Passthrough는 1:1 할당(최대 성능), vGPU는 시간 분할(유연성), MIG는 물리 분할(격리+예측성). 대형 학습은 Passthrough, 추론 서빙 멀티테넌시는 MIG, VDI 혼합 환경은 vGPU.

Q12. IOMMU의 역할과 GPU 가상화에서의 중요성을 설명하세요.

모범 답변 포인트: IOMMU(Intel VT-d)는 디바이스의 DMA 요청을 가상 주소로 변환하여 VM 간 메모리를 격리합니다. 이것 없이는 GPU가 다른 VM의 메모리에 접근할 수 있어 보안 문제가 발생합니다.

Q13. MIG의 프로필 구성을 설명하고, A100-80GB에서 최대 분할을 했을 때 각 인스턴스의 사양을 말씀하세요.

모범 답변 포인트: 최대 7개의 1g.10gb 인스턴스로 분할 가능. 각각 14 SM, 10GB HBM2e, 독립 L2 Cache, 별도 메모리 컨트롤러를 가집니다. GI(GPU Instance) 안에 CI(Compute Instance)를 만들어야 CUDA 사용이 가능합니다.

Q14. InfiniBand와 Ethernet의 차이를 설명하고, 분산 학습에서 InfiniBand가 중요한 이유를 말씀하세요.

모범 답변 포인트: InfiniBand는 RDMA를 네이티브 지원하여 0.5us 지연, CPU 바이패스 전송이 가능합니다. NDR은 400Gbps 대역폭을 제공합니다. 분산 학습의 AllReduce 통신은 수백 GB 데이터를 주고받으므로 대역폭과 지연이 직접 성능에 영향합니다.

Q15. RDMA의 Zero-copy 전송이 어떻게 동작하는지 설명하세요.

모범 답변 포인트: 애플리케이션이 ibverbs API로 메모리를 등록하면, NIC가 해당 메모리에 직접 DMA 접근합니다. 데이터는 커널 버퍼를 거치지 않고 사용자 공간 메모리에서 NIC로 직접 전송됩니다.

Q16. GPU Direct RDMA가 분산 학습에서 어떤 이점을 제공하나요?

모범 답변 포인트: GPU 메모리에서 NIC로 직접 전송하여 Host Memory 경유(bounce buffer)를 제거합니다. PCIe 대역폭 활용을 2배 향상시키고 전송 지연을 줄입니다.

Q17. RoCE v2와 InfiniBand의 차이점을 설명하고, RoCE v2 환경에서 PFC 설정이 중요한 이유를 말씀하세요.

모범 답변 포인트: RoCE v2는 UDP/IP 위에서 RDMA를 실행하며, 기존 Ethernet 인프라를 활용합니다. 하지만 Ethernet은 패킷 손실 허용 기반이므로, RDMA의 무손실 전제를 위해 PFC(Priority Flow Control)로 혼잡 시 전송을 일시 중지해야 합니다.

Q18. NCCL의 역할과 분산 학습에서의 동작 방식을 설명하세요.

모범 답변 포인트: NCCL은 다중 GPU 간 AllReduce, Broadcast, AllGather 등 집합 통신을 최적화합니다. NVLink, NVSwitch, InfiniBand를 자동 감지하여 최적 통신 경로를 선택하며, Ring-AllReduce 또는 Tree-AllReduce 알고리즘을 사용합니다.

Q19. SR-IOV가 무엇이며, GPU 클러스터에서 어떻게 활용되나요?

모범 답변 포인트: SR-IOV는 물리 NIC를 여러 VF(Virtual Function)로 분할하여 VM에 직접 할당합니다. GPU 클러스터에서는 VM 환경에서 InfiniBand/RoCE NIC를 SR-IOV로 분할하여 GPU Direct RDMA 성능을 유지합니다.

Q20. KubeVirt에서 GPU를 VM에 할당하는 두 가지 방법을 비교하세요.

모범 답변 포인트: (1) PCI Passthrough: hostDevices로 물리 GPU 직접 할당, 최대 성능, 1:1 매핑. (2) vGPU: mediated devices로 가상 GPU 할당, GPU 공유 가능하지만 오버헤드 존재. MIG + KubeVirt 조합도 가능합니다.

K8s와 성능 최적화 (10개)

Q21. NVIDIA GPU Operator의 구성 요소를 설명하고, 각각의 역할을 말씀하세요.

모범 답변 포인트: Driver(커널 모듈), Container Toolkit(런타임 통합), Device Plugin(K8s 리소스 등록), DCGM Exporter(메트릭), MIG Manager(MIG 자동 구성), GFD(노드 라벨링), Validator(검증).

Q22. K8s에서 GPU Scheduling 시 Topology-aware scheduling이 왜 중요한가요?

모범 답변 포인트: NVLink로 연결된 GPU들은 PCIe 연결 GPU보다 6~10배 빠른 통신이 가능합니다. 분산 학습에서 GPU를 임의 할당하면 NVLink 대신 PCIe를 거쳐 통신하여 성능이 크게 저하됩니다.

Q23. Gang Scheduling이 분산 학습에서 필수인 이유를 설명하세요.

모범 답변 포인트: AllReduce 통신은 모든 워커가 참여해야 완료됩니다. 4개 중 3개만 할당되면 나머지 1개를 기다리며 3개 GPU가 유휴 상태가 됩니다. Volcano/Kueue 같은 스케줄러로 all-or-nothing 할당이 필요합니다.

Q24. GPU Time-Slicing과 MIG의 차이를 K8s 관점에서 설명하세요.

모범 답변 포인트: Time-Slicing은 소프트웨어 시간 분할로 성능 격리가 없지만 모든 GPU에서 가능합니다. MIG는 물리 분할로 완전 격리되지만 A100/H100만 지원합니다. K8s에서는 각각 nvidia.com/gpu replicas와 nvidia.com/mig-Xg.XXgb로 요청합니다.

Q25. DCGM Exporter로 모니터링해야 할 핵심 메트릭 5개와 각각의 의미를 설명하세요.

모범 답변 포인트: GPU_UTIL(SM 활용률), MEM_COPY_UTIL(메모리 대역폭), FB_USED(메모리 사용량), GPU_TEMP(온도), XID_ERRORS(하드웨어 에러). 추가로 POWER_USAGE, ECC_SBE/DBE도 중요합니다.

Q26. Mixed Precision Training이 동작하는 원리를 설명하고, Loss Scaling이 필요한 이유를 말씀하세요.

모범 답변 포인트: Forward Pass를 FP16으로 실행하여 Tensor Core 활용, 그래디언트도 FP16으로 계산 후 FP32 마스터 가중치에 적용합니다. FP16의 표현 범위가 좁아 작은 그래디언트가 0으로 반올림되는 것을 Loss Scaling으로 방지합니다.

Q27. DeepSpeed ZeRO의 3개 Stage를 설명하고, 각각의 메모리 절감 효과를 비교하세요.

모범 답변 포인트: Stage 1(Optimizer State 분할, ~1.5x), Stage 2(+Gradient 분할, ~2x), Stage 3(+Model Parameter 분할, ~Nx). Stage 3는 통신 오버헤드가 가장 크므로 InfiniBand 같은 고속 네트워크가 필수입니다.

Q28. Triton Inference Server의 Dynamic Batching이 추론 효율을 높이는 원리를 설명하세요.

모범 답변 포인트: 개별 요청을 큐에 모아 설정된 최대 대기 시간 내에 배치를 구성하여 한 번에 처리합니다. GPU는 배치가 클수록 높은 처리량을 보이므로, 동적 배치는 지연 시간과 처리량의 균형을 맞춥니다.

Q29. XID 에러 중 XID 79("GPU has fallen off the bus")가 발생했을 때 디버깅 절차를 설명하세요.

모범 답변 포인트: (1) dmesg로 전후 로그 확인 → (2) PCIe 링크 상태 확인(lspci) → (3) GPU 리셋 시도(nvidia-smi --gpu-reset) → (4) 물리 연결 확인(리시팅) → (5) 다른 슬롯 테스트 → (6) 반복 발생 시 RMA.

Q30. 1000장의 H100 GPU 클러스터를 새로 구축할 때, GPU 소프트웨어 스택을 어떻게 설계하시겠습니까?

모범 답변 포인트: (1) OS: Ubuntu 22.04 + 최신 커널 → (2) 드라이버: NVIDIA Driver 535+ → (3) 네트워크: InfiniBand NDR + NCCL → (4) 컨테이너: K8s + GPU Operator + DCGM → (5) 스케줄링: Volcano(Gang Scheduling) + GFD(Topology-aware) → (6) 모니터링: DCGM Exporter + Prometheus + Grafana → (7) 스토리지: GPU Direct Storage + 분산 파일시스템 → (8) MIG/vGPU: 워크로드별 파티셔닝 전략 수립.

10. 10개월 학습 로드맵

Month 1-2: GPU 기초와 CUDA 프로그래밍

목표: GPU 아키텍처 이해 + CUDA 프로그래밍 능력

주차	주제	활동
1주	GPU 아키텍처	NVIDIA GPU 아키텍처 백서 읽기 (Ampere, Hopper)
2주	CUDA 기초	벡터 덧셈, 행렬 곱셈 구현
3주	CUDA 최적화	Shared Memory 타일링, Memory Coalescing 실습
4주	CUDA 심화	Warp-level primitives, CUDA Streams
5-6주	cuBLAS/cuDNN	라이브러리 활용, 성능 비교
7-8주	프로파일링	Nsight Systems/Compute로 실제 커널 분석

리소스:

NVIDIA CUDA Programming Guide
"Programming Massively Parallel Processors" (David Kirk, Wen-mei Hwu)
NVIDIA DLI (Deep Learning Institute) CUDA 과정

Month 3-4: Linux 시스템 + GPU 드라이버

목표: Linux 커널/드라이버 레벨 이해 + GPU 트러블슈팅

주차	주제	활동
1-2주	Linux 커널 기초	메모리 관리, 디바이스 드라이버, PCIe
3-4주	GPU 드라이버	NVIDIA 드라이버 설치/설정, 모듈 구조
5-6주	트러블슈팅	XID 에러 분석, ECC 에러 대응, dmesg 분석
7-8주	성능 도구	perf, strace, eBPF를 활용한 시스템 분석

Month 5-6: 가상화 (핵심!)

목표: KVM/QEMU + PCI Passthrough + vGPU + MIG 실습

주차	주제	활동
1-2주	KVM/QEMU	VM 생성, IOMMU 설정, 기본 가상화
3-4주	PCI Passthrough	GPU VFIO 바인딩, VM에 GPU 할당
5-6주	MIG	MIG 프로필 구성, 성능 테스트
7-8주	vGPU	vGPU 라이선스 설정, 스케줄러 비교

Month 7-8: 네트워크 (InfiniBand/RDMA)

목표: InfiniBand 아키텍처 이해 + RDMA 프로그래밍 + NCCL 튜닝

주차	주제	활동
1-2주	InfiniBand 기초	아키텍처, Subnet Manager, 기본 명령어
3-4주	RDMA	ibverbs 프로그래밍, 벤치마크
5-6주	GPU Direct	GPU Direct RDMA/Storage 실습
7-8주	NCCL 튜닝	분산 학습 NCCL 벤치마크, 환경 변수 최적화

Month 9-10: Kubernetes + 통합 프로젝트

목표: K8s GPU 관리 + 대규모 클러스터 운영 + 포트폴리오 완성

주차	주제	활동
1-2주	GPU Operator	설치, 설정, MIG Manager
3-4주	스케줄링	Topology-aware, Gang Scheduling (Volcano)
5-6주	모니터링	DCGM + Prometheus + Grafana 대시보드
7-8주	통합 프로젝트	포트폴리오 프로젝트 완성 + 면접 준비

11. 포트폴리오 프로젝트 3개

프로젝트 1: CUDA 커널 최적화 (행렬 곱셈 벤치마크)

목표: Naive CUDA → Shared Memory Tiling → Tensor Core 활용 → cuBLAS 비교

프로젝트 구성:
cuda-matmul-benchmark/
├── src/
│   ├── naive_matmul.cu          # Naive 구현
│   ├── tiled_matmul.cu          # Shared Memory 타일링
│   ├── wmma_matmul.cu           # Tensor Core (WMMA API)
│   └── cublas_matmul.cu         # cuBLAS 래퍼
├── benchmark/
│   ├── run_benchmarks.sh
│   └── plot_results.py          # 결과 시각화
├── profiles/
│   ├── nsight_systems/
│   └── nsight_compute/
└── README.md

핵심 결과물:

각 구현의 GFLOPS 비교표
Nsight Compute 프로파일링 결과 (SM Occupancy, Memory Throughput)
cuBLAS 대비 달성률 (보통 Naive: 1~~5%, Tiled: 20~~40%, Tensor Core: 60~80%)

프로젝트 2: MIG + K8s 멀티테넌트 GPU 클러스터

목표: MIG를 활용한 GPU 공유 클러스터 구축

프로젝트 구성:
mig-k8s-multitenant/
├── infra/
│   ├── gpu-operator-values.yaml
│   ├── mig-config.yaml
│   └── monitoring/
│       ├── dcgm-dashboard.json    # Grafana 대시보드
│       └── alert-rules.yaml       # Prometheus 알람
├── workloads/
│   ├── inference-deployment.yaml  # MIG 1g.10gb 추론
│   ├── training-job.yaml          # MIG 3g.40gb 학습
│   └── notebook-statefulset.yaml  # MIG 2g.20gb Jupyter
├── scheduler/
│   ├── gang-scheduling.yaml       # Volcano 설정
│   └── priority-classes.yaml
└── docs/
    ├── architecture.md
    └── benchmark-results.md

핵심 결과물:

1 A100-80GB를 MIG로 분할하여 3개 워크로드 동시 운영
각 MIG 인스턴스의 성능 격리 검증 (noisy neighbor 테스트)
Grafana 대시보드: 인스턴스별 GPU 활용률, 메모리, 온도

프로젝트 3: 분산 학습 성능 프로파일링 (NCCL + InfiniBand)

목표: 분산 학습의 통신 병목 분석 및 최적화

프로젝트 구성:
distributed-training-profiler/
├── benchmarks/
│   ├── nccl_allreduce.sh          # NCCL 벤치마크
│   ├── ib_bandwidth.sh            # InfiniBand 대역폭
│   └── multi_node_training.py     # 실제 학습 스크립트
├── profiling/
│   ├── nsight_distributed.sh      # 분산 환경 프로파일링
│   └── nccl_debug_analysis.py     # NCCL 로그 분석
├── optimization/
│   ├── nccl_env_tuning.sh         # NCCL 환경 변수 최적화
│   └── topology_optimization.py   # GPU-NIC 토폴로지 최적화
└── results/
    ├── scaling_efficiency.png     # 스케일링 효율 그래프
    └── communication_breakdown.png # 통신 시간 분석

핵심 결과물:

2-node, 4-node, 8-node 스케일링 효율 측정
NCCL AllReduce 시간 대비 연산 시간 비율 분석
환경 변수 튜닝 전후 비교 (NCCL_IB_HCA, NCCL_ALGO 등)
Nsight Systems 타임라인에서 통신/연산 오버랩 분석

12. 퀴즈

Q1. H100 GPU의 SM 1개에는 몇 개의 FP32 CUDA Core가 있으며, 전체 SM 수는 몇 개인가요?

정답: SM 1개에 128개의 FP32 CUDA Core가 있으며, H100은 총 132개의 SM을 가집니다. 따라서 전체 CUDA Core 수는 128 x 132 = 16,896개입니다. 참고로 A100은 SM당 64개 FP32 Core x 108 SM = 6,912개입니다.

Q2. MIG에서 A100-80GB를 1g.10gb 프로필로 최대 분할하면 몇 개의 인스턴스가 생성되며, 각 인스턴스의 SM 수는 얼마인가요?

정답: 최대 7개의 1g.10gb 인스턴스가 생성됩니다. 각 인스턴스는 약 14개의 SM과 10GB HBM2e 메모리를 갖습니다. 각 인스턴스는 독립된 L2 Cache와 메모리 컨트롤러를 가지므로 성능이 물리적으로 격리됩니다. MIG를 사용하려면 먼저 GPU Instance(GI)를 생성한 후, 그 안에 Compute Instance(CI)를 생성해야 CUDA를 사용할 수 있습니다.

Q3. RDMA over InfiniBand와 RoCE v2의 주요 차이점 3가지는 무엇인가요?

정답: (1) 전송 계층: InfiniBand는 자체 전송 프로토콜을 사용하고, RoCE v2는 UDP/IP 위에서 동작합니다. (2) 혼잡 제어: InfiniBand는 Credit-based 흐름 제어로 네이티브 무손실이지만, RoCE v2는 Ethernet 기반이므로 PFC(Priority Flow Control) 설정이 필수입니다. (3) 인프라: InfiniBand는 전용 스위치/케이블이 필요하지만, RoCE v2는 기존 Ethernet 스위치를 활용할 수 있어 비용이 낮습니다. 대역폭은 InfiniBand NDR(400Gbps)이 일반적으로 RoCE(100~200Gbps)보다 높습니다.

Q4. Kubernetes에서 Gang Scheduling이 필요한 이유를 설명하고, 이를 지원하는 스케줄러를 2개 이상 말하세요.

정답: 분산 학습에서 AllReduce 통신은 모든 워커가 참여해야 완료됩니다. N개 GPU 중 일부만 할당되면 나머지를 기다리며 할당된 GPU가 유휴 상태가 되어 자원이 낭비됩니다. Gang Scheduling은 all-or-nothing으로 필요한 모든 리소스를 한 번에 할당합니다. 이를 지원하는 스케줄러로는 Volcano, Kueue(K8s SIG Scheduling), YuniKorn(Apache)이 있습니다. 기본 K8s 스케줄러(kube-scheduler)는 Gang Scheduling을 지원하지 않습니다.

Q5. GPU 활용률(SM Utilization)이 90% 이상인데도 학습 속도가 느린 경우, 가능한 원인 3가지와 진단 방법을 설명하세요.

정답: (1) Memory-Bound: SM은 활발하지만 메모리 대역폭이 포화된 상태입니다. Nsight Compute에서 Memory Throughput이 피크에 가까운지 확인합니다. Shared Memory 활용, Memory Coalescing 패턴을 점검합니다. (2) Warp Divergence: 조건 분기로 인해 Warp 내 스레드가 순차 실행됩니다. Nsight Compute에서 Branch Efficiency 메트릭을 확인합니다. (3) Low SM Occupancy + 높은 Compute: 적은 수의 Warp가 높은 산술 강도로 실행 중입니다. Active Warps per SM을 확인하고 Block 크기, Register 사용량을 조정합니다. 추가로 Tensor Core 미활용(FP16/BF16을 사용하지 않음)도 원인이 될 수 있습니다.

13. 참고 자료

공식 문서

NVIDIA CUDA Programming Guide - NVIDIA Developer
NVIDIA A100 Whitepaper - NVIDIA
NVIDIA H100 Whitepaper - NVIDIA
NVIDIA MIG User Guide - NVIDIA Developer
NVIDIA Virtual GPU Software Documentation - NVIDIA
NVIDIA GPU Operator Documentation - NVIDIA
NVIDIA DCGM Documentation - NVIDIA
NVIDIA NCCL Documentation - NVIDIA

네트워크

InfiniBand Architecture Specification - IBTA
RDMA Aware Programming User Manual - NVIDIA Networking (Mellanox)
RoCE v2 Deployment Guide - NVIDIA Networking

Kubernetes

NVIDIA Device Plugin for Kubernetes - GitHub
Volcano: Kubernetes Native Batch System - volcano.sh
Kueue: Kubernetes-native Job Queueing - K8s SIG Scheduling

학습/추론 최적화

DeepSpeed Documentation - Microsoft
Triton Inference Server Documentation - NVIDIA
vLLM Documentation - vLLM Project
Flash Attention Paper - Tri Dao et al.

서적

"Programming Massively Parallel Processors" - David Kirk, Wen-mei Hwu
"CUDA by Example" - Jason Sanders, Edward Kandrot
"Computer Architecture: A Quantitative Approach" - Hennessy, Patterson
"Understanding Linux Kernel" - Daniel P. Bovet, Marco Cesati

커뮤니티/블로그

NVIDIA Developer Blog - developer.nvidia.com/blog
NVIDIA GTC Sessions (무료) - nvidia.com/gtc
Horace He's "Making Deep Learning Go Brrrr" Blog Series
Lily Chen's GPU Mode Community - Discord

GPU Software Engineer Complete Guide: From CUDA Architecture to vGPU/MIG, InfiniBand, and K8s GPU Scheduling — System Optimization Mastery

1. The Rare Career of GPU Software Engineer
2. JD Line-by-Line Dissection
3. GPU Architecture Deep Dive
4. GPU Virtualization Technology
5. High-Speed Networking: InfiniBand and RDMA
6. Kubernetes GPU Management
7. AI Workload Optimization
8. Linux System Troubleshooting
- 8-1. GPU-Related Linux Commands
- 8-2. Common GPU Issues and Solutions
9. Interview Questions: Top 30
10. 10-Month Study Roadmap
11. Portfolio Projects (3)
12. Quiz
13. References

1. The Rare Career of GPU Software Engineer

"People Who USE GPUs" vs "People Who MAKE GPUs Work"

Since 2024, the keyword dominating the AI industry has been "GPU." Every company is fighting to acquire GPUs, but the number of engineers who can properly operate acquired GPUs is vanishingly small.

A critical distinction is needed here:

Aspect	People Who USE GPUs	People Who MAKE GPUs Work
Role	ML Engineer, Researcher	GPU Software Engineer
Focus	Model accuracy, training algorithms	GPU utilization, memory bandwidth, scheduling
Tools	PyTorch, TensorFlow	nvidia-smi, Nsight, DCGM, NCCL
Key Question	"Why isn't this model performing?"	"Why is this GPU only 70% utilized?"
Abstraction Level	Python API	CUDA kernels, drivers, hypervisors
Response Area	Model architecture changes	XID error analysis, PCIe bottleneck resolution, MIG config

When an ML Engineer calls model.to('cuda') in PyTorch, a GPU Software Engineer understands and optimizes which path that call takes, which driver calls it traverses, and which memory region the allocation lands in.

Market Value of This Role

The supply-demand imbalance for GPU Software Engineers is extreme:

Demand side: As of 2025, over 5 million NVIDIA GPUs are installed in datacenters worldwide. Millions more are deployed annually, and systems engineers are needed to operate them.
Supply side: Engineers with deep GPU system software knowledge have traditionally existed only within NVIDIA itself, HPC research labs, and major cloud providers. In the Korean market, this talent pool is extremely limited.
Salary premium: In the US, GPU/CUDA Engineers typically command base salaries of USD 250K-400K. In Korea, companies are increasingly offering exceptional compensation for experts in this field.

LG Uplus GPU Technology TF Mission

Understanding why LG Uplus established the GPU Technology TF is essential:

Telecom AI Infrastructure Business: LG Uplus is pursuing not only its own AI services but also providing GPU cloud services to enterprise customers.
GPU Multi-tenancy: Sharing a single GPU among multiple customers requires vGPU/MIG technology.
Leveraging Network Strengths: As a telecom company, they have competency in InfiniBand/RoCE high-speed network design.
End-to-End Optimization: This team is responsible for the entire pipeline from hardware selection to virtualization, containers, and AI workload onboarding.

Role	Core Competency	GPU Depth	Infra Depth
ML Engineer	Model development, training pipelines	Low (API level)	Low
MLOps Engineer	CI/CD, model serving, pipeline automation	Medium	Medium
GPU SW Engineer	GPU architecture, virtualization, drivers	Very High	High
Infra SRE	Server/network availability, monitoring	Medium	Very High
HPC Engineer	Parallel computing, MPI, schedulers	High	High

GPU Software Engineer sits at the intersection of all these roles, specifically responsible for the software layer closest to hardware.

2. JD Line-by-Line Dissection

Let's analyze what each item in the LG Uplus GPU Software Engineer JD actually means.

Responsibilities

"GPU resource management and performance optimization"

This is not simply monitoring nvidia-smi. Specifically:

Finding and resolving causes when GPU utilization (SM Occupancy) is below expectations
Analyzing HBM memory bandwidth utilization and kernel-level optimization
Managing power capping vs performance tradeoffs
Determining RMA decisions when ECC errors occur
Designing cluster-level GPU allocation policies

"GPU virtualization technology development and optimization"

This is the core differentiator for this position:

Designing vGPU profiles: determining which virtual GPU size to allocate to which customer
MIG partitioning strategy: configuring A100/H100 MIG profiles to match workloads
Establishing technical selection criteria between PCI Passthrough vs vGPU vs MIG
Building pipelines for GPU allocation to VMs in KubeVirt environments

"AI/ML workload GPU onboarding and performance optimization"

Efficiently deploying customer AI models onto GPU infrastructure:

Model profiling: analyzing GPU memory requirements, compute requirements
Matching appropriate GPU type/size (A100-40GB vs A100-80GB vs H100)
Distributed training setup: NCCL communication optimization, InfiniBand utilization
Inference serving: Triton Inference Server configuration, batch size optimization

Qualification Analysis

"BS+ in CS (MS preferred in systems/network/OS)"

The reason for MS preference is clear. Knowledge in this field is mostly covered in graduate-level courses:

Operating Systems: memory management, scheduling, device drivers
Computer Architecture: cache hierarchy, memory models, parallel processing
Networking: RDMA, high-performance protocols

"GPU or system software practical experience"

The key phrase is "system software." This means:

Kernel module development/debugging experience
Interaction with device drivers
Low-level performance profiling (perf, ftrace, eBPF)
C/C++ level systems programming

Required Skills Analysis

The following sections will cover each required skill in depth.

3. GPU Architecture Deep Dive

3-1. GPU Compute Architecture

SM (Streaming Multiprocessor) Architecture

The core compute unit of a GPU is the SM (Streaming Multiprocessor). Understanding the SM structure of modern NVIDIA GPUs is the starting point for all GPU optimization.

SM Internal Components:

SM (Streaming Multiprocessor)
├── CUDA Cores (INT32 + FP32)
│   └── H100: 128 FP32 cores per SM
├── Tensor Cores
│   └── H100: 4th-gen Tensor Cores (FP8 support)
├── RT Cores (Ray Tracing, unused in datacenters)
├── Warp Schedulers (4)
│   └── Each scheduler independently dispatches warps
├── Register File (256KB per SM in H100)
├── Shared Memory / L1 Cache (unified, up to 228KB)
├── Load/Store Units
├── Special Function Units (SFU)
│   └── Transcendental functions: sin, cos, exp
└── Texture Units

Warps and the SIMT Model:

The fundamental unit of GPU execution is a Warp (32 threads). All threads in the same Warp execute the same instruction simultaneously. This is the SIMT (Single Instruction, Multiple Threads) model.

Grid (entire workload)
├── Block 0
│   ├── Warp 0  (Thread 0~31)   → Same instruction, simultaneous execution
│   ├── Warp 1  (Thread 32~63)  → Same instruction, simultaneous execution
│   └── Warp N  ...
├── Block 1
│   ├── Warp 0
│   └── ...
└── Block M

Warp Divergence problem: When threads within a Warp take different branches (if/else), both branches are executed sequentially, degrading performance. This is called "Warp Divergence" and is a pattern that must be avoided in GPU programming.

// Bad example: Warp Divergence
__global__ void kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx % 2 == 0) {
        data[idx] = expensive_operation_A(data[idx]);
    } else {
        data[idx] = expensive_operation_B(data[idx]);
    }
    // Even/odd threads in same Warp take different branches -> sequential execution
}

// Good example: Threads in same Warp take same branch
__global__ void kernel_optimized(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int warp_id = idx / 32;
    if (warp_id % 2 == 0) {
        data[idx] = expensive_operation_A(data[idx]);
    } else {
        data[idx] = expensive_operation_B(data[idx]);
    }
    // All threads in same Warp take same branch
}

Architecture Generation Comparison

Feature	Ampere (A100)	Hopper (H100)	Blackwell (B200)
SM Count	108	132	192
CUDA Cores	6,912	16,896	21,760+
Tensor Cores	432 (3rd gen)	528 (4th gen)	768 (5th gen)
Memory	HBM2e 80GB	HBM3 80GB	HBM3e 192GB
Memory Bandwidth	2.0 TB/s	3.35 TB/s	8.0 TB/s
NVLink	3.0 (600GB/s)	4.0 (900GB/s)	5.0 (1.8TB/s)
Transformer Engine	None	FP8 support	FP4 support
MIG Support	Up to 7 instances	Up to 7 instances	Up to 7 instances
TDP	400W	700W	1000W
FP16 Tensor Perf	312 TFLOPS	989 TFLOPS	2,250+ TFLOPS

Transformer Engine: Introduced with H100, the Transformer Engine provides hardware-level FP8 precision support. It dynamically switches tensors between FP8/FP16 per layer during training, halving memory usage while minimizing accuracy loss.

NVLink and NVSwitch: High-speed direct communication paths between GPUs.

NVLink Topology (DGX H100):
GPU 0 <-- NVLink 4.0 (900GB/s) --> GPU 1
  |                                    |
  NVSwitch (fully connected)        NVSwitch
  |                                    |
GPU 2 <-- NVLink 4.0 (900GB/s) --> GPU 3
  |                                    |
  ...        8-GPU fully connected     ...
GPU 6 <-- NVLink 4.0 (900GB/s) --> GPU 7

Total bandwidth: 8 GPUs x 900GB/s = 7.2TB/s (bidirectional)

3-2. GPU Memory Hierarchy (Critical!)

Understanding the GPU memory hierarchy accounts for 80% of GPU performance optimization. All GPU performance problems ultimately reduce to memory problems.

Memory Hierarchy (Fast -> Slow):

1. Register (Fastest)
   ├── Capacity: Up to 255 per thread (32-bit)
   ├── Latency: ~1 cycle
   ├── Bandwidth: Infinite (directly connected to ALU)
   └── Note: Compiler auto-allocates; spills to local memory when exceeded

2. Shared Memory
   ├── Capacity: 48KB ~ 228KB per SM (configurable)
   ├── Latency: ~5 cycles (~5x slower than registers)
   ├── Bandwidth: ~19TB/s (H100)
   ├── Feature: Shared among threads in same Block
   └── Warning: Bank Conflicts possible

3. L1 Cache
   ├── Unified with Shared Memory (ratio configurable)
   ├── H100: Shared Memory + L1 = 228KB per SM
   └── Auto-cached, not directly programmable

4. L2 Cache
   ├── Capacity: H100 = 50MB (shared across all SMs)
   ├── Latency: ~200 cycles
   └── A100: 40MB, Blackwell: up to 128MB

5. Global Memory (HBM)
   ├── Capacity: 40GB ~ 192GB
   ├── Latency: ~600 cycles (~600x slower than registers)
   ├── Bandwidth: 2.0 ~ 8.0 TB/s (by generation)
   └── Accessible from all threads

Memory Bandwidth Utilization Calculation:

Determining whether GPU performance is memory-bound or compute-bound is the core skill.

Arithmetic Intensity = FLOPs / Bytes Accessed

H100:
- Peak Compute: 989 TFLOPS (FP16 Tensor)
- Peak Memory BW: 3.35 TB/s

Balance Point (Roofline Analysis):
  989 TFLOPS / 3.35 TB/s = 295 FLOPs/Byte

-> Arithmetic Intensity < 295: Memory-Bound
-> Arithmetic Intensity > 295: Compute-Bound

Examples:
- Vector addition: 1 FLOP / 12 Bytes = 0.08 -> Extremely Memory-Bound
- Matrix multiply (NxN): 2N FLOPs / 8 Bytes = O(N) -> Compute-Bound for large N
- Transformer Attention: Typically Memory-Bound (especially during inference)

Memory Coalescing:

When 32 threads in a Warp access contiguous memory, hardware merges this into a single large transaction. Non-contiguous access splits into multiple transactions, wasting bandwidth.

// Good: Coalesced Access (contiguous)
// Thread 0 -> data[0], Thread 1 -> data[1], ..., Thread 31 -> data[31]
__global__ void coalesced(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = data[idx];  // Single 128-byte transaction
}

// Bad: Strided Access (non-contiguous)
// Thread 0 -> data[0], Thread 1 -> data[32], ..., Thread 31 -> data[31*32]
__global__ void strided(float* data, int stride) {
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    float val = data[idx];  // 32 separate transactions!
}

Bank Conflicts:

Shared Memory is divided into 32 banks. When 2+ threads in the same Warp access the same bank, accesses are serialized, degrading performance.

Shared Memory Bank Layout (4-byte granularity):
Bank 0: addr 0, 128, 256, ...
Bank 1: addr 4, 132, 260, ...
Bank 2: addr 8, 136, 264, ...
...
Bank 31: addr 124, 252, 380, ...

Bank Conflict Example:
Thread 0 -> Bank 0 (addr 0)
Thread 1 -> Bank 0 (addr 128)  <- Same bank! Conflict
-> 2-way bank conflict: 2x slower

Avoidance: Add padding
__shared__ float tile[32][33];  // Pad to 33 (instead of 32)
// Avoids bank conflicts on column access

3-3. CUDA Programming Fundamentals

Grid, Block, Thread Hierarchy

CUDA Execution Model:

Grid (1)
├── Block (0,0)  --  Block (1,0)  --  Block (2,0)
├── Block (0,1)  --  Block (1,1)  --  Block (2,1)
└── Block (0,2)  --  Block (1,2)  --  Block (2,2)

Each Block:
├── Thread (0,0) ... Thread (15,0)
├── Thread (0,1) ... Thread (15,1)
└── Thread (0,15) ... Thread (15,15)

Constraints:
- Max 1024 threads per Block
- Block dimensions: max (1024, 1024, 64)
- Grid dimensions: max (2^31-1, 65535, 65535)

CUDA Code Example: Vector Addition

#include <stdio.h>
#include <cuda_runtime.h>

// GPU kernel function
__global__ void vectorAdd(const float* A, const float* B, float* C, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    int n = 1 << 20;  // 1M elements
    size_t size = n * sizeof(float);

    // Host memory allocation
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    // Initialize
    for (int i = 0; i < n; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // Device memory allocation
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Host -> Device copy
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Kernel launch
    int blockSize = 256;
    int gridSize = (n + blockSize - 1) / blockSize;
    vectorAdd<<<gridSize, blockSize>>>(d_A, d_B, d_C, n);

    // Device -> Host copy
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Verify
    printf("C[0] = %f (expected 3.0)\n", h_C[0]);

    // Free memory
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}

CUDA Code Example: Matrix Multiplication (Shared Memory Tiling)

#define TILE_SIZE 32

__global__ void matMul(const float* A, const float* B, float* C, int N) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Load from Global Memory to Shared Memory
        if (row < N && t * TILE_SIZE + threadIdx.x < N)
            tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        else
            tileA[threadIdx.y][threadIdx.x] = 0.0f;

        if (col < N && t * TILE_SIZE + threadIdx.y < N)
            tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tileB[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();  // Synchronize all threads in Block

        for (int k = 0; k < TILE_SIZE; k++)
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];

        __syncthreads();
    }

    if (row < N && col < N)
        C[row * N + col] = sum;
}

Key CUDA Libraries

Library	Purpose	Key API
cuBLAS	Linear algebra (matrix ops)	cublasSgemm, cublasGemmEx
cuDNN	Deep learning primitives	cudnnConvolutionForward
cuFFT	Fast Fourier Transform	cufftExecC2C
cuSPARSE	Sparse matrix operations	cusparseSpMV
Thrust	C++ parallel algorithms (STL-like)	thrust::sort, thrust::reduce
CUTLASS	GEMM customization	Template-based GEMM

3-4. GPU Profiling and Performance Analysis

nvidia-smi Detailed Usage

# Basic status check
nvidia-smi

# Monitor at 1-second intervals
nvidia-smi dmon -s pucvmet -d 1

# Detailed GPU process info
nvidia-smi pmon -d 1

# Query format (for scripts)
nvidia-smi --query-gpu=timestamp,gpu_bus_id,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw --format=csv -l 1

# MIG status check
nvidia-smi mig -lgi
nvidia-smi mig -lci

# GPU topology check (NVLink connections)
nvidia-smi topo -m

# ECC error check
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv

NVIDIA Nsight Systems (System-Level Profiling)

# Full application profiling
nsys profile --stats=true -o report python train.py

# Trace CUDA API calls + GPU kernels + memory transfers
nsys profile --trace=cuda,nvtx,osrt -o detailed_report python train.py

# Visualize results (GUI)
nsys-ui report.nsys-rep

Nsight Systems provides a timeline view for:

Identifying CPU-GPU synchronization points
Checking kernel execution and memory transfer overlap
Identifying CPU bottlenecks (data loading, preprocessing)
Measuring NCCL communication time (distributed training)

NVIDIA Nsight Compute (Kernel-Level Analysis)

# Detailed analysis of specific kernel
ncu --target-processes all --set full -o kernel_report python train.py

# Profile specific kernel only
ncu --kernel-name "volta_sgemm" --launch-count 10 -o sgemm_report ./my_app

# Check key metrics
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_active,dram__throughput.avg.pct_of_peak_sustained_active ./my_app

Key Metrics:

SM Occupancy: Ratio of active warps in SM (higher is better, typically aim for 50%+)
Compute Throughput: Compute utilization (% of peak)
Memory Throughput: Memory bandwidth utilization (% of peak)
Warp Stall Reasons: Why warps are waiting (memory, synchronization, etc.)

DCGM (Data Center GPU Manager)

For large-scale cluster monitoring, nvidia-smi alone is insufficient. DCGM provides:

# Start DCGM
sudo systemctl start nvidia-dcgm

# Health check
dcgmi health -g 0 -c

# Run diagnostics (Level 3: most detailed)
dcgmi diag -r 3 -g 0

# Metric collection (Prometheus integration)
dcgm-exporter &
# Prometheus scrapes http://localhost:9400/metrics

GPU Utilization Low: Analysis Pattern

GPU Utilization Low (<50%)
├── CPU bottleneck?
│   ├── Slow data loading -> Increase num_workers, prefetch
│   ├── Heavy preprocessing -> Use DALI (GPU preprocessing)
│   └── Python GIL -> Multiprocessing
├── Memory transfer bottleneck?
│   ├── PCIe bandwidth saturated -> Use GPU Direct
│   └── Unnecessary CPU-GPU copies -> Use pinned memory
├── Small kernels + large overhead?
│   ├── Kernel launch overhead -> Use CUDA Graphs
│   └── Excessive synchronization -> Optimize async execution
├── Batch size too small?
│   └── Not enough work to fill GPU -> Increase batch or use Gradient Accumulation
└── Communication overhead? (distributed training)
    ├── AllReduce taking too long -> NCCL tuning
    └── Network bottleneck -> Check InfiniBand

4. GPU Virtualization Technology

4-1. Virtualization Fundamentals

Type 1 vs Type 2 Hypervisor

Type 1 (Bare-metal):           Type 2 (Hosted):
+-----------------+            +-----------------+
|   VM1    VM2    |            |   VM1    VM2    |
| +-----++-----+ |            | +-----++-----+ |
| |Guest||Guest| |            | |Guest||Guest| |
| | OS  || OS  | |            | | OS  || OS  | |
| +-----++-----+ |            | +-----++-----+ |
|  Hypervisor     |            |  Hypervisor     |
|  (ESXi, KVM)   |            |  (VirtualBox)   |
|  Hardware       |            |  Host OS        |
+-----------------+            |  Hardware       |
                               +-----------------+

In the LG Uplus environment, KVM is the core technology. KVM is a Type 1 hypervisor that operates as a Linux kernel module, using QEMU as the userspace emulator.

IOMMU (Intel VT-d / AMD-Vi)

IOMMU is an essential hardware feature for GPU virtualization:

Without IOMMU (unsafe):
VM -> (virtual address) -> Physical Memory (direct access -> can access other VM memory)

With IOMMU (safe):
VM -> (virtual address) -> IOMMU translation -> (physical address, isolated)
                            └── DMA requests also isolated!

Verifying IOMMU is enabled:

# Check IOMMU groups
find /sys/kernel/iommu_groups/ -type l

# Check kernel boot parameters
cat /proc/cmdline | grep iommu
# Should contain intel_iommu=on or amd_iommu=on

# Check devices per IOMMU group
for g in /sys/kernel/iommu_groups/*/devices/*; do
    echo "IOMMU Group $(basename $(dirname $(dirname $g))):"
    lspci -nns $(basename $g)
done

4-2. PCI Passthrough

PCI Passthrough is the most basic method of directly assigning a physical GPU to a VM.

PCI Passthrough Architecture:

Host (Linux + KVM)
├── GPU 0 -> VFIO driver binding -> VM1 (direct access)
├── GPU 1 -> VFIO driver binding -> VM2 (direct access)
├── GPU 2 -> NVIDIA driver -> Host use
└── GPU 3 -> NVIDIA driver -> Host use

Setup procedure:

# 1. Enable IOMMU (GRUB)
# Add to /etc/default/grub:
# GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"

# 2. Find GPU PCI ID
lspci -nn | grep NVIDIA
# 41:00.0 3D controller [0302]: NVIDIA Corporation A100 [10de:20b2]

# 3. Bind to VFIO driver
echo "10de 20b2" > /sys/bus/pci/drivers/vfio-pci/new_id
echo "0000:41:00.0" > /sys/bus/pci/devices/0000:41:00.0/driver/unbind
echo "0000:41:00.0" > /sys/bus/pci/drivers/vfio-pci/bind

# 4. Add device when starting VM with QEMU/KVM
# -device vfio-pci,host=41:00.0

Pros and Cons:

Pros	Cons
Native performance (nearly zero overhead)	1 GPU = 1 VM (no sharing)
Simple setup	No live migration
All CUDA features supported	Potential GPU resource waste

4-3. vGPU (NVIDIA Virtual GPU)

vGPU uses time-slicing to share a single physical GPU across multiple VMs.

vGPU Architecture:

Physical GPU (A100-80GB)
├── vGPU Instance 1 (A100-4C, 4GB) -> VM1
├── vGPU Instance 2 (A100-4C, 4GB) -> VM2
├── vGPU Instance 3 (A100-8C, 8GB) -> VM3
└── ... (as many as remaining capacity allows)

Time-slicing:
t=0ms  [VM1 runs] -> t=16ms [VM2 runs] -> t=32ms [VM3 runs] -> ...

vGPU Profile Types

Series	Purpose	Example
A-series	Virtual Application	A100-1-5A (5GB, VDI apps)
B-series	Virtual PC	A100-2-10B (10GB, VDI desktop)
C-series	Compute	A100-4-20C (20GB, AI compute)
Q-series	Quadro	A100-8-40Q (40GB, professional graphics)

For LG Uplus GPU Technology TF, C-series (Compute) will be the primary focus.

vGPU Scheduler

# vGPU Scheduler Types
Equal Share:
  - Equal time allocation to all vGPUs
  - Fair but no priority setting possible

Fixed Share:
  - Time allocation proportional to vGPU profile size
  - 4GB vGPU: 8GB vGPU = 1:2 time

Best Effort:
  - Redistributes idle vGPU time to active vGPUs
  - Most efficient but performance less predictable

4-4. MIG (Multi-Instance GPU)

MIG is an A100/H100-exclusive technology that physically partitions the GPU. Unlike vGPU's time-slicing, MIG completely isolates SMs and memory.

MIG Architecture (A100-80GB):

Full GPU: 108 SM, 80GB HBM2e
├── MIG Instance 1 (7g.80gb): 98 SM, 80GB  <- Nearly full (solo use)
or
├── MIG Instance 1 (4g.40gb): 56 SM, 40GB
├── MIG Instance 2 (3g.40gb): 42 SM, 40GB
or
├── MIG Instance 1 (3g.40gb): 42 SM, 40GB
├── MIG Instance 2 (2g.20gb): 28 SM, 20GB
├── MIG Instance 3 (1g.10gb): 14 SM, 10GB
├── MIG Instance 4 (1g.10gb): 14 SM, 10GB
or (maximum partition)
├── MIG Instance 1~7 (1g.10gb): 14 SM each, 10GB each (x7)

MIG Configuration Commands

# Enable MIG
sudo nvidia-smi -i 0 -mig 1

# Check available MIG profiles
nvidia-smi mig -lgip

# Create GPU Instances
sudo nvidia-smi mig -i 0 -cgi 9,14,14,14  # 3g.40gb + 1g.10gb x3

# Create Compute Instances
sudo nvidia-smi mig -i 0 -cci

# Check current MIG status
nvidia-smi mig -lgi
nvidia-smi mig -lci

# Delete MIG instances
sudo nvidia-smi mig -i 0 -dci
sudo nvidia-smi mig -i 0 -dgi

# Disable MIG
sudo nvidia-smi -i 0 -mig 0

MIG vs vGPU Comparison

Feature	MIG	vGPU
Isolation Level	Physical (SM + memory fully separated)	Time-slicing (software isolation)
Performance Predictability	Consistent (dedicated resources)	Variable (affected by other VMs)
Max Instances	7 (A100/H100)	Many (within GPU memory limits)
Supported GPUs	A100, H100, A30	Most datacenter GPUs
Flexibility	Fixed profiles (reconfiguration needed)	Dynamic allocation possible
Licensing	No additional license needed	vGPU license required
Use Cases	Inference serving, small-scale training	VDI, mixed workloads

MIG on K8s: NVIDIA MIG Manager

# MIG Configuration ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - device-filter: ["0x20B210DE"]
          devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      mixed-config:
        - device-filter: ["0x20B210DE"]
          devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "1g.10gb": 4

4-5. SR-IOV (NIC Virtualization)

SR-IOV virtualizes NICs for direct VM assignment. This is important when combined with GPU Direct RDMA.

SR-IOV Structure:

Physical NIC (ConnectX-7)
├── PF (Physical Function): Managed by host driver
├── VF 0 (Virtual Function) -> VM1 (direct assignment, native performance)
├── VF 1 -> VM2
├── VF 2 -> VM3
└── ... (up to 128 VFs)

Advantages:
- VMs access NIC directly without virtual bridge
- Near-native network performance
- Minimal CPU overhead

GPU Direct RDMA Combination:
GPU in VM <-> VF(SR-IOV NIC) <-> InfiniBand <-> Remote GPU
  (PCIe direct)  (SR-IOV bypass)    (RDMA)

4-6. KubeVirt

KubeVirt manages VMs as first-class resources on Kubernetes. It is a core technology when containers and VMs need to run on the same platform.

# KubeVirt VM with GPU PCI Passthrough
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: gpu-vm
spec:
  running: true
  template:
    spec:
      domain:
        devices:
          hostDevices:
            - name: gpu
              deviceName: nvidia.com/A100
        resources:
          requests:
            memory: '32Gi'
            cpu: '8'
      volumes:
        - name: rootdisk
          containerDisk:
            image: quay.io/containerdisks/ubuntu:22.04

KubeVirt + vGPU:

# KubeVirt VM with vGPU allocation
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vgpu-vm
spec:
  template:
    spec:
      domain:
        devices:
          gpus:
            - name: vgpu
              deviceName: nvidia.com/NVIDIA_A100-4C
        resources:
          requests:
            memory: '16Gi'

Use Cases:

Legacy VM workloads: Migrating existing VM-based AI workloads to K8s
Mixed environments: Running containers + VMs simultaneously on the same K8s cluster
GPU sharing: Flexibly allocating GPUs to VMs and containers through vGPU

5. High-Speed Networking: InfiniBand and RDMA

5-1. InfiniBand Architecture

Distributed GPU training performance is determined by the network. No matter how fast the GPUs are, slow inter-GPU communication degrades overall performance.

InfiniBand vs Ethernet Comparison

Feature	InfiniBand NDR	RoCE v2 (100GbE)	TCP/IP (100GbE)
Bandwidth	400 Gbps	100 Gbps	100 Gbps
Latency	0.5us	1~2us	10~50us
RDMA Support	Native	RoCE v2	None (kernel path)
CPU Overhead	Nearly zero	Low	High
Congestion Control	Credit-based	PFC/ECN	TCP congestion control
Cost	Very high	Medium	Low
Use Case	HPC, AI training	AI training (cloud)	General workloads

InfiniBand Generations

InfiniBand Speed Evolution:
SDR  (2001):   10 Gbps
DDR  (2005):   20 Gbps
QDR  (2008):   40 Gbps
FDR  (2011):  56 Gbps
EDR  (2014): 100 Gbps
HDR  (2018): 200 Gbps
NDR  (2022): 400 Gbps
XDR  (2024): 800 Gbps
GDR  (2026): 1.6 Tbps (planned)

InfiniBand Network Components

InfiniBand Fabric Structure:

Leaf Switch (ToR)
├── HCA (Host Channel Adapter) -- Server 1 [GPU 0~7]
├── HCA -- Server 2 [GPU 0~7]
├── HCA -- Server 3 [GPU 0~7]
└── HCA -- Server 4 [GPU 0~7]

Spine Switch
├── Leaf Switch 1
├── Leaf Switch 2
├── Leaf Switch 3
└── Leaf Switch 4

Management Components:
- Subnet Manager (OpenSM): LID assignment, routing table management
- LID (Local ID): Subnet address (16-bit)
- GID (Global ID): Global address (128-bit, IPv6-like)
- GUID (Globally Unique ID): Hardware unique identifier

5-2. RDMA (Remote Direct Memory Access)

RDMA accesses remote memory directly without CPU involvement. It is the backbone of distributed GPU training.

TCP/IP Transfer (Traditional):
App -> Socket API -> TCP/IP Stack (Kernel) -> NIC Driver -> NIC -> Network
                     ^ CPU involved (copy, checksum, segmentation)

RDMA Transfer:
App -> RDMA Verbs -> NIC (direct) -> Network
       ^ Zero-copy, CPU bypass

RDMA Transport Types

Transport	Description	Use Case
InfiniBand	Native RDMA	HPC, AI clusters
RoCE v2	RDMA over UDP/IP	Cloud environments
iWARP	RDMA over TCP/IP	Legacy environments

RDMA Programming Basics

// ibverbs-based RDMA Write example (simplified)
#include <infiniband/verbs.h>

// 1. Open device
struct ibv_context *ctx = ibv_open_device(dev);

// 2. Create Protection Domain
struct ibv_pd *pd = ibv_alloc_pd(ctx);

// 3. Register memory (for NIC direct access)
struct ibv_mr *mr = ibv_reg_mr(pd, buf, size,
    IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);

// 4. Create Queue Pair
struct ibv_qp *qp = ibv_create_qp(pd, &qp_init_attr);

// 5. RDMA Write (write directly to remote memory)
struct ibv_send_wr wr;
wr.opcode = IBV_WR_RDMA_WRITE;
wr.wr.rdma.remote_addr = remote_addr;
wr.wr.rdma.rkey = remote_key;
ibv_post_send(qp, &wr, &bad_wr);

5-3. GPU Direct

GPU Direct RDMA

GPU Direct RDMA enables direct data transfer from GPU memory to remote GPU memory.

Normal Path (without GPU Direct):
GPU0 -> PCIe -> Host Memory -> NIC -> Network -> NIC -> Host Memory -> PCIe -> GPU1
       (copy1)                (copy2)          (copy3)               (copy4)

GPU Direct RDMA:
GPU0 -> PCIe -> NIC -> Network -> NIC -> PCIe -> GPU1
       (direct)                         (direct)
CPU bypass, 2x reduction in copy count

GPU Direct Storage (GDS)

Normal Storage Access:
NVMe -> Host Memory (bounce buffer) -> GPU Memory
        CPU involved, 2 copies

GPU Direct Storage:
NVMe -> GPU Memory (direct)
        CPU bypass, 1 copy

Use Cases: Large dataset loading (checkpoint recovery, data preprocessing)

NCCL + InfiniBand Combination

# NCCL environment variables (distributed training)
export NCCL_IB_HCA=mlx5_0,mlx5_1  # Specify InfiniBand HCAs
export NCCL_IB_GID_INDEX=3         # RoCE v2 GID index
export NCCL_SOCKET_IFNAME=eth0     # Control channel interface
export NCCL_DEBUG=INFO             # Debug logging

# NCCL topology file (GPU-NIC mapping optimization)
export NCCL_TOPO_FILE=/path/to/topo.xml

# NCCL AllReduce benchmark
/usr/local/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 8

5-4. Network Performance Tuning

InfiniBand Benchmarks

# Bandwidth test
# Server: ib_write_bw --size=65536
# Client: ib_write_bw --size=65536 <server_ip>

# Latency test
# Server: ib_write_lat
# Client: ib_write_lat <server_ip>

# Example results (NDR 400Gbps):
# Bandwidth: ~48 GB/s (theoretical 50 GB/s)
# Latency: ~0.6 us

PFC (Priority Flow Control) Configuration

PFC is essential in RoCE v2 environments:

# Mellanox NIC PFC configuration
mlnx_qos -i eth0 --pfc 0,0,0,1,0,0,0,0
# Enable PFC only on Priority 3 (RoCE traffic)

# DSCP -> Priority mapping
mlnx_qos -i eth0 --trust dscp

ECMP (Equal-Cost Multi-Path) Routing

ECMP in Large-Scale InfiniBand Fabrics:

Server A --- Leaf 1 -+- Spine 1 -+- Leaf 3 --- Server C
                      +- Spine 2 -+
                      +- Spine 3 -+
                      +- Spine 4 -+

-> Load-balance across 4 equal-cost paths
-> Hash-based (source/destination LID) distribution
-> Adaptive Routing (AR): Dynamic path selection based on congestion

6. Kubernetes GPU Management

6-1. NVIDIA GPU Operator

GPU Operator automatically deploys the GPU software stack to K8s clusters.

GPU Operator Components:

GPU Operator
├── NVIDIA Driver (DaemonSet)
│   └── Auto-builds/installs kernel module
├── NVIDIA Container Toolkit
│   └── Adds GPU support to container runtime
├── NVIDIA Device Plugin
│   └── Registers GPU resources with K8s
├── DCGM Exporter
│   └── GPU metrics -> Prometheus
├── MIG Manager
│   └── Auto-applies MIG profiles
├── GPU Feature Discovery (GFD)
│   └── Auto-adds GPU labels to nodes
└── NVIDIA Validator
    └── Validates installation state

Installation:

# Install GPU Operator via Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set mig.strategy=mixed \
  --set dcgmExporter.enabled=true

6-2. GPU Device Plugin

# Allocating GPUs to a Pod
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:12.3.0-runtime-ubuntu22.04
      resources:
        limits:
          nvidia.com/gpu: 2 # Request 2 GPUs
      command: ['nvidia-smi']

For GPUs that don't support MIG, multiple Pods can share a GPU:

# GPU Time-Slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4  # Split 1 GPU into 4 (time-slicing)

6-3. GPU Scheduling

Basic Scheduling

K8s default GPU scheduling is simple: place Pods on nodes with sufficient available GPUs. However, large-scale GPU clusters require more sophisticated scheduling.

Topology-Aware Scheduling

DGX H100 GPU Topology:
GPU0 - NVLink - GPU1 (same NVSwitch domain)
GPU2 - NVLink - GPU3 (same NVSwitch domain)
GPU4 - NVLink - GPU5 (same NVSwitch domain)
GPU6 - NVLink - GPU7 (same NVSwitch domain)

GPU0 - PCIe - GPU4 (different domain, PCIe connection)

-> 4-GPU training: GPU0,1,2,3 (NVLink) >> GPU0,2,4,6 (PCIe)

# Topology-aware scheduling with NodeSelector
apiVersion: v1
kind: Pod
metadata:
  name: distributed-training
spec:
  nodeSelector:
    nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'
    nvidia.com/gpu.count: '8'
  containers:
    - name: trainer
      resources:
        limits:
          nvidia.com/gpu: 4

Gang Scheduling

In distributed training, all GPUs must be allocated simultaneously. Partial allocation wastes resources as allocated GPUs wait for the rest.

# Gang Scheduling with Volcano
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  minAvailable: 4 # Schedule minimum 4 Pods simultaneously
  schedulerName: volcano
  tasks:
    - replicas: 4
      name: worker
      template:
        spec:
          containers:
            - name: trainer
              image: training-image:latest
              resources:
                limits:
                  nvidia.com/gpu: 8 # 8 GPUs per node

Bin-packing vs Spread Strategy

Bin-packing (resource consolidation):
Node1: [GPU0 used, GPU1 used, GPU2 used, GPU3 free]
Node2: [GPU0 free, GPU1 free, GPU2 free, GPU3 free]
-> Pros: Power savings on idle nodes, resource efficiency
-> Cons: Potential hot spots

Spread (distribution):
Node1: [GPU0 used, GPU1 free, GPU2 used, GPU3 free]
Node2: [GPU0 used, GPU1 free, GPU2 used, GPU3 free]
-> Pros: Load distribution, fault isolation
-> Cons: Resource fragmentation

GPU Feature Discovery (GFD)

# Example node labels added by GFD
nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
nvidia.com/gpu.count=8
nvidia.com/gpu.memory=81920
nvidia.com/cuda.driver.major=535
nvidia.com/mig.strategy=mixed
nvidia.com/gpu.family=ampere
nvidia.com/mig-1g.10gb.count=4
nvidia.com/mig-3g.40gb.count=1

6-4. GPU Monitoring on K8s

DCGM Exporter + Prometheus + Grafana

# DCGM Exporter DaemonSet (included in GPU Operator)
# Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

Key Prometheus Metrics:

# GPU Utilization
DCGM_FI_DEV_GPU_UTIL             # SM utilization (%)
DCGM_FI_DEV_MEM_COPY_UTIL        # Memory bandwidth utilization (%)

# Memory
DCGM_FI_DEV_FB_USED              # Used framebuffer (MB)
DCGM_FI_DEV_FB_FREE              # Free framebuffer (MB)

# Temperature/Power
DCGM_FI_DEV_GPU_TEMP             # GPU temperature (C)
DCGM_FI_DEV_POWER_USAGE          # Power usage (W)

# Errors
DCGM_FI_DEV_XID_ERRORS           # XID error codes
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL    # Single-bit ECC errors
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL    # Double-bit ECC errors

# PCIe
DCGM_FI_DEV_PCIE_TX_THROUGHPUT   # PCIe transmit throughput
DCGM_FI_DEV_PCIE_RX_THROUGHPUT   # PCIe receive throughput

Alert Configuration Examples:

# Prometheus Alert Rules
groups:
  - name: gpu-alerts
    rules:
      - alert: GPUMemoryAlmostFull
        expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'GPU memory usage above 95%'

      - alert: GPUThermalThrottling
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: 'GPU temperature exceeds 85C'

      - alert: GPUXIDError
        expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: 'GPU XID error detected'

7. AI Workload Optimization

7-1. Training Optimization

Mixed Precision Training

# PyTorch Automatic Mixed Precision (AMP)
import torch
from torch.cuda.amp import autocast, GradScaler

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler()

for data, target in dataloader:
    optimizer.zero_grad()

    # Forward Pass in FP16
    with autocast():
        output = model(data.cuda())
        loss = criterion(output, target.cuda())

    # Loss Scaling + Backward Pass
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Precision Comparison:

Precision	Bits	Memory Savings	Tensor Core Support	Use Case
FP32	32	Baseline	Yes (lower throughput)	Default training
TF32	19	-	Yes (A100+)	Auto-applied
FP16	16	2x	Yes	Mixed Precision
BF16	16	2x	Yes (A100+)	LLM training (wider range)
FP8 (E4M3)	8	4x	Yes (H100+)	Transformer Engine
INT8	8	4x	Yes	Inference quantization

DeepSpeed ZeRO

# DeepSpeed ZeRO Stage 3 Configuration
# ds_config.json
{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9
    },
    "bf16": {
        "enabled": true
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

ZeRO Memory Partitioning:

ZeRO Stage 0 (Default):
GPU0: [Model] + [Gradient] + [Optimizer State]
GPU1: [Model] + [Gradient] + [Optimizer State]
-> Full replication on all GPUs

ZeRO Stage 1 (Optimizer partitioning):
GPU0: [Model] + [Gradient] + [Optimizer 1/2]
GPU1: [Model] + [Gradient] + [Optimizer 2/2]
-> ~1.5x memory savings

ZeRO Stage 2 (+ Gradient partitioning):
GPU0: [Model] + [Gradient 1/2] + [Optimizer 1/2]
GPU1: [Model] + [Gradient 2/2] + [Optimizer 2/2]
-> ~2x memory savings

ZeRO Stage 3 (+ Model partitioning):
GPU0: [Model 1/2] + [Gradient 1/2] + [Optimizer 1/2]
GPU1: [Model 2/2] + [Gradient 2/2] + [Optimizer 2/2]
-> ~Nx memory savings (N = number of GPUs)

Parallelism Strategy Comparison

Data Parallelism:
Split input data into N parts across N GPUs with identical models
-> AllReduce for gradient synchronization
-> Communication volume: O(model_size)

Tensor Parallelism:
Split a single layer (matrix) into N parts across N GPUs
-> Communication needed at each layer in Forward/Backward
-> Requires high-speed inter-GPU communication (NVLink)

Pipeline Parallelism:
Place model layers sequentially across N GPUs
-> Process micro-batches in pipeline fashion
-> Minimizing bubbles (idle time) is key

3D Parallelism (LLM Training):
Data Parallel x Tensor Parallel x Pipeline Parallel
Example: 256 GPUs = 32 DP x 4 TP x 2 PP

7-2. Inference Optimization

TensorRT Optimization

# Model optimization via TensorRT (Python API)
import tensorrt as trt

logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
    1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)

# Parse ONNX model
with open("model.onnx", "rb") as f:
    parser.parse(f.read())

# Build configuration
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB
config.set_flag(trt.BuilderFlag.FP16)  # Enable FP16 quantization

# Build engine
engine = builder.build_serialized_network(network, config)

Triton Inference Server

Triton Architecture:
Client -> HTTP/gRPC -> Triton Server
                       ├── Model Repository
                       │   ├── model_a/ (TensorRT)
                       │   ├── model_b/ (ONNX Runtime)
                       │   └── model_c/ (Python Backend)
                       ├── Scheduler
                       │   ├── Dynamic Batching
                       │   └── Sequence Batching
                       ├── Model Ensemble
                       │   └── Preprocessing -> Model -> Postprocessing pipeline
                       └── Metrics (Prometheus)

# Triton Model Configuration (config.pbtxt)
name: "my_model"
platform: "tensorrt_plan"
max_batch_size: 64

input [
  {
    name: "input"
    data_type: TYPE_FP16
    dims: [ 3, 224, 224 ]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP16
    dims: [ 1000 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 16, 32, 64 ]
  max_queue_delay_microseconds: 100
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

vLLM: LLM Inference Optimization

from vllm import LLM, SamplingParams

# Start vLLM server
llm = LLM(
    model="meta-llama/Llama-3-70B",
    tensor_parallel_size=4,      # 4 GPU Tensor Parallel
    gpu_memory_utilization=0.9,  # Use 90% GPU memory
    max_model_len=8192,
    dtype="bfloat16",
)

# Key optimization techniques:
# 1. PagedAttention: Manages KV Cache in pages (memory efficiency)
# 2. Continuous Batching: Dynamically adds/removes requests from batches
# 3. Prefix Caching: Reuses KV Cache for common prefixes

7-3. Performance Bottleneck Analysis Patterns

[Performance Diagnosis Flowchart]

1. Check nvidia-smi
   ├── GPU Util < 30%
   │   ├── Likely CPU/IO bottleneck
   │   │   ├── Check top/htop -> CPU at 100%? -> Optimize data loading
   │   │   └── Check iostat -> Disk I/O? -> Use NVMe/GDS
   │   └── Kernels too small -> CUDA Graphs, increase batch
   ├── GPU Util > 90%, performance still low
   │   ├── Possibly Memory-Bound
   │   │   ├── Nsight Compute -> Check Memory Throughput
   │   │   └── Check Memory Coalescing patterns
   │   └── Possibly Warp Divergence
   │       └── Nsight Compute -> Check Warp Stall Reasons
   └── GPU Util irregular (fluctuating)
       ├── Synchronization bottleneck -> Optimize async execution
       └── Communication bottleneck (distributed) -> NCCL profiling

2. Distributed Training Bottleneck
   ├── Check NCCL AllReduce time
   │   ├── Check NCCL regions in Nsight Systems
   │   └── Analyze communication/compute ratio
   ├── Check InfiniBand bandwidth
   │   └── ib_write_bw benchmark
   └── Check GPU topology
       └── nvidia-smi topo -m

8. Linux System Troubleshooting

# GPU device information
lspci -vv -s $(lspci | grep NVIDIA | head -1 | awk '{print $1}')

# GPU driver version
cat /proc/driver/nvidia/version

# CUDA version
nvcc --version

# GPU memory usage detail
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv

# GPU memory per process
nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv

# PCIe bandwidth check
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv

# GPU clock information
nvidia-smi --query-gpu=clocks.current.graphics,clocks.current.memory,clocks.max.graphics,clocks.max.memory --format=csv

# GPU-related messages in dmesg
dmesg | grep -i -E "nvidia|nvrm|gpu|xid"

# InfiniBand status check
ibstat
ibstatus
ibv_devinfo

# RDMA device check
rdma link show
rdma resource show

# NIC status
ethtool -i mlx5_core0
mlxlink -d /dev/mst/mt4125_pciconf0 -m

# Kernel module status
lsmod | grep nvidia
lsmod | grep mlx
lsmod | grep vfio

# NUMA topology (GPU-CPU affinity)
numactl --hardware
lstopo --of ascii
nvidia-smi topo -m

8-2. Common GPU Issues and Solutions

XID Error Interpretation

XID Errors are error codes reported by the NVIDIA GPU driver. They appear in dmesg.

XID Code	Meaning	Severity	Response
XID 13	Graphics Engine Exception	High	Possible CUDA kernel bug, update driver
XID 31	GPU Memory Page Fault	High	Memory access error, check code
XID 43	GPU stopped processing	High	GPU hang, reset needed
XID 45	Preemptive cleanup	Medium	Timeout, check workload
XID 48	Double Bit ECC Error	Critical	Hardware defect, RMA
XID 63	ECC page retirement	Medium	Page retired, RMA if accumulated
XID 64	ECC page retirement (DBE)	High	Double-bit error, consider RMA
XID 79	GPU has fallen off the bus	Critical	PCIe disconnection, check hardware
XID 94	Contained ECC error	Medium	ECC error within MIG instance
XID 95	Uncontained ECC error	Critical	MIG isolation failure, GPU reset needed

# Monitor XID errors
dmesg -w | grep "NVRM: Xid"

# Example output:
# NVRM: Xid (PCI:0000:41:00): 79, pid=0, GPU has fallen off the bus
# NVRM: Xid (PCI:0000:41:00): 48, pid=12345, DBE (double bit error)

GPU Reset / Driver Reload

# Attempt GPU reset on hang
nvidia-smi --gpu-reset -i 0

# Driver reload (all GPU processes must be terminated)
# 1. Check GPU-using processes
fuser -v /dev/nvidia*

# 2. Kill processes
kill -9 <pid>

# 3. Unload/load driver
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
sudo modprobe nvidia

# If that doesn't work
sudo systemctl restart nvidia-persistenced

CUDA OOM Debugging

# Debugging OOM in PyTorch

# Check memory usage
import torch
print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved:  {torch.cuda.memory_reserved()/1e9:.2f} GB")
print(f"Max Allocated: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")

# Memory snapshot (detailed analysis)
torch.cuda.memory._record_memory_history(max_entries=100000)
# ... run training code ...
snapshot = torch.cuda.memory._snapshot()
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
# Visualize at https://pytorch.org/memory_viz

OOM Mitigation:

Reduce batch size
Use Gradient Accumulation
Enable Mixed Precision (FP16/BF16)
Gradient Checkpointing (Activation Recomputation)
Apply DeepSpeed ZeRO Stage 2/3
Offload model parameters to CPU/NVMe

ECC Errors and RMA Procedure

# Check ECC errors
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.aggregate.total --format=csv

# Check Retired Pages
nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv

# RMA Criteria:
# - Recurring Uncorrected (Double-bit) ECC errors
# - Retired Pages exceeding threshold (typically 60+ pages)
# - Multiple XID 48 occurrences
# - GPU fell off bus (XID 79)

Thermal Throttling Response

# Temperature monitoring
nvidia-smi --query-gpu=temperature.gpu,temperature.memory --format=csv -l 1

# Check/set power limit
nvidia-smi --query-gpu=power.limit,power.default_limit,power.max_limit --format=csv
sudo nvidia-smi -pl 300  # Set power limit to 300W

# Check clock speeds (decrease during throttling)
nvidia-smi --query-gpu=clocks.current.graphics,clocks.max.graphics --format=csv

# Check throttling reason
nvidia-smi --query-gpu=clocks_throttle_reasons.active --format=csv

Thermal Throttling Prevention:

Verify server room cooling capacity (400-1000W heat per GPU)
Optimize airflow (Hot/Cold Aisle separation)
Consider liquid cooling (DGX H100 supports liquid cooling option)
Set power limits (performance vs temperature tradeoff)

9. Interview Questions: Top 30

GPU Architecture and CUDA (10 Questions)

Q1. Explain the internal structure of an SM (Streaming Multiprocessor) and describe how Warp Divergence affects performance.

Key Answer Points: SM consists of CUDA Cores, Tensor Cores, Warp Scheduler, Register File, Shared Memory/L1 Cache. A Warp (32 threads) executes the same instruction under the SIMT model; if/else branches cause both paths to execute sequentially, resulting in up to 2x performance degradation.

Q2. Explain the GPU memory hierarchy from Register to Global Memory, including latency and optimization strategy for each level.

Key Answer Points: Register (1 cycle) to Shared Memory (5 cycles) to L1/L2 Cache to Global Memory (HBM, 600 cycles). Use Shared Memory tiling to reduce Global Memory accesses; maximize bandwidth utilization through Memory Coalescing.

Q3. What is Memory Coalescing and why is it important?

Key Answer Points: When 32 threads in a Warp access contiguous memory, hardware merges into a single 128-byte transaction. Non-contiguous (strided) access splits into 32 separate transactions, utilizing only 1/32 of bandwidth.

Q4. Explain the Roofline Model and how to determine whether a given kernel is Memory-Bound or Compute-Bound.

Key Answer Points: Calculate Arithmetic Intensity (FLOPs/Byte) and compare with hardware's Balance Point (Peak FLOPS / Peak BW). For H100, the balance point is 295 FLOPs/Byte; below that is Memory-Bound.

Q5. Explain the key differences between A100 and H100, and how H100's Transformer Engine impacts training performance.

Key Answer Points: H100 offers 4th-gen Tensor Cores (FP8), NVLink 4.0 (900GB/s), HBM3 (3.35TB/s), Transformer Engine. Transformer Engine dynamically switches between FP8 and FP16, achieving 2x throughput improvement.

Q6. Explain the CUDA Grid-Block-Thread hierarchy and the criteria for choosing Block size.

Key Answer Points: Grid is a collection of Blocks; Blocks contain Threads and execute on the same SM. Block size should be a multiple of 32 (Warp size), maximize SM Occupancy, and account for Shared Memory and Register usage. 128 or 256 is typically a good starting point.

Q7. Explain the difference between NVIDIA Nsight Systems and Nsight Compute, and their respective use scenarios.

Key Answer Points: Nsight Systems shows system-level timeline (CPU-GPU interactions, kernel launches, memory transfers). Nsight Compute provides detailed analysis of individual kernels including SM Occupancy, Memory Throughput, and Warp Stall Reasons.

Q8. What are Shared Memory Bank Conflicts and how do you avoid them?

Key Answer Points: Shared Memory has 32 banks; when threads in the same Warp access the same bank simultaneously, accesses are serialized. Avoid by adding padding (array width of 33 instead of 32) or redesigning access patterns.

Q9. Explain methods to reduce Host-Device memory transfer overhead in CUDA.

Key Answer Points: Pinned Memory (cudaMallocHost), async transfers (cudaMemcpyAsync + CUDA Streams), Zero-copy Memory (Unified Memory), overlapping transfers with computation, CUDA Graphs.

Q10. If GPU utilization is only 30%, what debugging steps would you follow?

Key Answer Points: (1) nvidia-smi for basic memory/utilization check -> (2) Nsight Systems for CPU vs GPU time ratio -> (3) Check data loading bottleneck (num_workers, prefetch) -> (4) Check kernel size (increase batch) -> (5) Check synchronization bottleneck (CUDA Graphs) -> (6) Check PCIe bottleneck.

Virtualization and Networking (10 Questions)

Q11. Compare PCI Passthrough, vGPU, and MIG, describing appropriate use scenarios for each.

Key Answer Points: PCI Passthrough = 1:1 assignment (max performance), vGPU = time-slicing (flexibility), MIG = physical partition (isolation + predictability). Large training: Passthrough; multi-tenant inference: MIG; mixed VDI: vGPU.

Q12. Explain the role of IOMMU and its importance in GPU virtualization.

Key Answer Points: IOMMU (Intel VT-d) translates device DMA requests to virtual addresses, isolating VM memory. Without it, GPUs could access other VMs' memory, creating security vulnerabilities.

Q13. Explain MIG profile configuration. When maximally partitioning an A100-80GB, what are each instance's specifications?

Key Answer Points: Up to 7 1g.10gb instances. Each has about 14 SMs, 10GB HBM2e, independent L2 Cache, separate memory controller. Must create GPU Instance (GI) first, then Compute Instance (CI) within it for CUDA use.

Q14. Explain the differences between InfiniBand and Ethernet, and why InfiniBand is important for distributed training.

Key Answer Points: InfiniBand natively supports RDMA with 0.5us latency and CPU-bypass transfer. NDR provides 400Gbps bandwidth. Distributed training's AllReduce communication exchanges hundreds of GBs, so bandwidth and latency directly impact performance.

Q15. Explain how RDMA's Zero-copy transfer works.

Key Answer Points: Application registers memory via ibverbs API, allowing NIC to directly DMA access that memory. Data transfers directly from user-space memory to NIC without kernel buffer intermediary.

Q16. What advantages does GPU Direct RDMA provide for distributed training?

Key Answer Points: Transfers directly from GPU memory to NIC, eliminating Host Memory bounce buffer. Doubles effective PCIe bandwidth utilization and reduces transfer latency.

Q17. Explain the differences between RoCE v2 and InfiniBand, and why PFC configuration is critical in RoCE v2 environments.

Key Answer Points: RoCE v2 runs RDMA over UDP/IP, leveraging existing Ethernet infrastructure. However, Ethernet is lossy by design, so PFC (Priority Flow Control) is needed to pause transmission during congestion to maintain RDMA's lossless requirement.

Q18. Explain the role of NCCL and how it works in distributed training.

Key Answer Points: NCCL optimizes collective communication (AllReduce, Broadcast, AllGather) across multiple GPUs. It auto-detects NVLink, NVSwitch, InfiniBand for optimal communication paths, using Ring-AllReduce or Tree-AllReduce algorithms.

Q19. What is SR-IOV and how is it used in GPU clusters?

Key Answer Points: SR-IOV partitions physical NICs into multiple VFs (Virtual Functions) for direct VM assignment. In GPU clusters, InfiniBand/RoCE NICs are partitioned via SR-IOV in VM environments to maintain GPU Direct RDMA performance.

Q20. Compare two methods for assigning GPUs to VMs in KubeVirt.

Key Answer Points: (1) PCI Passthrough: hostDevices for direct physical GPU assignment, max performance, 1:1 mapping. (2) vGPU: mediated devices for virtual GPU assignment, GPU sharing possible but with overhead. MIG + KubeVirt combination is also possible.

K8s and Performance Optimization (10 Questions)

Q21. Explain the components of NVIDIA GPU Operator and each component's role.

Key Answer Points: Driver (kernel module), Container Toolkit (runtime integration), Device Plugin (K8s resource registration), DCGM Exporter (metrics), MIG Manager (auto MIG configuration), GFD (node labeling), Validator (verification).

Q22. Why is Topology-aware scheduling important for GPU scheduling in K8s?

Key Answer Points: NVLink-connected GPUs communicate 6-10x faster than PCIe-connected GPUs. Random GPU allocation in distributed training forces communication through PCIe instead of NVLink, significantly degrading performance.

Q23. Explain why Gang Scheduling is essential for distributed training.

Key Answer Points: AllReduce communication requires all workers to participate. If only 3 of 4 GPUs are allocated, the 3 sit idle waiting for the 4th. All-or-nothing allocation via schedulers like Volcano/Kueue is necessary.

Q24. Explain the difference between GPU Time-Slicing and MIG from a K8s perspective.

Key Answer Points: Time-Slicing is software time-sharing with no performance isolation but works on all GPUs. MIG is physical partitioning with full isolation but only supports A100/H100. In K8s, they are requested as nvidia.com/gpu replicas and nvidia.com/mig-Xg.XXgb respectively.

Q25. Name 5 key metrics to monitor with DCGM Exporter and explain each.

Key Answer Points: GPU_UTIL (SM utilization), MEM_COPY_UTIL (memory bandwidth), FB_USED (memory usage), GPU_TEMP (temperature), XID_ERRORS (hardware errors). POWER_USAGE, ECC_SBE/DBE are also important.

Q26. Explain how Mixed Precision Training works and why Loss Scaling is necessary.

Key Answer Points: Forward Pass runs in FP16 for Tensor Core utilization, gradients computed in FP16 then applied to FP32 master weights. FP16's narrow range causes small gradients to round to zero; Loss Scaling prevents this.

Q27. Explain DeepSpeed ZeRO's 3 Stages and compare memory savings for each.

Key Answer Points: Stage 1 (Optimizer State partition, ~1.5x), Stage 2 (+Gradient partition, ~2x), Stage 3 (+Model Parameter partition, ~Nx). Stage 3 has the highest communication overhead, requiring high-speed networks like InfiniBand.

Q28. Explain how Triton Inference Server's Dynamic Batching improves inference efficiency.

Key Answer Points: Individual requests are queued and batched within a configured maximum wait time for combined processing. GPUs show higher throughput with larger batches, so dynamic batching balances latency and throughput.

Q29. Describe the debugging procedure when XID 79 ("GPU has fallen off the bus") occurs.

Key Answer Points: (1) Check dmesg for surrounding logs -> (2) Check PCIe link status (lspci) -> (3) Attempt GPU reset (nvidia-smi --gpu-reset) -> (4) Check physical connection (reseat) -> (5) Test different slot -> (6) RMA if recurring.

Q30. How would you design the GPU software stack for a new 1000 H100 GPU cluster?

Key Answer Points: (1) OS: Ubuntu 22.04 + latest kernel -> (2) Driver: NVIDIA Driver 535+ -> (3) Network: InfiniBand NDR + NCCL -> (4) Container: K8s + GPU Operator + DCGM -> (5) Scheduling: Volcano (Gang) + GFD (Topology-aware) -> (6) Monitoring: DCGM Exporter + Prometheus + Grafana -> (7) Storage: GPU Direct Storage + distributed filesystem -> (8) MIG/vGPU: per-workload partitioning strategy.

10. 10-Month Study Roadmap

Month 1-2: GPU Fundamentals and CUDA Programming

Goal: Understand GPU architecture + develop CUDA programming skills

Week	Topic	Activity
1	GPU Architecture	Read NVIDIA GPU architecture whitepapers (Ampere, Hopper)
2	CUDA Basics	Implement vector addition, matrix multiplication
3	CUDA Optimization	Shared Memory tiling, Memory Coalescing exercises
4	Advanced CUDA	Warp-level primitives, CUDA Streams
5-6	cuBLAS/cuDNN	Library usage, performance comparison
7-8	Profiling	Analyze real kernels with Nsight Systems/Compute

Resources:

NVIDIA CUDA Programming Guide
"Programming Massively Parallel Processors" (David Kirk, Wen-mei Hwu)
NVIDIA DLI (Deep Learning Institute) CUDA courses

Month 3-4: Linux Systems + GPU Drivers

Goal: Kernel/driver level understanding + GPU troubleshooting

Week	Topic	Activity
1-2	Linux Kernel Basics	Memory management, device drivers, PCIe
3-4	GPU Drivers	NVIDIA driver installation/configuration, module structure
5-6	Troubleshooting	XID error analysis, ECC error response, dmesg analysis
7-8	Performance Tools	System analysis using perf, strace, eBPF

Month 5-6: Virtualization (Core!)

Goal: KVM/QEMU + PCI Passthrough + vGPU + MIG hands-on

Week	Topic	Activity
1-2	KVM/QEMU	VM creation, IOMMU setup, basic virtualization
3-4	PCI Passthrough	GPU VFIO binding, GPU assignment to VM
5-6	MIG	MIG profile configuration, performance testing
7-8	vGPU	vGPU license setup, scheduler comparison

Month 7-8: Networking (InfiniBand/RDMA)

Goal: InfiniBand architecture + RDMA programming + NCCL tuning

Week	Topic	Activity
1-2	InfiniBand Basics	Architecture, Subnet Manager, basic commands
3-4	RDMA	ibverbs programming, benchmarks
5-6	GPU Direct	GPU Direct RDMA/Storage hands-on
7-8	NCCL Tuning	Distributed training NCCL benchmarks, env var optimization

Month 9-10: Kubernetes + Integration Project

Goal: K8s GPU management + large-scale cluster operations + portfolio completion

Week	Topic	Activity
1-2	GPU Operator	Installation, configuration, MIG Manager
3-4	Scheduling	Topology-aware, Gang Scheduling (Volcano)
5-6	Monitoring	DCGM + Prometheus + Grafana dashboards
7-8	Integration Project	Complete portfolio projects + interview prep

11. Portfolio Projects (3)

Project 1: CUDA Kernel Optimization (Matrix Multiplication Benchmark)

Goal: Naive CUDA to Shared Memory Tiling to Tensor Core utilization to cuBLAS comparison

Project Structure:
cuda-matmul-benchmark/
├── src/
│   ├── naive_matmul.cu          # Naive implementation
│   ├── tiled_matmul.cu          # Shared Memory tiling
│   ├── wmma_matmul.cu           # Tensor Core (WMMA API)
│   └── cublas_matmul.cu         # cuBLAS wrapper
├── benchmark/
│   ├── run_benchmarks.sh
│   └── plot_results.py          # Result visualization
├── profiles/
│   ├── nsight_systems/
│   └── nsight_compute/
└── README.md

Key Deliverables:

GFLOPS comparison table for each implementation
Nsight Compute profiling results (SM Occupancy, Memory Throughput)
Achievement rate vs cuBLAS (typically Naive: 1~~5%, Tiled: 20~~40%, Tensor Core: 60~80%)

Project 2: MIG + K8s Multi-Tenant GPU Cluster

Goal: Build a GPU sharing cluster using MIG

Project Structure:
mig-k8s-multitenant/
├── infra/
│   ├── gpu-operator-values.yaml
│   ├── mig-config.yaml
│   └── monitoring/
│       ├── dcgm-dashboard.json    # Grafana dashboard
│       └── alert-rules.yaml       # Prometheus alerts
├── workloads/
│   ├── inference-deployment.yaml  # MIG 1g.10gb inference
│   ├── training-job.yaml          # MIG 3g.40gb training
│   └── notebook-statefulset.yaml  # MIG 2g.20gb Jupyter
├── scheduler/
│   ├── gang-scheduling.yaml       # Volcano config
│   └── priority-classes.yaml
└── docs/
    ├── architecture.md
    └── benchmark-results.md

Key Deliverables:

3 workloads running simultaneously on 1 A100-80GB via MIG partitioning
Performance isolation verification (noisy neighbor testing)
Grafana dashboard: per-instance GPU utilization, memory, temperature

Project 3: Distributed Training Performance Profiling (NCCL + InfiniBand)

Goal: Analyze and optimize communication bottlenecks in distributed training

Project Structure:
distributed-training-profiler/
├── benchmarks/
│   ├── nccl_allreduce.sh          # NCCL benchmarks
│   ├── ib_bandwidth.sh            # InfiniBand bandwidth
│   └── multi_node_training.py     # Actual training script
├── profiling/
│   ├── nsight_distributed.sh      # Distributed env profiling
│   └── nccl_debug_analysis.py     # NCCL log analysis
├── optimization/
│   ├── nccl_env_tuning.sh         # NCCL env var optimization
│   └── topology_optimization.py   # GPU-NIC topology optimization
└── results/
    ├── scaling_efficiency.png     # Scaling efficiency graph
    └── communication_breakdown.png # Communication time analysis

Key Deliverables:

2-node, 4-node, 8-node scaling efficiency measurements
NCCL AllReduce time vs compute time ratio analysis
Before/after comparison of env var tuning (NCCL_IB_HCA, NCCL_ALGO, etc.)
Nsight Systems timeline showing communication/compute overlap

12. Quiz

Q1. How many FP32 CUDA Cores are in a single H100 SM, and how many total SMs does the H100 have?

Answer: A single SM has 128 FP32 CUDA Cores, and the H100 has a total of 132 SMs. Therefore, the total CUDA Core count is 128 x 132 = 16,896. For comparison, the A100 has 64 FP32 Cores per SM x 108 SMs = 6,912 total.

Q2. When maximally partitioning an A100-80GB with 1g.10gb MIG profiles, how many instances are created and how many SMs does each have?

Answer: A maximum of 7 1g.10gb instances are created. Each instance has approximately 14 SMs and 10GB HBM2e memory. Each instance has independent L2 Cache and separate memory controllers, so performance is physically isolated. To use CUDA, you must first create a GPU Instance (GI) and then create a Compute Instance (CI) within it.

Q3. What are three key differences between RDMA over InfiniBand and RoCE v2?

Answer: (1) Transport layer: InfiniBand uses its own transport protocol, while RoCE v2 operates over UDP/IP. (2) Congestion control: InfiniBand uses credit-based flow control for native lossless behavior, while RoCE v2 is Ethernet-based and requires PFC (Priority Flow Control) configuration for lossless operation. (3) Infrastructure: InfiniBand requires dedicated switches/cables, while RoCE v2 can leverage existing Ethernet switches at lower cost. InfiniBand NDR (400Gbps) generally offers higher bandwidth than RoCE (100-200Gbps).

Q4. Explain why Gang Scheduling is necessary in Kubernetes and name at least two schedulers that support it.

Answer: In distributed training, AllReduce communication requires all workers to participate for completion. If only some GPUs are allocated out of the needed total, the allocated ones sit idle waiting for the rest, wasting resources. Gang Scheduling provides all-or-nothing allocation. Schedulers supporting this include Volcano, Kueue (K8s SIG Scheduling), and YuniKorn (Apache). The default K8s scheduler (kube-scheduler) does not support Gang Scheduling.

Q5. When GPU utilization (SM Utilization) exceeds 90% but training speed is slow, name three possible causes and how to diagnose each.

Answer: (1) Memory-Bound: SMs are active but memory bandwidth is saturated. Check Memory Throughput in Nsight Compute; review Shared Memory usage and Memory Coalescing patterns. (2) Warp Divergence: Conditional branches cause sequential execution within Warps. Check Branch Efficiency metric in Nsight Compute. (3) Low SM Occupancy + high compute: Few Warps executing with high arithmetic intensity. Check Active Warps per SM and adjust Block size and Register usage. Additionally, underutilization of Tensor Cores (not using FP16/BF16) can also be a cause.

13. References

Official Documentation

NVIDIA CUDA Programming Guide - NVIDIA Developer
NVIDIA A100 Whitepaper - NVIDIA
NVIDIA H100 Whitepaper - NVIDIA
NVIDIA MIG User Guide - NVIDIA Developer
NVIDIA Virtual GPU Software Documentation - NVIDIA
NVIDIA GPU Operator Documentation - NVIDIA
NVIDIA DCGM Documentation - NVIDIA
NVIDIA NCCL Documentation - NVIDIA

Networking

InfiniBand Architecture Specification - IBTA
RDMA Aware Programming User Manual - NVIDIA Networking (Mellanox)
RoCE v2 Deployment Guide - NVIDIA Networking

Kubernetes

NVIDIA Device Plugin for Kubernetes - GitHub
Volcano: Kubernetes Native Batch System - volcano.sh
Kueue: Kubernetes-native Job Queueing - K8s SIG Scheduling

Training/Inference Optimization

DeepSpeed Documentation - Microsoft
Triton Inference Server Documentation - NVIDIA
vLLM Documentation - vLLM Project
Flash Attention Paper - Tri Dao et al.

Books

"Programming Massively Parallel Processors" - David Kirk, Wen-mei Hwu
"CUDA by Example" - Jason Sanders, Edward Kandrot
"Computer Architecture: A Quantitative Approach" - Hennessy, Patterson
"Understanding Linux Kernel" - Daniel P. Bovet, Marco Cesati

Community/Blogs

NVIDIA Developer Blog - developer.nvidia.com/blog
NVIDIA GTC Sessions (Free) - nvidia.com/gtc
Horace He's "Making Deep Learning Go Brrrr" Blog Series
Lily Chen's GPU Mode Community - Discord

GPU Software Engineer 합격 가이드: CUDA 아키텍처부터 vGPU/MIG, InfiniBand, K8s GPU 스케줄링까지 시스템 최적화 완전 정복

1. GPU Software Engineer라는 희소한 커리어

"GPU를 쓰는 사람" vs "GPU가 일하게 만드는 사람"

이 역할의 시장 가치

LG유플러스 GPU기술 TF의 미션

관련 직군 비교

2. JD 라인 바이 라인 해부

담당 업무

지원 자격 분석

필수 기술 분석

3. GPU 아키텍처 심화

3-1. GPU 연산 구조

SM (Streaming Multiprocessor) 아키텍처

아키텍처 세대별 비교

3-2. GPU 메모리 계층 (핵심!)

3-3. CUDA 프로그래밍 기초

Grid, Block, Thread 계층

CUDA 코드 예제: 벡터 덧셈

CUDA 코드 예제: 행렬 곱셈 (Shared Memory 활용)

주요 CUDA 라이브러리

3-4. GPU 프로파일링과 성능 분석

nvidia-smi 상세 활용

NVIDIA Nsight Systems (시스템 레벨 프로파일링)

NVIDIA Nsight Compute (커널 레벨 분석)

DCGM (Data Center GPU Manager)

GPU 활용률이 낮은 이유 분석 패턴

4. GPU 가상화 기술

4-1. 가상화 기초

Type 1 vs Type 2 Hypervisor

IOMMU (Intel VT-d / AMD-Vi)

4-2. PCI Passthrough

4-3. vGPU (NVIDIA Virtual GPU)

vGPU 프로필 유형

vGPU Scheduler

4-4. MIG (Multi-Instance GPU)

MIG 설정 명령어

MIG vs vGPU 비교

K8s에서 MIG 사용: NVIDIA MIG Manager

4-5. SR-IOV (NIC 가상화)

4-6. KubeVirt

5. 고속 네트워크: InfiniBand와 RDMA

5-1. InfiniBand 아키텍처

InfiniBand vs Ethernet 비교

InfiniBand 세대

InfiniBand 네트워크 구성요소

5-2. RDMA (Remote Direct Memory Access)

RDMA 전송 유형

RDMA 프로그래밍 기초

5-3. GPU Direct

GPU Direct RDMA

GPU Direct Storage (GDS)

NCCL + InfiniBand 조합

5-4. 네트워크 성능 튜닝

InfiniBand 벤치마크

PFC (Priority Flow Control) 설정

ECMP (Equal-Cost Multi-Path) 라우팅

6. Kubernetes GPU 관리

6-1. NVIDIA GPU Operator

6-2. GPU Device Plugin

Time-Slicing 설정 (GPU 공유)

6-3. GPU Scheduling

기본 스케줄링

Topology-Aware Scheduling

Gang Scheduling

Bin-packing vs Spread 전략

GPU Feature Discovery (GFD)

6-4. GPU 모니터링 on K8s

DCGM Exporter + Prometheus + Grafana

7. AI 워크로드 최적화

7-1. 학습 최적화

Mixed Precision Training

DeepSpeed ZeRO

병렬화 전략 비교

7-2. 추론 최적화

TensorRT 최적화

Triton Inference Server

vLLM: LLM 추론 최적화

7-3. 성능 병목 분석 패턴

8. Linux 시스템 트러블슈팅

8-1. GPU 관련 Linux 명령어