Split View: CUDA GPU 프로그래밍 모델 Deep Dive — SIMT, 메모리 계층, Tensor Core, 커널 최적화 완전 정복 (2025)

CUDA GPU 프로그래밍 모델 Deep Dive — SIMT, 메모리 계층, Tensor Core, 커널 최적화 완전 정복 (2025)

TL;DR

CUDA는 NVIDIA의 GPU 프로그래밍 플랫폼. 2007년 출시 이후 AI/HPC의 사실상 표준. ChatGPT, Stable Diffusion 같은 모든 주요 AI 워크로드가 CUDA 위에서.
SIMT (Single Instruction, Multiple Threads): 프로그래머는 스칼라 코드를 쓰고, 하드웨어가 32 threads = 1 warp로 묶어 병렬 실행. SIMD의 효율 + 스칼라 프로그래밍의 생산성.
스레드 계층: Grid → Block → Warp → Thread. 각 block은 같은 SM(Streaming Multiprocessor)에서 실행. Warp 단위 스케줄링.
메모리 계층: Register(가장 빠름) → Shared Memory(L1 크기) → L2 Cache → Global Memory(느리지만 큼). 각 레벨이 수십~수백 배 차이.
Memory Coalescing: Warp의 32 스레드가 연속 주소를 접근하면 1 transaction. 랜덤 접근은 32 transactions. 32배 대역폭 차이 → CUDA 최적화의 1순위.
Warp Divergence: Warp 내 스레드가 다른 경로를 타면 순차 실행 → 병렬성 손실.
Tensor Core (Volta, 2017): 행렬 곱을 하나의 명령어로. FP16 기준 125 TFLOPS. AI 가속의 열쇠.
현대적 흐름: Triton, cutlass, FlashAttention 같은 고수준 추상화가 핸드-튜닝 CUDA를 대체.
대안: ROCm/HIP (AMD), SYCL/oneAPI (Intel), Metal (Apple). 생태계 격차로 CUDA가 여전히 지배.

1. GPU가 왜 AI를 지배하는가

1.1 CPU vs GPU

항목	CPU	GPU
코어 수	8-128	수천-수만
코어당 성능	매우 빠름	상대적으로 느림
캐시	큰 L1/L2/L3	작은 공유 캐시
병렬성	ILP + MIMD	SIMT (massive)
용도	범용, 분기 많은 로직	수학, 데이터 병렬

CPU: "적은 수의 영리한 코어가 복잡한 일을". GPU: "수많은 단순 코어가 같은 일을 병렬로".

1.2 AI 워크로드의 특성

Transformer 추론 한 번:

MatMul × 수천 번
Element-wise 연산 (GELU, LayerNorm)
Attention (QK^T, softmax, @V)

모두 대규모 행렬 연산. 수십억 개의 MAC (multiply-accumulate) 연산. CPU로는 초당 100-1000 GFLOPS, GPU는 100,000+ TFLOPS (H100 기준).

1000배 성능 차이. "AI 시대의 CPU는 GPU"가 맞는 말.

1.3 NVIDIA의 독점

왜 NVIDIA만? 세 가지 요인:

CUDA 생태계: 2007년부터 쌓인 라이브러리, 문서, 커뮤니티.
하드웨어 투자: Tensor Core, HBM, NVLink, NVSwitch.
Jensen Huang의 장기 베팅: 2012년 AlexNet 이후 AI 올인.

AMD (MI300), Intel (Gaudi), Google (TPU), Apple (M-series), 중국 (Huawei Ascend) 모두 경쟁 중이지만 2025년 기준 CUDA 호환성과 성숙도가 여전히 NVIDIA 독점.

1.4 이 글의 초점

"CUDA를 모르고 AI 시스템을 이해할 수 없다". 이 글은 프로그래밍 모델부터 하드웨어까지 해부한다.

2. GPU 하드웨어 아키텍처

2.1 Streaming Multiprocessor (SM)

GPU의 기본 단위는 SM (Streaming Multiprocessor). 각 SM은 독립된 실행 유닛.

H100:   132 SM
A100:   108 SM
RTX 4090: 128 SM

각 SM 내부:

┌─────────────────────────────────┐
│           SM                     │
│  ┌─────────┐  ┌─────────┐       │
│  │ Warp    │  │ Warp    │ × 4  │
│  │ Sched   │  │ Sched   │       │
│  └─────────┘  └─────────┘       │
│                                  │
│  CUDA Cores (INT32, FP32, FP64) │
│  Tensor Cores                    │
│  Special Function Units          │
│  Load/Store Units                │
│                                  │
│  Register File (수만 개)          │
│  Shared Memory / L1 (수백 KB)    │
└─────────────────────────────────┘

CUDA Cores: 일반 ALU. FP32 연산.
Tensor Cores: 행렬 곱 전용.
Register File: 매우 큼 (수만 개). 스레드가 많아서.
Shared Memory: L1 캐시와 같은 공간. 사용자가 직접 관리.

2.2 Warp

Warp = 32 threads가 lockstep으로 실행되는 그룹. NVIDIA GPU의 근본 단위.

SM은 여러 warp를 동시에 실행:

A100: SM당 최대 64 warps (= 2048 threads) 활성.
H100: 더 많음.

Warp scheduler가 매 cycle 실행 가능한 warp를 선택 → 지연 시간 숨김.

2.3 SIMT — 프로그래밍 모델

SIMT (Single Instruction, Multiple Threads):

프로그래머는 하나의 스레드처럼 코드를 쓴다.
하드웨어가 32 threads를 묶어 같은 명령어를 다른 데이터에 실행.

SIMD와의 차이:

SIMD (e.g., AVX)	SIMT (CUDA)
명시적 vector	스칼라 코드
8×float 한 번에	32 threads가 32×float
분기 어려움	분기 가능 (divergence 대가)
개발 어려움	자연스러움

SIMT는 SIMD의 효율 + 스칼라 프로그래밍의 생산성. CUDA의 킬러 피처.

2.4 GPU 전체 구조

Host (CPU + DRAM)
     │ PCIe / NVLink
GPU
├── L2 Cache (수십 MB, 전체 공유)
├── Global Memory (HBM, 수십-수백 GB)
└── SM × N
    ├── L1 / Shared Memory
    ├── Register File
    └── Cores

Global Memory (= HBM, High Bandwidth Memory):

GPU 전체에서 공유.
크다 (40-192 GB).
느리다 (수백 cycle 지연).
하지만 HBM 대역폭은 엄청나다 (3 TB/s on H100).

L2 Cache: 모든 SM 공유. 수십 MB.

L1 / Shared Memory: SM 안에. 매우 빠름. Shared Memory는 프로그래머가 직접 관리.

Register: 가장 빠름. SM 내부. 스레드별로 자기 공간.

2.5 계층별 접근 시간

대략적 지연 시간 (A100 기준):

Register:       1 cycle
Shared Memory:  ~20 cycles
L1 cache:       ~30 cycles
L2 cache:       ~200 cycles
HBM:            ~500 cycles

PCIe로 host DRAM: ~10,000+ cycles.

데이터 locality가 성능의 전부. 가능한 한 낮은 레벨에 데이터를 유지.

3. CUDA 프로그래밍 모델

3.1 Hello World

#include <stdio.h>

__global__ void hello_kernel() {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    printf("Hello from thread %d\n", tid);
}

int main() {
    hello_kernel<<<2, 32>>>();  // 2 blocks × 32 threads = 64 threads
    cudaDeviceSynchronize();
    return 0;
}

__global__: GPU에서 실행되는 함수 (kernel).
<<<grid, block>>>: 실행 구성 — grid 크기 × block 크기.
threadIdx.x: block 내 thread 인덱스.
blockIdx.x: grid 내 block 인덱스.
blockDim.x: block 크기.

3.2 스레드 계층

Grid
├── Block 0
│   ├── Thread 0
│   ├── Thread 1
│   └── ...
├── Block 1
│   └── ...
└── ...

Grid: 여러 block. 커널 실행당 하나. Block: 여러 thread. 같은 SM에서 실행, shared memory 공유. Thread: 기본 단위.

이차원/삼차원도 가능:

dim3 grid(16, 16);
dim3 block(32, 32);
kernel<<<grid, block>>>();
// 총 16×16×32×32 = 262,144 threads

threadIdx.y, blockIdx.z 등으로 접근.

3.3 전역 인덱스 계산

__global__ void add_kernel(float *a, float *b, float *c, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

// 호출
int n = 1000000;
int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;  // ceil(n / blockSize)
add_kernel<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

각 thread가 배열의 한 원소 처리.
n이 blockSize의 배수가 아니면 마지막 block에 일부 thread가 범위 초과 → if 체크.

3.4 메모리 관리

// Host (CPU) 메모리
float *h_a = (float*)malloc(n * sizeof(float));

// Device (GPU) 메모리
float *d_a;
cudaMalloc(&d_a, n * sizeof(float));

// Host → Device 복사
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);

// Kernel 실행 (d_a 사용)
kernel<<<grid, block>>>(d_a);

// Device → Host 복사
cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost);

// 해제
cudaFree(d_a);
free(h_a);

명시적으로 host/device 메모리 관리. PCIe 복사가 비쌈 → 최소화.

3.5 Unified Memory

CUDA 6+에서 managed memory:

float *data;
cudaMallocManaged(&data, n * sizeof(float));

// CPU에서 접근
for (int i = 0; i < n; i++) data[i] = i;

// GPU에서 접근
kernel<<<grid, block>>>(data);

cudaDeviceSynchronize();

런타임이 페이지 폴트를 감지해 자동으로 host↔device 이동. 개발 편의성 ↑, 수동 관리보다 약간 느림.

4. 메모리 계층 상세

4.1 Global Memory

가장 큼 (40-192 GB).
GPU 전체 공유.
느림 (수백 cycle).
모든 kernel에서 접근 가능.

__global__ void kernel(float *data) {
    data[threadIdx.x] = threadIdx.x * 2.0;  // global memory 쓰기
}

4.2 Shared Memory

SM 내부 (L1과 동일 물리).
매우 빠름 (수십 cycle).
Block 내 thread들이 공유.
크기 제한 (~48-228 KB per SM).

__global__ void kernel() {
    __shared__ float shared_data[256];
    
    shared_data[threadIdx.x] = threadIdx.x;
    __syncthreads();  // block 내 모든 thread 동기화
    
    // 모든 thread가 shared_data 읽을 수 있음
    float sum = 0;
    for (int i = 0; i < 256; i++) {
        sum += shared_data[i];
    }
}

Use case: 같은 데이터를 여러 thread가 반복 접근 → global에서 shared로 가져와서 빠르게.

4.3 Register

가장 빠름 (1 cycle).
Thread 로컬.
컴파일러가 자동 할당.

__global__ void kernel() {
    float x = 1.0;  // register
    float y = 2.0;  // register
    float z = x + y;
}

각 SM의 register file이 크다 (수만 개). 그래도 한 thread가 너무 많이 쓰면 register spill → global memory로 넘어감 → 느림.

4.4 Constant Memory

__constant__ float coefficients[256];

__global__ void kernel(float *input) {
    float x = input[threadIdx.x];
    float y = 0;
    for (int i = 0; i < 256; i++) {
        y += coefficients[i] * pow(x, i);
    }
}

// Host에서
cudaMemcpyToSymbol(coefficients, host_data, 256 * sizeof(float));

64 KB 크기.
Read-only.
빠른 broadcast (warp의 모든 thread가 같은 값 읽을 때).

4.5 Texture Memory

원래 그래픽 텍스처용이지만 compute에서도 사용:

하드웨어 캐싱.
2D 공간 지역성 최적화.
보간 하드웨어.

이미지 처리에 유용. 현대에서는 __ldg() (Load Global Read-only cache)로 부분 대체.

5. Memory Coalescing — 최우선 최적화

5.1 원리

Warp (32 threads)가 메모리에 접근할 때:

Coalesced: 32 threads가 연속된 128 바이트 접근.

Thread 0 → address 0-3
Thread 1 → address 4-7
...
Thread 31 → address 124-127

→ 1 memory transaction (128 bytes).

Uncoalesced: 32 threads가 흩어진 주소 접근.

Thread 0 → address 0
Thread 1 → address 1024
Thread 2 → address 2048
...

→ 32 memory transactions, 각 128 bytes 중 4 bytes만 사용.

대역폭 차이 32배. 실제 GPU 성능의 성패가 여기서 결정된다.

5.2 예제

BAD:

__global__ void transpose_bad(float *in, float *out, int N) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    out[x * N + y] = in[y * N + x];
    //     ↑ 스트라이드 N
}

Warp의 thread들이 out[0], out[N], out[2N], ...에 접근 → stride 접근 → uncoalesced.

GOOD (shared memory 중간 단계):

__global__ void transpose_good(float *in, float *out, int N) {
    __shared__ float tile[32][33];  // +1 for bank conflict 방지
    
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;
    
    // Coalesced read from input
    tile[threadIdx.y][threadIdx.x] = in[y * N + x];
    __syncthreads();
    
    x = blockIdx.y * 32 + threadIdx.x;
    y = blockIdx.x * 32 + threadIdx.y;
    
    // Coalesced write to output
    out[y * N + x] = tile[threadIdx.x][threadIdx.y];
}

Read와 write 모두 coalesced. Shared memory에서 "transpose" 수행.

결과: 20-30배 성능 차이.

5.3 패턴 암기

"Thread i가 array[i]를 접근하라". 이것이 coalescing의 기본.

행렬을 2D 접근할 때:

Row-major: thread tid가 A[blockIdx.y * N + tid] → coalesced.
Column-major: A[tid * N + blockIdx.x] → uncoalesced.

6. Warp Divergence

6.1 문제

Warp 내 32 threads가 다른 경로를 타면?

__global__ void divergent_kernel(int *data) {
    if (threadIdx.x < 16) {
        data[threadIdx.x] = compute_a();
    } else {
        data[threadIdx.x] = compute_b();
    }
}

Hardware는 양쪽 경로 모두 순차 실행한다:

Threads 0-15: compute_a() 실행, 16-31은 idle.
Threads 16-31: compute_b() 실행, 0-15는 idle.

→ 병렬성 절반. Warp divergence의 대가.

6.2 완화 패턴

패턴 1: Warp 단위로 분기

if (blockIdx.x % 2 == 0) { /* path A */ }
else { /* path B */ }

같은 block의 모든 thread가 같은 경로 → warp 전체가 같은 경로 → divergence 없음.

패턴 2: 분기 대신 산술

// Bad
if (x > 0) y = x; else y = -x;

// Good (no branch)
y = fabs(x);

패턴 3: 양쪽 계산 후 선택

float a = compute_a();
float b = compute_b();
float result = (threadIdx.x < 16) ? a : b;

두 결과 모두 계산하지만 분기 없음.

6.3 Independent Thread Scheduling (Volta+)

Volta 아키텍처 (2017)부터 warp 내 thread들이 독립적 스케줄. 더 유연한 divergence 처리 가능.

하지만 성능 관점에선 여전히 divergence 피하는 것이 좋음.

7. Shared Memory Bank Conflict

7.1 Shared Memory 구조

Shared memory는 32개 bank로 나뉘어 있다. 각 bank는 32-bit word 단위.

Bank 0: words 0, 32, 64, ...
Bank 1: words 1, 33, 65, ...
...
Bank 31: words 31, 63, 95, ...

Warp의 32 threads가 서로 다른 bank에 접근 → 1 cycle. 같은 bank에 동시 접근 → bank conflict → 순차 실행.

7.2 예제

__shared__ float data[32][32];

// Row access (bank conflict 없음)
float x = data[threadIdx.x][0];  // thread i → bank i

// Column access (bank conflict!)
float y = data[0][threadIdx.x];  // 모든 thread → bank 0 → 32-way conflict

Column access는 모든 thread가 같은 bank.

해결: 패딩

__shared__ float data[32][33];  // 33 대신 32

Row 크기를 33으로 → column access가 각자 다른 bank에 도착.

7.3 Broadcast

같은 word를 여러 thread가 읽는 것은 OK (broadcast). Hardware가 한 번 읽고 모두에게 분배.

Bank conflict는 같은 bank의 다른 word일 때만 문제.

8. Occupancy

8.1 개념

Occupancy = "SM의 최대 warp 수 대비 현재 활성 warp 수".

A100 SM: 최대 64 warps 동시. 내 커널이 32 warps만 쓴다면 occupancy = 50%.

8.2 왜 높은 occupancy가 좋은가

GPU는 warp switching으로 지연 시간 숨김. 한 warp가 메모리 대기 중이면 다른 warp 실행.

Warps가 많을수록 더 많은 지연 시간을 숨길 수 있다.

8.3 Occupancy 제한 요인

Register per thread: 너무 많이 쓰면 register file 부족.
Shared memory per block: 너무 많이 쓰면 block 수 제한.
Threads per block: 블록 크기가 작으면 warp 적음.

__global__ __launch_bounds__(256, 4)
void kernel() {
    // ...
}

__launch_bounds__로 컴파일러에 힌트. "최대 256 thread/block, 4 blocks/SM 보장해줘".

8.4 Occupancy가 전부 아님

Latency hiding이 이미 충분하면 occupancy 증가가 도움 안 됨. 100% occupancy 추구가 항상 옳은 것 아님. Tensor core 사용 코드는 낮은 occupancy에서도 최고 성능.

NVIDIA Nsight Compute로 측정.

9. Tensor Core — AI 가속의 열쇠

9.1 등장 (Volta, 2017)

기존 GPU는 FP32 MAC이 기본 연산. AI 워크로드는 대부분 **행렬 곱 (MatMul)**인데, GEMM을 FP32로 돌리면 이론 TFLOPS의 50% 정도만.

NVIDIA의 답: Tensor Core — 행렬 곱 전용 유닛.

9.2 구조

V100 Tensor Core: 4×4 × 4×4 행렬 곱을 한 cycle에.

D = A × B + C

A: 4×4 FP16
B: 4×4 FP16
C: 4×4 FP32
D: 4×4 FP32

16 MAC per cycle (vs CUDA core의 1 MAC).

결과: V100 FP16 = 125 TFLOPS (vs FP32의 15 TFLOPS).

9.3 진화

V100 (Volta, 2017): FP16 → FP32 matmul.
A100 (Ampere, 2020): TF32, BF16, INT8, 2:4 sparsity. 312 TFLOPS (BF16).
H100 (Hopper, 2022): FP8, Transformer Engine. 2000 TFLOPS (FP8).
B100/B200 (Blackwell, 2024): FP4, 20,000+ TFLOPS.

세대마다 새 정밀도 지원. Tensor core가 가속기의 진짜 주역.

9.4 사용 방법

직접 사용: wmma API

#include <mma.h>
using namespace nvcuda;

__global__ void matmul_wmma(half *a, half *b, float *c) {
    wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
    
    wmma::fill_fragment(c_frag, 0.0f);
    wmma::load_matrix_sync(a_frag, a, 16);
    wmma::load_matrix_sync(b_frag, b, 16);
    wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
    wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
}

복잡하다. 직접 쓰는 사람 드물다.

일반적 접근:

cuBLAS: cublasGemmEx()이 Tensor Core 자동 사용.
cuDNN: Convolution/RNN 등 자동.
PyTorch / TensorFlow: 프레임워크가 내부적으로 cuBLAS/cuDNN 호출.
Triton / cutlass: 고수준 라이브러리.

대부분 AI 개발자는 Tensor Core 사용을 전혀 모른 채 효과를 누린다.

9.5 Mixed Precision

Tensor Core는 낮은 정밀도 입력 + 높은 정밀도 누산을 쓴다.

A, B: FP16 (2 bytes)
C, D: FP32 (4 bytes)

이점:

메모리: 절반.
대역폭: 절반.
속도: Tensor Core 사용.
정확도: 누산이 FP32라 수치 안전.

PyTorch의 torch.cuda.amp.autocast()가 이를 자동화. "AMP (Automatic Mixed Precision)".

10. Streams와 비동기 실행

10.1 Stream

CUDA stream은 순서 있는 연산의 큐. 다른 stream은 병렬 가능.

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

kernel1<<<grid, block, 0, stream1>>>(...);
kernel2<<<grid, block, 0, stream2>>>(...);
// kernel1과 kernel2 병렬 가능

10.2 겹침 최적화

일반적 워크플로:

Host → GPU (copy)  →  Kernel  →  GPU → Host (copy)
      [wait]           [wait]       [wait]

Stream으로 겹치기:

Stream 1: copy1  kernel1  copy_back1
Stream 2:         copy2   kernel2   copy_back2
Stream 3:                   copy3  kernel3  copy_back3

각 단계가 병렬. 시간 절약.

10.3 Pinned Memory

비동기 복사는 pinned host memory 필요:

float *h_data;
cudaMallocHost(&h_data, n * sizeof(float));  // pinned
// ...
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream);

malloc으로 할당된 메모리는 pageable이라 복사 중 OS가 이동 가능 → 비동기 복사 불가.

10.4 이벤트

cudaEvent_t event;
cudaEventCreate(&event);

kernel1<<<grid, block, 0, stream1>>>(...);
cudaEventRecord(event, stream1);
cudaStreamWaitEvent(stream2, event);  // stream2는 event 대기
kernel2<<<grid, block, 0, stream2>>>(...);

Stream 간 동기화.

11. CUDA Graph — Launch Overhead 제거

11.1 문제

매 커널 호출마다 launch overhead 존재 (수십 μs). 작은 커널이 많으면 이 오버헤드가 지배.

예: Transformer 추론 = 수백 개의 작은 커널 호출 → overhead만 수 ms.

11.2 Graph

CUDA Graph는 "커널 호출 시퀀스를 미리 기록"해서 한 번에 submit.

cudaGraph_t graph;
cudaGraphExec_t graphExec;

// 기록 시작
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
kernel1<<<..., stream>>>(...);
kernel2<<<..., stream>>>(...);
kernel3<<<..., stream>>>(...);
cudaStreamEndCapture(stream, &graph);

// Graph 인스턴스화
cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);

// 여러 번 실행 (빠름!)
for (int i = 0; i < N; i++) {
    cudaGraphLaunch(graphExec, stream);
}

Launch overhead가 하나의 graph launch로 감축. 수백 μs → 수 μs.

11.3 실전

PyTorch 2.0+의 CUDAGraphs가 자동. Transformer 추론 단일 토큰 생성에서 2-3배 속도 향상.

12. Multi-GPU와 NCCL

12.1 확장성 문제

LLM 훈련은 모델이 단일 GPU 메모리를 초과. 수백~수천 GPU 병렬 필요.

패러다임:

Data Parallel: 같은 모델을 여러 GPU에 복제, 다른 데이터 학습 → gradient 합산.
Model Parallel: 모델을 여러 GPU에 분할.
Pipeline Parallel: 모델의 layer를 GPU별로 나눔.

모두 GPU 간 통신이 필요.

12.2 NVLink와 NVSwitch

NVIDIA의 고속 GPU 간 연결:

NVLink: GPU 쌍 사이 직접 연결. H100: 900 GB/s.
NVSwitch: GPU 간 fully-connected switch. DGX H100이 이걸 포함.

PCIe는 ~64 GB/s. NVLink는 훨씬 빠르다.

12.3 NCCL

NCCL (NVIDIA Collective Communications Library). MPI 유사 API, GPU 간 통신:

AllReduce: 모든 GPU의 값을 합산, 결과를 모두 공유 → gradient 동기화.
AllGather: 각 GPU의 데이터를 모든 GPU에 복사.
Broadcast: 한 GPU의 데이터를 모든 GPU에.
Reduce: 한 GPU로 합산.
AllToAll: 모든 GPU 간 교차.

NCCL은 NVLink, PCIe, InfiniBand, Ethernet을 자동으로 선택. PyTorch의 torch.distributed 기본 backend.

12.4 P2P

두 GPU가 NVLink로 직접 연결되어 있으면 peer-to-peer memcpy 가능. Host 거치지 않음.

cudaDeviceEnablePeerAccess(dev1, 0);
cudaMemcpyPeer(dst, dev1, src, dev0, size);

12.5 RDMA

다른 노드의 GPU와 직접 통신 (InfiniBand + GPUDirect RDMA). CPU 거치지 않음.

대규모 훈련의 핵심. 이 세션의 RDMA 포스트 참고.

13. 고수준 추상화

직접 CUDA 쓰는 개발자는 적다. 대부분 라이브러리 사용.

13.1 cuBLAS / cuDNN / cuSPARSE

NVIDIA의 수학 라이브러리:

cuBLAS: 행렬 연산 (GEMM, GEMV 등). BLAS 호환.
cuDNN: 딥러닝 primitive (convolution, pooling, RNN).
cuSPARSE: sparse 연산.
cuSOLVER: linear algebra solver.

PyTorch/TF 내부에서 호출. "표준 GPU 라이브러리".

13.2 Thrust

C++ STL 유사 GPU 라이브러리:

#include <thrust/device_vector.h>
#include <thrust/transform.h>

thrust::device_vector<float> a(n), b(n), c(n);
thrust::transform(a.begin(), a.end(), b.begin(), c.begin(), thrust::plus<float>());

STL과 유사한 API로 GPU 프로그래밍. 낮은 러닝 커브.

13.3 CUB

Low-level 빌딩 블록. Reduction, scan, sort 같은 operation의 최적화된 구현.

13.4 cutlass

GEMM 빌딩 블록 라이브러리. 커스텀 GEMM 커널 작성 시 사용. FlashAttention도 cutlass 기반.

13.5 Triton

OpenAI의 Python-like 언어:

import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

CUDA보다 훨씬 단순. 내부적으로 MLIR로 컴파일 → NVIDIA 최적화 커널 생성. 퍼포먼스가 핸드-튜닝 CUDA에 필적.

2024+ Triton이 AI 연구자의 주류 선택.

13.6 FlashAttention

Tri Dao의 attention 최적화 구현. Tensor Core + Shared Memory + fused kernel로 attention 연산을 메모리 경쟁 없이 수행. GPT/LLaMA 훈련에 필수.

CUDA로 작성 + cutlass 활용. Triton 버전도 존재.

14. 프로파일링과 최적화

14.1 Nsight Systems

전체 파이프라인 프로파일. Timeline 뷰. 커널 실행, 메모리 복사, CPU/GPU 활동.

nsys profile --stats=true ./my_app

"전체적으로 어디가 병목"인지 파악.

14.2 Nsight Compute

단일 커널 심층 분석:

Occupancy.
Memory throughput.
Warp efficiency.
Instruction mix.
Roofline analysis.

ncu --set full ./my_app

Nsight UI에서 각 커널의 상세 메트릭 + 개선 제안.

14.3 PyTorch Profiler

with torch.profiler.profile() as prof:
    model(input)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

PyTorch 레벨에서 어느 연산이 시간 소비.

14.4 Roofline

각 커널이 compute-bound인지 memory-bound인지 판단:

Arithmetic Intensity = FLOPs / bytes 접근.
GPU의 peak FLOPS와 peak bandwidth로 roof.
커널이 roof의 어디에 위치?

Compute-bound: Tensor core 활용, occupancy. Memory-bound: Coalescing, shared memory, 데이터 reuse.

15. 경쟁자

15.1 AMD ROCm / HIP

AMD의 CUDA 대응. HIP (Heterogeneous-Compute Interface for Portability):

// CUDA
__global__ void kernel() { ... }

// HIP (거의 동일)
__global__ void kernel() { ... }

hipify 도구가 CUDA 코드를 자동 변환. MI300은 H100 대비 유사한 하드웨어 사양.

문제: 생태계. 라이브러리, 툴링, 커뮤니티가 NVIDIA에 한참 밀림.

15.2 Intel oneAPI / SYCL

Intel의 대응. SYCL = 표준 기반 (Khronos). DPC++ (Data Parallel C++).

queue q;
q.submit([&](handler& h) {
    h.parallel_for(range<1>(n), [=](id<1> i) {
        c[i] = a[i] + b[i];
    });
});

NVIDIA, AMD, Intel 모두 백엔드. 이론적으로 portable. 현실은 각 벤더 성능 차이.

15.3 Apple Metal

Apple Silicon의 GPU. Metal Performance Shaders (MPS).

PyTorch MPS backend로 macOS에서 GPU 훈련 가능. M-series 칩의 유니파이드 메모리로 일부 워크로드에 효율적.

15.4 WebGPU

브라우저에서 GPU 접근. Compute shader 지원. WebGL보다 훨씬 강력.

WebGPU + Wasm + LLM: 브라우저에서 소형 LLM 실행 가능.

15.5 Google TPU

Google 자체 ASIC. JAX/XLA 통합. 특정 워크로드(transformer)에 매우 효율적. Google Cloud 전용.

16. 학습 로드맵

1단계: 기본

CUDA C++ Programming Guide (NVIDIA 공식).
간단한 커널 작성: vector add, matrix mul.
nvcc 컴파일, cudaMemcpy, <<<grid, block>>>.

2단계: 최적화

Memory coalescing 실습.
Shared memory 활용.
Bank conflict 찾기.
Nsight Compute 사용.

3단계: 고급

Warp-level primitive (__shfl_sync).
Tensor Core (wmma).
CUDA Graph.
Multi-GPU (NCCL).

4단계: 생태계

cuBLAS, cuDNN 사용.
Triton으로 커스텀 커널.
PyTorch extension (torch.utils.cpp_extension).
FlashAttention, cutlass 읽기.

책:

"Programming Massively Parallel Processors" — Hwu, Kirk.
"CUDA by Example" — Sanders, Kandrot.
"Professional CUDA C Programming" — Cheng et al.

온라인:

NVIDIA Developer Blog.
"GPU MODE" Discord 및 YouTube.
PMPP 강의 (University of Illinois).

샘플:

NVIDIA cuda-samples GitHub.
Triton tutorial.
FlashAttention 소스.

17. 요약 — 한 장 정리

┌─────────────────────────────────────────────────────┐
│              CUDA Cheat Sheet                        │
├─────────────────────────────────────────────────────┤
│ 하드웨어:                                              │
│   GPU = N × SM                                       │
│   SM = Cores + Tensor Cores + RegFile + SharedMem    │
│   H100: 132 SM                                       │
│                                                       │
│ 실행 모델 (SIMT):                                      │
│   Grid → Block → Warp (32) → Thread                  │
│   프로그래머는 스칼라 코드                              │
│   HW가 32 threads를 묶어 실행                          │
│                                                       │
│ 메모리 계층:                                           │
│   Register (가장 빠름)                                │
│   Shared Memory / L1 (수십 cycles)                   │
│   L2 (~200 cycles)                                    │
│   HBM (수백 cycles, 수십 GB)                          │
│   Host (PCIe, 느림)                                   │
│                                                       │
│ 최우선 최적화:                                          │
│   Memory coalescing!                                  │
│   Thread i → array[i]                                │
│   Warp 32 threads → 연속 128 bytes                   │
│                                                       │
│ 함정:                                                  │
│   Warp divergence (분기)                             │
│   Bank conflict (shared memory)                      │
│   Register spill                                     │
│   Uncoalesced access                                 │
│                                                       │
│ Tensor Core:                                          │
│   Volta 2017+                                        │
│   Matrix multiply 전용                                │
│   FP16/BF16/FP8/FP4                                  │
│   Mixed precision (autocast)                         │
│   cuBLAS/cuDNN 자동 사용                              │
│                                                       │
│ Streams:                                              │
│   독립적 연산 시퀀스                                   │
│   copy/compute 겹침                                   │
│   pinned memory 필수                                  │
│                                                       │
│ CUDA Graph:                                           │
│   Launch overhead 제거                                │
│   소형 커널 많을 때 필수                               │
│                                                       │
│ Multi-GPU:                                            │
│   NVLink (900 GB/s H100)                             │
│   NCCL (AllReduce, etc)                              │
│   P2P, RDMA                                          │
│                                                       │
│ 라이브러리:                                             │
│   cuBLAS/cuDNN (행렬/DL primitive)                   │
│   Thrust (STL-like)                                  │
│   cutlass (GEMM 빌딩)                                 │
│   Triton (Python-like 커널)                          │
│   FlashAttention                                      │
│                                                       │
│ 도구:                                                  │
│   nvcc (컴파일러)                                     │
│   Nsight Systems (timeline)                          │
│   Nsight Compute (kernel deep)                       │
│   cuda-gdb                                            │
│                                                       │
│ 대안:                                                  │
│   ROCm/HIP (AMD)                                    │
│   SYCL/oneAPI (Intel)                                │
│   Metal (Apple)                                      │
│   WebGPU                                             │
│   TPU (Google)                                       │
└─────────────────────────────────────────────────────┘

18. 퀴즈

Q1. SIMT와 SIMD의 차이는?

A. 프로그래밍 모델 수준의 차이. SIMD(예: AVX)는 프로그래머가 명시적으로 vector 타입(__m256)을 사용해 "8개 float을 한 번에"를 써야 한다. 분기 처리가 어렵고 컴파일러 도움 없이는 최적화가 복잡. SIMT(CUDA)는 프로그래머가 하나의 스레드처럼 스칼라 코드를 쓰고, 하드웨어가 32 threads를 warp로 묶어 같은 명령어를 다른 데이터에 실행. "작성은 스칼라, 실행은 vector". 분기도 자연스럽게 쓸 수 있고(단, divergence 대가), GPU의 수천 코어를 활용하기 훨씬 쉽다. SIMT는 SIMD의 효율 + 스칼라 프로그래밍의 생산성을 결합 — CUDA가 OpenCL과 pure SIMD를 이긴 결정적 설계.

Q2. Memory coalescing이 왜 CUDA의 1순위 최적화인가?

A. 메모리 대역폭의 32배 차이. Warp의 32 threads가 연속된 128 바이트에 접근하면 하드웨어가 이를 1 memory transaction으로 묶는다. 같은 32 threads가 흩어진 주소에 접근하면 32 transactions가 발생하고 각 transaction은 128 바이트 중 실제 사용하는 4 바이트만 가져옴 → 나머지 124 바이트 낭비. 결과: 32배 적은 effective bandwidth. HBM은 이미 TB/s 수준이지만 uncoalesced 접근하면 수십 GB/s로 떨어짐. 해결은 간단하지만 엄격: "thread i는 array[i]에 접근"하라. 2D 배열은 row-major 순회. 이것이 왜 matrix transpose가 CUDA의 교과서 예제인지의 이유 — naive 구현은 uncoalesced고 최적화 버전은 shared memory를 중간 단계로 써서 read/write 모두 coalesced.

Q3. Warp divergence가 왜 성능을 해치는가?

A. Warp의 32 threads가 lockstep으로 실행되기 때문. Hardware는 32 threads에게 동시에 같은 명령어를 준다. if (tid < 16) A() else B() 같은 분기가 있으면 warp가 양쪽 경로를 모두 순차 실행한다: 먼저 threads 0-15이 A() 실행하는 동안 16-31은 idle, 그 다음 16-31이 B() 실행하는 동안 0-15가 idle. 결과: 병렬성 절반. 최악의 경우 32-way divergence는 32배 느려진다. 완화: (1) warp boundary에 맞춰 분기 (blockIdx.x % 2), (2) 분기 대신 산술(fabs, ternary), (3) 두 경로 모두 계산 후 select. Volta(2017) 이후 independent thread scheduling이 더 유연한 처리를 허용하지만 여전히 divergence 최소화가 원칙.

Q4. Shared memory bank conflict가 무엇인가?

A. Shared memory가 32개 bank로 나뉘어 있어 같은 bank 동시 접근 시 순차 처리. Shared memory는 32-bit word 단위로 stripe되어 bank 0이 words 0, 32, 64, ..., bank 1이 words 1, 33, 65, ...를 담당. Warp의 32 threads가 서로 다른 bank에 접근하면 1 cycle. 같은 bank의 다른 word에 접근하면 N-way conflict로 N cycle 걸림. 전형적 예: data[32][32] 2D 배열에서 column access. 모든 thread가 data[i][0]에 접근 → 모두 bank 0 → 32-way conflict. 해결: 패딩 — data[32][33]으로 크기를 33으로 해서 column access가 각자 다른 bank에 떨어지게. Broadcast(같은 word를 여러 thread가 읽기)는 OK — hardware가 한 번 읽고 분배.

Q5. Tensor Core가 AI 워크로드를 왜 10-100배 빠르게 하는가?

A. 행렬 곱을 전용 유닛으로 하드웨어화. 기존 CUDA core는 1 FMA(Fused Multiply-Add) per cycle. Tensor core(V100+)는 4×4 × 4×4 행렬 곱을 한 명령어로 수행 — 16 MACs per cycle. 더 중요한 것은 낮은 정밀도를 지원: FP16(V100), BF16/TF32/INT8(A100), FP8(H100), FP4(B200). 정밀도를 낮추면 메모리 대역폭 절반 + 더 많은 연산 처리. 결과: H100 기준 FP32 = 67 TFLOPS vs Tensor Core FP8 = 2000 TFLOPS. 30배 차이. Transformer 추론/훈련이 거의 완전히 matmul이므로 이 속도가 그대로 실제 성능에 반영. "CPU는 왜 못 쫓아오는가"의 답 — CPU에 비슷한 유닛을 만들 수 없는 건 아니지만 workload와 투자 차이. NVIDIA가 2017년 한 이 결정이 AI 가속기 시장을 정의.

Q6. CUDA Graph가 해결하는 문제는?

A. Kernel launch overhead. 매 kernel<<<>>>() 호출마다 CPU↔GPU 통신, driver 오버헤드로 수십 μs 걸림. 큰 커널에선 무시 가능하지만 Transformer 추론처럼 수백 개의 작은 커널(layer마다 여러 연산)을 연속 호출하면 overhead만 수 ms — 실제 compute time에 근접. CUDA Graph는 "커널 호출 시퀀스를 미리 기록해서 하나의 graph launch로 submit". 수백 개의 kernel launch가 하나의 launch로 감축. Transformer 추론 token generation에서 2-3배 속도 향상. PyTorch 2.0+의 torch.compile + CUDAGraphs 조합이 자동 적용. 핵심: "launch 자체가 병목인가"를 판단해야 효과적 — compute-heavy workload는 큰 효과 없지만 latency-sensitive small kernel workload는 극적.

Q7. AMD ROCm이 CUDA를 대체 못 하는 이유는?

A. 기술보다 생태계. 하드웨어는 AMD MI300이 H100에 근접하다(일부 지표는 더 우수). HIP은 CUDA 코드를 자동 변환해주는 도구(hipify)까지 제공. 그런데도 실제 채택이 느린 이유: (1) 라이브러리 성숙도 — cuBLAS/cuDNN/NCCL에 해당하는 rocBLAS/MIOpen/RCCL이 기능 제한 또는 성능 차이, (2) 프레임워크 지원 — PyTorch가 공식 지원하지만 일부 연산만 최적화, 최신 기능은 CUDA 먼저, (3) 개발자 커뮤니티 — 책, 튜토리얼, Stack Overflow 답변, 블로그 압도적으로 CUDA 우세, (4) 핵심 라이브러리 저자 — Triton, FlashAttention, cutlass 같은 혁신이 CUDA 타겟, AMD는 항상 뒤따름, (5) 기업 관성 — 엔지니어 훈련 비용과 이미 작성된 CUDA 코드베이스의 sunk cost. "Hardware가 같아도 생태계가 다르면 사용자가 움직이지 않는다"는 교훈. AMD가 따라잡으려면 수년간의 투자가 필요.

이 글이 도움이 됐다면 다음 포스트도 확인해 보세요:

"Transformer Architecture Deep Dive" — GPU에서 실제 돌아가는 것.
"Diffusion Models Deep Dive" — 또 다른 GPU-heavy 워크로드.
"RDMA & NCCL" — 멀티 GPU 통신.
"LLVM Compiler Infrastructure" — Triton이 MLIR로 컴파일되는 배경.

CUDA GPU Programming Model Deep Dive — SIMT, Memory Hierarchy, Tensor Core, Kernel Optimization (2025)

TL;DR

CUDA is NVIDIA's GPU programming platform. Since 2007, the de facto standard for AI/HPC. All major AI workloads (ChatGPT, Stable Diffusion) run on CUDA.
SIMT (Single Instruction, Multiple Threads): programmers write scalar code, hardware bundles 32 threads = 1 warp for parallel execution. SIMD efficiency plus scalar programming productivity.
Thread hierarchy: Grid, Block, Warp, Thread. Each block executes on the same SM (Streaming Multiprocessor). Warp-level scheduling.
Memory hierarchy: Register (fastest), Shared Memory (L1 size), L2 Cache, Global Memory (slow but big). Each level differs by tens to hundreds of times.
Memory Coalescing: if a warp's 32 threads access contiguous addresses, it becomes 1 transaction. Random access becomes 32 transactions. 32x bandwidth difference — the #1 CUDA optimization.
Warp Divergence: threads within a warp taking different paths are executed serially, losing parallelism.
Tensor Core (Volta, 2017): matrix multiply as a single instruction. 125 TFLOPS at FP16. The key to AI acceleration.
Modern trend: higher-level abstractions like Triton, cutlass, and FlashAttention replace hand-tuned CUDA.
Alternatives: ROCm/HIP (AMD), SYCL/oneAPI (Intel), Metal (Apple). Ecosystem gap keeps CUDA dominant.

1. Why GPUs Dominate AI

1.1 CPU vs GPU

Item	CPU	GPU
Core count	8-128	Thousands to tens of thousands
Per-core perf	Very fast	Relatively slow
Cache	Large L1/L2/L3	Small shared cache
Parallelism	ILP + MIMD	SIMT (massive)
Purpose	General, branchy logic	Math, data-parallel

CPU: "A few smart cores handle complex work." GPU: "Many simple cores do the same work in parallel."

1.2 Nature of AI Workloads

One Transformer inference pass:

MatMul x thousands
Element-wise ops (GELU, LayerNorm)
Attention (QK^T, softmax, @V)

All are large-scale matrix operations — billions of MAC (multiply-accumulate) ops. CPUs do 100-1000 GFLOPS, GPUs do 100,000+ TFLOPS (H100). A 1000x gap.

1.3 NVIDIA's Monopoly

Three factors:

CUDA ecosystem: libraries, docs, community accumulated since 2007.
Hardware investment: Tensor Core, HBM, NVLink, NVSwitch.
Jensen Huang's long bet: all-in on AI since AlexNet (2012).

AMD (MI300), Intel (Gaudi), Google (TPU), Apple (M-series), China (Huawei Ascend) all compete, but as of 2025 CUDA compatibility and maturity keep NVIDIA dominant.

2. GPU Hardware Architecture

2.1 Streaming Multiprocessor (SM)

The basic GPU unit is the SM. Each SM is an independent execution unit.

H100:   132 SM
A100:   108 SM
RTX 4090: 128 SM

Inside each SM:

+---------------------------------+
|           SM                    |
|  +---------+  +---------+       |
|  | Warp    |  | Warp    | x 4   |
|  | Sched   |  | Sched   |       |
|  +---------+  +---------+       |
|                                 |
|  CUDA Cores (INT32, FP32, FP64) |
|  Tensor Cores                   |
|  Special Function Units         |
|  Load/Store Units               |
|                                 |
|  Register File (tens of thousands) |
|  Shared Memory / L1 (hundreds of KB) |
+---------------------------------+

2.2 Warp

A warp = 32 threads executed in lockstep. The fundamental NVIDIA GPU unit. A100 runs up to 64 warps (= 2048 threads) per SM. Warp schedulers pick runnable warps each cycle to hide latency.

2.3 SIMT — The Programming Model

SIMT (Single Instruction, Multiple Threads): programmers write code as if for a single thread; hardware bundles 32 threads and issues one instruction over different data.

SIMT differs from SIMD:

SIMD (e.g., AVX)	SIMT (CUDA)
Explicit vector	Scalar code
8xfloat at once	32 threads x 32xfloat
Hard to branch	Branches allowed (divergence cost)
Hard to develop	Natural

2.4 Full GPU Structure

Host (CPU + DRAM)
     | PCIe / NVLink
GPU
+-- L2 Cache (tens of MB, global)
+-- Global Memory (HBM, tens-hundreds of GB)
+-- SM x N
    +-- L1 / Shared Memory
    +-- Register File
    +-- Cores

Global Memory (HBM): shared across GPU, 40-192 GB, ~500-cycle latency, but huge bandwidth (3 TB/s on H100).

2.5 Access Latency

Approximate (A100):

Register:       1 cycle
Shared Memory:  ~20 cycles
L1 cache:       ~30 cycles
L2 cache:       ~200 cycles
HBM:            ~500 cycles

Host DRAM via PCIe: ~10,000+ cycles. Data locality is everything.

3. CUDA Programming Model

3.1 Hello World

#include <stdio.h>

__global__ void hello_kernel() {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    printf("Hello from thread %d\n", tid);
}

int main() {
    hello_kernel<<<2, 32>>>();  // 2 blocks x 32 threads = 64 threads
    cudaDeviceSynchronize();
    return 0;
}

__global__: a kernel executed on the GPU.
<<<grid, block>>>: launch config — grid size x block size.
threadIdx.x, blockIdx.x, blockDim.x: index/size intrinsics.

3.2 Thread Hierarchy

Grid contains Blocks; Blocks contain Threads. Blocks on the same SM share shared memory. 2D/3D is supported:

dim3 grid(16, 16);
dim3 block(32, 32);
kernel<<<grid, block>>>();
// 16x16x32x32 = 262,144 threads

3.3 Global Index

__global__ void add_kernel(float *a, float *b, float *c, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) c[idx] = a[idx] + b[idx];
}

3.4 Memory Management

float *d_a;
cudaMalloc(&d_a, n * sizeof(float));
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
kernel<<<grid, block>>>(d_a);
cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_a);

PCIe copies are expensive — minimize them.

3.5 Unified Memory

float *data;
cudaMallocManaged(&data, n * sizeof(float));

Runtime handles page faults, moving pages between host and device. Easier, slightly slower than manual.

4. Memory Hierarchy

4.1 Global Memory

Largest (40-192 GB), shared GPU-wide, ~hundreds of cycles.

4.2 Shared Memory

Inside SM (same physical as L1). Very fast (tens of cycles). Shared across threads in a block. Limited (~48-228 KB per SM).

__global__ void kernel() {
    __shared__ float shared_data[256];
    shared_data[threadIdx.x] = threadIdx.x;
    __syncthreads();
    // All threads can read shared_data
}

Use case: reuse data that multiple threads access repeatedly.

4.3 Register

Fastest (1 cycle), thread-local, compiler-allocated. Too many registers per thread cause register spill to global memory.

4.4 Constant Memory

__constant__ float coefficients[256];

64 KB, read-only, fast broadcast when all warp threads read the same value.

4.5 Texture Memory

Originally for graphics; offers 2D spatial locality and interpolation. Modern code often uses __ldg() instead.

5. Memory Coalescing — The Top Optimization

5.1 Principle

Coalesced: 32 threads access contiguous 128 bytes -> 1 memory transaction.

Uncoalesced: threads hit scattered addresses -> 32 transactions, only 4 of 128 bytes per transaction used.

32x bandwidth difference. Real GPU performance is decided here.

5.2 Example

BAD:

__global__ void transpose_bad(float *in, float *out, int N) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    out[x * N + y] = in[y * N + x];
}

Stride N access -> uncoalesced.

GOOD (via shared memory):

__global__ void transpose_good(float *in, float *out, int N) {
    __shared__ float tile[32][33];  // +1 to avoid bank conflict
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;
    tile[threadIdx.y][threadIdx.x] = in[y * N + x];
    __syncthreads();
    x = blockIdx.y * 32 + threadIdx.x;
    y = blockIdx.x * 32 + threadIdx.y;
    out[y * N + x] = tile[threadIdx.x][threadIdx.y];
}

20-30x speedup.

5.3 Pattern

"Thread i accesses array[i]." That is the coalescing rule.

6. Warp Divergence

If threads within a warp take different paths, hardware executes both paths serially:

if (threadIdx.x < 16) compute_a();
else compute_b();

-> 50% parallelism lost.

Mitigation:

Branch at warp boundaries (e.g., blockIdx.x % 2).
Replace branches with arithmetic (fabs, ternary).
Compute both paths, select result.

Volta (2017) introduced Independent Thread Scheduling, but avoiding divergence remains the guideline.

7. Shared Memory Bank Conflict

Shared memory is split into 32 banks of 32-bit words.

Different threads hit different banks -> 1 cycle.
Same bank, different words -> N-way conflict -> N cycles.

__shared__ float data[32][32];
float y = data[0][threadIdx.x];  // all threads -> bank 0 -> 32-way conflict

Fix with padding: __shared__ float data[32][33];

Broadcast (same word, many threads) is free.

8. Occupancy

Occupancy = active warps / max warps per SM. A100 max is 64 warps.

High occupancy hides latency via warp switching. Limits:

Registers per thread.
Shared memory per block.
Threads per block.

__global__ __launch_bounds__(256, 4)
void kernel() { /* ... */ }

High occupancy is not always best — Tensor Core code can peak at low occupancy. Measure with Nsight Compute.

9. Tensor Core — The Key to AI Acceleration

9.1 Volta (2017)

V100 Tensor Core does 4x4 x 4x4 matrix multiply per cycle: D = A x B + C. 16 MACs per cycle (vs CUDA core's 1 MAC). V100 FP16 = 125 TFLOPS.

9.2 Evolution

V100 (Volta, 2017): FP16 -> FP32 matmul.
A100 (Ampere, 2020): TF32, BF16, INT8, 2:4 sparsity. 312 TFLOPS (BF16).
H100 (Hopper, 2022): FP8, Transformer Engine. 2000 TFLOPS (FP8).
B100/B200 (Blackwell, 2024): FP4, 20,000+ TFLOPS.

9.3 Usage

Direct via wmma API:

#include <mma.h>
using namespace nvcuda;

__global__ void matmul_wmma(half *a, half *b, float *c) {
    wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
    wmma::fill_fragment(c_frag, 0.0f);
    wmma::load_matrix_sync(a_frag, a, 16);
    wmma::load_matrix_sync(b_frag, b, 16);
    wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
    wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
}

Most developers use cuBLAS/cuDNN/PyTorch/Triton instead.

9.4 Mixed Precision

Low-precision input (FP16) + high-precision accumulation (FP32). Half memory, half bandwidth, Tensor Core speed, numerical safety. Automated by PyTorch's torch.cuda.amp.autocast() (AMP).

10. Streams and Asynchronous Execution

A CUDA stream is an ordered queue of operations. Different streams can run in parallel.

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
kernel1<<<grid, block, 0, stream1>>>(...);
kernel2<<<grid, block, 0, stream2>>>(...);

Overlap copy and compute across streams. Async copies need pinned memory:

cudaMallocHost(&h_data, n * sizeof(float));
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream);

Events synchronize across streams.

11. CUDA Graph — Eliminate Launch Overhead

Each kernel launch has ~tens of microseconds overhead. Transformer inference issues hundreds of small kernels -> overhead dominates.

cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
kernel1<<<..., stream>>>(...);
kernel2<<<..., stream>>>(...);
cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
cudaGraphLaunch(graphExec, stream);

Hundreds of launches reduce to one graph launch. 2-3x speedup on Transformer token generation. PyTorch 2.0+ CUDAGraphs automates this.

12. Multi-GPU and NCCL

LLM training exceeds a single GPU's memory. Parallel patterns: Data Parallel, Model Parallel, Pipeline Parallel — all need GPU-to-GPU communication.

NVLink: H100 = 900 GB/s.
NVSwitch: fully-connected GPU switch in DGX.
PCIe: ~64 GB/s.

NCCL provides AllReduce, AllGather, Broadcast, Reduce, AllToAll. Selects NVLink/PCIe/IB/Ethernet automatically. Default PyTorch distributed backend.

P2P memcpy via NVLink:

cudaDeviceEnablePeerAccess(dev1, 0);
cudaMemcpyPeer(dst, dev1, src, dev0, size);

GPUDirect RDMA enables direct inter-node GPU communication.

13. High-Level Abstractions

cuBLAS / cuDNN / cuSPARSE / cuSOLVER: NVIDIA's standard math libraries.
Thrust: STL-like C++ GPU library.
CUB: optimized reduction/scan/sort building blocks.
cutlass: GEMM building blocks (used by FlashAttention).
Triton: OpenAI's Python-like kernel language.

Triton example:

import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    tl.store(output_ptr + offsets, x + y, mask=mask)

Triton is the mainstream choice for AI researchers post-2024. FlashAttention (Tri Dao) uses Tensor Core + shared memory + fused kernels for memory-efficient attention — critical for GPT/LLaMA training.

14. Profiling and Optimization

Nsight Systems: full pipeline timeline (nsys profile --stats=true ./my_app).
Nsight Compute: per-kernel deep analysis (occupancy, memory throughput, warp efficiency, roofline) (ncu --set full ./my_app).
PyTorch Profiler: framework-level view.

Roofline tells you if a kernel is compute-bound or memory-bound. Compute-bound -> Tensor Core, occupancy. Memory-bound -> coalescing, shared memory, reuse.

15. Competitors

AMD ROCm / HIP: near-CUDA source compatibility with hipify. MI300 hardware close to H100. Ecosystem still lags.
Intel oneAPI / SYCL: Khronos-based, cross-vendor. DPC++ portable but uneven performance.
Apple Metal: PyTorch MPS backend on macOS; unified memory on M-series.
WebGPU: compute shaders in the browser; small LLMs via WebGPU + Wasm.
Google TPU: JAX/XLA, Google Cloud-only, Transformer-optimized.

16. Learning Path

Basics: CUDA C++ Programming Guide, simple kernels (vector add, matmul), nvcc, cudaMemcpy.
Optimization: coalescing, shared memory, bank conflicts, Nsight Compute.
Advanced: warp-level primitives (__shfl_sync), Tensor Core (wmma), CUDA Graph, multi-GPU (NCCL).
Ecosystem: cuBLAS/cuDNN, Triton, PyTorch extensions, FlashAttention/cutlass reading.

Books: PMPP (Hwu/Kirk), CUDA by Example, Professional CUDA C Programming. Online: NVIDIA Developer Blog, GPU MODE, PMPP lectures.

17. Cheat Sheet

+------------------------------------------+
|          CUDA Cheat Sheet                |
+------------------------------------------+
| Hardware:                                 |
|   GPU = N x SM                            |
|   SM = Cores + Tensor Cores + RegFile     |
|   H100: 132 SM                            |
|                                           |
| Execution (SIMT):                         |
|   Grid -> Block -> Warp (32) -> Thread    |
|   Scalar code; HW bundles 32 threads      |
|                                           |
| Memory Hierarchy:                         |
|   Register (fastest)                      |
|   Shared / L1 (tens of cycles)            |
|   L2 (~200 cycles)                        |
|   HBM (~500 cycles)                       |
|                                           |
| #1 Optimization: Memory Coalescing        |
|   Thread i -> array[i]                    |
|                                           |
| Pitfalls:                                 |
|   Warp divergence, bank conflict,         |
|   register spill, uncoalesced access      |
|                                           |
| Tensor Core: Volta+, FP16/BF16/FP8/FP4    |
| Streams: overlap copy/compute             |
| CUDA Graph: remove launch overhead        |
| Multi-GPU: NVLink + NCCL + P2P + RDMA     |
| Libraries: cuBLAS/cuDNN/Thrust/cutlass/   |
|            Triton/FlashAttention          |
| Tools: nvcc, Nsight Systems/Compute       |
| Alternatives: ROCm/SYCL/Metal/WebGPU/TPU  |
+------------------------------------------+

18. Quiz

Q1. What is the difference between SIMT and SIMD?

A. SIMD (e.g., AVX) requires explicit vector types like __m256; branches are hard, optimization needs compiler help. SIMT (CUDA) lets you write scalar code; hardware bundles 32 threads into a warp issuing one instruction over different data. "Write scalar, execute vector." Branches work naturally (at the cost of divergence). SIMT provides SIMD efficiency with scalar productivity — the decisive design that beat OpenCL and pure SIMD.

Q2. Why is memory coalescing the #1 CUDA optimization?

A. 32x bandwidth difference. A warp accessing a contiguous 128 bytes turns into 1 transaction; scattered access becomes 32 transactions, each wasting 124 of 128 bytes. HBM peaks at TB/s but drops to tens of GB/s when uncoalesced. Rule: thread i accesses array[i]. This is why matrix transpose is the textbook CUDA example — naive is uncoalesced; the optimized version uses shared memory as an intermediate so both read and write are coalesced.

Q3. Why does warp divergence hurt performance?

A. Warps run 32 threads in lockstep. With if (tid < 16) A() else B(), hardware serializes both paths: threads 0-15 run A() while 16-31 idle, then 16-31 run B() while 0-15 idle. Parallelism halves. Worst case is 32-way divergence, 32x slower. Mitigations: branch at warp boundaries (blockIdx.x % 2), replace with arithmetic (fabs, ternary), or compute both paths and select. Volta (2017)+ adds independent thread scheduling for flexibility, but minimizing divergence remains best.

Q4. What is a shared memory bank conflict?

A. Shared memory is split into 32 banks of 32-bit words (bank 0: words 0, 32, 64, ...; bank 1: words 1, 33, 65, ...). If 32 threads hit different banks -> 1 cycle. Same bank, different words -> N-way conflict. Classic example: data[32][32] column access puts all threads in bank 0 -> 32-way. Fix: pad to data[32][33] so columns fall across different banks. Broadcast (same word across threads) is free.

Q5. Why do Tensor Cores make AI 10-100x faster?

A. Matrix multiply is hardware-specialized. CUDA core: 1 FMA/cycle. Tensor Core (V100+): a 4x4 x 4x4 matmul per cycle = 16 MACs. Lower precisions help further — FP16, BF16/TF32/INT8 (A100), FP8 (H100), FP4 (B200). H100: FP32 = 67 TFLOPS vs Tensor Core FP8 = 2000 TFLOPS (~30x). Transformer training/inference is nearly pure matmul, so this speedup hits real workloads. NVIDIA's 2017 decision defined the AI accelerator market.

Q6. What problem does CUDA Graph solve?

A. Kernel launch overhead (~tens of microseconds per launch). Negligible for big kernels, but Transformer inference issues hundreds of small kernels — launch overhead approaches actual compute. CUDA Graph records a kernel sequence as a single graph, launched once instead of N times. 2-3x speedup on token generation. PyTorch 2.0+ torch.compile with CUDAGraphs automates this. Best when launch is the bottleneck; less effective for compute-heavy workloads.

Q7. Why can AMD ROCm not replace CUDA?

A. Ecosystem, not hardware. AMD MI300 is close to H100 (sometimes better). HIP with hipify converts CUDA code. Yet adoption lags because: (1) library maturity — rocBLAS/MIOpen/RCCL trail cuBLAS/cuDNN/NCCL in features and performance; (2) framework support — PyTorch ships CUDA features first; (3) developer community — books, tutorials, Stack Overflow overwhelmingly CUDA; (4) innovation originates on CUDA (Triton, FlashAttention, cutlass); (5) corporate inertia — engineer training costs and existing CUDA codebases. "Same hardware, different ecosystem, users stay put." Catching up takes years of investment.

"Transformer Architecture Deep Dive" — what actually runs on GPUs.
"Diffusion Models Deep Dive" — another GPU-heavy workload.
"RDMA & NCCL" — multi-GPU communication.
"LLVM Compiler Infrastructure" — the MLIR backing Triton.