Split View: NVIDIA GPU와 CUDA 완전 해부: 왜 GPU가 AI를 지배하는가

NVIDIA GPU와 CUDA 완전 해부: 왜 GPU가 AI를 지배하는가

왜 지금 GPU를 깊이 이해해야 하는가
1. CPU vs GPU: 근본적 설계 철학의 차이
2. GPU 내부 계층 구조: Device부터 CUDA Core까지
3. SIMT: Single Instruction, Multiple Threads
- Warp: GPU의 기본 실행 단위
- Warp Divergence: 성능 킬러
4. CUDA 메모리 계층: 레이턴시와 대역폭의 트레이드오프
5. 행렬 곱셈의 병렬화: CUDA 커널의 핵심
- Naive 구현: 글로벌 메모리 폭격
- Tiled 구현: 공유 메모리로 글로벌 메모리 접근 최소화
6. Tensor Core: 행렬 곱셈을 위한 전용 하드웨어
7. Grid, Block, Thread 계층: CUDA 프로그래밍 모델
8. CUDA의 역사: 왜 NVIDIA가 AI를 독점했는가
9. 실전 최적화: H100에서 GEMM 최대 성능 뽑기
10. 정리: GPU가 AI를 지배하는 이유

왜 지금 GPU를 깊이 이해해야 하는가

GPT-4를 추론하는 데 A100 GPU 수백 장이 동시에 돌아간다. LLaMA-3를 파인튜닝하려면 H100 클러스터가 필요하다. 이 모든 계산의 심장부에 NVIDIA GPU와 CUDA가 있다. 그런데 정작 "왜 GPU가 이렇게 빠른가"를 정확히 설명할 수 있는 엔지니어는 드물다. 이 글에서는 H100 스펙, SIMT 실행 모델, Warp divergence, 공유 메모리 타일링, Tensor Core까지 실제 CUDA 코드와 함께 완전히 해부한다.

1. CPU vs GPU: 근본적 설계 철학의 차이

CPU와 GPU의 차이는 단순히 "코어 수"가 아니다. 설계 철학 자체가 다르다.

CPU: 레이턴시 최적화

CPU는 단일 스레드를 최대한 빠르게 실행하도록 설계되었다. 이를 위해 다음 장치들을 탑재한다:

분기 예측(Branch Prediction): if/else 분기를 미리 예측해 파이프라인 스톨을 줄임
비순서 실행(Out-of-Order Execution): 의존성이 없는 명령어를 순서와 무관하게 병렬 실행
투기적 실행(Speculative Execution): 분기 결과를 예측하고 미리 계산 (Spectre 취약점의 원인이기도 함)
거대한 L1/L2/L3 캐시: 단일 스레드의 메모리 레이턴시를 숨기기 위한 수십 MB 캐시

GPU: 처리량 최적화

GPU는 수천 개의 스레드가 동시에 실행될 때 전체 처리량을 극대화하도록 설계되었다. 개별 스레드의 레이턴시는 포기하는 대신, 레이턴시가 발생할 때 다른 스레드로 즉시 컨텍스트 스위칭한다.

CPU 아키텍처 (레이턴시 최적화):
┌────────────────────────────────────────────┐
│  코어 0  │  코어 1  │  코어 2  │  코어 3  │
│ (강력)   │ (강력)   │ (강력)   │ (강력)   │
│ OOO실행  │ OOO실행  │ OOO실행  │ OOO실행  │
├──────────┴──────────┴──────────┴───────────┤
│           L3 캐시 (32MB+)                  │
├────────────────────────────────────────────┤
│           메인 메모리 (DDR5, ~50 GB/s)     │
└────────────────────────────────────────────┘

GPU 아키텍처 (처리량 최적화) — H100 기준:
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│SM│SM│SM│SM│SM│SM│SM│SM│SM│SM│...│  132개 SM
│  │  │  │  │  │  │  │  │  │  │   │  각 SM = 128 CUDA 코어
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴───┘
       L2 캐시 (50MB)
       HBM3 (80GB, 3.35 TB/s)

핵심 차이: CPU는 1개 스레드를 1ns 만에 처리하려 하고, GPU는 1개 스레드가 700ns 걸려도 10만 개 스레드가 동시에 나아가므로 전체 처리량이 압도적이다.

2. GPU 내부 계층 구조: Device부터 CUDA Core까지

H100의 내부는 계층적으로 구성된다. 최상위 Device에서 최하위 CUDA Core까지 계층을 내려가 보자.

GPU Device (H100 SXM5)
│
├── SM 0 (Streaming Multiprocessor)
│   ├── 4개 Warp Scheduler
│   ├── 4개 Dispatch Unit
│   ├── 128개 CUDA Core (FP32)
│   ├── 64개 FP64 Core
│   ├── 4개 Tensor Core (4세대, FP8/FP16/BF16/TF32 지원)
│   ├── 1개 특수함수 유닛 (SFU: sin, cos, sqrt 등)
│   ├── LD/ST 유닛 (Load/Store)
│   ├── L1 캐시 + 공유 메모리 = 228KB (비율 설정 가능)
│   └── 레지스터 파일 (65536개 x 32비트 레지스터)
│
├── SM 1 ...
├── SM 2 ...
│   ...
├── SM 131 ...
│
├── L2 캐시 (50MB)
└── HBM3 메모리 (80GB, 3.35 TB/s 대역폭)

핵심 수치 (H100 SXM5):

항목	수치
SM 개수	132개
SM당 CUDA Core	128개 (FP32)
총 CUDA Core	16,896개
SM당 Tensor Core	4개 (4세대)
총 Tensor Core	528개
L2 캐시	50MB
HBM3 메모리	80GB
메모리 대역폭	3.35 TB/s
FP16 Tensor Core 성능	989 TFLOPS

3. SIMT: Single Instruction, Multiple Threads

GPU의 핵심 실행 모델은 **SIMT(Single Instruction Multiple Threads)**다. CPU의 SIMD(Single Instruction Multiple Data)와 유사하지만, 핵심 차이가 있다.

Warp: GPU의 기본 실행 단위

CUDA에서 Warp는 32개 스레드의 묶음으로, 이 32개 스레드가 항상 동일한 명령어를 동시에 실행한다. 즉, SM의 Warp Scheduler는 Warp 단위로 명령어를 디스패치한다.

Thread Block (예: 256개 스레드)
├── Warp 0: Thread 0~31   → 동일 명령어 실행 (동기화됨)
├── Warp 1: Thread 32~63  → 동일 명령어 실행 (동기화됨)
├── Warp 2: Thread 64~95  → 동일 명령어 실행 (동기화됨)
├── ...
└── Warp 7: Thread 224~255

Warp Divergence: 성능 킬러

Warp 내 스레드들이 서로 다른 분기를 취하면 Warp Divergence가 발생한다. GPU는 SIMT이기 때문에, 분기를 직렬화한다.

// 이 코드는 Warp Divergence를 유발한다
__global__ void divergent_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx % 2 == 0) {
        // Warp 내 짝수 스레드만 실행
        data[idx] = data[idx] * 2.0f;  // Step 1
    } else {
        // Warp 내 홀수 스레드만 실행
        data[idx] = data[idx] + 1.0f;  // Step 2
    }
    // 결과: Step 1 실행 시 홀수 스레드는 idle
    //       Step 2 실행 시 짝수 스레드는 idle
    // 실제 처리량: 이론값의 50%
}

반면, 모든 스레드가 같은 분기를 취하면 divergence가 없다:

// Divergence 없음: 모든 스레드가 같은 경로
__global__ void coherent_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // blockIdx.x 전체가 짝수이거나 홀수이면 divergence 없음
    if (blockIdx.x % 2 == 0) {
        data[idx] = data[idx] * 2.0f;
    } else {
        data[idx] = data[idx] + 1.0f;
    }
}

실무 팁: 분기 조건을 threadIdx 기반이 아닌 blockIdx 기반으로 설계하면 divergence를 줄일 수 있다.

4. CUDA 메모리 계층: 레이턴시와 대역폭의 트레이드오프

CUDA 커널 성능의 대부분은 메모리 접근 패턴에서 결정된다. 계층별 특성을 정확히 알아야 한다.

메모리 계층 (빠름 → 느림):

┌─────────────────────────────────────────────────────┐
│  레지스터 (32-bit, 65536개/SM)                      │
│  레이턴시: ~1 사이클    대역폭: 극대 (로컬 접근)     │
│  범위: 단일 스레드만 접근 가능                       │
├─────────────────────────────────────────────────────┤
│  L1 캐시 / 공유 메모리 (228KB/SM on H100)           │
│  레이턴시: ~32 사이클   대역폭: ~19 TB/s             │
│  범위: Thread Block 내 모든 스레드                   │
├─────────────────────────────────────────────────────┤
│  L2 캐시 (50MB on H100)                             │
│  레이턴시: ~200 사이클  대역폭: ~12 TB/s             │
│  범위: 모든 SM 공유                                  │
├─────────────────────────────────────────────────────┤
│  글로벌 메모리 HBM3 (80GB on H100)                  │
│  레이턴시: ~700 사이클  대역폭: 3.35 TB/s            │
│  범위: 모든 스레드 접근 가능                         │
└─────────────────────────────────────────────────────┘

메모리 종류	레이턴시	대역폭	크기	범위
레지스터	~1 사이클	극대	256KB/SM	단일 스레드
공유 메모리	~32 사이클	~19 TB/s	228KB/SM	Thread Block
L2 캐시	~200 사이클	~12 TB/s	50MB	전체 GPU
HBM3 (글로벌)	~700 사이클	3.35 TB/s	80GB	전체 GPU

공유 메모리(Shared Memory)는 프로그래머가 직접 제어하는 캐시다. __shared__ 키워드로 선언하면 해당 Thread Block의 모든 스레드가 접근할 수 있고, HBM보다 20배 이상 빠르다.

5. 행렬 곱셈의 병렬화: CUDA 커널의 핵심

딥러닝의 핵심 연산은 행렬 곱셈(GEMM: General Matrix Multiply)이다. 이를 GPU에서 어떻게 병렬화하는지 단계별로 살펴본다.

Naive 구현: 글로벌 메모리 폭격

// Naive 행렬 곱셈 커널 — 느린 이유를 이해하기 위한 버전
__global__ void matmul_naive(float* A, float* B, float* C, int N) {
    // 각 스레드가 출력 행렬의 한 원소를 담당
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            // 매 반복마다 글로벌 메모리 접근 2번 → 레이턴시 ~700 사이클
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

// 호출 예시
dim3 blockDim(16, 16);  // 256 스레드/블록
dim3 gridDim((N + 15) / 16, (N + 15) / 16);
matmul_naive<<<gridDim, blockDim>>>(A, B, C, N);

이 구현의 문제: N=4096인 행렬이면 각 스레드가 4096번 글로벌 메모리에서 읽는다. 총 글로벌 메모리 읽기 횟수 = N^3 = 68 billion번. HBM 레이턴시 700 사이클 × 68B = 처참한 성능.

Tiled 구현: 공유 메모리로 글로벌 메모리 접근 최소화

#define TILE_SIZE 16

// 타일링된 행렬 곱셈 — 공유 메모리 활용
__global__ void matmul_tiled(float* A, float* B, float* C, int N) {
    // 공유 메모리에 타일 선언 (L1 수준 레이턴시)
    __shared__ float tile_A[TILE_SIZE][TILE_SIZE];
    __shared__ float tile_B[TILE_SIZE][TILE_SIZE];

    int tx = threadIdx.x, ty = threadIdx.y;
    int row = blockIdx.y * TILE_SIZE + ty;
    int col = blockIdx.x * TILE_SIZE + tx;
    float sum = 0.0f;

    // K 차원을 타일 단위로 순회
    for (int t = 0; t < N / TILE_SIZE; t++) {
        // 단계 1: 각 스레드가 타일 하나씩 공유 메모리에 로드
        tile_A[ty][tx] = A[row * N + t * TILE_SIZE + tx];
        tile_B[ty][tx] = B[(t * TILE_SIZE + ty) * N + col];

        // 단계 2: 모든 스레드가 로드 완료할 때까지 동기화
        __syncthreads();

        // 단계 3: 공유 메모리에서 타일 내 내적 계산
        // 이 루프는 전부 공유 메모리(~32 사이클) 접근
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tile_A[ty][k] * tile_B[k][tx];
        }

        // 단계 4: 다음 타일 로드 전에 동기화
        __syncthreads();
    }

    C[row * N + col] = sum;
}

왜 빠른가:

타일 크기 16×16 기준:

Naive: 글로벌 메모리 접근 = 2 × N^3 번
Tiled: 글로벌 메모리 접근 = 2 × N^3 / TILE_SIZE 번 = 16배 감소

타일을 공유 메모리에 한 번 올려두면, 타일 내 16개 스레드가 같은 데이터를 공유해서 재사용한다. 메모리 대역폭 효율이 극적으로 개선된다.

타일링 원리:

A 행렬:         B 행렬:
┌──┬──┬──┬──┐  ┌──┬──┬──┬──┐
│T0│T1│T2│T3│  │T0│T1│T2│T3│
├──┼──┼──┼──┤  ├──┼──┼──┼──┤
│  │  │  │  │  │  │  │  │  │
└──┴──┴──┴──┘  └──┴──┴──┴──┘

각 Thread Block은 C의 16×16 타일 하나를 담당.
A의 해당 행 타일들과 B의 해당 열 타일들을 순서대로
공유 메모리에 올려서 계산 → 글로벌 메모리 접근 최소화.

6. Tensor Core: 행렬 곱셈을 위한 전용 하드웨어

CUDA Core가 스칼라 FP32 연산을 1개씩 처리한다면, Tensor Core는 행렬 전체를 단 하나의 명령어로 처리한다.

4세대 Tensor Core (H100):

연산	Tensor Core	CUDA Core	배속
FP16 행렬 곱	1 사이클에 256 ops	1 사이클에 2 ops	128배
BF16 행렬 곱	1 사이클에 256 ops	지원 안 함	—
FP8 행렬 곱	1 사이클에 512 ops	지원 안 함	—
TF32 행렬 곱	1 사이클에 128 ops	지원 안 함	—

4세대 Tensor Core는 한 명령어로 16×16 행렬 곱셈을 수행한다. WMMA(Warp Matrix Multiply-Accumulate) API로 직접 호출할 수 있다:

// WMMA API를 이용한 Tensor Core 직접 호출
#include <mma.h>
using namespace nvcuda::wmma;

__global__ void tensor_core_matmul(half* a_ptr, half* b_ptr,
                                    float* c_ptr, int M, int N, int K) {
    // 각 Warp가 16x16 출력 타일 하나를 담당
    // fragment = Tensor Core가 다루는 행렬 조각
    fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
    fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
    fragment<accumulator, 16, 16, 16, float> c_frag;

    // 누산기 초기화
    fill_fragment(c_frag, 0.0f);

    // K 차원 순회하며 타일 로드 후 곱셈
    for (int k = 0; k < K; k += 16) {
        // 글로벌 메모리에서 fragment 로드
        load_matrix_sync(a_frag, a_ptr + /* offset */, K);
        load_matrix_sync(b_frag, b_ptr + /* offset */, N);

        // 단 하나의 명령어로 16x16x16 행렬 곱 누산
        // 내부에서 Tensor Core 하드웨어 직접 호출
        mma_sync(c_frag, a_frag, b_frag, c_frag);
    }

    // 결과를 글로벌 메모리에 저장
    store_matrix_sync(c_ptr + /* offset */, c_frag, N, mem_row_major);
}

실제로는 WMMA API보다 cuBLAS, cuBLASLt, 또는 CUTLASS가 Tensor Core를 더 효율적으로 활용하지만, 이 코드는 Tensor Core가 어떻게 동작하는지 이해하는 데 적합하다.

7. Grid, Block, Thread 계층: CUDA 프로그래밍 모델

CUDA 커널을 호출할 때 프로그래머는 **실행 구성(Execution Configuration)**을 지정한다.

CUDA 실행 계층:

Grid (전체 커널 실행)
├── Block (0,0): 256 스레드
│   ├── Warp 0: Thread 0~31 ── SM에서 동시 실행
│   ├── Warp 1: Thread 32~63
│   └── ...
├── Block (1,0): 256 스레드
├── Block (2,0): 256 스레드
└── ...

각 Block은 하나의 SM에 할당됨.
SM은 동시에 여러 Block을 처리할 수 있음 (occupancy).

PyTorch는 이 모든 것을 감춘다:

import torch

A = torch.randn(4096, 4096, device='cuda', dtype=torch.float16)
B = torch.randn(4096, 4096, device='cuda', dtype=torch.float16)

# 이 한 줄 내부에서 일어나는 일:
# 1. cuBLAS 또는 CUTLASS 커널 선택
# 2. Grid: (256, 256), Block: (16, 16) 등으로 구성
# 3. Tensor Core 활용한 tiled GEMM 실행
# 4. 총 ~4096개 Thread Block, 각 16*16=256 스레드
result = torch.matmul(A, B)  # → 내부적으로 cublasSgemmEx 호출

Occupancy 최적화: SM 하나가 동시에 처리할 수 있는 Warp 수는 레지스터와 공유 메모리 사용량에 따라 결정된다. Block 크기가 너무 크면 한 SM에 한 Block만 올라가 SM 활용률이 낮아진다.

SM 리소스 한계 (H100 기준):
- 최대 동시 Thread Block: 32개
- 최대 동시 스레드: 2048개
- 레지스터: 65536개 (스레드 수 × 스레드당 레지스터로 나눔)
- 공유 메모리: 228KB (Block들이 나눠 씀)

Occupancy = 실제 활성 Warp 수 / 최대 가능 Warp 수
목표: 50% 이상 (메모리 레이턴시 숨기기 위해)

8. CUDA의 역사: 왜 NVIDIA가 AI를 독점했는가

CUDA의 탄생과 성장을 이해하면, 왜 2024년에도 NVIDIA가 AI 칩 시장의 80% 이상을 차지하는지 알 수 있다.

2006년 이전: OpenGL 셰이더 해킹의 시대

GPU가 빠르다는 것은 알았지만, GPU를 범용 계산에 쓰려면 OpenGL 텍스처 샘플링이나 셰이더 언어로 수식을 "위장"해야 했다. 행렬 곱셈을 이미지 필터로 속여서 GPU에 올리는 방식이었다.

2006년: CUDA 1.0 출시

Jensen Huang은 "GPU를 C 언어로 프로그래밍할 수 있게 한다"는 도박을 했다. 당시 NVIDIA 이사회는 게임 GPU에 컴퓨팅 하드웨어를 추가하는 결정을 승인해야 했다.

초기 CUDA 사용자: 물리 시뮬레이션 연구자, 유체역학 HPC 팀, 분자 동역학 연구자. 딥러닝은 아직 등장 전이었다.

2012년: AlexNet의 순간

Alex Krizhevsky가 GTX 580 두 장으로 AlexNet을 학습시켜 ImageNet에서 압도적 1위를 차지했다. 이 사건이 딥러닝 혁명의 시작이었다. GTX 580은 소비자용 게임 GPU였지만, CUDA 덕분에 병렬 컴퓨팅 플랫폼이 되었다.

CUDA vs OpenCL: 생태계 전쟁

Khronos Group의 OpenCL은 AMD, Intel, NVIDIA 모두를 지원하는 개방 표준이었다. 그럼에도 CUDA가 이긴 이유:

개발자 도구: cuDNN, cuBLAS, Nsight 프로파일러 등 NVIDIA의 생태계가 압도적으로 풍부했다
선점 효과: 연구자들이 CUDA 코드를 공유하면서 논문 코드가 전부 CUDA 기반이 됨
하드웨어 최적화: Tensor Core, NVLink 등 NVIDIA 전용 기능은 CUDA로만 접근 가능

결과: PyTorch, TensorFlow, JAX 모두 기본적으로 CUDA를 사용한다. AMD의 ROCm, Intel의 oneAPI가 따라오고 있지만 아직 격차가 크다.

9. 실전 최적화: H100에서 GEMM 최대 성능 뽑기

H100의 이론상 FP16 Tensor Core 성능은 989 TFLOPS이다. 실제로 접근하려면:

import torch
import time

# 1. dtype 선택: FP16 또는 BF16 사용 (Tensor Core 활성화)
A = torch.randn(8192, 8192, device='cuda', dtype=torch.bfloat16)
B = torch.randn(8192, 8192, device='cuda', dtype=torch.bfloat16)

# 2. cuBLAS를 위한 워밍업 (첫 실행은 커널 컴파일/선택에 시간 소요)
for _ in range(5):
    _ = torch.matmul(A, B)
torch.cuda.synchronize()

# 3. 측정
start = time.perf_counter()
for _ in range(100):
    C = torch.matmul(A, B)
torch.cuda.synchronize()
end = time.perf_counter()

# 4. TFLOPS 계산
# 8192x8192 행렬 곱: 2 * N^3 = 2 * 8192^3 = 1.1e12 FLOP
flops = 2 * 8192**3 * 100
elapsed = end - start
tflops = flops / elapsed / 1e12
print(f"달성 성능: {tflops:.1f} TFLOPS")
# 목표: 700+ TFLOPS (이론값 989의 70%+ 달성 가능)

성능을 좌우하는 요인들:

메모리 정렬: 행렬이 128바이트 경계에 정렬되어야 최적 메모리 접근
행렬 크기: 16의 배수여야 Tensor Core 패딩 없이 사용 가능
torch.backends.cudnn.benchmark = True: cuDNN이 최적 알고리즘을 자동 선택
NVLink: 멀티 GPU 환경에서 GPU 간 통신 대역폭 (H100 NVLink = 900 GB/s)

10. 정리: GPU가 AI를 지배하는 이유

GPU가 AI 계산을 지배하는 이유는 단순히 "코어가 많아서"가 아니다. 세 가지가 결합된 결과다:

하드웨어 설계: 수천 개의 단순 코어 + HBM 고대역폭 메모리 + Tensor Core 행렬 전용 하드웨어
프로그래밍 모델: CUDA의 SIMT 모델 + 계층적 메모리(공유 메모리, 레지스터) + 정밀한 제어
생태계: 18년간 쌓인 cuDNN, cuBLAS, NCCL, Nsight, PyTorch/TF 통합

딥러닝 연산의 95%는 행렬 곱셈으로 귀결되고, H100의 Tensor Core는 이를 위해 완전히 최적화된 하드웨어다. 이것이 바로 NVIDIA가 AI 인프라의 핵심을 쥐고 있는 이유다.

다음 포스트에서는 cuDNN이 컨볼루션을 내부에서 어떻게 처리하는지, Winograd 알고리즘과 im2col 변환, FlashAttention의 타일링 트릭을 해부한다.

NVIDIA GPU and CUDA Architecture Deep Dive: Why GPUs Dominate AI

Why Every AI Engineer Needs to Understand GPU Internals
1. CPU vs GPU: Fundamentally Different Design Philosophies
2. GPU Internal Hierarchy: Device Down to CUDA Core
3. SIMT: Single Instruction, Multiple Threads
- Warp: The Fundamental Unit of GPU Execution
- Warp Divergence: The Performance Killer
4. CUDA Memory Hierarchy: Latency vs Bandwidth Tradeoffs
5. Parallelizing Matrix Multiplication: The Core of CUDA
- Naive Implementation: Hammering Global Memory
- Tiled Implementation: Using Shared Memory to Minimize Global Accesses
6. Tensor Cores: Dedicated Matrix Multiply Hardware
7. Grid, Block, Thread: The CUDA Execution Hierarchy
8. The History of CUDA: Why NVIDIA Owns AI Infrastructure
9. Practical Optimization: Extracting Peak GEMM Performance on H100
10. Summary: Why GPUs Dominate AI

Why Every AI Engineer Needs to Understand GPU Internals

Running GPT-4 inference requires hundreds of A100 GPUs operating simultaneously. Fine-tuning LLaMA-3 demands an H100 cluster. At the heart of all that computation sits NVIDIA's GPU architecture and the CUDA programming model. Yet surprisingly few engineers can precisely explain why GPUs are so fast. This post dissects H100 specifications, the SIMT execution model, warp divergence, shared memory tiling, and Tensor Cores — all backed by real CUDA code.

1. CPU vs GPU: Fundamentally Different Design Philosophies

The difference between a CPU and GPU is not merely "more cores." The underlying design philosophy is fundamentally different.

CPU: Latency Optimization

A CPU is designed to execute a single thread as fast as possible. To achieve this, it incorporates:

Branch Prediction: Predicts if/else outcomes ahead of time to prevent pipeline stalls
Out-of-Order Execution: Executes independent instructions out of program order to keep the pipeline full
Speculative Execution: Computes down predicted branches before knowing if they're taken (the mechanism behind Spectre/Meltdown)
Massive Caches: Tens of megabytes of L1/L2/L3 cache to hide memory latency for a single thread

GPU: Throughput Optimization

A GPU is designed to maximize total throughput when thousands of threads run simultaneously. Individual thread latency is traded away — instead, when one thread stalls waiting for memory, the hardware instantly switches to another thread. Latency is hidden by parallelism.

CPU Architecture (latency-optimized):
┌────────────────────────────────────────────┐
│  Core 0  │  Core 1  │  Core 2  │  Core 3  │
│ (powerful)│(powerful)│(powerful)│(powerful)│
│ OOO exec │ OOO exec │ OOO exec │ OOO exec │
├──────────┴──────────┴──────────┴───────────┤
│           L3 Cache (32MB+)                 │
├────────────────────────────────────────────┤
│           Main Memory (DDR5, ~50 GB/s)     │
└────────────────────────────────────────────┘

GPU Architecture (throughput-optimized) — H100:
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│SM│SM│SM│SM│SM│SM│SM│SM│SM│SM│...│  132 SMs
│  │  │  │  │  │  │  │  │  │  │   │  each SM = 128 CUDA cores
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴───┘
       L2 Cache (50MB)
       HBM3 (80GB, 3.35 TB/s)

The key insight: a CPU tries to finish 1 thread in 1 ns. A GPU accepts that each thread might take 700 ns, but runs 100,000 threads simultaneously — total throughput wins overwhelmingly.

2. GPU Internal Hierarchy: Device Down to CUDA Core

The H100's interior is organized hierarchically. Let's descend from the top-level Device all the way to individual CUDA Cores.

GPU Device (H100 SXM5)
│
├── SM 0 (Streaming Multiprocessor)
│   ├── 4x Warp Schedulers
│   ├── 4x Dispatch Units
│   ├── 128x CUDA Cores (FP32)
│   ├── 64x FP64 Cores
│   ├── 4x Tensor Cores (4th gen: FP8/FP16/BF16/TF32)
│   ├── 1x Special Function Unit (SFU: sin, cos, sqrt)
│   ├── Load/Store Units (LD/ST)
│   ├── L1 Cache + Shared Memory = 228KB (ratio configurable)
│   └── Register File (65536 x 32-bit registers)
│
├── SM 1 ...
├── SM 2 ...
│   ...
├── SM 131 ...
│
├── L2 Cache (50MB)
└── HBM3 Memory (80GB, 3.35 TB/s bandwidth)

Key H100 SXM5 Numbers:

Spec	Value
SM Count	132
CUDA Cores per SM	128 (FP32)
Total CUDA Cores	16,896
Tensor Cores per SM	4 (4th gen)
Total Tensor Cores	528
L2 Cache	50MB
HBM3 Memory	80GB
Memory Bandwidth	3.35 TB/s
FP16 Tensor Core Perf	989 TFLOPS

3. SIMT: Single Instruction, Multiple Threads

The GPU's core execution model is SIMT (Single Instruction Multiple Threads). It resembles the CPU's SIMD (Single Instruction Multiple Data) but has a crucial difference: threads in SIMT have their own program counter and registers, giving the illusion of independent threads while sharing execution hardware.

Warp: The Fundamental Unit of GPU Execution

In CUDA, a Warp is a group of 32 threads that always execute the same instruction simultaneously. The Warp Scheduler inside each SM dispatches instructions at the warp granularity.

Thread Block (example: 256 threads)
├── Warp 0: Threads  0-31  → execute same instruction in lockstep
├── Warp 1: Threads 32-63  → execute same instruction in lockstep
├── Warp 2: Threads 64-95  → execute same instruction in lockstep
├── ...
└── Warp 7: Threads 224-255

Warp Divergence: The Performance Killer

When threads within a warp take different branches, Warp Divergence occurs. Because SIMT requires all 32 threads to execute the same instruction, the GPU serializes the divergent paths:

// This code causes warp divergence
__global__ void divergent_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx % 2 == 0) {
        // Even-indexed threads in the warp execute this
        data[idx] = data[idx] * 2.0f;  // Step 1
    } else {
        // Odd-indexed threads in the warp execute this
        data[idx] = data[idx] + 1.0f;  // Step 2
    }
    // Result: During Step 1, odd threads are masked off (idle)
    //         During Step 2, even threads are masked off (idle)
    // Effective throughput: 50% of theoretical
}

Compare with a divergence-free version:

// No divergence: all threads in a warp take the same path
__global__ void coherent_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // All threads in a block share the same blockIdx.x,
    // so all 32 threads in a warp take the same branch
    if (blockIdx.x % 2 == 0) {
        data[idx] = data[idx] * 2.0f;
    } else {
        data[idx] = data[idx] + 1.0f;
    }
}

Practical rule: Design branch conditions based on blockIdx rather than threadIdx to avoid divergence within warps.

4. CUDA Memory Hierarchy: Latency vs Bandwidth Tradeoffs

Most CUDA kernel performance comes down to memory access patterns. Understanding the characteristics of each memory level is essential.

Memory Hierarchy (fast → slow):

┌─────────────────────────────────────────────────────┐
│  Registers (32-bit, 65536 per SM)                   │
│  Latency: ~1 cycle     BW: enormous (local access)  │
│  Scope: single thread only                          │
├─────────────────────────────────────────────────────┤
│  L1 Cache / Shared Memory (228KB/SM on H100)        │
│  Latency: ~32 cycles   BW: ~19 TB/s                 │
│  Scope: all threads in a Thread Block               │
├─────────────────────────────────────────────────────┤
│  L2 Cache (50MB on H100)                            │
│  Latency: ~200 cycles  BW: ~12 TB/s                 │
│  Scope: all SMs share                               │
├─────────────────────────────────────────────────────┤
│  Global Memory HBM3 (80GB on H100)                  │
│  Latency: ~700 cycles  BW: 3.35 TB/s                │
│  Scope: all threads on the device                   │
└─────────────────────────────────────────────────────┘

Memory Type	Latency	Bandwidth	Size	Scope
Registers	~1 cycle	enormous	256KB/SM	Single thread
Shared Memory	~32 cycles	~19 TB/s	228KB/SM	Thread Block
L2 Cache	~200 cycles	~12 TB/s	50MB	Whole GPU
HBM3 (Global)	~700 cycles	3.35 TB/s	80GB	Whole GPU

Shared memory is a programmer-managed cache. Declaring variables with __shared__ puts them in shared memory, accessible by all threads in the same Thread Block, with over 20x lower latency than HBM.

5. Parallelizing Matrix Multiplication: The Core of CUDA

Matrix multiplication (GEMM: General Matrix Multiply) is the dominant operation in deep learning. Let's walk through how to parallelize it on GPU step by step.

Naive Implementation: Hammering Global Memory

// Naive matrix multiply kernel — understand WHY it's slow
__global__ void matmul_naive(float* A, float* B, float* C, int N) {
    // Each thread computes one element of the output matrix
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            // Two global memory reads per iteration → ~700 cycles latency each
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

// Launch configuration
dim3 blockDim(16, 16);  // 256 threads per block
dim3 gridDim((N + 15) / 16, (N + 15) / 16);
matmul_naive<<<gridDim, blockDim>>>(A, B, C, N);

The problem: for N=4096, each thread reads from global memory 4096 times. Total global memory reads = N^3 = 68 billion. With 700-cycle HBM latency, performance is terrible.

Tiled Implementation: Using Shared Memory to Minimize Global Accesses

#define TILE_SIZE 16

// Tiled matrix multiply using shared memory
__global__ void matmul_tiled(float* A, float* B, float* C, int N) {
    // Declare tiles in shared memory (~32-cycle latency vs 700 for HBM)
    __shared__ float tile_A[TILE_SIZE][TILE_SIZE];
    __shared__ float tile_B[TILE_SIZE][TILE_SIZE];

    int tx = threadIdx.x, ty = threadIdx.y;
    int row = blockIdx.y * TILE_SIZE + ty;
    int col = blockIdx.x * TILE_SIZE + tx;
    float sum = 0.0f;

    // Iterate over the K dimension in tile-sized chunks
    for (int t = 0; t < N / TILE_SIZE; t++) {
        // Step 1: Each thread loads one element into shared memory
        tile_A[ty][tx] = A[row * N + t * TILE_SIZE + tx];
        tile_B[ty][tx] = B[(t * TILE_SIZE + ty) * N + col];

        // Step 2: Wait for ALL threads to finish loading
        __syncthreads();

        // Step 3: Compute partial dot product from this tile
        // All accesses here hit shared memory (~32 cycles)
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tile_A[ty][k] * tile_B[k][tx];
        }

        // Step 4: Sync before loading the next tile
        __syncthreads();
    }

    C[row * N + col] = sum;
}

Why is it faster?

With tile size 16×16:

Naive: global memory accesses = 2 × N^3
Tiled: global memory accesses = 2 × N^3 / TILE_SIZE = 16x reduction

By loading a tile into shared memory once, all 16 threads along the K dimension reuse the same data. Memory bandwidth efficiency improves dramatically.

Tiling principle:

Matrix A:         Matrix B:
┌──┬──┬──┬──┐  ┌──┬──┬──┬──┐
│T0│T1│T2│T3│  │T0│T1│T2│T3│
├──┼──┼──┼──┤  ├──┼──┼──┼──┤
│  │  │  │  │  │  │  │  │  │
└──┴──┴──┴──┘  └──┴──┴──┴──┘

Each Thread Block owns one 16×16 output tile of C.
It iterates over A's row tiles and B's column tiles,
loading each pair into shared memory before computing.
Result: global memory traffic reduced by TILE_SIZE factor.

6. Tensor Cores: Dedicated Matrix Multiply Hardware

Where a CUDA core processes one FP32 scalar multiply-add per cycle, a Tensor Core processes an entire 16×16 matrix multiply in a single instruction.

4th Generation Tensor Core (H100):

Operation	Tensor Core	CUDA Core	Speedup
FP16 matrix multiply	256 ops/cycle	2 ops/cycle	128x
BF16 matrix multiply	256 ops/cycle	not supported	—
FP8 matrix multiply	512 ops/cycle	not supported	—
TF32 matrix multiply	128 ops/cycle	not supported	—

The WMMA (Warp Matrix Multiply-Accumulate) API exposes Tensor Cores directly:

// Direct Tensor Core access via WMMA API
#include <mma.h>
using namespace nvcuda::wmma;

__global__ void tensor_core_matmul(half* a_ptr, half* b_ptr,
                                    float* c_ptr, int M, int N, int K) {
    // Each warp handles one 16x16 output tile
    // fragment = a matrix tile that the Tensor Core operates on
    fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
    fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
    fragment<accumulator, 16, 16, 16, float> c_frag;

    // Initialize accumulator to zero
    fill_fragment(c_frag, 0.0f);

    // Iterate over K dimension
    for (int k = 0; k < K; k += 16) {
        // Load fragments from global memory
        load_matrix_sync(a_frag, a_ptr + /* offset */, K);
        load_matrix_sync(b_frag, b_ptr + /* offset */, N);

        // ONE instruction: 16x16x16 matrix multiply-accumulate
        // Internally dispatches to Tensor Core hardware
        mma_sync(c_frag, a_frag, b_frag, c_frag);
    }

    // Store result back to global memory
    store_matrix_sync(c_ptr + /* offset */, c_frag, N, mem_row_major);
}

In practice, cuBLAS, cuBLASLt, and CUTLASS exploit Tensor Cores far more efficiently than manual WMMA code, but this demonstrates exactly what Tensor Cores do.

7. Grid, Block, Thread: The CUDA Execution Hierarchy

When launching a CUDA kernel, the programmer specifies an execution configuration describing the grid and block dimensions.

CUDA Execution Hierarchy:

Grid (entire kernel launch)
├── Block (0,0): 256 threads
│   ├── Warp 0: Threads  0-31 ── execute in lockstep on SM
│   ├── Warp 1: Threads 32-63
│   └── ...
├── Block (1,0): 256 threads
├── Block (2,0): 256 threads
└── ...

Each Block is assigned to exactly one SM.
One SM can host multiple Blocks simultaneously (occupancy).

PyTorch hides all of this behind a single operator:

import torch

A = torch.randn(4096, 4096, device='cuda', dtype=torch.float16)
B = torch.randn(4096, 4096, device='cuda', dtype=torch.float16)

# What happens inside this one line:
# 1. cuBLAS or CUTLASS selects the optimal kernel
# 2. Execution config: e.g., Grid(256,256), Block(16,16)
# 3. Tiled GEMM using Tensor Cores
# 4. ~4096 Thread Blocks, each with 256 threads
result = torch.matmul(A, B)  # internally calls cublasSgemmEx

Occupancy Optimization: The number of concurrent warps an SM can host depends on register and shared memory usage per block. If a block uses too many resources, only one block fits per SM, leaving execution units underutilized.

SM Resource Limits (H100):
- Max concurrent Thread Blocks: 32
- Max concurrent threads: 2048
- Registers: 65536 total (divided among all threads in resident warps)
- Shared Memory: 228KB (divided among resident blocks)

Occupancy = Active Warps / Maximum Possible Warps
Goal: 50%+ (needed to hide memory latency through warp switching)

8. The History of CUDA: Why NVIDIA Owns AI Infrastructure

Understanding CUDA's rise explains why, even in 2024, NVIDIA holds over 80% of the AI accelerator market.

Pre-2006: The OpenGL Shader Hacking Era

Researchers knew GPUs were fast, but using them for general computation required encoding calculations as OpenGL texture operations or shader programs. Matrix multiplication had to be disguised as an image filter to run on GPU.

2006: CUDA 1.0 Launch

Jensen Huang made a bet: "We'll let programmers write GPU code in C." At the time, this meant adding general-purpose computing logic to hardware designed for gaming. The NVIDIA board had to approve dedicating silicon area to this experiment.

Early CUDA users: physics simulation researchers, CFD teams, molecular dynamics groups. Deep learning hadn't emerged yet.

2012: The AlexNet Moment

Alex Krizhevsky used two GTX 580 consumer gaming GPUs to train AlexNet, which crushed ImageNet competition with a massive accuracy lead. This single event triggered the deep learning revolution. GTX 580 was a gaming GPU, but CUDA made it a parallel computing platform.

CUDA vs OpenCL: The Ecosystem War

The Khronos Group's OpenCL supported AMD, Intel, and NVIDIA through an open standard. CUDA won despite being proprietary, for three reasons:

Developer tools: cuDNN, cuBLAS, Nsight profiler — NVIDIA's ecosystem was far richer
First-mover effect: Research code published in papers was all CUDA, so labs standardized on it
Hardware exclusivity: Tensor Cores, NVLink, and other features are only accessible through CUDA

Result: PyTorch, TensorFlow, and JAX all use CUDA by default. AMD's ROCm and Intel's oneAPI are catching up, but the gap remains substantial.

9. Practical Optimization: Extracting Peak GEMM Performance on H100

The H100's theoretical FP16 Tensor Core performance is 989 TFLOPS. Here's how to approach that number:

import torch
import time

# 1. Use FP16 or BF16 to activate Tensor Cores
A = torch.randn(8192, 8192, device='cuda', dtype=torch.bfloat16)
B = torch.randn(8192, 8192, device='cuda', dtype=torch.bfloat16)

# 2. Warmup: first runs include kernel compilation/selection
for _ in range(5):
    _ = torch.matmul(A, B)
torch.cuda.synchronize()

# 3. Measure
start = time.perf_counter()
for _ in range(100):
    C = torch.matmul(A, B)
torch.cuda.synchronize()
end = time.perf_counter()

# 4. Compute TFLOPS
# 8192x8192 matmul: 2 * N^3 = 2 * 8192^3 = 1.1e12 FLOPs per call
flops = 2 * 8192**3 * 100
elapsed = end - start
tflops = flops / elapsed / 1e12
print(f"Achieved: {tflops:.1f} TFLOPS")
# Target: 700+ TFLOPS (70%+ of 989 TFLOPS theoretical is achievable)

Factors that determine performance:

Memory alignment: Matrices must be 128-byte aligned for optimal memory coalescing
Matrix dimensions: Must be multiples of 16 for Tensor Cores to operate without padding
torch.backends.cudnn.benchmark = True: Lets cuDNN auto-select the fastest algorithm on first run
NVLink: In multi-GPU setups, GPU-to-GPU communication bandwidth (H100 NVLink = 900 GB/s) matters for model parallelism

10. Summary: Why GPUs Dominate AI

GPU's dominance over AI computation is not simply "more cores." It's the combination of three factors:

Hardware design: Thousands of simple cores + HBM high-bandwidth memory + Tensor Cores as dedicated matrix multiply hardware
Programming model: CUDA's SIMT model + hierarchical memory (shared memory, registers) + fine-grained control
Ecosystem: 18 years of accumulated cuDNN, cuBLAS, NCCL, Nsight, PyTorch/TF integration

95% of deep learning operations reduce to matrix multiplication, and the H100's Tensor Cores are purpose-built hardware for exactly that. This is why NVIDIA holds the critical infrastructure position for the AI era.

The next post dissects cuDNN's internals: how it handles convolutions using Winograd's algorithm and im2col transformation, the FlashAttention tiling trick, and why torch.backends.cudnn.benchmark = True can double your training throughput.