CUDA GPU Programming Model Deep Dive — SIMT, Memory Hierarchy, Tensor Core, Kernel Optimization (2025)

TL;DR

CUDA is NVIDIA's GPU programming platform. Since 2007, the de facto standard for AI/HPC. All major AI workloads (ChatGPT, Stable Diffusion) run on CUDA.
SIMT (Single Instruction, Multiple Threads): programmers write scalar code, hardware bundles 32 threads = 1 warp for parallel execution. SIMD efficiency plus scalar programming productivity.
Thread hierarchy: Grid, Block, Warp, Thread. Each block executes on the same SM (Streaming Multiprocessor). Warp-level scheduling.
Memory hierarchy: Register (fastest), Shared Memory (L1 size), L2 Cache, Global Memory (slow but big). Each level differs by tens to hundreds of times.
Memory Coalescing: if a warp's 32 threads access contiguous addresses, it becomes 1 transaction. Random access becomes 32 transactions. 32x bandwidth difference — the #1 CUDA optimization.
Warp Divergence: threads within a warp taking different paths are executed serially, losing parallelism.
Tensor Core (Volta, 2017): matrix multiply as a single instruction. 125 TFLOPS at FP16. The key to AI acceleration.
Modern trend: higher-level abstractions like Triton, cutlass, and FlashAttention replace hand-tuned CUDA.
Alternatives: ROCm/HIP (AMD), SYCL/oneAPI (Intel), Metal (Apple). Ecosystem gap keeps CUDA dominant.

1. Why GPUs Dominate AI

1.1 CPU vs GPU

Item	CPU	GPU
Core count	8-128	Thousands to tens of thousands
Per-core perf	Very fast	Relatively slow
Cache	Large L1/L2/L3	Small shared cache
Parallelism	ILP + MIMD	SIMT (massive)
Purpose	General, branchy logic	Math, data-parallel

CPU: "A few smart cores handle complex work." GPU: "Many simple cores do the same work in parallel."

1.2 Nature of AI Workloads

One Transformer inference pass:

MatMul x thousands
Element-wise ops (GELU, LayerNorm)
Attention (QK^T, softmax, @V)

All are large-scale matrix operations — billions of MAC (multiply-accumulate) ops. CPUs do 100-1000 GFLOPS, GPUs do 100,000+ TFLOPS (H100). A 1000x gap.

1.3 NVIDIA's Monopoly

Three factors:

CUDA ecosystem: libraries, docs, community accumulated since 2007.
Hardware investment: Tensor Core, HBM, NVLink, NVSwitch.
Jensen Huang's long bet: all-in on AI since AlexNet (2012).

AMD (MI300), Intel (Gaudi), Google (TPU), Apple (M-series), China (Huawei Ascend) all compete, but as of 2025 CUDA compatibility and maturity keep NVIDIA dominant.

2. GPU Hardware Architecture

2.1 Streaming Multiprocessor (SM)

The basic GPU unit is the SM. Each SM is an independent execution unit.

H100:   132 SM
A100:   108 SM
RTX 4090: 128 SM

Inside each SM:

+---------------------------------+
|           SM                    |
|  +---------+  +---------+       |
|  | Warp    |  | Warp    | x 4   |
|  | Sched   |  | Sched   |       |
|  +---------+  +---------+       |
|                                 |
|  CUDA Cores (INT32, FP32, FP64) |
|  Tensor Cores                   |
|  Special Function Units         |
|  Load/Store Units               |
|                                 |
|  Register File (tens of thousands) |
|  Shared Memory / L1 (hundreds of KB) |
+---------------------------------+

2.2 Warp

A warp = 32 threads executed in lockstep. The fundamental NVIDIA GPU unit. A100 runs up to 64 warps (= 2048 threads) per SM. Warp schedulers pick runnable warps each cycle to hide latency.

2.3 SIMT — The Programming Model

SIMT (Single Instruction, Multiple Threads): programmers write code as if for a single thread; hardware bundles 32 threads and issues one instruction over different data.

SIMT differs from SIMD:

SIMD (e.g., AVX)	SIMT (CUDA)
Explicit vector	Scalar code
8xfloat at once	32 threads x 32xfloat
Hard to branch	Branches allowed (divergence cost)
Hard to develop	Natural

2.4 Full GPU Structure

Host (CPU + DRAM)
     | PCIe / NVLink
GPU
+-- L2 Cache (tens of MB, global)
+-- Global Memory (HBM, tens-hundreds of GB)
+-- SM x N
    +-- L1 / Shared Memory
    +-- Register File
    +-- Cores

Global Memory (HBM): shared across GPU, 40-192 GB, ~500-cycle latency, but huge bandwidth (3 TB/s on H100).

2.5 Access Latency

Approximate (A100):

Register:       1 cycle
Shared Memory:  ~20 cycles
L1 cache:       ~30 cycles
L2 cache:       ~200 cycles
HBM:            ~500 cycles

Host DRAM via PCIe: ~10,000+ cycles. Data locality is everything.

3. CUDA Programming Model

3.1 Hello World

#include <stdio.h>

__global__ void hello_kernel() {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    printf("Hello from thread %d\n", tid);
}

int main() {
    hello_kernel<<<2, 32>>>();  // 2 blocks x 32 threads = 64 threads
    cudaDeviceSynchronize();
    return 0;
}

__global__: a kernel executed on the GPU.
<<<grid, block>>>: launch config — grid size x block size.
threadIdx.x, blockIdx.x, blockDim.x: index/size intrinsics.

3.2 Thread Hierarchy

Grid contains Blocks; Blocks contain Threads. Blocks on the same SM share shared memory. 2D/3D is supported:

dim3 grid(16, 16);
dim3 block(32, 32);
kernel<<<grid, block>>>();
// 16x16x32x32 = 262,144 threads

3.3 Global Index

__global__ void add_kernel(float *a, float *b, float *c, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) c[idx] = a[idx] + b[idx];
}

3.4 Memory Management

float *d_a;
cudaMalloc(&d_a, n * sizeof(float));
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
kernel<<<grid, block>>>(d_a);
cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_a);

PCIe copies are expensive — minimize them.

3.5 Unified Memory

float *data;
cudaMallocManaged(&data, n * sizeof(float));

Runtime handles page faults, moving pages between host and device. Easier, slightly slower than manual.

4. Memory Hierarchy

4.1 Global Memory

Largest (40-192 GB), shared GPU-wide, ~hundreds of cycles.

4.2 Shared Memory

Inside SM (same physical as L1). Very fast (tens of cycles). Shared across threads in a block. Limited (~48-228 KB per SM).

__global__ void kernel() {
    __shared__ float shared_data[256];
    shared_data[threadIdx.x] = threadIdx.x;
    __syncthreads();
    // All threads can read shared_data
}

Use case: reuse data that multiple threads access repeatedly.

4.3 Register

Fastest (1 cycle), thread-local, compiler-allocated. Too many registers per thread cause register spill to global memory.

4.4 Constant Memory

__constant__ float coefficients[256];

64 KB, read-only, fast broadcast when all warp threads read the same value.

4.5 Texture Memory

Originally for graphics; offers 2D spatial locality and interpolation. Modern code often uses __ldg() instead.

5. Memory Coalescing — The Top Optimization

5.1 Principle

Coalesced: 32 threads access contiguous 128 bytes -> 1 memory transaction.

Uncoalesced: threads hit scattered addresses -> 32 transactions, only 4 of 128 bytes per transaction used.

32x bandwidth difference. Real GPU performance is decided here.

5.2 Example

BAD:

__global__ void transpose_bad(float *in, float *out, int N) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    out[x * N + y] = in[y * N + x];
}

Stride N access -> uncoalesced.

GOOD (via shared memory):

__global__ void transpose_good(float *in, float *out, int N) {
    __shared__ float tile[32][33];  // +1 to avoid bank conflict
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;
    tile[threadIdx.y][threadIdx.x] = in[y * N + x];
    __syncthreads();
    x = blockIdx.y * 32 + threadIdx.x;
    y = blockIdx.x * 32 + threadIdx.y;
    out[y * N + x] = tile[threadIdx.x][threadIdx.y];
}

20-30x speedup.

5.3 Pattern

"Thread i accesses array[i]." That is the coalescing rule.

6. Warp Divergence

If threads within a warp take different paths, hardware executes both paths serially:

if (threadIdx.x < 16) compute_a();
else compute_b();

-> 50% parallelism lost.

Mitigation:

Branch at warp boundaries (e.g., blockIdx.x % 2).
Replace branches with arithmetic (fabs, ternary).
Compute both paths, select result.

Volta (2017) introduced Independent Thread Scheduling, but avoiding divergence remains the guideline.

7. Shared Memory Bank Conflict

Shared memory is split into 32 banks of 32-bit words.

Different threads hit different banks -> 1 cycle.
Same bank, different words -> N-way conflict -> N cycles.

__shared__ float data[32][32];
float y = data[0][threadIdx.x];  // all threads -> bank 0 -> 32-way conflict

Fix with padding: __shared__ float data[32][33];

Broadcast (same word, many threads) is free.

8. Occupancy

Occupancy = active warps / max warps per SM. A100 max is 64 warps.

High occupancy hides latency via warp switching. Limits:

Registers per thread.
Shared memory per block.
Threads per block.

__global__ __launch_bounds__(256, 4)
void kernel() { /* ... */ }

High occupancy is not always best — Tensor Core code can peak at low occupancy. Measure with Nsight Compute.

9. Tensor Core — The Key to AI Acceleration

9.1 Volta (2017)

V100 Tensor Core does 4x4 x 4x4 matrix multiply per cycle: D = A x B + C. 16 MACs per cycle (vs CUDA core's 1 MAC). V100 FP16 = 125 TFLOPS.

9.2 Evolution

V100 (Volta, 2017): FP16 -> FP32 matmul.
A100 (Ampere, 2020): TF32, BF16, INT8, 2:4 sparsity. 312 TFLOPS (BF16).
H100 (Hopper, 2022): FP8, Transformer Engine. 2000 TFLOPS (FP8).
B100/B200 (Blackwell, 2024): FP4, 20,000+ TFLOPS.

9.3 Usage

Direct via wmma API:

#include <mma.h>
using namespace nvcuda;

__global__ void matmul_wmma(half *a, half *b, float *c) {
    wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
    wmma::fill_fragment(c_frag, 0.0f);
    wmma::load_matrix_sync(a_frag, a, 16);
    wmma::load_matrix_sync(b_frag, b, 16);
    wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
    wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
}

Most developers use cuBLAS/cuDNN/PyTorch/Triton instead.

9.4 Mixed Precision

Low-precision input (FP16) + high-precision accumulation (FP32). Half memory, half bandwidth, Tensor Core speed, numerical safety. Automated by PyTorch's torch.cuda.amp.autocast() (AMP).

10. Streams and Asynchronous Execution

A CUDA stream is an ordered queue of operations. Different streams can run in parallel.

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
kernel1<<<grid, block, 0, stream1>>>(...);
kernel2<<<grid, block, 0, stream2>>>(...);

Overlap copy and compute across streams. Async copies need pinned memory:

cudaMallocHost(&h_data, n * sizeof(float));
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream);

Events synchronize across streams.

11. CUDA Graph — Eliminate Launch Overhead

Each kernel launch has ~tens of microseconds overhead. Transformer inference issues hundreds of small kernels -> overhead dominates.

cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
kernel1<<<..., stream>>>(...);
kernel2<<<..., stream>>>(...);
cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
cudaGraphLaunch(graphExec, stream);

Hundreds of launches reduce to one graph launch. 2-3x speedup on Transformer token generation. PyTorch 2.0+ CUDAGraphs automates this.

12. Multi-GPU and NCCL

LLM training exceeds a single GPU's memory. Parallel patterns: Data Parallel, Model Parallel, Pipeline Parallel — all need GPU-to-GPU communication.

NVLink: H100 = 900 GB/s.
NVSwitch: fully-connected GPU switch in DGX.
PCIe: ~64 GB/s.

NCCL provides AllReduce, AllGather, Broadcast, Reduce, AllToAll. Selects NVLink/PCIe/IB/Ethernet automatically. Default PyTorch distributed backend.

P2P memcpy via NVLink:

cudaDeviceEnablePeerAccess(dev1, 0);
cudaMemcpyPeer(dst, dev1, src, dev0, size);

GPUDirect RDMA enables direct inter-node GPU communication.

13. High-Level Abstractions

cuBLAS / cuDNN / cuSPARSE / cuSOLVER: NVIDIA's standard math libraries.
Thrust: STL-like C++ GPU library.
CUB: optimized reduction/scan/sort building blocks.
cutlass: GEMM building blocks (used by FlashAttention).
Triton: OpenAI's Python-like kernel language.

Triton example:

import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    tl.store(output_ptr + offsets, x + y, mask=mask)

Triton is the mainstream choice for AI researchers post-2024. FlashAttention (Tri Dao) uses Tensor Core + shared memory + fused kernels for memory-efficient attention — critical for GPT/LLaMA training.

14. Profiling and Optimization

Nsight Systems: full pipeline timeline (nsys profile --stats=true ./my_app).
Nsight Compute: per-kernel deep analysis (occupancy, memory throughput, warp efficiency, roofline) (ncu --set full ./my_app).
PyTorch Profiler: framework-level view.

Roofline tells you if a kernel is compute-bound or memory-bound. Compute-bound -> Tensor Core, occupancy. Memory-bound -> coalescing, shared memory, reuse.

15. Competitors

AMD ROCm / HIP: near-CUDA source compatibility with hipify. MI300 hardware close to H100. Ecosystem still lags.
Intel oneAPI / SYCL: Khronos-based, cross-vendor. DPC++ portable but uneven performance.
Apple Metal: PyTorch MPS backend on macOS; unified memory on M-series.
WebGPU: compute shaders in the browser; small LLMs via WebGPU + Wasm.
Google TPU: JAX/XLA, Google Cloud-only, Transformer-optimized.

16. Learning Path

Basics: CUDA C++ Programming Guide, simple kernels (vector add, matmul), nvcc, cudaMemcpy.
Optimization: coalescing, shared memory, bank conflicts, Nsight Compute.
Advanced: warp-level primitives (__shfl_sync), Tensor Core (wmma), CUDA Graph, multi-GPU (NCCL).
Ecosystem: cuBLAS/cuDNN, Triton, PyTorch extensions, FlashAttention/cutlass reading.

Books: PMPP (Hwu/Kirk), CUDA by Example, Professional CUDA C Programming. Online: NVIDIA Developer Blog, GPU MODE, PMPP lectures.

17. Cheat Sheet

+------------------------------------------+
|          CUDA Cheat Sheet                |
+------------------------------------------+
| Hardware:                                 |
|   GPU = N x SM                            |
|   SM = Cores + Tensor Cores + RegFile     |
|   H100: 132 SM                            |
|                                           |
| Execution (SIMT):                         |
|   Grid -> Block -> Warp (32) -> Thread    |
|   Scalar code; HW bundles 32 threads      |
|                                           |
| Memory Hierarchy:                         |
|   Register (fastest)                      |
|   Shared / L1 (tens of cycles)            |
|   L2 (~200 cycles)                        |
|   HBM (~500 cycles)                       |
|                                           |
| #1 Optimization: Memory Coalescing        |
|   Thread i -> array[i]                    |
|                                           |
| Pitfalls:                                 |
|   Warp divergence, bank conflict,         |
|   register spill, uncoalesced access      |
|                                           |
| Tensor Core: Volta+, FP16/BF16/FP8/FP4    |
| Streams: overlap copy/compute             |
| CUDA Graph: remove launch overhead        |
| Multi-GPU: NVLink + NCCL + P2P + RDMA     |
| Libraries: cuBLAS/cuDNN/Thrust/cutlass/   |
|            Triton/FlashAttention          |
| Tools: nvcc, Nsight Systems/Compute       |
| Alternatives: ROCm/SYCL/Metal/WebGPU/TPU  |
+------------------------------------------+

18. Quiz

Q1. What is the difference between SIMT and SIMD?

A. SIMD (e.g., AVX) requires explicit vector types like __m256; branches are hard, optimization needs compiler help. SIMT (CUDA) lets you write scalar code; hardware bundles 32 threads into a warp issuing one instruction over different data. "Write scalar, execute vector." Branches work naturally (at the cost of divergence). SIMT provides SIMD efficiency with scalar productivity — the decisive design that beat OpenCL and pure SIMD.

Q2. Why is memory coalescing the #1 CUDA optimization?

A. 32x bandwidth difference. A warp accessing a contiguous 128 bytes turns into 1 transaction; scattered access becomes 32 transactions, each wasting 124 of 128 bytes. HBM peaks at TB/s but drops to tens of GB/s when uncoalesced. Rule: thread i accesses array[i]. This is why matrix transpose is the textbook CUDA example — naive is uncoalesced; the optimized version uses shared memory as an intermediate so both read and write are coalesced.

Q3. Why does warp divergence hurt performance?

A. Warps run 32 threads in lockstep. With if (tid < 16) A() else B(), hardware serializes both paths: threads 0-15 run A() while 16-31 idle, then 16-31 run B() while 0-15 idle. Parallelism halves. Worst case is 32-way divergence, 32x slower. Mitigations: branch at warp boundaries (blockIdx.x % 2), replace with arithmetic (fabs, ternary), or compute both paths and select. Volta (2017)+ adds independent thread scheduling for flexibility, but minimizing divergence remains best.

Q4. What is a shared memory bank conflict?

A. Shared memory is split into 32 banks of 32-bit words (bank 0: words 0, 32, 64, ...; bank 1: words 1, 33, 65, ...). If 32 threads hit different banks -> 1 cycle. Same bank, different words -> N-way conflict. Classic example: data[32][32] column access puts all threads in bank 0 -> 32-way. Fix: pad to data[32][33] so columns fall across different banks. Broadcast (same word across threads) is free.

Q5. Why do Tensor Cores make AI 10-100x faster?

A. Matrix multiply is hardware-specialized. CUDA core: 1 FMA/cycle. Tensor Core (V100+): a 4x4 x 4x4 matmul per cycle = 16 MACs. Lower precisions help further — FP16, BF16/TF32/INT8 (A100), FP8 (H100), FP4 (B200). H100: FP32 = 67 TFLOPS vs Tensor Core FP8 = 2000 TFLOPS (~30x). Transformer training/inference is nearly pure matmul, so this speedup hits real workloads. NVIDIA's 2017 decision defined the AI accelerator market.

Q6. What problem does CUDA Graph solve?

A. Kernel launch overhead (~tens of microseconds per launch). Negligible for big kernels, but Transformer inference issues hundreds of small kernels — launch overhead approaches actual compute. CUDA Graph records a kernel sequence as a single graph, launched once instead of N times. 2-3x speedup on token generation. PyTorch 2.0+ torch.compile with CUDAGraphs automates this. Best when launch is the bottleneck; less effective for compute-heavy workloads.

Q7. Why can AMD ROCm not replace CUDA?

A. Ecosystem, not hardware. AMD MI300 is close to H100 (sometimes better). HIP with hipify converts CUDA code. Yet adoption lags because: (1) library maturity — rocBLAS/MIOpen/RCCL trail cuBLAS/cuDNN/NCCL in features and performance; (2) framework support — PyTorch ships CUDA features first; (3) developer community — books, tutorials, Stack Overflow overwhelmingly CUDA; (4) innovation originates on CUDA (Triton, FlashAttention, cutlass); (5) corporate inertia — engineer training costs and existing CUDA codebases. "Same hardware, different ecosystem, users stay put." Catching up takes years of investment.

"Transformer Architecture Deep Dive" — what actually runs on GPUs.
"Diffusion Models Deep Dive" — another GPU-heavy workload.
"RDMA & NCCL" — multi-GPU communication.
"LLVM Compiler Infrastructure" — the MLIR backing Triton.