- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Why Every AI Engineer Needs to Understand GPU Internals
- 1. CPU vs GPU: Fundamentally Different Design Philosophies
- 2. GPU Internal Hierarchy: Device Down to CUDA Core
- 3. SIMT: Single Instruction, Multiple Threads
- 4. CUDA Memory Hierarchy: Latency vs Bandwidth Tradeoffs
- 5. Parallelizing Matrix Multiplication: The Core of CUDA
- 6. Tensor Cores: Dedicated Matrix Multiply Hardware
- 7. Grid, Block, Thread: The CUDA Execution Hierarchy
- 8. The History of CUDA: Why NVIDIA Owns AI Infrastructure
- 9. Practical Optimization: Extracting Peak GEMM Performance on H100
- 10. Summary: Why GPUs Dominate AI
Why Every AI Engineer Needs to Understand GPU Internals
Running GPT-4 inference requires hundreds of A100 GPUs operating simultaneously. Fine-tuning LLaMA-3 demands an H100 cluster. At the heart of all that computation sits NVIDIA's GPU architecture and the CUDA programming model. Yet surprisingly few engineers can precisely explain why GPUs are so fast. This post dissects H100 specifications, the SIMT execution model, warp divergence, shared memory tiling, and Tensor Cores — all backed by real CUDA code.
1. CPU vs GPU: Fundamentally Different Design Philosophies
The difference between a CPU and GPU is not merely "more cores." The underlying design philosophy is fundamentally different.
CPU: Latency Optimization
A CPU is designed to execute a single thread as fast as possible. To achieve this, it incorporates:
- Branch Prediction: Predicts if/else outcomes ahead of time to prevent pipeline stalls
- Out-of-Order Execution: Executes independent instructions out of program order to keep the pipeline full
- Speculative Execution: Computes down predicted branches before knowing if they're taken (the mechanism behind Spectre/Meltdown)
- Massive Caches: Tens of megabytes of L1/L2/L3 cache to hide memory latency for a single thread
GPU: Throughput Optimization
A GPU is designed to maximize total throughput when thousands of threads run simultaneously. Individual thread latency is traded away — instead, when one thread stalls waiting for memory, the hardware instantly switches to another thread. Latency is hidden by parallelism.
CPU Architecture (latency-optimized):
┌────────────────────────────────────────────┐
│ Core 0 │ Core 1 │ Core 2 │ Core 3 │
│ (powerful)│(powerful)│(powerful)│(powerful)│
│ OOO exec │ OOO exec │ OOO exec │ OOO exec │
├──────────┴──────────┴──────────┴───────────┤
│ L3 Cache (32MB+) │
├────────────────────────────────────────────┤
│ Main Memory (DDR5, ~50 GB/s) │
└────────────────────────────────────────────┘
GPU Architecture (throughput-optimized) — H100:
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│SM│SM│SM│SM│SM│SM│SM│SM│SM│SM│...│ 132 SMs
│ │ │ │ │ │ │ │ │ │ │ │ each SM = 128 CUDA cores
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴───┘
L2 Cache (50MB)
HBM3 (80GB, 3.35 TB/s)
The key insight: a CPU tries to finish 1 thread in 1 ns. A GPU accepts that each thread might take 700 ns, but runs 100,000 threads simultaneously — total throughput wins overwhelmingly.
2. GPU Internal Hierarchy: Device Down to CUDA Core
The H100's interior is organized hierarchically. Let's descend from the top-level Device all the way to individual CUDA Cores.
GPU Device (H100 SXM5)
│
├── SM 0 (Streaming Multiprocessor)
│ ├── 4x Warp Schedulers
│ ├── 4x Dispatch Units
│ ├── 128x CUDA Cores (FP32)
│ ├── 64x FP64 Cores
│ ├── 4x Tensor Cores (4th gen: FP8/FP16/BF16/TF32)
│ ├── 1x Special Function Unit (SFU: sin, cos, sqrt)
│ ├── Load/Store Units (LD/ST)
│ ├── L1 Cache + Shared Memory = 228KB (ratio configurable)
│ └── Register File (65536 x 32-bit registers)
│
├── SM 1 ...
├── SM 2 ...
│ ...
├── SM 131 ...
│
├── L2 Cache (50MB)
└── HBM3 Memory (80GB, 3.35 TB/s bandwidth)
Key H100 SXM5 Numbers:
| Spec | Value |
|---|---|
| SM Count | 132 |
| CUDA Cores per SM | 128 (FP32) |
| Total CUDA Cores | 16,896 |
| Tensor Cores per SM | 4 (4th gen) |
| Total Tensor Cores | 528 |
| L2 Cache | 50MB |
| HBM3 Memory | 80GB |
| Memory Bandwidth | 3.35 TB/s |
| FP16 Tensor Core Perf | 989 TFLOPS |
3. SIMT: Single Instruction, Multiple Threads
The GPU's core execution model is SIMT (Single Instruction Multiple Threads). It resembles the CPU's SIMD (Single Instruction Multiple Data) but has a crucial difference: threads in SIMT have their own program counter and registers, giving the illusion of independent threads while sharing execution hardware.
Warp: The Fundamental Unit of GPU Execution
In CUDA, a Warp is a group of 32 threads that always execute the same instruction simultaneously. The Warp Scheduler inside each SM dispatches instructions at the warp granularity.
Thread Block (example: 256 threads)
├── Warp 0: Threads 0-31 → execute same instruction in lockstep
├── Warp 1: Threads 32-63 → execute same instruction in lockstep
├── Warp 2: Threads 64-95 → execute same instruction in lockstep
├── ...
└── Warp 7: Threads 224-255
Warp Divergence: The Performance Killer
When threads within a warp take different branches, Warp Divergence occurs. Because SIMT requires all 32 threads to execute the same instruction, the GPU serializes the divergent paths:
// This code causes warp divergence
__global__ void divergent_kernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx % 2 == 0) {
// Even-indexed threads in the warp execute this
data[idx] = data[idx] * 2.0f; // Step 1
} else {
// Odd-indexed threads in the warp execute this
data[idx] = data[idx] + 1.0f; // Step 2
}
// Result: During Step 1, odd threads are masked off (idle)
// During Step 2, even threads are masked off (idle)
// Effective throughput: 50% of theoretical
}
Compare with a divergence-free version:
// No divergence: all threads in a warp take the same path
__global__ void coherent_kernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// All threads in a block share the same blockIdx.x,
// so all 32 threads in a warp take the same branch
if (blockIdx.x % 2 == 0) {
data[idx] = data[idx] * 2.0f;
} else {
data[idx] = data[idx] + 1.0f;
}
}
Practical rule: Design branch conditions based on blockIdx rather than threadIdx to avoid divergence within warps.
4. CUDA Memory Hierarchy: Latency vs Bandwidth Tradeoffs
Most CUDA kernel performance comes down to memory access patterns. Understanding the characteristics of each memory level is essential.
Memory Hierarchy (fast → slow):
┌─────────────────────────────────────────────────────┐
│ Registers (32-bit, 65536 per SM) │
│ Latency: ~1 cycle BW: enormous (local access) │
│ Scope: single thread only │
├─────────────────────────────────────────────────────┤
│ L1 Cache / Shared Memory (228KB/SM on H100) │
│ Latency: ~32 cycles BW: ~19 TB/s │
│ Scope: all threads in a Thread Block │
├─────────────────────────────────────────────────────┤
│ L2 Cache (50MB on H100) │
│ Latency: ~200 cycles BW: ~12 TB/s │
│ Scope: all SMs share │
├─────────────────────────────────────────────────────┤
│ Global Memory HBM3 (80GB on H100) │
│ Latency: ~700 cycles BW: 3.35 TB/s │
│ Scope: all threads on the device │
└─────────────────────────────────────────────────────┘
| Memory Type | Latency | Bandwidth | Size | Scope |
|---|---|---|---|---|
| Registers | ~1 cycle | enormous | 256KB/SM | Single thread |
| Shared Memory | ~32 cycles | ~19 TB/s | 228KB/SM | Thread Block |
| L2 Cache | ~200 cycles | ~12 TB/s | 50MB | Whole GPU |
| HBM3 (Global) | ~700 cycles | 3.35 TB/s | 80GB | Whole GPU |
Shared memory is a programmer-managed cache. Declaring variables with __shared__ puts them in shared memory, accessible by all threads in the same Thread Block, with over 20x lower latency than HBM.
5. Parallelizing Matrix Multiplication: The Core of CUDA
Matrix multiplication (GEMM: General Matrix Multiply) is the dominant operation in deep learning. Let's walk through how to parallelize it on GPU step by step.
Naive Implementation: Hammering Global Memory
// Naive matrix multiply kernel — understand WHY it's slow
__global__ void matmul_naive(float* A, float* B, float* C, int N) {
// Each thread computes one element of the output matrix
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N) {
float sum = 0.0f;
for (int k = 0; k < N; k++) {
// Two global memory reads per iteration → ~700 cycles latency each
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
}
// Launch configuration
dim3 blockDim(16, 16); // 256 threads per block
dim3 gridDim((N + 15) / 16, (N + 15) / 16);
matmul_naive<<<gridDim, blockDim>>>(A, B, C, N);
The problem: for N=4096, each thread reads from global memory 4096 times. Total global memory reads = N^3 = 68 billion. With 700-cycle HBM latency, performance is terrible.
Tiled Implementation: Using Shared Memory to Minimize Global Accesses
#define TILE_SIZE 16
// Tiled matrix multiply using shared memory
__global__ void matmul_tiled(float* A, float* B, float* C, int N) {
// Declare tiles in shared memory (~32-cycle latency vs 700 for HBM)
__shared__ float tile_A[TILE_SIZE][TILE_SIZE];
__shared__ float tile_B[TILE_SIZE][TILE_SIZE];
int tx = threadIdx.x, ty = threadIdx.y;
int row = blockIdx.y * TILE_SIZE + ty;
int col = blockIdx.x * TILE_SIZE + tx;
float sum = 0.0f;
// Iterate over the K dimension in tile-sized chunks
for (int t = 0; t < N / TILE_SIZE; t++) {
// Step 1: Each thread loads one element into shared memory
tile_A[ty][tx] = A[row * N + t * TILE_SIZE + tx];
tile_B[ty][tx] = B[(t * TILE_SIZE + ty) * N + col];
// Step 2: Wait for ALL threads to finish loading
__syncthreads();
// Step 3: Compute partial dot product from this tile
// All accesses here hit shared memory (~32 cycles)
for (int k = 0; k < TILE_SIZE; k++) {
sum += tile_A[ty][k] * tile_B[k][tx];
}
// Step 4: Sync before loading the next tile
__syncthreads();
}
C[row * N + col] = sum;
}
Why is it faster?
With tile size 16×16:
- Naive: global memory accesses = 2 × N^3
- Tiled: global memory accesses = 2 × N^3 / TILE_SIZE = 16x reduction
By loading a tile into shared memory once, all 16 threads along the K dimension reuse the same data. Memory bandwidth efficiency improves dramatically.
Tiling principle:
Matrix A: Matrix B:
┌──┬──┬──┬──┐ ┌──┬──┬──┬──┐
│T0│T1│T2│T3│ │T0│T1│T2│T3│
├──┼──┼──┼──┤ ├──┼──┼──┼──┤
│ │ │ │ │ │ │ │ │ │
└──┴──┴──┴──┘ └──┴──┴──┴──┘
Each Thread Block owns one 16×16 output tile of C.
It iterates over A's row tiles and B's column tiles,
loading each pair into shared memory before computing.
Result: global memory traffic reduced by TILE_SIZE factor.
6. Tensor Cores: Dedicated Matrix Multiply Hardware
Where a CUDA core processes one FP32 scalar multiply-add per cycle, a Tensor Core processes an entire 16×16 matrix multiply in a single instruction.
4th Generation Tensor Core (H100):
| Operation | Tensor Core | CUDA Core | Speedup |
|---|---|---|---|
| FP16 matrix multiply | 256 ops/cycle | 2 ops/cycle | 128x |
| BF16 matrix multiply | 256 ops/cycle | not supported | — |
| FP8 matrix multiply | 512 ops/cycle | not supported | — |
| TF32 matrix multiply | 128 ops/cycle | not supported | — |
The WMMA (Warp Matrix Multiply-Accumulate) API exposes Tensor Cores directly:
// Direct Tensor Core access via WMMA API
#include <mma.h>
using namespace nvcuda::wmma;
__global__ void tensor_core_matmul(half* a_ptr, half* b_ptr,
float* c_ptr, int M, int N, int K) {
// Each warp handles one 16x16 output tile
// fragment = a matrix tile that the Tensor Core operates on
fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;
// Initialize accumulator to zero
fill_fragment(c_frag, 0.0f);
// Iterate over K dimension
for (int k = 0; k < K; k += 16) {
// Load fragments from global memory
load_matrix_sync(a_frag, a_ptr + /* offset */, K);
load_matrix_sync(b_frag, b_ptr + /* offset */, N);
// ONE instruction: 16x16x16 matrix multiply-accumulate
// Internally dispatches to Tensor Core hardware
mma_sync(c_frag, a_frag, b_frag, c_frag);
}
// Store result back to global memory
store_matrix_sync(c_ptr + /* offset */, c_frag, N, mem_row_major);
}
In practice, cuBLAS, cuBLASLt, and CUTLASS exploit Tensor Cores far more efficiently than manual WMMA code, but this demonstrates exactly what Tensor Cores do.
7. Grid, Block, Thread: The CUDA Execution Hierarchy
When launching a CUDA kernel, the programmer specifies an execution configuration describing the grid and block dimensions.
CUDA Execution Hierarchy:
Grid (entire kernel launch)
├── Block (0,0): 256 threads
│ ├── Warp 0: Threads 0-31 ── execute in lockstep on SM
│ ├── Warp 1: Threads 32-63
│ └── ...
├── Block (1,0): 256 threads
├── Block (2,0): 256 threads
└── ...
Each Block is assigned to exactly one SM.
One SM can host multiple Blocks simultaneously (occupancy).
PyTorch hides all of this behind a single operator:
import torch
A = torch.randn(4096, 4096, device='cuda', dtype=torch.float16)
B = torch.randn(4096, 4096, device='cuda', dtype=torch.float16)
# What happens inside this one line:
# 1. cuBLAS or CUTLASS selects the optimal kernel
# 2. Execution config: e.g., Grid(256,256), Block(16,16)
# 3. Tiled GEMM using Tensor Cores
# 4. ~4096 Thread Blocks, each with 256 threads
result = torch.matmul(A, B) # internally calls cublasSgemmEx
Occupancy Optimization: The number of concurrent warps an SM can host depends on register and shared memory usage per block. If a block uses too many resources, only one block fits per SM, leaving execution units underutilized.
SM Resource Limits (H100):
- Max concurrent Thread Blocks: 32
- Max concurrent threads: 2048
- Registers: 65536 total (divided among all threads in resident warps)
- Shared Memory: 228KB (divided among resident blocks)
Occupancy = Active Warps / Maximum Possible Warps
Goal: 50%+ (needed to hide memory latency through warp switching)
8. The History of CUDA: Why NVIDIA Owns AI Infrastructure
Understanding CUDA's rise explains why, even in 2024, NVIDIA holds over 80% of the AI accelerator market.
Pre-2006: The OpenGL Shader Hacking Era
Researchers knew GPUs were fast, but using them for general computation required encoding calculations as OpenGL texture operations or shader programs. Matrix multiplication had to be disguised as an image filter to run on GPU.
2006: CUDA 1.0 Launch
Jensen Huang made a bet: "We'll let programmers write GPU code in C." At the time, this meant adding general-purpose computing logic to hardware designed for gaming. The NVIDIA board had to approve dedicating silicon area to this experiment.
Early CUDA users: physics simulation researchers, CFD teams, molecular dynamics groups. Deep learning hadn't emerged yet.
2012: The AlexNet Moment
Alex Krizhevsky used two GTX 580 consumer gaming GPUs to train AlexNet, which crushed ImageNet competition with a massive accuracy lead. This single event triggered the deep learning revolution. GTX 580 was a gaming GPU, but CUDA made it a parallel computing platform.
CUDA vs OpenCL: The Ecosystem War
The Khronos Group's OpenCL supported AMD, Intel, and NVIDIA through an open standard. CUDA won despite being proprietary, for three reasons:
- Developer tools: cuDNN, cuBLAS, Nsight profiler — NVIDIA's ecosystem was far richer
- First-mover effect: Research code published in papers was all CUDA, so labs standardized on it
- Hardware exclusivity: Tensor Cores, NVLink, and other features are only accessible through CUDA
Result: PyTorch, TensorFlow, and JAX all use CUDA by default. AMD's ROCm and Intel's oneAPI are catching up, but the gap remains substantial.
9. Practical Optimization: Extracting Peak GEMM Performance on H100
The H100's theoretical FP16 Tensor Core performance is 989 TFLOPS. Here's how to approach that number:
import torch
import time
# 1. Use FP16 or BF16 to activate Tensor Cores
A = torch.randn(8192, 8192, device='cuda', dtype=torch.bfloat16)
B = torch.randn(8192, 8192, device='cuda', dtype=torch.bfloat16)
# 2. Warmup: first runs include kernel compilation/selection
for _ in range(5):
_ = torch.matmul(A, B)
torch.cuda.synchronize()
# 3. Measure
start = time.perf_counter()
for _ in range(100):
C = torch.matmul(A, B)
torch.cuda.synchronize()
end = time.perf_counter()
# 4. Compute TFLOPS
# 8192x8192 matmul: 2 * N^3 = 2 * 8192^3 = 1.1e12 FLOPs per call
flops = 2 * 8192**3 * 100
elapsed = end - start
tflops = flops / elapsed / 1e12
print(f"Achieved: {tflops:.1f} TFLOPS")
# Target: 700+ TFLOPS (70%+ of 989 TFLOPS theoretical is achievable)
Factors that determine performance:
- Memory alignment: Matrices must be 128-byte aligned for optimal memory coalescing
- Matrix dimensions: Must be multiples of 16 for Tensor Cores to operate without padding
torch.backends.cudnn.benchmark = True: Lets cuDNN auto-select the fastest algorithm on first run- NVLink: In multi-GPU setups, GPU-to-GPU communication bandwidth (H100 NVLink = 900 GB/s) matters for model parallelism
10. Summary: Why GPUs Dominate AI
GPU's dominance over AI computation is not simply "more cores." It's the combination of three factors:
- Hardware design: Thousands of simple cores + HBM high-bandwidth memory + Tensor Cores as dedicated matrix multiply hardware
- Programming model: CUDA's SIMT model + hierarchical memory (shared memory, registers) + fine-grained control
- Ecosystem: 18 years of accumulated cuDNN, cuBLAS, NCCL, Nsight, PyTorch/TF integration
95% of deep learning operations reduce to matrix multiplication, and the H100's Tensor Cores are purpose-built hardware for exactly that. This is why NVIDIA holds the critical infrastructure position for the AI era.
The next post dissects cuDNN's internals: how it handles convolutions using Winograd's algorithm and im2col transformation, the FlashAttention tiling trick, and why torch.backends.cudnn.benchmark = True can double your training throughput.