Skip to content
Published on

NVIDIA GPU and CUDA Architecture Deep Dive: Why GPUs Dominate AI

Authors

Why Every AI Engineer Needs to Understand GPU Internals

Running GPT-4 inference requires hundreds of A100 GPUs operating simultaneously. Fine-tuning LLaMA-3 demands an H100 cluster. At the heart of all that computation sits NVIDIA's GPU architecture and the CUDA programming model. Yet surprisingly few engineers can precisely explain why GPUs are so fast. This post dissects H100 specifications, the SIMT execution model, warp divergence, shared memory tiling, and Tensor Cores — all backed by real CUDA code.


1. CPU vs GPU: Fundamentally Different Design Philosophies

The difference between a CPU and GPU is not merely "more cores." The underlying design philosophy is fundamentally different.

CPU: Latency Optimization

A CPU is designed to execute a single thread as fast as possible. To achieve this, it incorporates:

  • Branch Prediction: Predicts if/else outcomes ahead of time to prevent pipeline stalls
  • Out-of-Order Execution: Executes independent instructions out of program order to keep the pipeline full
  • Speculative Execution: Computes down predicted branches before knowing if they're taken (the mechanism behind Spectre/Meltdown)
  • Massive Caches: Tens of megabytes of L1/L2/L3 cache to hide memory latency for a single thread

GPU: Throughput Optimization

A GPU is designed to maximize total throughput when thousands of threads run simultaneously. Individual thread latency is traded away — instead, when one thread stalls waiting for memory, the hardware instantly switches to another thread. Latency is hidden by parallelism.

CPU Architecture (latency-optimized):
┌────────────────────────────────────────────┐
Core 0Core 1Core 2Core 3 (powerful)(powerful)(powerful)(powerful)OOO exec │ OOO exec │ OOO exec │ OOO exec │
├──────────┴──────────┴──────────┴───────────┤
L3 Cache (32MB+)├────────────────────────────────────────────┤
Main Memory (DDR5, ~50 GB/s)└────────────────────────────────────────────┘

GPU Architecture (throughput-optimized)H100:
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
SMSMSMSMSMSMSMSMSMSM...132 SMs
│  │  │  │  │  │  │  │  │  │  │   │  each SM = 128 CUDA cores
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴───┘
       L2 Cache (50MB)
       HBM3 (80GB, 3.35 TB/s)

The key insight: a CPU tries to finish 1 thread in 1 ns. A GPU accepts that each thread might take 700 ns, but runs 100,000 threads simultaneously — total throughput wins overwhelmingly.


2. GPU Internal Hierarchy: Device Down to CUDA Core

The H100's interior is organized hierarchically. Let's descend from the top-level Device all the way to individual CUDA Cores.

GPU Device (H100 SXM5)
├── SM 0 (Streaming Multiprocessor)
│   ├── 4x Warp Schedulers
│   ├── 4x Dispatch Units
│   ├── 128x CUDA Cores (FP32)
│   ├── 64x FP64 Cores
│   ├── 4x Tensor Cores (4th gen: FP8/FP16/BF16/TF32)
│   ├── 1x Special Function Unit (SFU: sin, cos, sqrt)
│   ├── Load/Store Units (LD/ST)
│   ├── L1 Cache + Shared Memory = 228KB (ratio configurable)
│   └── Register File (65536 x 32-bit registers)
├── SM 1 ...
├── SM 2 ...
...
├── SM 131 ...
├── L2 Cache (50MB)
└── HBM3 Memory (80GB, 3.35 TB/s bandwidth)

Key H100 SXM5 Numbers:

SpecValue
SM Count132
CUDA Cores per SM128 (FP32)
Total CUDA Cores16,896
Tensor Cores per SM4 (4th gen)
Total Tensor Cores528
L2 Cache50MB
HBM3 Memory80GB
Memory Bandwidth3.35 TB/s
FP16 Tensor Core Perf989 TFLOPS

3. SIMT: Single Instruction, Multiple Threads

The GPU's core execution model is SIMT (Single Instruction Multiple Threads). It resembles the CPU's SIMD (Single Instruction Multiple Data) but has a crucial difference: threads in SIMT have their own program counter and registers, giving the illusion of independent threads while sharing execution hardware.

Warp: The Fundamental Unit of GPU Execution

In CUDA, a Warp is a group of 32 threads that always execute the same instruction simultaneously. The Warp Scheduler inside each SM dispatches instructions at the warp granularity.

Thread Block (example: 256 threads)
├── Warp 0: Threads  0-31  → execute same instruction in lockstep
├── Warp 1: Threads 32-63  → execute same instruction in lockstep
├── Warp 2: Threads 64-95  → execute same instruction in lockstep
├── ...
└── Warp 7: Threads 224-255

Warp Divergence: The Performance Killer

When threads within a warp take different branches, Warp Divergence occurs. Because SIMT requires all 32 threads to execute the same instruction, the GPU serializes the divergent paths:

// This code causes warp divergence
__global__ void divergent_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx % 2 == 0) {
        // Even-indexed threads in the warp execute this
        data[idx] = data[idx] * 2.0f;  // Step 1
    } else {
        // Odd-indexed threads in the warp execute this
        data[idx] = data[idx] + 1.0f;  // Step 2
    }
    // Result: During Step 1, odd threads are masked off (idle)
    //         During Step 2, even threads are masked off (idle)
    // Effective throughput: 50% of theoretical
}

Compare with a divergence-free version:

// No divergence: all threads in a warp take the same path
__global__ void coherent_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // All threads in a block share the same blockIdx.x,
    // so all 32 threads in a warp take the same branch
    if (blockIdx.x % 2 == 0) {
        data[idx] = data[idx] * 2.0f;
    } else {
        data[idx] = data[idx] + 1.0f;
    }
}

Practical rule: Design branch conditions based on blockIdx rather than threadIdx to avoid divergence within warps.


4. CUDA Memory Hierarchy: Latency vs Bandwidth Tradeoffs

Most CUDA kernel performance comes down to memory access patterns. Understanding the characteristics of each memory level is essential.

Memory Hierarchy (fast → slow):

┌─────────────────────────────────────────────────────┐
Registers (32-bit, 65536 per SM)Latency: ~1 cycle     BW: enormous (local access)Scope: single thread only                          │
├─────────────────────────────────────────────────────┤
L1 Cache / Shared Memory (228KB/SM on H100)Latency: ~32 cycles   BW: ~19 TB/s                 │
Scope: all threads in a Thread Block├─────────────────────────────────────────────────────┤
L2 Cache (50MB on H100)Latency: ~200 cycles  BW: ~12 TB/s                 │
Scope: all SMs share                               │
├─────────────────────────────────────────────────────┤
Global Memory HBM3 (80GB on H100)Latency: ~700 cycles  BW: 3.35 TB/s                │
Scope: all threads on the device                   │
└─────────────────────────────────────────────────────┘
Memory TypeLatencyBandwidthSizeScope
Registers~1 cycleenormous256KB/SMSingle thread
Shared Memory~32 cycles~19 TB/s228KB/SMThread Block
L2 Cache~200 cycles~12 TB/s50MBWhole GPU
HBM3 (Global)~700 cycles3.35 TB/s80GBWhole GPU

Shared memory is a programmer-managed cache. Declaring variables with __shared__ puts them in shared memory, accessible by all threads in the same Thread Block, with over 20x lower latency than HBM.


5. Parallelizing Matrix Multiplication: The Core of CUDA

Matrix multiplication (GEMM: General Matrix Multiply) is the dominant operation in deep learning. Let's walk through how to parallelize it on GPU step by step.

Naive Implementation: Hammering Global Memory

// Naive matrix multiply kernel — understand WHY it's slow
__global__ void matmul_naive(float* A, float* B, float* C, int N) {
    // Each thread computes one element of the output matrix
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            // Two global memory reads per iteration → ~700 cycles latency each
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

// Launch configuration
dim3 blockDim(16, 16);  // 256 threads per block
dim3 gridDim((N + 15) / 16, (N + 15) / 16);
matmul_naive<<<gridDim, blockDim>>>(A, B, C, N);

The problem: for N=4096, each thread reads from global memory 4096 times. Total global memory reads = N^3 = 68 billion. With 700-cycle HBM latency, performance is terrible.

Tiled Implementation: Using Shared Memory to Minimize Global Accesses

#define TILE_SIZE 16

// Tiled matrix multiply using shared memory
__global__ void matmul_tiled(float* A, float* B, float* C, int N) {
    // Declare tiles in shared memory (~32-cycle latency vs 700 for HBM)
    __shared__ float tile_A[TILE_SIZE][TILE_SIZE];
    __shared__ float tile_B[TILE_SIZE][TILE_SIZE];

    int tx = threadIdx.x, ty = threadIdx.y;
    int row = blockIdx.y * TILE_SIZE + ty;
    int col = blockIdx.x * TILE_SIZE + tx;
    float sum = 0.0f;

    // Iterate over the K dimension in tile-sized chunks
    for (int t = 0; t < N / TILE_SIZE; t++) {
        // Step 1: Each thread loads one element into shared memory
        tile_A[ty][tx] = A[row * N + t * TILE_SIZE + tx];
        tile_B[ty][tx] = B[(t * TILE_SIZE + ty) * N + col];

        // Step 2: Wait for ALL threads to finish loading
        __syncthreads();

        // Step 3: Compute partial dot product from this tile
        // All accesses here hit shared memory (~32 cycles)
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tile_A[ty][k] * tile_B[k][tx];
        }

        // Step 4: Sync before loading the next tile
        __syncthreads();
    }

    C[row * N + col] = sum;
}

Why is it faster?

With tile size 16×16:

  • Naive: global memory accesses = 2 × N^3
  • Tiled: global memory accesses = 2 × N^3 / TILE_SIZE = 16x reduction

By loading a tile into shared memory once, all 16 threads along the K dimension reuse the same data. Memory bandwidth efficiency improves dramatically.

Tiling principle:

Matrix A:         Matrix B:
┌──┬──┬──┬──┐  ┌──┬──┬──┬──┐
T0T1T2T3│  │T0T1T2T3├──┼──┼──┼──┤  ├──┼──┼──┼──┤
│  │  │  │  │  │  │  │  │  │
└──┴──┴──┴──┘  └──┴──┴──┴──┘

Each Thread Block owns one 16×16 output tile of C.
It iterates over A's row tiles and B's column tiles,
loading each pair into shared memory before computing.
Result: global memory traffic reduced by TILE_SIZE factor.

6. Tensor Cores: Dedicated Matrix Multiply Hardware

Where a CUDA core processes one FP32 scalar multiply-add per cycle, a Tensor Core processes an entire 16×16 matrix multiply in a single instruction.

4th Generation Tensor Core (H100):

OperationTensor CoreCUDA CoreSpeedup
FP16 matrix multiply256 ops/cycle2 ops/cycle128x
BF16 matrix multiply256 ops/cyclenot supported
FP8 matrix multiply512 ops/cyclenot supported
TF32 matrix multiply128 ops/cyclenot supported

The WMMA (Warp Matrix Multiply-Accumulate) API exposes Tensor Cores directly:

// Direct Tensor Core access via WMMA API
#include <mma.h>
using namespace nvcuda::wmma;

__global__ void tensor_core_matmul(half* a_ptr, half* b_ptr,
                                    float* c_ptr, int M, int N, int K) {
    // Each warp handles one 16x16 output tile
    // fragment = a matrix tile that the Tensor Core operates on
    fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
    fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
    fragment<accumulator, 16, 16, 16, float> c_frag;

    // Initialize accumulator to zero
    fill_fragment(c_frag, 0.0f);

    // Iterate over K dimension
    for (int k = 0; k < K; k += 16) {
        // Load fragments from global memory
        load_matrix_sync(a_frag, a_ptr + /* offset */, K);
        load_matrix_sync(b_frag, b_ptr + /* offset */, N);

        // ONE instruction: 16x16x16 matrix multiply-accumulate
        // Internally dispatches to Tensor Core hardware
        mma_sync(c_frag, a_frag, b_frag, c_frag);
    }

    // Store result back to global memory
    store_matrix_sync(c_ptr + /* offset */, c_frag, N, mem_row_major);
}

In practice, cuBLAS, cuBLASLt, and CUTLASS exploit Tensor Cores far more efficiently than manual WMMA code, but this demonstrates exactly what Tensor Cores do.


7. Grid, Block, Thread: The CUDA Execution Hierarchy

When launching a CUDA kernel, the programmer specifies an execution configuration describing the grid and block dimensions.

CUDA Execution Hierarchy:

Grid (entire kernel launch)
├── Block (0,0): 256 threads
│   ├── Warp 0: Threads  0-31 ── execute in lockstep on SM
│   ├── Warp 1: Threads 32-63
│   └── ...
├── Block (1,0): 256 threads
├── Block (2,0): 256 threads
└── ...

Each Block is assigned to exactly one SM.
One SM can host multiple Blocks simultaneously (occupancy).

PyTorch hides all of this behind a single operator:

import torch

A = torch.randn(4096, 4096, device='cuda', dtype=torch.float16)
B = torch.randn(4096, 4096, device='cuda', dtype=torch.float16)

# What happens inside this one line:
# 1. cuBLAS or CUTLASS selects the optimal kernel
# 2. Execution config: e.g., Grid(256,256), Block(16,16)
# 3. Tiled GEMM using Tensor Cores
# 4. ~4096 Thread Blocks, each with 256 threads
result = torch.matmul(A, B)  # internally calls cublasSgemmEx

Occupancy Optimization: The number of concurrent warps an SM can host depends on register and shared memory usage per block. If a block uses too many resources, only one block fits per SM, leaving execution units underutilized.

SM Resource Limits (H100):
- Max concurrent Thread Blocks: 32
- Max concurrent threads: 2048
- Registers: 65536 total (divided among all threads in resident warps)
- Shared Memory: 228KB (divided among resident blocks)

Occupancy = Active Warps / Maximum Possible Warps
Goal: 50%+ (needed to hide memory latency through warp switching)

8. The History of CUDA: Why NVIDIA Owns AI Infrastructure

Understanding CUDA's rise explains why, even in 2024, NVIDIA holds over 80% of the AI accelerator market.

Pre-2006: The OpenGL Shader Hacking Era

Researchers knew GPUs were fast, but using them for general computation required encoding calculations as OpenGL texture operations or shader programs. Matrix multiplication had to be disguised as an image filter to run on GPU.

2006: CUDA 1.0 Launch

Jensen Huang made a bet: "We'll let programmers write GPU code in C." At the time, this meant adding general-purpose computing logic to hardware designed for gaming. The NVIDIA board had to approve dedicating silicon area to this experiment.

Early CUDA users: physics simulation researchers, CFD teams, molecular dynamics groups. Deep learning hadn't emerged yet.

2012: The AlexNet Moment

Alex Krizhevsky used two GTX 580 consumer gaming GPUs to train AlexNet, which crushed ImageNet competition with a massive accuracy lead. This single event triggered the deep learning revolution. GTX 580 was a gaming GPU, but CUDA made it a parallel computing platform.

CUDA vs OpenCL: The Ecosystem War

The Khronos Group's OpenCL supported AMD, Intel, and NVIDIA through an open standard. CUDA won despite being proprietary, for three reasons:

  1. Developer tools: cuDNN, cuBLAS, Nsight profiler — NVIDIA's ecosystem was far richer
  2. First-mover effect: Research code published in papers was all CUDA, so labs standardized on it
  3. Hardware exclusivity: Tensor Cores, NVLink, and other features are only accessible through CUDA

Result: PyTorch, TensorFlow, and JAX all use CUDA by default. AMD's ROCm and Intel's oneAPI are catching up, but the gap remains substantial.


9. Practical Optimization: Extracting Peak GEMM Performance on H100

The H100's theoretical FP16 Tensor Core performance is 989 TFLOPS. Here's how to approach that number:

import torch
import time

# 1. Use FP16 or BF16 to activate Tensor Cores
A = torch.randn(8192, 8192, device='cuda', dtype=torch.bfloat16)
B = torch.randn(8192, 8192, device='cuda', dtype=torch.bfloat16)

# 2. Warmup: first runs include kernel compilation/selection
for _ in range(5):
    _ = torch.matmul(A, B)
torch.cuda.synchronize()

# 3. Measure
start = time.perf_counter()
for _ in range(100):
    C = torch.matmul(A, B)
torch.cuda.synchronize()
end = time.perf_counter()

# 4. Compute TFLOPS
# 8192x8192 matmul: 2 * N^3 = 2 * 8192^3 = 1.1e12 FLOPs per call
flops = 2 * 8192**3 * 100
elapsed = end - start
tflops = flops / elapsed / 1e12
print(f"Achieved: {tflops:.1f} TFLOPS")
# Target: 700+ TFLOPS (70%+ of 989 TFLOPS theoretical is achievable)

Factors that determine performance:

  1. Memory alignment: Matrices must be 128-byte aligned for optimal memory coalescing
  2. Matrix dimensions: Must be multiples of 16 for Tensor Cores to operate without padding
  3. torch.backends.cudnn.benchmark = True: Lets cuDNN auto-select the fastest algorithm on first run
  4. NVLink: In multi-GPU setups, GPU-to-GPU communication bandwidth (H100 NVLink = 900 GB/s) matters for model parallelism

10. Summary: Why GPUs Dominate AI

GPU's dominance over AI computation is not simply "more cores." It's the combination of three factors:

  1. Hardware design: Thousands of simple cores + HBM high-bandwidth memory + Tensor Cores as dedicated matrix multiply hardware
  2. Programming model: CUDA's SIMT model + hierarchical memory (shared memory, registers) + fine-grained control
  3. Ecosystem: 18 years of accumulated cuDNN, cuBLAS, NCCL, Nsight, PyTorch/TF integration

95% of deep learning operations reduce to matrix multiplication, and the H100's Tensor Cores are purpose-built hardware for exactly that. This is why NVIDIA holds the critical infrastructure position for the AI era.

The next post dissects cuDNN's internals: how it handles convolutions using Winograd's algorithm and im2col transformation, the FlashAttention tiling trick, and why torch.backends.cudnn.benchmark = True can double your training throughput.