Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

When you first approach GPU programming, the most confusing part is understanding how a single line of kernel code fans out into thousands of executing units on real hardware. A CPU is designed to run sequential work fast on a handful of cores, while a GPU is designed to run the same operation across thousands of simple compute units simultaneously. If you miss this distinction, your code may run but perform worse than a CPU.

In this article, we organize everything from CUDA's logical execution model (grid, block, warp, thread) to the actual hardware (SM, memory hierarchy, tensor cores) around diagrams. Once you have translated these abstract concepts into pictures, you can simulate "why is this code slow" in your head during performance tuning.

Here is what we cover:

- Execution model: the grid, block, warp, and thread hierarchy

- Hardware: the internal structure of an SM (Streaming Multiprocessor)

- Memory hierarchy: registers, shared memory, L1/L2 cache, global HBM

- Coalesced memory access patterns

- Warp scheduling and occupancy

- Tensor cores and mixed-precision compute

- Streams and asynchronous execution

- A simple kernel example and common pitfalls

> Concrete numbers for specific architectures (for example, Hopper or Blackwell) vary by generation and chip, so always confirm exact values in the official documentation and each chip's datasheet. This article focuses on the parts that are conceptually consistent across generations.

Execution Model: The Logical Hierarchy

The CUDA execution model is strictly hierarchical. Launching a kernel creates many threads, and those threads are not scattered randomly; they are organized into the following hierarchy.

┌──────────────────────────────┐

│ GRID │

│ (one kernel launch = 1 grid) │

│ │

│ ┌────────┐ ┌────────┐ │

│ │ BLOCK │ │ BLOCK │ │

│ │ (0,0) │ │ (1,0) │ │

│ └────────┘ └────────┘ │

│ ┌────────┐ ┌────────┐ │

│ │ BLOCK │ │ BLOCK │ │

│ │ (0,1) │ │ (1,1) │ │

│ └────────┘ └────────┘ │

└──────────────────────────────┘

│

▼ (zoom into one block)

┌──────────────────────────────┐

│ BLOCK │

│ up to 1024 threads (example) │

│ │

│ Thread Thread Thread ... │

│ (0) (1) (2) │

│ │

│ -- grouped 32 at a time -- │

│ [warp 0] [warp 1] ... │

└──────────────────────────────┘

│

▼ (zoom into one warp)

┌──────────────────────────────┐

│ WARP │

│ 32 threads execute the same │

│ instruction together (SIMT) │

│ │

│ T0 T1 T2 ... T30 T31 │

│ └──── lockstep ────┘ │

└──────────────────────────────┘

Grid

The full set of threads created by one kernel launch forms a single grid. A grid can be 1D, 2D, or 3D, which maps naturally onto matrix or volume data.

Block

A grid is divided into blocks. Threads inside the same block enjoy two privileges:

1. They can exchange data quickly through **shared memory**.

2. They can coordinate progress through **synchronization (`__syncthreads()`)**.

Threads in different blocks cannot synchronize directly by default. This constraint is exactly what guarantees the GPU's scalability. Because blocks are independent, the hardware is free to place them on whatever SMs are available.

Warp

From the hardware's perspective, the real scheduling unit is the **warp**, not the thread. A warp is a group of 32 threads that execute the same instruction at the same time. This model is called SIMT (Single Instruction, Multiple Threads).

SIMT execution (1 instruction -> 32 threads)

PC ──▶ [ ADD r1, r2, r3 ]

│

┌────────┼─────────────────────────────────┐

▼ ▼ ▼ ... ▼ ▼

T0 T1 T2 T30 T31

(each thread runs the same instruction on its own registers)

A key pitfall appears here. If threads within the same warp take different branches (for example, `if`-`else`), the hardware executes both paths sequentially. This is called **warp divergence** and is a major source of slowdowns.

Warp divergence (serialization due to branching)

if (threadIdx.x < 16) { A() } else { B() }

cycles 1..k : T0..T15 -> run A(), T16..T31 -> idle (masked off)

cycles k+1..: T0..T15 -> idle, T16..T31 -> run B()

result: time(A) + time(B) (serial, not parallel)

Index Computation

Each thread computes its own global position as follows. This formula appears at the top of nearly every CUDA kernel.

int idx = blockIdx.x * blockDim.x + threadIdx.x;

Global index when blockDim.x = 4

blockIdx.x: 0 1 2

┌────────┐ ┌────────┐ ┌────────┐

threadIdx.x: │0 1 2 3 │ │0 1 2 3 │ │0 1 2 3 │

global idx : │0 1 2 3 │ │4 5 6 7 │ │8 9 10 11│

└────────┘ └────────┘ └────────┘

Hardware Structure: The SM (Streaming Multiprocessor)

The hardware unit that actually executes the logical model is the **SM (Streaming Multiprocessor)**. A single GPU consists of dozens of SMs, and blocks are assigned to SMs for execution. One SM can usually host several blocks at the same time.

┌─────────────────────────────────────────────────────────────┐

│ SM (Streaming Multiprocessor) │

│ │

│ ┌───────────────┐ ┌───────────────┐ │

│ │ Warp Scheduler│ │ Warp Scheduler│ (often 4 partitions) │

│ │ + Dispatch │ │ + Dispatch │ │

│ └───────┬───────┘ └───────┬───────┘ │

│ ▼ ▼ │

│ ┌──────────────┐ ┌──────────────┐ │

│ │ CUDA Cores │ │ CUDA Cores │ (FP32 / INT math) │

│ │ ████████████│ │ ████████████│ │

│ └──────────────┘ └──────────────┘ │

│ ┌──────────────┐ ┌──────────────┐ │

│ │ Tensor Cores │ │ Tensor Cores │ (matrix-mul accel) │

│ └──────────────┘ └──────────────┘ │

│ ┌──────────────┐ ┌──────────────┐ │

│ │ LD/ST │ │ SFU │ (memory / transcend) │

│ └──────────────┘ └──────────────┘ │

│ │

│ ┌───────────────────────────────────────────────────────┐ │

│ │ Register File (tens of thousands of 32-bit) │ │

│ └───────────────────────────────────────────────────────┘ │

│ ┌───────────────────────────────────────────────────────┐ │

│ │ Shared Memory / L1 Cache (partitioned per block) │ │

│ └───────────────────────────────────────────────────────┘ │

└─────────────────────────────────────────────────────────────┘

The main components inside an SM are:

- **Warp Scheduler**: picks a ready warp and dispatches it to the execution units. An SM is usually divided into partitions, each with its own scheduler.

- **CUDA cores**: ALUs that perform basic arithmetic such as FP32/INT.

- **Tensor cores**: dedicated units that process small matrix multiply-accumulate (MMA) operations in one step.

- **LD/ST units**: handle memory loads and stores.

- **SFU (Special Function Unit)**: computes transcendental functions like `sin`, `sqrt`, `exp` quickly.

- **Register file**: the fastest storage, holding per-thread local variables.

- **Shared memory / L1**: fast on-chip memory used by blocks within the SM.

An important insight here is that the register file and shared memory are **finite resources**. The number of blocks and warps an SM can host simultaneously is determined by how much of these resources each kernel uses. This is precisely the essence of **occupancy**, which we discuss next.

Memory Hierarchy

It is no exaggeration to say that 80% of GPU performance is decided in memory. Compute units are extremely fast, but they sit idle if data is not delivered in time. CUDA memory forms a hierarchy where speed and capacity are inversely related.

Memory hierarchy (faster/smaller toward the top)

▲ fast / small

│ ┌──────────────────────────┐

│ │ Registers (per thread) │ a few cycles, tens-hundreds KB/SM

│ ├──────────────────────────┤

│ │ Shared Memory / L1 (block)│ ~tens of cycles, tens-hundreds KB/SM

│ ├──────────────────────────┤

│ │ L2 Cache (shared by SMs) │ ~hundreds of cycles, a few-tens MB

│ ├──────────────────────────┤

│ │ Global Memory (HBM, DRAM)│ ~hundreds of cycles, tens of GB

│ └──────────────────────────┘

│ slow / large

▼

The characteristics of each memory space are summarized below. (Concrete numbers vary by architecture, so treat these as relative tendencies.)

| --- | --- | --- | --- | --- | --- |

The core strategy is clear: **minimize global memory access, and reuse data once read by keeping it in shared memory or registers**. The tiling technique is exactly an implementation of this principle.

Coalesced Memory Access

Global memory is fetched in fixed-size memory segments (for example, 32, 64, or 128 bytes) per transaction. When the 32 threads of a warp access **contiguous addresses**, the hardware merges them into a small number of transactions. This is coalescing.

Coalesced access (good): the warp hits contiguous addresses

Thread: T0 T1 T2 T3 ... T31

Addr : 0 4 8 12 ... 124

└──────────────────────┘

-> served by 1 memory transaction

Uncoalesced access (bad): threads hit scattered addresses

Thread: T0 T1 T2 T3 ...

Addr : 0 512 1024 1536 ...

└─┐ └─┐ └─┐ └─┐

▼ ▼ ▼ ▼

-> a separate transaction per thread -> wasted bandwidth

The typical causes of broken coalescing are:

- Wrong data layout (for example, row-major / column-major mismatch)

- Access with a large stride

- Failing to use Struct of Arrays (SoA) instead of Array of Structs (AoS)

AoS vs SoA

AoS (Array of Structs): [x0 y0 z0][x1 y1 z1][x2 y2 z2]...

-> reading only x is interleaved with y,z -> non-contiguous

SoA (Struct of Arrays): [x0 x1 x2 ...][y0 y1 y2 ...][z0 z1 z2 ...]

-> the x array is contiguous -> coalescing friendly

Warp Scheduling and Occupancy

The real secret to a GPU's speed is not the number of cores but **latency hiding**. When one warp stalls waiting on global memory, the scheduler immediately switches to another warp that is ready to run. With enough warps in flight, memory latency becomes nearly invisible.

Warp scheduling timeline (latency hiding)

time ───────────────────────────────────────▶

Warp0: [run]──[memory wait ........]──[run]

Warp1: [run]──[memory wait ......]──[run]

Warp2: [run]──[memory wait ....]──

Warp3: [run]──[memory wait]

-> even when one warp stalls, others keep the execution units busy

-> the units never go idle (high utilization)

Occupancy

**Occupancy** is the ratio of active warps on an SM to the maximum number of warps that SM can support. Higher occupancy means more warps available to hide latency. However, 100% occupancy is not required for peak performance. Moderate occupancy can hide latency just fine, and sometimes using more registers per thread makes a kernel faster.

The factors that limit occupancy are summarized below.

| Limiting factor | Description | How to relax |

| --- | --- | --- |

| Registers per thread | More registers means fewer concurrent warps | Simplify variables, cap with compiler flags |

| Shared memory per block | More shared memory means fewer concurrent blocks | Adjust tile size |

| Threads per block | Too few means a warp shortage, too many starves resources | Pick a reasonable value like 128 or 256 |

| Max blocks/warps per SM | Hardware limit | Check architecture specs |

Occupancy intuition

Kernel that uses few registers / little shared memory:

SM resources ████████████████ -> many concurrent warps -> high occupancy

Kernel that uses many registers / lots of shared memory:

SM resources ████ -> few warps fit -> low occupancy

In practice, the standard approach is to find a good block size with NVIDIA's Occupancy Calculator or the `cudaOccupancyMaxPotentialBlockSize` API, then confirm the real bottleneck with Nsight Compute.

Tensor Cores

The core operation of deep learning is matrix multiplication (GEMM). Unlike ordinary CUDA cores that process multiply-add at the scalar level, **tensor cores** process a small matrix block's multiply-accumulate (MMA, Matrix Multiply-Accumulate) in a single instruction. This pushes throughput up by several to dozens of times.

Tensor core MMA operation (conceptual)

A (M×K) B (K×N) C (M×N)

┌──────────┐ ┌──────────┐ ┌──────────┐

│ tile │ × │ tile │ + │ accumul. │ ──▶ D = A·B + C

└──────────┘ └──────────┘ └──────────┘

(low prec.) (low prec.) (high-prec. accumulate)

A warp cooperates to perform one tile MMA

inputs are FP16/BF16/INT8 etc., accumulation in FP32 keeps precision

A defining feature of tensor cores is **mixed precision**. Inputs are taken in lower precision such as FP16, BF16, TF32, INT8, and in the latest generations FP8, while accumulation is done in higher precision like FP32 to limit accuracy loss.

| Input precision | Feature | Main use |

| --- | --- | --- |

| FP16 | Half precision, broad support | Training/inference |

| BF16 | Same exponent range as FP32 | Training stability |

| TF32 | Processes FP32 inputs internally | Minimal code change |

| INT8 | Integer, quantized inference | Low-latency inference |

| FP8 | Supported in latest generations | Large-scale training/inference |

In most cases, rather than driving tensor cores directly, you use them through high-level libraries like cuBLAS, cuDNN, and CUTLASS. To write them by hand, you use warp-level `mma` family APIs.

> Recent generations like Hopper and Blackwell have expanded tensor core throughput, supported precisions (for example, FP8), and asynchronous data movement features each generation. Specific supported precisions and performance numbers differ by generation and chip, so confirm the official specifications.

Streams and Asynchronous Execution

By default, CUDA work runs sequentially on a single default stream. But when you use multiple **streams**, you can overlap independent work. A classic example is running data transfer and kernel computation at the same time to reduce total time.

Single stream (serial): transfer then compute, one after another

H2D[==] Kernel[======] D2H[==]

─────────────────────────────────▶ time

Multiple streams (overlap): chunked and overlapped

S1: H2D[==] Kernel[====] D2H[==]

S2: H2D[==] Kernel[====] D2H[==]

S3: H2D[==] Kernel[====] D2H[==]

──────────────────────────────────────▶ time

-> transfer and compute overlap -> shorter total time

To exploit overlap properly, you need the following conditions:

- Host memory must be **page-locked (pinned)** for asynchronous copies to work.

- You must explicitly specify a stream for `cudaMemcpyAsync` and the kernel launch.

- You can manage cross-stream dependencies with events (`cudaEvent`).

cudaStream_t stream;

cudaStreamCreate(&stream);

cudaMemcpyAsync(d_in, h_in, bytes, cudaMemcpyHostToDevice, stream);

myKernel<<<grid, block, 0, stream>>>(d_in, d_out, n);

cudaMemcpyAsync(h_out, d_out, bytes, cudaMemcpyDeviceToHost, stream);

cudaStreamSynchronize(stream);

cudaStreamDestroy(stream);

In recent generations, techniques such as capturing a repeated work graph ahead of time with CUDA Graphs to cut launch overhead, or overlapping transfer and compute more tightly with asynchronous memory copy features, are widely used.

Kernel Example: Tiled Matrix Multiplication

As an example that brings the concepts together, let us look at tiled matrix multiplication using shared memory. The key idea is to read a small tile from global memory once into shared memory, then reuse that tile many times to reduce global access.

First, the simplest vector addition.

__global__ void vectorAdd(const float* a, const float* b, float* c, int n) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < n) {

c[idx] = a[idx] + b[idx];

}

// Host-side launch example

// int threads = 256;

// int blocks = (n + threads - 1) / threads;

// vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, n);

Now matrix multiplication with shared-memory tiling. It computes `C = A * B`, with each block responsible for a 16x16 tile.

#define TILE 16

__global__ void matMulTiled(const float* A, const float* B, float* C, int N) {

__shared__ float As[TILE][TILE];

__shared__ float Bs[TILE][TILE];

int row = blockIdx.y * TILE + threadIdx.y;

int col = blockIdx.x * TILE + threadIdx.x;

float acc = 0.0f;

// iterate over the K dimension tile by tile

for (int t = 0; t < (N + TILE - 1) / TILE; ++t) {

// load one tile into shared memory (coalesced access)

int aCol = t * TILE + threadIdx.x;

int bRow = t * TILE + threadIdx.y;

As[threadIdx.y][threadIdx.x] = (row < N && aCol < N) ? A[row * N + aCol] : 0.0f;

Bs[threadIdx.y][threadIdx.x] = (bRow < N && col < N) ? B[bRow * N + col] : 0.0f;

__syncthreads(); // ensure the tile is fully loaded

// compute the partial sum from shared memory (no global access)

for (int k = 0; k < TILE; ++k) {

acc += As[threadIdx.y][k] * Bs[k][threadIdx.x];

}

__syncthreads(); // sync before loading the next tile

}

if (row < N && col < N) {

C[row * N + col] = acc;

}

Why this kernel is fast, in diagram form:

Tiling reuses global accesses

Naive version: for each C[i][j], read N of A's row + N of B's col from global

-> the same data is loaded over and over

Tiled version: load a TILE×TILE block into shared memory once ->

every thread in the block reuses that tile

-> global traffic drops by roughly a factor of TILE

┌─────────┐ load ┌──────────────┐ reuse ┌──────────┐

│ Global │ ───────▶ │ Shared (tile) │ ──────▶ │ 256 │

│ (HBM) │ once │ 16×16 │ many │ threads │

└─────────┘ └──────────────┘ └──────────┘

Note why `__syncthreads()` appears twice. The first waits until every thread finishes loading the tile, preventing a data race; the second guarantees the current computation is done before the next tile overwrites it. Omitting either synchronization corrupts results non-deterministically.

Common Pitfalls and Checklist

Finally, here are pitfalls you will frequently run into in practice.

1. Uncoalesced Memory Access

The most common and most damaging. Switch your data layout to SoA, or adjust indexing so access is contiguous. The habit of checking the profiler's memory-efficiency metric first matters.

2. Warp Divergence

When branching within a warp diverges by data, execution serializes. Where possible, align branch conditions with warp boundaries, or consider replacing branches with arithmetic (predication).

3. Shared Memory Bank Conflicts

Shared memory is split into banks; when threads of a warp access the same bank simultaneously, a conflict serializes them. Adding one column of padding (for example, `[TILE][TILE+1]`) to avoid conflicts is a common technique.

4. Excessive Register Usage

Using too many registers lowers occupancy, and exceeding the limit spills to slow local memory. Try splitting the kernel into smaller pieces or reducing variables.

5. Missing or Misused Synchronization

Calling `__syncthreads()` from only some threads inside a branch causes a deadlock. Write code so that all threads pass through the same synchronization point.

6. Skipping Error Checks

If you do not check the return values of CUDA calls and `cudaGetLastError()`, you can silently get wrong results. In debug builds, checking errors after each call is the safe choice.

| Symptom | Suspected cause | Tool to check |

| --- | --- | --- |

| Slow but cores seem idle | Memory bound, uncoalescing | Nsight Compute |

| Low occupancy | Too many registers / shared memory | Occupancy Calculator |

| Branch-heavy kernel is slow | Warp divergence | Nsight Compute |

| Results occasionally wrong | Missing sync, race condition | compute-sanitizer |

| Transfer does not overlap compute | Pinned memory not used | Nsight Systems |

Conclusion

Using CUDA well comes down to two things. First, you must be able to picture **how the logical execution model (grid, block, warp) maps onto real hardware (SM, memory hierarchy)**. Second, **how well you place and reuse data across the memory hierarchy** decides most of the performance.

If you write code while recalling the diagrams in this article, you can explain "why is this kernel slow" structurally rather than by guessing. The next step is to validate your hypotheses with a profiler like Nsight and iterate.

Finally, architectures evolve quickly across generations. Concrete numbers such as tensor core precision support, asynchronous memory movement, and per-SM resources change by chip and version, so during real tuning always confirm the official documentation for your architecture.

References

- [CUDA Toolkit Documentation](https://docs.nvidia.com/cuda/)

- [CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/)

- [CUDA C++ Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/)

- [Parallel Thread Execution (PTX) ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/)

- [cuBLAS Library Documentation](https://docs.nvidia.com/cuda/cublas/)

- [Matrix Multiplication Performance Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html)

- [NVIDIA Developer Blog](https://developer.nvidia.com/blog/)

- [Nsight Compute Documentation](https://docs.nvidia.com/nsight-compute/)

- [CUTLASS GitHub Repository](https://github.com/NVIDIA/cutlass)