CUDA Architecture Visualized — From Threads to Tensor Cores

Introduction
Execution Model: The Logical Hierarchy
Hardware Structure: The SM (Streaming Multiprocessor)
Memory Hierarchy
- Coalesced Memory Access
Warp Scheduling and Occupancy
- Occupancy
Tensor Cores
Streams and Asynchronous Execution
Kernel Example: Tiled Matrix Multiplication
Common Pitfalls and Checklist
Conclusion
References

Introduction

When you first approach GPU programming, the most confusing part is understanding how a single line of kernel code fans out into thousands of executing units on real hardware. A CPU is designed to run sequential work fast on a handful of cores, while a GPU is designed to run the same operation across thousands of simple compute units simultaneously. If you miss this distinction, your code may run but perform worse than a CPU.

In this article, we organize everything from CUDA's logical execution model (grid, block, warp, thread) to the actual hardware (SM, memory hierarchy, tensor cores) around diagrams. Once you have translated these abstract concepts into pictures, you can simulate "why is this code slow" in your head during performance tuning.

Here is what we cover:

Execution model: the grid, block, warp, and thread hierarchy
Hardware: the internal structure of an SM (Streaming Multiprocessor)
Memory hierarchy: registers, shared memory, L1/L2 cache, global HBM
Coalesced memory access patterns
Warp scheduling and occupancy
Tensor cores and mixed-precision compute
Streams and asynchronous execution
A simple kernel example and common pitfalls

Concrete numbers for specific architectures (for example, Hopper or Blackwell) vary by generation and chip, so always confirm exact values in the official documentation and each chip's datasheet. This article focuses on the parts that are conceptually consistent across generations.

Execution Model: The Logical Hierarchy

The CUDA execution model is strictly hierarchical. Launching a kernel creates many threads, and those threads are not scattered randomly; they are organized into the following hierarchy.

                          ┌──────────────────────────────┐
                          │           GRID               │
                          │  (one kernel launch = 1 grid) │
                          │                              │
                          │   ┌────────┐   ┌────────┐    │
                          │   │ BLOCK  │   │ BLOCK  │    │
                          │   │ (0,0)  │   │ (1,0)  │    │
                          │   └────────┘   └────────┘    │
                          │   ┌────────┐   ┌────────┐    │
                          │   │ BLOCK  │   │ BLOCK  │    │
                          │   │ (0,1)  │   │ (1,1)  │    │
                          │   └────────┘   └────────┘    │
                          └──────────────────────────────┘
                                       │
                                       ▼  (zoom into one block)
                          ┌──────────────────────────────┐
                          │           BLOCK              │
                          │  up to 1024 threads (example) │
                          │                              │
                          │  Thread Thread Thread ...    │
                          │   (0)    (1)    (2)          │
                          │                              │
                          │  -- grouped 32 at a time --  │
                          │   [warp 0] [warp 1] ...      │
                          └──────────────────────────────┘
                                       │
                                       ▼  (zoom into one warp)
                          ┌──────────────────────────────┐
                          │           WARP               │
                          │  32 threads execute the same  │
                          │  instruction together (SIMT)  │
                          │                              │
                          │  T0 T1 T2 ... T30 T31        │
                          │  └──── lockstep ────┘         │
                          └──────────────────────────────┘

Grid

The full set of threads created by one kernel launch forms a single grid. A grid can be 1D, 2D, or 3D, which maps naturally onto matrix or volume data.

Block

A grid is divided into blocks. Threads inside the same block enjoy two privileges:

They can exchange data quickly through shared memory.
They can coordinate progress through synchronization (__syncthreads()).

Threads in different blocks cannot synchronize directly by default. This constraint is exactly what guarantees the GPU's scalability. Because blocks are independent, the hardware is free to place them on whatever SMs are available.

Warp

From the hardware's perspective, the real scheduling unit is the warp, not the thread. A warp is a group of 32 threads that execute the same instruction at the same time. This model is called SIMT (Single Instruction, Multiple Threads).

   SIMT execution (1 instruction -> 32 threads)

   PC ──▶ [ ADD r1, r2, r3 ]
            │
   ┌────────┼─────────────────────────────────┐
   ▼        ▼        ▼      ...      ▼          ▼
  T0       T1       T2             T30         T31
  (each thread runs the same instruction on its own registers)

A key pitfall appears here. If threads within the same warp take different branches (for example, if-else), the hardware executes both paths sequentially. This is called warp divergence and is a major source of slowdowns.

   Warp divergence (serialization due to branching)

   if (threadIdx.x < 16) { A() } else { B() }

   cycles 1..k :  T0..T15 -> run A(),  T16..T31 -> idle (masked off)
   cycles k+1..: T0..T15 -> idle,      T16..T31 -> run B()

   result: time(A) + time(B)  (serial, not parallel)

Index Computation

Each thread computes its own global position as follows. This formula appears at the top of nearly every CUDA kernel.

int idx = blockIdx.x * blockDim.x + threadIdx.x;

   Global index when blockDim.x = 4

   blockIdx.x:      0           1           2
                ┌────────┐  ┌────────┐  ┌────────┐
   threadIdx.x: │0 1 2 3 │  │0 1 2 3 │  │0 1 2 3 │
   global idx : │0 1 2 3 │  │4 5 6 7 │  │8 9 10 11│
                └────────┘  └────────┘  └────────┘

Hardware Structure: The SM (Streaming Multiprocessor)

The hardware unit that actually executes the logical model is the SM (Streaming Multiprocessor). A single GPU consists of dozens of SMs, and blocks are assigned to SMs for execution. One SM can usually host several blocks at the same time.

 ┌─────────────────────────────────────────────────────────────┐
 │                  SM (Streaming Multiprocessor)              │
 │                                                             │
 │  ┌───────────────┐  ┌───────────────┐                       │
 │  │ Warp Scheduler│  │ Warp Scheduler│  (often 4 partitions) │
 │  │  + Dispatch   │  │  + Dispatch   │                       │
 │  └───────┬───────┘  └───────┬───────┘                       │
 │          ▼                  ▼                               │
 │  ┌──────────────┐   ┌──────────────┐                        │
 │  │  CUDA Cores  │   │  CUDA Cores  │   (FP32 / INT math)    │
 │  │  ████████████│   │  ████████████│                        │
 │  └──────────────┘   └──────────────┘                        │
 │  ┌──────────────┐   ┌──────────────┐                        │
 │  │ Tensor Cores │   │ Tensor Cores │   (matrix-mul accel)   │
 │  └──────────────┘   └──────────────┘                        │
 │  ┌──────────────┐   ┌──────────────┐                        │
 │  │     LD/ST     │   │     SFU      │   (memory / transcend) │
 │  └──────────────┘   └──────────────┘                        │
 │                                                             │
 │  ┌───────────────────────────────────────────────────────┐  │
 │  │        Register File (tens of thousands of 32-bit)     │  │
 │  └───────────────────────────────────────────────────────┘  │
 │  ┌───────────────────────────────────────────────────────┐  │
 │  │     Shared Memory / L1 Cache (partitioned per block)   │  │
 │  └───────────────────────────────────────────────────────┘  │
 └─────────────────────────────────────────────────────────────┘

The main components inside an SM are:

Warp Scheduler: picks a ready warp and dispatches it to the execution units. An SM is usually divided into partitions, each with its own scheduler.
CUDA cores: ALUs that perform basic arithmetic such as FP32/INT.
Tensor cores: dedicated units that process small matrix multiply-accumulate (MMA) operations in one step.
LD/ST units: handle memory loads and stores.
SFU (Special Function Unit): computes transcendental functions like sin, sqrt, exp quickly.
Register file: the fastest storage, holding per-thread local variables.
Shared memory / L1: fast on-chip memory used by blocks within the SM.

An important insight here is that the register file and shared memory are finite resources. The number of blocks and warps an SM can host simultaneously is determined by how much of these resources each kernel uses. This is precisely the essence of occupancy, which we discuss next.

Memory Hierarchy

It is no exaggeration to say that 80% of GPU performance is decided in memory. Compute units are extremely fast, but they sit idle if data is not delivered in time. CUDA memory forms a hierarchy where speed and capacity are inversely related.

                  Memory hierarchy (faster/smaller toward the top)

           ▲  fast / small
           │   ┌──────────────────────────┐
           │   │  Registers (per thread)   │  a few cycles, tens-hundreds KB/SM
           │   ├──────────────────────────┤
           │   │ Shared Memory / L1 (block)│  ~tens of cycles, tens-hundreds KB/SM
           │   ├──────────────────────────┤
           │   │  L2 Cache (shared by SMs) │  ~hundreds of cycles, a few-tens MB
           │   ├──────────────────────────┤
           │   │  Global Memory (HBM, DRAM)│  ~hundreds of cycles, tens of GB
           │   └──────────────────────────┘
           │  slow / large
           ▼

The characteristics of each memory space are summarized below. (Concrete numbers vary by architecture, so treat these as relative tendencies.)

Memory space	Scope	Relative latency	Relative bandwidth	Capacity	Notes
Registers	Thread	Lowest	Highest	Very small	Allocated by compiler
Shared memory	Block	Low	High	Small	Explicitly managed, banked
L1 cache	SM	Low	High	Small	Shares resources with shared memory
L2 cache	Whole GPU	Medium	Medium	Medium	Automatic caching
Global (HBM)	Whole GPU	Highest	Large in absolute terms but relatively slow	Large	Exchanges data with host
Constant memory	Whole GPU	Low on cache hit	Favorable for broadcast	Small	Read only
Local memory	Thread	High	Low	Lives in global	Used on register spill

The core strategy is clear: minimize global memory access, and reuse data once read by keeping it in shared memory or registers. The tiling technique is exactly an implementation of this principle.

Coalesced Memory Access

Global memory is fetched in fixed-size memory segments (for example, 32, 64, or 128 bytes) per transaction. When the 32 threads of a warp access contiguous addresses, the hardware merges them into a small number of transactions. This is coalescing.

   Coalesced access (good): the warp hits contiguous addresses
   Thread:  T0  T1  T2  T3  ...  T31
   Addr  :  0   4   8   12  ...  124
            └──────────────────────┘
            -> served by 1 memory transaction

   Uncoalesced access (bad): threads hit scattered addresses
   Thread:  T0     T1       T2        T3   ...
   Addr  :  0     512      1024     1536  ...
            └─┐   └─┐      └─┐      └─┐
              ▼     ▼        ▼        ▼
            -> a separate transaction per thread -> wasted bandwidth

The typical causes of broken coalescing are:

Wrong data layout (for example, row-major / column-major mismatch)
Access with a large stride
Failing to use Struct of Arrays (SoA) instead of Array of Structs (AoS)

   AoS vs SoA

   AoS (Array of Structs):  [x0 y0 z0][x1 y1 z1][x2 y2 z2]...
     -> reading only x is interleaved with y,z -> non-contiguous

   SoA (Struct of Arrays):  [x0 x1 x2 ...][y0 y1 y2 ...][z0 z1 z2 ...]
     -> the x array is contiguous -> coalescing friendly

Warp Scheduling and Occupancy

The real secret to a GPU's speed is not the number of cores but latency hiding. When one warp stalls waiting on global memory, the scheduler immediately switches to another warp that is ready to run. With enough warps in flight, memory latency becomes nearly invisible.

   Warp scheduling timeline (latency hiding)

   time ───────────────────────────────────────▶

   Warp0: [run]──[memory wait ........]──[run]
   Warp1:       [run]──[memory wait ......]──[run]
   Warp2:              [run]──[memory wait ....]──
   Warp3:                     [run]──[memory wait]

   -> even when one warp stalls, others keep the execution units busy
   -> the units never go idle (high utilization)

Occupancy

Occupancy is the ratio of active warps on an SM to the maximum number of warps that SM can support. Higher occupancy means more warps available to hide latency. However, 100% occupancy is not required for peak performance. Moderate occupancy can hide latency just fine, and sometimes using more registers per thread makes a kernel faster.

The factors that limit occupancy are summarized below.

Limiting factor	Description	How to relax
Registers per thread	More registers means fewer concurrent warps	Simplify variables, cap with compiler flags
Shared memory per block	More shared memory means fewer concurrent blocks	Adjust tile size
Threads per block	Too few means a warp shortage, too many starves resources	Pick a reasonable value like 128 or 256
Max blocks/warps per SM	Hardware limit	Check architecture specs

   Occupancy intuition

   Kernel that uses few registers / little shared memory:
     SM resources ████████████████  -> many concurrent warps -> high occupancy

   Kernel that uses many registers / lots of shared memory:
     SM resources ████             -> few warps fit -> low occupancy

In practice, the standard approach is to find a good block size with NVIDIA's Occupancy Calculator or the cudaOccupancyMaxPotentialBlockSize API, then confirm the real bottleneck with Nsight Compute.

Tensor Cores

The core operation of deep learning is matrix multiplication (GEMM). Unlike ordinary CUDA cores that process multiply-add at the scalar level, tensor cores process a small matrix block's multiply-accumulate (MMA, Matrix Multiply-Accumulate) in a single instruction. This pushes throughput up by several to dozens of times.

   Tensor core MMA operation (conceptual)

        A (M×K)        B (K×N)         C (M×N)
     ┌──────────┐   ┌──────────┐    ┌──────────┐
     │  tile    │ × │  tile     │ + │ accumul.  │ ──▶ D = A·B + C
     └──────────┘   └──────────┘    └──────────┘
       (low prec.)    (low prec.)    (high-prec. accumulate)

   A warp cooperates to perform one tile MMA
   inputs are FP16/BF16/INT8 etc., accumulation in FP32 keeps precision

A defining feature of tensor cores is mixed precision. Inputs are taken in lower precision such as FP16, BF16, TF32, INT8, and in the latest generations FP8, while accumulation is done in higher precision like FP32 to limit accuracy loss.

Input precision	Feature	Main use
FP16	Half precision, broad support	Training/inference
BF16	Same exponent range as FP32	Training stability
TF32	Processes FP32 inputs internally	Minimal code change
INT8	Integer, quantized inference	Low-latency inference
FP8	Supported in latest generations	Large-scale training/inference

In most cases, rather than driving tensor cores directly, you use them through high-level libraries like cuBLAS, cuDNN, and CUTLASS. To write them by hand, you use warp-level mma family APIs.

Recent generations like Hopper and Blackwell have expanded tensor core throughput, supported precisions (for example, FP8), and asynchronous data movement features each generation. Specific supported precisions and performance numbers differ by generation and chip, so confirm the official specifications.

Streams and Asynchronous Execution

By default, CUDA work runs sequentially on a single default stream. But when you use multiple streams, you can overlap independent work. A classic example is running data transfer and kernel computation at the same time to reduce total time.

   Single stream (serial): transfer then compute, one after another

   H2D[==] Kernel[======] D2H[==]
   ─────────────────────────────────▶ time

   Multiple streams (overlap): chunked and overlapped

   S1: H2D[==] Kernel[====] D2H[==]
   S2:      H2D[==] Kernel[====] D2H[==]
   S3:           H2D[==] Kernel[====] D2H[==]
   ──────────────────────────────────────▶ time
        -> transfer and compute overlap -> shorter total time

To exploit overlap properly, you need the following conditions:

Host memory must be page-locked (pinned) for asynchronous copies to work.
You must explicitly specify a stream for cudaMemcpyAsync and the kernel launch.
You can manage cross-stream dependencies with events (cudaEvent).

cudaStream_t stream;
cudaStreamCreate(&stream);

cudaMemcpyAsync(d_in, h_in, bytes, cudaMemcpyHostToDevice, stream);
myKernel<<<grid, block, 0, stream>>>(d_in, d_out, n);
cudaMemcpyAsync(h_out, d_out, bytes, cudaMemcpyDeviceToHost, stream);

cudaStreamSynchronize(stream);
cudaStreamDestroy(stream);

In recent generations, techniques such as capturing a repeated work graph ahead of time with CUDA Graphs to cut launch overhead, or overlapping transfer and compute more tightly with asynchronous memory copy features, are widely used.

Kernel Example: Tiled Matrix Multiplication

As an example that brings the concepts together, let us look at tiled matrix multiplication using shared memory. The key idea is to read a small tile from global memory once into shared memory, then reuse that tile many times to reduce global access.

First, the simplest vector addition.

__global__ void vectorAdd(const float* a, const float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

// Host-side launch example
// int threads = 256;
// int blocks  = (n + threads - 1) / threads;
// vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, n);

Now matrix multiplication with shared-memory tiling. It computes C = A * B, with each block responsible for a 16x16 tile.

#define TILE 16

__global__ void matMulTiled(const float* A, const float* B, float* C, int N) {
    __shared__ float As[TILE][TILE];
    __shared__ float Bs[TILE][TILE];

    int row = blockIdx.y * TILE + threadIdx.y;
    int col = blockIdx.x * TILE + threadIdx.x;

    float acc = 0.0f;

    // iterate over the K dimension tile by tile
    for (int t = 0; t < (N + TILE - 1) / TILE; ++t) {
        // load one tile into shared memory (coalesced access)
        int aCol = t * TILE + threadIdx.x;
        int bRow = t * TILE + threadIdx.y;

        As[threadIdx.y][threadIdx.x] = (row < N && aCol < N) ? A[row * N + aCol] : 0.0f;
        Bs[threadIdx.y][threadIdx.x] = (bRow < N && col < N) ? B[bRow * N + col] : 0.0f;

        __syncthreads();  // ensure the tile is fully loaded

        // compute the partial sum from shared memory (no global access)
        for (int k = 0; k < TILE; ++k) {
            acc += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        }

        __syncthreads();  // sync before loading the next tile
    }

    if (row < N && col < N) {
        C[row * N + col] = acc;
    }
}

Why this kernel is fast, in diagram form:

   Tiling reuses global accesses

   Naive version: for each C[i][j], read N of A's row + N of B's col from global
                  -> the same data is loaded over and over

   Tiled version: load a TILE×TILE block into shared memory once ->
                  every thread in the block reuses that tile
                  -> global traffic drops by roughly a factor of TILE

   ┌─────────┐   load   ┌──────────────┐  reuse  ┌──────────┐
   │ Global  │ ───────▶ │ Shared (tile) │ ──────▶ │ 256       │
   │  (HBM)  │  once     │   16×16      │  many   │ threads   │
   └─────────┘          └──────────────┘         └──────────┘

Note why __syncthreads() appears twice. The first waits until every thread finishes loading the tile, preventing a data race; the second guarantees the current computation is done before the next tile overwrites it. Omitting either synchronization corrupts results non-deterministically.

Common Pitfalls and Checklist

Finally, here are pitfalls you will frequently run into in practice.

1. Uncoalesced Memory Access

The most common and most damaging. Switch your data layout to SoA, or adjust indexing so access is contiguous. The habit of checking the profiler's memory-efficiency metric first matters.

2. Warp Divergence

When branching within a warp diverges by data, execution serializes. Where possible, align branch conditions with warp boundaries, or consider replacing branches with arithmetic (predication).

3. Shared Memory Bank Conflicts

Shared memory is split into banks; when threads of a warp access the same bank simultaneously, a conflict serializes them. Adding one column of padding (for example, [TILE][TILE+1]) to avoid conflicts is a common technique.

4. Excessive Register Usage

Using too many registers lowers occupancy, and exceeding the limit spills to slow local memory. Try splitting the kernel into smaller pieces or reducing variables.

5. Missing or Misused Synchronization

Calling __syncthreads() from only some threads inside a branch causes a deadlock. Write code so that all threads pass through the same synchronization point.

6. Skipping Error Checks

If you do not check the return values of CUDA calls and cudaGetLastError(), you can silently get wrong results. In debug builds, checking errors after each call is the safe choice.

Symptom	Suspected cause	Tool to check
Slow but cores seem idle	Memory bound, uncoalescing	Nsight Compute
Low occupancy	Too many registers / shared memory	Occupancy Calculator
Branch-heavy kernel is slow	Warp divergence	Nsight Compute
Results occasionally wrong	Missing sync, race condition	compute-sanitizer
Transfer does not overlap compute	Pinned memory not used	Nsight Systems

Conclusion

Using CUDA well comes down to two things. First, you must be able to picture how the logical execution model (grid, block, warp) maps onto real hardware (SM, memory hierarchy). Second, how well you place and reuse data across the memory hierarchy decides most of the performance.

If you write code while recalling the diagrams in this article, you can explain "why is this kernel slow" structurally rather than by guessing. The next step is to validate your hypotheses with a profiler like Nsight and iterate.

Finally, architectures evolve quickly across generations. Concrete numbers such as tensor core precision support, asynchronous memory movement, and per-SM resources change by chip and version, so during real tuning always confirm the official documentation for your architecture.