Skip to content
Published on

CUDA Architecture Visualized — From Threads to Tensor Cores

Authors

Introduction

When you first approach GPU programming, the most confusing part is understanding how a single line of kernel code fans out into thousands of executing units on real hardware. A CPU is designed to run sequential work fast on a handful of cores, while a GPU is designed to run the same operation across thousands of simple compute units simultaneously. If you miss this distinction, your code may run but perform worse than a CPU.

In this article, we organize everything from CUDA's logical execution model (grid, block, warp, thread) to the actual hardware (SM, memory hierarchy, tensor cores) around diagrams. Once you have translated these abstract concepts into pictures, you can simulate "why is this code slow" in your head during performance tuning.

Here is what we cover:

  • Execution model: the grid, block, warp, and thread hierarchy
  • Hardware: the internal structure of an SM (Streaming Multiprocessor)
  • Memory hierarchy: registers, shared memory, L1/L2 cache, global HBM
  • Coalesced memory access patterns
  • Warp scheduling and occupancy
  • Tensor cores and mixed-precision compute
  • Streams and asynchronous execution
  • A simple kernel example and common pitfalls

Concrete numbers for specific architectures (for example, Hopper or Blackwell) vary by generation and chip, so always confirm exact values in the official documentation and each chip's datasheet. This article focuses on the parts that are conceptually consistent across generations.


Execution Model: The Logical Hierarchy

The CUDA execution model is strictly hierarchical. Launching a kernel creates many threads, and those threads are not scattered randomly; they are organized into the following hierarchy.

                          ┌──────────────────────────────┐
                          │           GRID               │
                          │  (one kernel launch = 1 grid) │
                          │                              │
                          │   ┌────────┐   ┌────────┐    │
                          │   │ BLOCK  │   │ BLOCK  │    │
                          │   │ (0,0)  │   │ (1,0)  │    │
                          │   └────────┘   └────────┘    │
                          │   ┌────────┐   ┌────────┐    │
                          │   │ BLOCK  │   │ BLOCK  │    │
                          │   │ (0,1)  │   │ (1,1)  │    │
                          │   └────────┘   └────────┘    │
                          └──────────────────────────────┘
                                       ▼  (zoom into one block)
                          ┌──────────────────────────────┐
                          │           BLOCK              │
                          │  up to 1024 threads (example) │
                          │                              │
                          │  Thread Thread Thread ...    │
                          │   (0)    (1)    (2)          │
                          │                              │
                          │  -- grouped 32 at a time --  │
                          │   [warp 0] [warp 1] ...      │
                          └──────────────────────────────┘
                                       ▼  (zoom into one warp)
                          ┌──────────────────────────────┐
                          │           WARP               │
                          │  32 threads execute the same  │
                          │  instruction together (SIMT)  │
                          │                              │
                          │  T0 T1 T2 ... T30 T31        │
                          │  └──── lockstep ────┘         │
                          └──────────────────────────────┘

Grid

The full set of threads created by one kernel launch forms a single grid. A grid can be 1D, 2D, or 3D, which maps naturally onto matrix or volume data.

Block

A grid is divided into blocks. Threads inside the same block enjoy two privileges:

  1. They can exchange data quickly through shared memory.
  2. They can coordinate progress through synchronization (__syncthreads()).

Threads in different blocks cannot synchronize directly by default. This constraint is exactly what guarantees the GPU's scalability. Because blocks are independent, the hardware is free to place them on whatever SMs are available.

Warp

From the hardware's perspective, the real scheduling unit is the warp, not the thread. A warp is a group of 32 threads that execute the same instruction at the same time. This model is called SIMT (Single Instruction, Multiple Threads).

   SIMT execution (1 instruction -> 32 threads)

   PC ──▶ [ ADD r1, r2, r3 ]
   ┌────────┼─────────────────────────────────┐
   ▼        ▼        ▼      ...      ▼          ▼
  T0       T1       T2             T30         T31
  (each thread runs the same instruction on its own registers)

A key pitfall appears here. If threads within the same warp take different branches (for example, if-else), the hardware executes both paths sequentially. This is called warp divergence and is a major source of slowdowns.

   Warp divergence (serialization due to branching)

   if (threadIdx.x < 16) { A() } else { B() }

   cycles 1..k :  T0..T15 -> run A(),  T16..T31 -> idle (masked off)
   cycles k+1..: T0..T15 -> idle,      T16..T31 -> run B()

   result: time(A) + time(B)  (serial, not parallel)

Index Computation

Each thread computes its own global position as follows. This formula appears at the top of nearly every CUDA kernel.

int idx = blockIdx.x * blockDim.x + threadIdx.x;
   Global index when blockDim.x = 4

   blockIdx.x:      0           1           2
                ┌────────┐  ┌────────┐  ┌────────┐
   threadIdx.x: │0 1 2 3 │  │0 1 2 3 │  │0 1 2 3 │
   global idx : │0 1 2 3 │  │4 5 6 7 │  │8 9 10 11│
                └────────┘  └────────┘  └────────┘

Hardware Structure: The SM (Streaming Multiprocessor)

The hardware unit that actually executes the logical model is the SM (Streaming Multiprocessor). A single GPU consists of dozens of SMs, and blocks are assigned to SMs for execution. One SM can usually host several blocks at the same time.

 ┌─────────────────────────────────────────────────────────────┐
 │                  SM (Streaming Multiprocessor)              │
 │                                                             │
 │  ┌───────────────┐  ┌───────────────┐                       │
 │  │ Warp Scheduler│  │ Warp Scheduler│  (often 4 partitions) │
 │  │  + Dispatch   │  │  + Dispatch   │                       │
 │  └───────┬───────┘  └───────┬───────┘                       │
 │          ▼                  ▼                               │
 │  ┌──────────────┐   ┌──────────────┐                        │
 │  │  CUDA Cores  │   │  CUDA Cores  │   (FP32 / INT math)    │
 │  │  ████████████│   │  ████████████│                        │
 │  └──────────────┘   └──────────────┘                        │
 │  ┌──────────────┐   ┌──────────────┐                        │
 │  │ Tensor Cores │   │ Tensor Cores │   (matrix-mul accel)   │
 │  └──────────────┘   └──────────────┘                        │
 │  ┌──────────────┐   ┌──────────────┐                        │
 │  │     LD/ST     │   │     SFU      │   (memory / transcend) │
 │  └──────────────┘   └──────────────┘                        │
 │                                                             │
 │  ┌───────────────────────────────────────────────────────┐  │
 │  │        Register File (tens of thousands of 32-bit)     │  │
 │  └───────────────────────────────────────────────────────┘  │
 │  ┌───────────────────────────────────────────────────────┐  │
 │  │     Shared Memory / L1 Cache (partitioned per block)   │  │
 │  └───────────────────────────────────────────────────────┘  │
 └─────────────────────────────────────────────────────────────┘

The main components inside an SM are:

  • Warp Scheduler: picks a ready warp and dispatches it to the execution units. An SM is usually divided into partitions, each with its own scheduler.
  • CUDA cores: ALUs that perform basic arithmetic such as FP32/INT.
  • Tensor cores: dedicated units that process small matrix multiply-accumulate (MMA) operations in one step.
  • LD/ST units: handle memory loads and stores.
  • SFU (Special Function Unit): computes transcendental functions like sin, sqrt, exp quickly.
  • Register file: the fastest storage, holding per-thread local variables.
  • Shared memory / L1: fast on-chip memory used by blocks within the SM.

An important insight here is that the register file and shared memory are finite resources. The number of blocks and warps an SM can host simultaneously is determined by how much of these resources each kernel uses. This is precisely the essence of occupancy, which we discuss next.


Memory Hierarchy

It is no exaggeration to say that 80% of GPU performance is decided in memory. Compute units are extremely fast, but they sit idle if data is not delivered in time. CUDA memory forms a hierarchy where speed and capacity are inversely related.

                  Memory hierarchy (faster/smaller toward the top)

           ▲  fast / small
           │   ┌──────────────────────────┐
           │   │  Registers (per thread)   │  a few cycles, tens-hundreds KB/SM
           │   ├──────────────────────────┤
           │   │ Shared Memory / L1 (block)│  ~tens of cycles, tens-hundreds KB/SM
           │   ├──────────────────────────┤
           │   │  L2 Cache (shared by SMs) │  ~hundreds of cycles, a few-tens MB
           │   ├──────────────────────────┤
           │   │  Global Memory (HBM, DRAM)│  ~hundreds of cycles, tens of GB
           │   └──────────────────────────┘
           │  slow / large

The characteristics of each memory space are summarized below. (Concrete numbers vary by architecture, so treat these as relative tendencies.)

Memory spaceScopeRelative latencyRelative bandwidthCapacityNotes
RegistersThreadLowestHighestVery smallAllocated by compiler
Shared memoryBlockLowHighSmallExplicitly managed, banked
L1 cacheSMLowHighSmallShares resources with shared memory
L2 cacheWhole GPUMediumMediumMediumAutomatic caching
Global (HBM)Whole GPUHighestLarge in absolute terms but relatively slowLargeExchanges data with host
Constant memoryWhole GPULow on cache hitFavorable for broadcastSmallRead only
Local memoryThreadHighLowLives in globalUsed on register spill

The core strategy is clear: minimize global memory access, and reuse data once read by keeping it in shared memory or registers. The tiling technique is exactly an implementation of this principle.

Coalesced Memory Access

Global memory is fetched in fixed-size memory segments (for example, 32, 64, or 128 bytes) per transaction. When the 32 threads of a warp access contiguous addresses, the hardware merges them into a small number of transactions. This is coalescing.

   Coalesced access (good): the warp hits contiguous addresses
   Thread:  T0  T1  T2  T3  ...  T31
   Addr  :  0   4   8   12  ...  124
            └──────────────────────┘
            -> served by 1 memory transaction

   Uncoalesced access (bad): threads hit scattered addresses
   Thread:  T0     T1       T2        T3   ...
   Addr  :  0     512      1024     1536  ...
            └─┐   └─┐      └─┐      └─┐
              ▼     ▼        ▼        ▼
            -> a separate transaction per thread -> wasted bandwidth

The typical causes of broken coalescing are:

  • Wrong data layout (for example, row-major / column-major mismatch)
  • Access with a large stride
  • Failing to use Struct of Arrays (SoA) instead of Array of Structs (AoS)
   AoS vs SoA

   AoS (Array of Structs):  [x0 y0 z0][x1 y1 z1][x2 y2 z2]...
     -> reading only x is interleaved with y,z -> non-contiguous

   SoA (Struct of Arrays):  [x0 x1 x2 ...][y0 y1 y2 ...][z0 z1 z2 ...]
     -> the x array is contiguous -> coalescing friendly

Warp Scheduling and Occupancy

The real secret to a GPU's speed is not the number of cores but latency hiding. When one warp stalls waiting on global memory, the scheduler immediately switches to another warp that is ready to run. With enough warps in flight, memory latency becomes nearly invisible.

   Warp scheduling timeline (latency hiding)

   time ───────────────────────────────────────▶

   Warp0: [run]──[memory wait ........]──[run]
   Warp1:       [run]──[memory wait ......]──[run]
   Warp2:              [run]──[memory wait ....]──
   Warp3:                     [run]──[memory wait]

   -> even when one warp stalls, others keep the execution units busy
   -> the units never go idle (high utilization)

Occupancy

Occupancy is the ratio of active warps on an SM to the maximum number of warps that SM can support. Higher occupancy means more warps available to hide latency. However, 100% occupancy is not required for peak performance. Moderate occupancy can hide latency just fine, and sometimes using more registers per thread makes a kernel faster.

The factors that limit occupancy are summarized below.

Limiting factorDescriptionHow to relax
Registers per threadMore registers means fewer concurrent warpsSimplify variables, cap with compiler flags
Shared memory per blockMore shared memory means fewer concurrent blocksAdjust tile size
Threads per blockToo few means a warp shortage, too many starves resourcesPick a reasonable value like 128 or 256
Max blocks/warps per SMHardware limitCheck architecture specs
   Occupancy intuition

   Kernel that uses few registers / little shared memory:
     SM resources ████████████████  -> many concurrent warps -> high occupancy

   Kernel that uses many registers / lots of shared memory:
     SM resources ████             -> few warps fit -> low occupancy

In practice, the standard approach is to find a good block size with NVIDIA's Occupancy Calculator or the cudaOccupancyMaxPotentialBlockSize API, then confirm the real bottleneck with Nsight Compute.


Tensor Cores

The core operation of deep learning is matrix multiplication (GEMM). Unlike ordinary CUDA cores that process multiply-add at the scalar level, tensor cores process a small matrix block's multiply-accumulate (MMA, Matrix Multiply-Accumulate) in a single instruction. This pushes throughput up by several to dozens of times.

   Tensor core MMA operation (conceptual)

        A (M×K)        B (K×N)         C (M×N)
     ┌──────────┐   ┌──────────┐    ┌──────────┐
     │  tile    │ × │  tile     │ + │ accumul.  │ ──▶ D = A·B + C
     └──────────┘   └──────────┘    └──────────┘
       (low prec.)    (low prec.)    (high-prec. accumulate)

   A warp cooperates to perform one tile MMA
   inputs are FP16/BF16/INT8 etc., accumulation in FP32 keeps precision

A defining feature of tensor cores is mixed precision. Inputs are taken in lower precision such as FP16, BF16, TF32, INT8, and in the latest generations FP8, while accumulation is done in higher precision like FP32 to limit accuracy loss.

Input precisionFeatureMain use
FP16Half precision, broad supportTraining/inference
BF16Same exponent range as FP32Training stability
TF32Processes FP32 inputs internallyMinimal code change
INT8Integer, quantized inferenceLow-latency inference
FP8Supported in latest generationsLarge-scale training/inference

In most cases, rather than driving tensor cores directly, you use them through high-level libraries like cuBLAS, cuDNN, and CUTLASS. To write them by hand, you use warp-level mma family APIs.

Recent generations like Hopper and Blackwell have expanded tensor core throughput, supported precisions (for example, FP8), and asynchronous data movement features each generation. Specific supported precisions and performance numbers differ by generation and chip, so confirm the official specifications.


Streams and Asynchronous Execution

By default, CUDA work runs sequentially on a single default stream. But when you use multiple streams, you can overlap independent work. A classic example is running data transfer and kernel computation at the same time to reduce total time.

   Single stream (serial): transfer then compute, one after another

   H2D[==] Kernel[======] D2H[==]
   ─────────────────────────────────▶ time

   Multiple streams (overlap): chunked and overlapped

   S1: H2D[==] Kernel[====] D2H[==]
   S2:      H2D[==] Kernel[====] D2H[==]
   S3:           H2D[==] Kernel[====] D2H[==]
   ──────────────────────────────────────▶ time
        -> transfer and compute overlap -> shorter total time

To exploit overlap properly, you need the following conditions:

  • Host memory must be page-locked (pinned) for asynchronous copies to work.
  • You must explicitly specify a stream for cudaMemcpyAsync and the kernel launch.
  • You can manage cross-stream dependencies with events (cudaEvent).
cudaStream_t stream;
cudaStreamCreate(&stream);

cudaMemcpyAsync(d_in, h_in, bytes, cudaMemcpyHostToDevice, stream);
myKernel<<<grid, block, 0, stream>>>(d_in, d_out, n);
cudaMemcpyAsync(h_out, d_out, bytes, cudaMemcpyDeviceToHost, stream);

cudaStreamSynchronize(stream);
cudaStreamDestroy(stream);

In recent generations, techniques such as capturing a repeated work graph ahead of time with CUDA Graphs to cut launch overhead, or overlapping transfer and compute more tightly with asynchronous memory copy features, are widely used.


Kernel Example: Tiled Matrix Multiplication

As an example that brings the concepts together, let us look at tiled matrix multiplication using shared memory. The key idea is to read a small tile from global memory once into shared memory, then reuse that tile many times to reduce global access.

First, the simplest vector addition.

__global__ void vectorAdd(const float* a, const float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

// Host-side launch example
// int threads = 256;
// int blocks  = (n + threads - 1) / threads;
// vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, n);

Now matrix multiplication with shared-memory tiling. It computes C = A * B, with each block responsible for a 16x16 tile.

#define TILE 16

__global__ void matMulTiled(const float* A, const float* B, float* C, int N) {
    __shared__ float As[TILE][TILE];
    __shared__ float Bs[TILE][TILE];

    int row = blockIdx.y * TILE + threadIdx.y;
    int col = blockIdx.x * TILE + threadIdx.x;

    float acc = 0.0f;

    // iterate over the K dimension tile by tile
    for (int t = 0; t < (N + TILE - 1) / TILE; ++t) {
        // load one tile into shared memory (coalesced access)
        int aCol = t * TILE + threadIdx.x;
        int bRow = t * TILE + threadIdx.y;

        As[threadIdx.y][threadIdx.x] = (row < N && aCol < N) ? A[row * N + aCol] : 0.0f;
        Bs[threadIdx.y][threadIdx.x] = (bRow < N && col < N) ? B[bRow * N + col] : 0.0f;

        __syncthreads();  // ensure the tile is fully loaded

        // compute the partial sum from shared memory (no global access)
        for (int k = 0; k < TILE; ++k) {
            acc += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        }

        __syncthreads();  // sync before loading the next tile
    }

    if (row < N && col < N) {
        C[row * N + col] = acc;
    }
}

Why this kernel is fast, in diagram form:

   Tiling reuses global accesses

   Naive version: for each C[i][j], read N of A's row + N of B's col from global
                  -> the same data is loaded over and over

   Tiled version: load a TILE×TILE block into shared memory once ->
                  every thread in the block reuses that tile
                  -> global traffic drops by roughly a factor of TILE

   ┌─────────┐   load   ┌──────────────┐  reuse  ┌──────────┐
   │ Global  │ ───────▶ │ Shared (tile) │ ──────▶ │ 256       │
   │  (HBM)  │  once     │   16×16      │  many   │ threads   │
   └─────────┘          └──────────────┘         └──────────┘

Note why __syncthreads() appears twice. The first waits until every thread finishes loading the tile, preventing a data race; the second guarantees the current computation is done before the next tile overwrites it. Omitting either synchronization corrupts results non-deterministically.


Common Pitfalls and Checklist

Finally, here are pitfalls you will frequently run into in practice.

1. Uncoalesced Memory Access

The most common and most damaging. Switch your data layout to SoA, or adjust indexing so access is contiguous. The habit of checking the profiler's memory-efficiency metric first matters.

2. Warp Divergence

When branching within a warp diverges by data, execution serializes. Where possible, align branch conditions with warp boundaries, or consider replacing branches with arithmetic (predication).

3. Shared Memory Bank Conflicts

Shared memory is split into banks; when threads of a warp access the same bank simultaneously, a conflict serializes them. Adding one column of padding (for example, [TILE][TILE+1]) to avoid conflicts is a common technique.

4. Excessive Register Usage

Using too many registers lowers occupancy, and exceeding the limit spills to slow local memory. Try splitting the kernel into smaller pieces or reducing variables.

5. Missing or Misused Synchronization

Calling __syncthreads() from only some threads inside a branch causes a deadlock. Write code so that all threads pass through the same synchronization point.

6. Skipping Error Checks

If you do not check the return values of CUDA calls and cudaGetLastError(), you can silently get wrong results. In debug builds, checking errors after each call is the safe choice.

SymptomSuspected causeTool to check
Slow but cores seem idleMemory bound, uncoalescingNsight Compute
Low occupancyToo many registers / shared memoryOccupancy Calculator
Branch-heavy kernel is slowWarp divergenceNsight Compute
Results occasionally wrongMissing sync, race conditioncompute-sanitizer
Transfer does not overlap computePinned memory not usedNsight Systems

Conclusion

Using CUDA well comes down to two things. First, you must be able to picture how the logical execution model (grid, block, warp) maps onto real hardware (SM, memory hierarchy). Second, how well you place and reuse data across the memory hierarchy decides most of the performance.

If you write code while recalling the diagrams in this article, you can explain "why is this kernel slow" structurally rather than by guessing. The next step is to validate your hypotheses with a profiler like Nsight and iterate.

Finally, architectures evolve quickly across generations. Concrete numbers such as tensor core precision support, asynchronous memory movement, and per-SM resources change by chip and version, so during real tuning always confirm the official documentation for your architecture.


References