Advanced CUDA GPU Programming: Warp Optimization, Tensor Cores, and Triton Kernels

CUDA Memory Hierarchy
Warp Optimization
Parallel Patterns
Tensor Cores and the WMMA API
Flash Attention: How It Works
CUDA Streams and Async Execution
Multi-GPU and NCCL
OpenAI Triton Custom Kernels
Profiling with Nsight Compute and Systems
Quiz

CUDA Memory Hierarchy

Understanding the memory hierarchy is the single most important factor in GPU performance optimization. CUDA exposes several memory types that differ in location, latency, bandwidth, and scope.

Memory Type Comparison

Memory Type	Location	Latency	Bandwidth	Capacity	Scope
Register	On-chip	~1 cycle	Highest	255 per thread	Thread
Shared Memory	On-chip	~5 cycles	~19 TB/s	48–164 KB per SM	Block
L1 Cache	On-chip	~28 cycles	~19 TB/s	128 KB per SM	SM
L2 Cache	Off-chip	~193 cycles	~7 TB/s	Tens of MB	Device
Global (HBM)	HBM	~600 cycles	~2 TB/s	Tens of GB	Device
Constant	Cached	~5 cycles (hit)	—	64 KB	Device
Texture	Cached	~5 cycles (hit)	—	L1-cached	Device

Using Shared Memory: Tiled Matrix Multiplication

Shared memory sits on-chip and is shared among threads in the same block. Loading tiles into shared memory dramatically reduces expensive global memory round-trips.

#define TILE_SIZE 16

__global__ void tiledMatMul(
    float* A, float* B, float* C,
    int M, int N, int K
) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    int numTiles = (K + TILE_SIZE - 1) / TILE_SIZE;

    for (int t = 0; t < numTiles; t++) {
        int aCol = t * TILE_SIZE + threadIdx.x;
        int bRow = t * TILE_SIZE + threadIdx.y;

        // Collaboratively load tiles into shared memory
        tileA[threadIdx.y][threadIdx.x] =
            (row < M && aCol < K) ? A[row * K + aCol] : 0.0f;
        tileB[threadIdx.y][threadIdx.x] =
            (bRow < K && col < N) ? B[bRow * N + col] : 0.0f;

        __syncthreads();

        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        __syncthreads();
    }

    if (row < M && col < N) {
        C[row * N + col] = sum;
    }
}

Constant Memory

Constant memory is read-only and cached. When all threads in a warp access the same address, the value is broadcast in a single cycle — ideal for filter weights or configuration data.

__constant__ float convWeights[256];

__global__ void convKernel(float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float result = 0.0f;
        for (int i = 0; i < 256; i++) {
            result += input[idx + i] * convWeights[i]; // broadcast access
        }
        output[idx] = result;
    }
}

Warp Optimization

Eliminating Warp Divergence

A warp is a group of 32 threads that execute in lockstep under SIMT. When a conditional branch splits threads within a warp onto different paths, both paths execute serially — this is called warp divergence.

// Bad: divergence within a warp (threads 0,2,4,... vs 1,3,5,...)
__global__ void badKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx % 2 == 0) {
        data[idx] = data[idx] * 2.0f;
    } else {
        data[idx] = data[idx] + 1.0f;
    }
}

// Good: branch aligned to warp boundary — no divergence
__global__ void goodKernel(float* data, int n) {
    int warpId = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
    int idx    = blockIdx.x * blockDim.x + threadIdx.x;
    if (warpId % 2 == 0) {
        data[idx] = data[idx] * 2.0f;
    } else {
        data[idx] = data[idx] + 1.0f;
    }
}

Coalesced Memory Access

Global memory is accessed in 128-byte transactions. If consecutive threads read consecutive addresses (coalesced), a single transaction serves the entire warp. Strided or random access can require 32 separate transactions.

// Bad: column-major strided access (stride = N bytes between threads)
__global__ void columnAccess(float* matrix, int M, int N) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;
    for (int row = 0; row < M; row++) {
        sum += matrix[row * N + col]; // stride-N access
    }
}

// Good: row-major coalesced access
__global__ void rowAccess(float* matrix, int M, int N) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;
    for (int col = 0; col < N; col++) {
        sum += matrix[row * N + col]; // sequential access
    }
}

Shared Memory Bank Conflicts

Shared memory is divided into 32 banks (4 bytes each). When two or more threads in a warp access different addresses within the same bank, they serialize — this is a bank conflict.

// Bad: 32-way bank conflict during transpose (all threads hit bank 0)
__shared__ float smem[32][32];
// smem[0][0], smem[1][0], smem[2][0]... all map to bank 0

// Good: +1 column padding shifts each row to a different bank
__global__ void matTransposeNoBankConflict(float* in, float* out, int N) {
    __shared__ float tile[TILE_SIZE][TILE_SIZE + 1]; // +1 padding

    int x = blockIdx.x * TILE_SIZE + threadIdx.x;
    int y = blockIdx.y * TILE_SIZE + threadIdx.y;

    if (x < N && y < N)
        tile[threadIdx.y][threadIdx.x] = in[y * N + x];

    __syncthreads();

    x = blockIdx.y * TILE_SIZE + threadIdx.x;
    y = blockIdx.x * TILE_SIZE + threadIdx.y;

    if (x < N && y < N)
        out[y * N + x] = tile[threadIdx.x][threadIdx.y];
}

Parallel Patterns

Parallel Reduction with Warp Shuffle

// Warp-level sum using __shfl_down_sync (no shared memory needed)
__device__ float warpReduceSum(float val) {
    for (int offset = 16; offset > 0; offset >>= 1) {
        val += __shfl_down_sync(0xffffffff, val, offset);
    }
    return val;
}

__global__ void blockReduceSum(float* input, float* output, int n) {
    __shared__ float shared[32];

    int tid   = threadIdx.x;
    int idx   = blockIdx.x * blockDim.x + tid;
    float val = (idx < n) ? input[idx] : 0.0f;

    // Step 1: reduce within each warp
    val = warpReduceSum(val);

    // Step 2: write each warp's result to shared memory
    if (tid % 32 == 0) shared[tid / 32] = val;
    __syncthreads();

    // Step 3: final reduction in first warp
    if (tid < 32) {
        val = (tid < blockDim.x / 32) ? shared[tid] : 0.0f;
        val = warpReduceSum(val);
        if (tid == 0) atomicAdd(output, val);
    }
}

Exclusive Prefix Sum (Scan)

The Blelloch (work-efficient) scan uses an up-sweep and down-sweep phase, each costing O(N) work.

__global__ void exclusiveScan(int* input, int* output, int n) {
    __shared__ int temp[1024 * 2];
    int tid    = threadIdx.x;
    int offset = 1;

    temp[2 * tid]     = (2 * tid < n)     ? input[2 * tid]     : 0;
    temp[2 * tid + 1] = (2 * tid + 1 < n) ? input[2 * tid + 1] : 0;

    // Up-sweep (reduce)
    for (int d = n >> 1; d > 0; d >>= 1) {
        __syncthreads();
        if (tid < d) {
            int ai = offset * (2 * tid + 1) - 1;
            int bi = offset * (2 * tid + 2) - 1;
            temp[bi] += temp[ai];
        }
        offset <<= 1;
    }

    if (tid == 0) temp[n - 1] = 0;

    // Down-sweep
    for (int d = 1; d < n; d <<= 1) {
        offset >>= 1;
        __syncthreads();
        if (tid < d) {
            int ai = offset * (2 * tid + 1) - 1;
            int bi = offset * (2 * tid + 2) - 1;
            int t  = temp[ai];
            temp[ai]  = temp[bi];
            temp[bi] += t;
        }
    }

    __syncthreads();
    if (2 * tid < n)     output[2 * tid]     = temp[2 * tid];
    if (2 * tid + 1 < n) output[2 * tid + 1] = temp[2 * tid + 1];
}

Tensor Cores and the WMMA API

Tensor Cores are specialized hardware units introduced in Volta (2017) that perform a 4×4 FMA in a single clock cycle at the hardware level. At the warp level, the WMMA API exposes 16×16×16 matrix multiply-accumulate. On A100, FP16 Tensor Cores deliver 312 TFLOPS — roughly 16x the throughput of FP32 CUDA Cores.

Mixed-Precision WMMA Matrix Multiply

#include <mma.h>
using namespace nvcuda;

#define WMMA_M 16
#define WMMA_N 16
#define WMMA_K 16

__global__ void tensorCoreMatMul(
    half* A, half* B, float* C,
    int M, int N, int K
) {
    wmma::fragment<wmma::matrix_a, WMMA_M, WMMA_N, WMMA_K,
                   half, wmma::row_major> aFrag;
    wmma::fragment<wmma::matrix_b, WMMA_M, WMMA_N, WMMA_K,
                   half, wmma::col_major> bFrag;
    wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K,
                   float> cFrag;

    int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
    int warpN = (blockIdx.y * blockDim.y + threadIdx.y);

    wmma::fill_fragment(cFrag, 0.0f);

    for (int i = 0; i < K; i += WMMA_K) {
        int aRow = warpM * WMMA_M;
        int aCol = i;
        int bRow = i;
        int bCol = warpN * WMMA_N;

        if (aRow < M && aCol < K && bRow < K && bCol < N) {
            wmma::load_matrix_sync(aFrag, A + aRow * K + aCol, K);
            wmma::load_matrix_sync(bFrag, B + bRow * N + bCol, N);
            wmma::mma_sync(cFrag, aFrag, bFrag, cFrag);
        }
    }

    int cRow = warpM * WMMA_M;
    int cCol = warpN * WMMA_N;
    if (cRow < M && cCol < N) {
        wmma::store_matrix_sync(C + cRow * N + cCol, cFrag, N,
                                wmma::mem_row_major);
    }
}

PTX Inline Assembly

For performance-critical inner loops, PTX inline assembly unlocks hardware instructions not exposed through the CUDA C++ API.

__device__ float fastSqrt(float x) {
    float result;
    asm("sqrt.approx.f32 %0, %1;" : "=f"(result) : "f"(x));
    return result;
}

__device__ int ldgInt(const int* ptr) {
    int val;
    // Cache-hint global load: prefer L2, bypass L1
    asm volatile("ld.global.cg.s32 %0, [%1];"
                 : "=r"(val) : "l"(ptr));
    return val;
}

Flash Attention: How It Works

Standard Attention requires O(N²) HBM memory to store the intermediate attention matrix S. Flash Attention tiles Q, K, V into SRAM and uses an online softmax trick to accumulate the output without ever materializing the full N×N matrix in HBM.

Key Idea: Online Softmax

Standard Attention HBM I/O:
  Read Q, K, V    → O(Nd)
  Write S = QK^T  → O(N^2)      <-- bottleneck
  Read S, write P → O(N^2)
  Read P, write O → O(Nd)
  Total: O(N^2 * d)

Flash Attention HBM I/O:
  Read Q, K, V → O(Nd)
  Write O      → O(Nd)
  Total: O(Nd)  — no N^2 term!

// Simplified Flash Attention forward pass
__global__ void flashAttentionKernel(
    const float* Q, const float* K, const float* V,
    float* O, float* l, float* m,
    int seqLen, int headDim, float scale
) {
    int tileSize = blockDim.x;
    int qStart   = blockIdx.x * tileSize;

    extern __shared__ float smem[];
    float* sQ = smem;
    float* sK = smem + tileSize * headDim;
    float* sV = smem + 2 * tileSize * headDim;

    // Load Q tile into SRAM
    for (int d = 0; d < headDim; d++) {
        sQ[threadIdx.x * headDim + d] =
            Q[(qStart + threadIdx.x) * headDim + d];
    }

    float mi  = -INFINITY;   // running row max for numerical stability
    float li  = 0.0f;        // running softmax denominator
    float oi[64] = {};       // output accumulator

    for (int kStart = 0; kStart < seqLen; kStart += tileSize) {
        // Load K and V tiles into SRAM
        for (int d = 0; d < headDim; d++) {
            sK[threadIdx.x * headDim + d] =
                K[(kStart + threadIdx.x) * headDim + d];
            sV[threadIdx.x * headDim + d] =
                V[(kStart + threadIdx.x) * headDim + d];
        }
        __syncthreads();

        for (int j = 0; j < tileSize; j++) {
            float sij = 0.0f;
            for (int d = 0; d < headDim; d++) {
                sij += sQ[threadIdx.x * headDim + d]
                     * sK[j * headDim + d];
            }
            sij *= scale;

            // Online softmax update
            float miNew = fmaxf(mi, sij);
            float liNew = li * expf(mi - miNew) + expf(sij - miNew);

            for (int d = 0; d < headDim; d++) {
                oi[d] = oi[d] * (li / liNew) * expf(mi - miNew)
                      + (expf(sij - miNew) / liNew) * sV[j * headDim + d];
            }
            mi = miNew;
            li = liNew;
        }
        __syncthreads();
    }

    for (int d = 0; d < headDim; d++) {
        O[(qStart + threadIdx.x) * headDim + d] = oi[d];
    }
    l[qStart + threadIdx.x] = li;
    m[qStart + threadIdx.x] = mi;
}

CUDA Streams and Async Execution

CUDA streams enable overlapping of memory transfers and kernel execution, maximizing GPU utilization through pipelining.

void pipelinedProcessing(float* h_input, float* h_output,
                         int totalSize, int chunkSize) {
    const int numStreams = 4;
    cudaStream_t streams[numStreams];
    float *d_input[numStreams], *d_output[numStreams];

    for (int i = 0; i < numStreams; i++) {
        cudaStreamCreate(&streams[i]);
        cudaMalloc(&d_input[i],  chunkSize * sizeof(float));
        cudaMalloc(&d_output[i], chunkSize * sizeof(float));
    }

    // Pipeline: HtoD copy, kernel, DtoH copy interleaved across streams
    for (int chunk = 0; chunk < totalSize / chunkSize; chunk++) {
        int s = chunk % numStreams;
        float* hIn  = h_input  + chunk * chunkSize;
        float* hOut = h_output + chunk * chunkSize;

        cudaMemcpyAsync(d_input[s], hIn,
                        chunkSize * sizeof(float),
                        cudaMemcpyHostToDevice, streams[s]);

        dim3 block(256);
        dim3 grid((chunkSize + 255) / 256);
        processKernel<<<grid, block, 0, streams[s]>>>(
            d_input[s], d_output[s], chunkSize);

        cudaMemcpyAsync(hOut, d_output[s],
                        chunkSize * sizeof(float),
                        cudaMemcpyDeviceToHost, streams[s]);
    }

    for (int i = 0; i < numStreams; i++) {
        cudaStreamSynchronize(streams[i]);
        cudaStreamDestroy(streams[i]);
        cudaFree(d_input[i]);
        cudaFree(d_output[i]);
    }
}

Inter-Stream Synchronization with Events

cudaEvent_t startEvent, stopEvent;
cudaEventCreate(&startEvent);
cudaEventCreate(&stopEvent);

// stream2 must wait until stream1's kernel finishes
cudaEventRecord(startEvent, stream1);
kernel1<<<grid, block, 0, stream1>>>(d_data);
cudaStreamWaitEvent(stream2, startEvent, 0);
kernel2<<<grid, block, 0, stream2>>>(d_data);

cudaEventRecord(stopEvent, stream2);
cudaEventSynchronize(stopEvent);

float ms = 0;
cudaEventElapsedTime(&ms, startEvent, stopEvent);
printf("Elapsed: %.3f ms\n", ms);

Multi-GPU and NCCL

Peer-to-Peer Direct Access

// Enable direct GPU-to-GPU memory copy (NVLink or PCIe P2P)
int canAccess;
cudaDeviceCanAccessPeer(&canAccess, 0, 1);
if (canAccess) {
    cudaSetDevice(0);
    cudaDeviceEnablePeerAccess(1, 0);

    float* d_src; // on GPU 0
    float* d_dst; // on GPU 1
    cudaSetDevice(0);
    cudaMalloc(&d_src, size);

    cudaSetDevice(1);
    cudaMalloc(&d_dst, size);

    // Direct copy without staging through host
    cudaMemcpyPeer(d_dst, 1, d_src, 0, size);
}

NCCL Ring-AllReduce

#include <nccl.h>

void ncclAllReduceExample(int numGPUs, float** d_buffers, size_t count) {
    ncclComm_t comms[numGPUs];
    int devList[numGPUs];
    for (int i = 0; i < numGPUs; i++) devList[i] = i;

    ncclCommInitAll(comms, numGPUs, devList);

    cudaStream_t streams[numGPUs];
    for (int i = 0; i < numGPUs; i++) {
        cudaSetDevice(i);
        cudaStreamCreate(&streams[i]);
    }

    // Sum all buffers and distribute the result to every GPU
    ncclGroupStart();
    for (int i = 0; i < numGPUs; i++) {
        cudaSetDevice(i);
        ncclAllReduce(d_buffers[i], d_buffers[i], count,
                      ncclFloat, ncclSum, comms[i], streams[i]);
    }
    ncclGroupEnd();

    for (int i = 0; i < numGPUs; i++) {
        cudaSetDevice(i);
        cudaStreamSynchronize(streams[i]);
        ncclCommDestroy(comms[i]);
    }
}

The theoretical bandwidth cost of Ring-AllReduce is 2 * (N-1) / N * dataSize. Each of the N GPUs sends and receives dataSize / N chunks N-1 times across two phases (Reduce-Scatter + AllGather), giving 2 * (N-1) * (dataSize / N) total bytes transferred per GPU.

OpenAI Triton Custom Kernels

Triton lets you write GPU kernels in Python at near-CUDA performance. Triton's compiler handles tiling, vectorization, and shared memory allocation automatically.

Triton Matrix Multiply Kernel

import triton
import triton.language as tl
import torch

@triton.jit
def matmul_kernel(
    a_ptr, b_ptr, c_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_M: tl.constexpr,
    BLOCK_N: tl.constexpr,
    BLOCK_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    offs_k = tl.arange(0, BLOCK_K)

    a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak)
    b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn)

    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)

    for k in range(0, K, BLOCK_K):
        a = tl.load(a_ptrs,
                    mask=(offs_m[:, None] < M) & (offs_k[None, :] < K - k),
                    other=0.0)
        b = tl.load(b_ptrs,
                    mask=(offs_k[:, None] < K - k) & (offs_n[None, :] < N),
                    other=0.0)
        acc += tl.dot(a, b)
        a_ptrs += BLOCK_K * stride_ak
        b_ptrs += BLOCK_K * stride_bk

    c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :]
    c_mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
    tl.store(c_ptrs, acc, mask=c_mask)


def matmul(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    M, K = a.shape
    K2, N = b.shape
    assert K == K2
    c = torch.empty((M, N), device=a.device, dtype=torch.float32)
    grid = lambda meta: (
        triton.cdiv(M, meta['BLOCK_M']),
        triton.cdiv(N, meta['BLOCK_N']),
    )
    matmul_kernel[grid](
        a, b, c,
        M, N, K,
        a.stride(0), a.stride(1),
        b.stride(0), b.stride(1),
        c.stride(0), c.stride(1),
        BLOCK_M=128, BLOCK_N=256, BLOCK_K=64,
    )
    return c

Flash Attention in Triton

@triton.jit
def flash_attn_fwd_kernel(
    Q_ptr, K_ptr, V_ptr, O_ptr,
    L_ptr, M_ptr,
    seqlen, headdim, scale,
    BLOCK_Q:  tl.constexpr,
    BLOCK_KV: tl.constexpr,
):
    start_q = tl.program_id(0) * BLOCK_Q
    offs_q  = start_q + tl.arange(0, BLOCK_Q)
    offs_d  = tl.arange(0, headdim)

    q = tl.load(Q_ptr + offs_q[:, None] * headdim + offs_d[None, :],
                mask=(offs_q[:, None] < seqlen), other=0.0)

    mi  = tl.full((BLOCK_Q,), value=-float('inf'), dtype=tl.float32)
    li  = tl.zeros((BLOCK_Q,), dtype=tl.float32)
    acc = tl.zeros((BLOCK_Q, headdim), dtype=tl.float32)

    for start_kv in range(0, seqlen, BLOCK_KV):
        offs_kv = start_kv + tl.arange(0, BLOCK_KV)

        k = tl.load(K_ptr + offs_kv[:, None] * headdim + offs_d[None, :],
                    mask=(offs_kv[:, None] < seqlen), other=0.0)
        v = tl.load(V_ptr + offs_kv[:, None] * headdim + offs_d[None, :],
                    mask=(offs_kv[:, None] < seqlen), other=0.0)

        s      = tl.dot(q, tl.trans(k)) * scale
        mi_new = tl.maximum(mi, tl.max(s, axis=1))
        p      = tl.exp(s - mi_new[:, None])
        li_new = li * tl.exp(mi - mi_new) + tl.sum(p, axis=1)

        acc = acc * (li * tl.exp(mi - mi_new))[:, None] / li_new[:, None]
        acc += tl.dot(p, v) / li_new[:, None]

        mi = mi_new
        li = li_new

    tl.store(O_ptr + offs_q[:, None] * headdim + offs_d[None, :],
             acc, mask=(offs_q[:, None] < seqlen))
    tl.store(L_ptr + offs_q, li, mask=(offs_q < seqlen))
    tl.store(M_ptr + offs_q, mi, mask=(offs_q < seqlen))

Profiling with Nsight Compute and Systems

Nsight Compute: Kernel-Level Analysis

# Full profile for all metrics
ncu --set full -o profile_report ./my_cuda_app

# Profile only a specific kernel, 3 launches
ncu --kernel-name tiledMatMul --launch-count 3 ./my_cuda_app

# Collect memory bandwidth metrics for Roofline analysis
ncu --metrics \
    l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum,\
    l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum,\
    sm__cycles_elapsed.avg,\
    sm__cycles_elapsed.avg.per_second \
    ./my_cuda_app

# DRAM bandwidth utilization as % of peak
ncu --metrics \
    gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed \
    ./my_cuda_app

Nsight Systems: Application-Level Timeline

# Capture full application trace
nsys profile --trace=cuda,nvtx,osrt \
             --output=timeline_report \
             ./my_cuda_app

# Print summary statistics
nsys stats timeline_report.nsys-rep

# Export GPU trace details
nsys analyze --report gputrace timeline_report.nsys-rep

NVTX Markers for Custom Ranges

#include <nvtx3/nvToolsExt.h>

void profiledFunction() {
    nvtxRangePush("DataPreprocessing");
    preprocessData();
    nvtxRangePop();

    nvtxRangePush("ForwardPass");
    forwardPass();
    nvtxRangePop();
}

Roofline Model

The Roofline model determines whether a kernel is compute-bound or memory-bandwidth-bound.

Arithmetic Intensity (AI) = FLOP / Bytes accessed

Example for A100 80GB:
  Peak FP16 Tensor Core throughput: 312 TFLOPS
  HBM bandwidth:                      2 TB/s
  Ridge point: 312 / 2 = 156 FLOP/Byte

  AI < 156  → memory-bandwidth bound
  AI > 156  → compute bound

# Measure arithmetic intensity
ncu --metrics \
    sm__ops_path_tensor_src_fp16.sum,\
    dram__bytes.sum \
    ./matmul_app
# AI = (tensor_ops * 2) / dram_bytes

Quiz

Q1. What causes shared memory bank conflicts in CUDA, and how do you fix them?

Answer: A bank conflict occurs when two or more threads in the same warp access different addresses that map to the same shared memory bank, forcing the accesses to serialize.

Explanation: Shared memory has 32 banks, each 4 bytes wide. Thread i naturally accesses bank i % 32. A classic example is matrix transposition: reading tile[threadIdx.x][threadIdx.y] causes all 32 threads to hit bank 0 — a 32-way conflict. The fix is to pad the inner dimension by 1: __shared__ float tile[32][33]. This shifts every row's base address by one bank, so threads fan out across distinct banks with zero conflicts.

Q2. How does warp divergence affect performance, and what patterns minimize it?

Answer: Warp divergence serializes both branches of a conditional within a warp, potentially doubling execution time for that warp. With k distinct paths, execution time scales up to k times the single-path cost.

Explanation: Under SIMT, all 32 threads must execute the same instruction. When a branch is taken by only some threads, the hardware masks the others and reruns the warp for each divergent path. Mitigation strategies: (1) align branch conditions to warp boundaries using warp ID instead of thread ID, so every thread in a warp takes the same path; (2) use warp-vote intrinsics like __all_sync and __any_sync to check uniform conditions; (3) sort or rearrange input data so work of the same type is grouped into the same warps.

Q3. What hardware principle makes Tensor Cores faster than FP32 CUDA Cores for matrix multiply?

Answer: Tensor Cores implement a dedicated 4x4 FMA array in silicon that computes an entire 16x16x16 warp-level MMA in a handful of clocks, operating on lower-precision inputs (FP16/BF16/INT8) to maximize throughput.

Explanation: A standard CUDA Core executes one scalar FP32 FMA per clock. A Tensor Core fuses 256 multiply-adds (16x16 dot products of length 16) into a single hardware operation, sharing instruction overhead across the whole warp. On A100, each SM has 4 Tensor Core units capable of 1,248 FP16 FLOPS/clock/SM. Additionally, Tensor Cores accumulate in FP32 while computing in FP16 (mixed precision), preserving numerical stability. This is why torch.cuda.amp and cuBLAS both route GEMM operations through Tensor Cores when available.

Q4. Why is Flash Attention more memory-efficient than standard Attention from an HBM vs. SRAM perspective?

Answer: Standard Attention materializes the full N×N attention matrix in HBM (O(N²) storage), while Flash Attention tiles Q, K, V into on-chip SRAM and uses an online softmax to accumulate the output in SRAM — reducing HBM I/O from O(N²d) to O(Nd).

Explanation: In standard Attention the bottleneck is writing and re-reading the N×N score matrix S and probability matrix P to HBM. Flash Attention divides Q into row tiles and K, V into column tiles that fit in SRAM. For each tile pair it computes local scores and incrementally updates (mi, li, oi) using the online softmax recurrence: mi_new = max(mi, s_ij), li_new = li * exp(mi - mi_new) + exp(s_ij - mi_new). Only the final output O is written back to HBM. SRAM bandwidth is approximately 10x higher than HBM, so even though more arithmetic is performed, wall-clock time is lower because HBM traffic drops from O(N²) to O(N).

Q5. Why does NCCL Ring-AllReduce transfer exactly 2*(N-1)/N * data_size bytes per GPU regardless of N?

Answer: Ring-AllReduce consists of a Reduce-Scatter phase and an AllGather phase, each requiring N-1 steps in which every GPU sends a chunk of size data_size / N. The total is 2 * (N-1) * (data_size / N).

Explanation: Arrange N GPUs in a logical ring. In Reduce-Scatter, each GPU holds one data_size / N chunk that is the partial sum of the corresponding slice from all N GPUs after N-1 rounds of send-and-accumulate. Each round moves exactly data_size / N bytes. In AllGather, each GPU broadcasts its reduced chunk around the ring in another N-1 rounds. Total bytes per GPU = 2 * (N-1) * (data_size / N) = 2 * (N-1) / N * data_size. As N grows this approaches 2 * data_size, which is a constant — far better than the naive broadcast cost of (N-1) * data_size that grows linearly with N.

Table of Contents