CUDA Programming Complete Guide: GPU Parallel Computing Zero to Hero

Overview

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It enables developers to harness the thousands of cores in a GPU to achieve computational throughput impossible with CPU alone. CUDA has become a foundational technology across deep learning, scientific simulation, image processing, and many other domains.

This guide targets both developers new to CUDA and engineers seeking advanced optimization techniques, walking through GPU architecture, kernel development, memory hierarchy, and real-world custom deep learning operators step by step.

1. GPU Architecture

1.1 CPU vs GPU Design Philosophy

A CPU is a general-purpose processor designed for complex control flow and low latency. Modern CPUs have 16–64 cores, each optimized for high-performance single-threaded execution. A GPU, by contrast, has thousands of simpler cores and operates under the SIMT (Single Instruction, Multiple Threads) model — applying the same operation to large volumes of data simultaneously.

For an NVIDIA A100 GPU:

6912 CUDA Cores (FP32 operations)
432 Tensor Cores (matrix operation acceleration)
80 GB HBM2e memory (2 TB/s bandwidth)
312 TFLOPS (FP16 Tensor Core peak)

1.2 NVIDIA GPU Architecture

Streaming Multiprocessor (SM)

The SM is the fundamental execution unit of a GPU. The A100 has 108 SMs, each containing:

64 CUDA Cores (FP32)
4 Tensor Cores (3rd generation)
256 KB L1 Cache / Shared Memory (configurable)
32 Load/Store Units
4 Special Function Units (SFUs)
Up to 2048 concurrent threads

CUDA Core vs Tensor Core

A CUDA Core processes a single FP32 or INT32 operation per cycle. A Tensor Core accelerates matrix-multiply-accumulate (MMA) operations in hardware. The 3rd-generation Tensor Cores in A100 process a 4x4 FP16 matrix operation (16 FMAs) per cycle, delivering more than 8x the throughput of a CUDA Core.

1.3 Ampere Architecture (A100)

A100 GPU Structure
┌─────────────────────────────────────────┐
│  GPC (Graphics Processing Cluster) x 8  │
│  ┌─────────────────────────────────┐    │
│  │  TPC (Texture Processing Cluster)│   │
│  │  ┌──────────┐  ┌──────────┐    │    │
│  │  │    SM    │  │    SM    │    │    │
│  │  │ 64 CUDA  │  │ 64 CUDA  │    │    │
│  │  │  4 TC    │  │  4 TC    │    │    │
│  │  └──────────┘  └──────────┘    │    │
│  └─────────────────────────────────┘    │
│                                         │
│  L2 Cache: 40 MB                        │
│  HBM2e: 80 GB @ 2 TB/s                 │
└─────────────────────────────────────────┘

A100 highlights:

MIG (Multi-Instance GPU): Partition a single GPU into up to 7 independent instances
NVLink 3.0: 600 GB/s bidirectional GPU-to-GPU bandwidth
TF32: New numeric format combining FP32 range with FP16 precision
Structural Sparsity: 2:4 sparsity pattern doubles Tensor Core throughput

1.4 Hopper Architecture (H100)

H100, released in 2022, is NVIDIA's latest data-center GPU:

132 SMs, each with 128 CUDA Cores
4th-gen Tensor Cores: FP8 support, up to 6x A100 performance
Transformer Engine: Per-layer automatic FP8/FP16 switching for transformer models
NVLink 4.0: 900 GB/s bidirectional bandwidth
HBM3: 80 GB @ 3.35 TB/s

1.5 Thread Hierarchy

CUDA uses a three-level thread hierarchy:

Grid  (entire kernel launch)
  └── Block (runs on one SM; threads share shared memory)
        └── Thread (individual execution unit)

Each level is three-dimensional:

// Example: 2D Grid of 2D Blocks
dim3 gridDim(32, 32, 1);    // 32 x 32 = 1024 blocks
dim3 blockDim(16, 16, 1);   // 16 x 16 = 256 threads per block
// Total threads: 1024 * 256 = 262,144

Warp

A warp is the SM's execution unit comprising 32 threads. All threads within a warp execute the same instruction simultaneously (SIMT). Much of CUDA optimization stems from understanding warp-level behavior.

1.6 Memory Hierarchy

Latency (low to high)
Register File   : ~1 cycle,    size: 64 KB/SM, up to 255 regs/thread
Shared Memory   : ~5 cycles,   size: up to 164 KB/SM (Ampere)
L1 Cache        : ~30 cycles,  size: unified with shared memory
L2 Cache        : ~200 cycles, size: 40 MB (A100)
Global Memory   : ~600 cycles, size: 80 GB (HBM2e)

Registers: Exclusively owned by each thread. Fastest. Excess register usage causes spilling to Local Memory (physically global memory).

Shared Memory: Shared among threads in the same block. Must be managed explicitly; physically shares space with L1 Cache.

Global Memory: Accessible by all threads. High latency makes coalesced access patterns essential.

2. CUDA Development Environment

2.1 Installing CUDA Toolkit

Download the appropriate package from https://developer.nvidia.com/cuda-toolkit.

On Ubuntu 22.04:

# Install CUDA keyring
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb

# Update and install
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

# Environment variables (add to ~/.bashrc or ~/.zshrc)
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Verify installation:

nvcc --version
# nvcc: NVIDIA (R) Cuda compiler driver
# Cuda compilation tools, release 12.3, V12.3.107

nvidia-smi
# Shows driver version, CUDA version, and GPU status

2.2 nvcc Compiler

nvcc is NVIDIA's CUDA C/C++ compiler. It compiles host code with g++/clang++ and device code to PTX, then to SASS (actual GPU instructions).

# Basic compilation
nvcc -o hello_cuda hello_cuda.cu

# With optimization flags
nvcc -O3 -arch=sm_80 -o matmul matmul.cu

# Debug build
nvcc -G -g -o debug_kernel debug_kernel.cu

# Multi-architecture support
nvcc -gencode arch=compute_75,code=sm_75 \
     -gencode arch=compute_80,code=sm_80 \
     -gencode arch=compute_86,code=sm_86 \
     -o multi_arch multi_arch.cu

# Key flags
# -arch=sm_XX : target architecture (sm_80 = Ampere A100)
# -O3         : optimization level
# -G          : device-side debug info
# -lineinfo   : line info for profiling
# --ptxas-options=-v : print register/memory usage

2.3 nvidia-smi

# Real-time GPU monitoring
nvidia-smi dmon -s u   # GPU utilization
nvidia-smi dmon -s m   # memory utilization

# Detailed info for a specific GPU
nvidia-smi -q -d MEMORY,UTILIZATION

# Lock GPU clocks for benchmarking
sudo nvidia-smi -pm 1              # Enable persistence mode
sudo nvidia-smi --lock-gpu-clocks=1410

3. CUDA Programming Fundamentals

3.1 Hello CUDA World

#include <cuda_runtime.h>
#include <stdio.h>

// __global__: runs on GPU, called from CPU
__global__ void helloKernel() {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    printf("Hello from Thread %d (Block %d, Thread %d)\n",
           tid, blockIdx.x, threadIdx.x);
}

int main() {
    // 2 blocks, 4 threads per block = 8 total threads
    helloKernel<<<2, 4>>>();
    cudaDeviceSynchronize();
    return 0;
}

CUDA function qualifiers:

__global__: runs on GPU, called from CPU (kernel)
__device__: runs on GPU, called from GPU only
__host__: runs on CPU, called from CPU (default)
__host__ __device__: can run on both

3.2 Thread Indexing

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void indexingDemo() {
    // 1D indexing
    int global_tid_1d = threadIdx.x + blockIdx.x * blockDim.x;

    // 2D indexing
    int row = threadIdx.y + blockIdx.y * blockDim.y;
    int col = threadIdx.x + blockIdx.x * blockDim.x;
    int global_tid_2d = row * (gridDim.x * blockDim.x) + col;

    // 3D indexing
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    int z = threadIdx.z + blockIdx.z * blockDim.z;
    int width  = gridDim.x * blockDim.x;
    int height = gridDim.y * blockDim.y;
    int global_tid_3d = z * (width * height) + y * width + x;

    if (global_tid_1d == 0) {
        printf("Block dims: (%d, %d, %d)\n",
               blockDim.x, blockDim.y, blockDim.z);
        printf("Grid dims:  (%d, %d, %d)\n",
               gridDim.x, gridDim.y, gridDim.z);
    }
}

int main() {
    dim3 block(4, 4, 1);
    dim3 grid(2, 2, 1);
    indexingDemo<<<grid, block>>>();
    cudaDeviceSynchronize();
    return 0;
}

3.3 Vector Addition

#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define CUDA_CHECK(err) \
    do { \
        cudaError_t e = (err); \
        if (e != cudaSuccess) { \
            fprintf(stderr, "CUDA error %s:%d: %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(e)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

__global__ void vectorAdd(const float* A, const float* B, float* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    const int N = 1 << 20;  // 1M elements
    const size_t bytes = N * sizeof(float);

    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);

    for (int i = 0; i < N; i++) {
        h_A[i] = (float)rand() / RAND_MAX;
        h_B[i] = (float)rand() / RAND_MAX;
    }

    float *d_A, *d_B, *d_C;
    CUDA_CHECK(cudaMalloc(&d_A, bytes));
    CUDA_CHECK(cudaMalloc(&d_B, bytes));
    CUDA_CHECK(cudaMalloc(&d_C, bytes));

    CUDA_CHECK(cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice));

    int blockSize = 256;
    int gridSize = (N + blockSize - 1) / blockSize;
    vectorAdd<<<gridSize, blockSize>>>(d_A, d_B, d_C, N);

    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    CUDA_CHECK(cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost));

    float maxError = 0.0f;
    for (int i = 0; i < N; i++) {
        maxError = fmaxf(maxError, fabsf(h_C[i] - (h_A[i] + h_B[i])));
    }
    printf("Max error: %e\n", maxError);

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}

3.4 Matrix Multiplication (Naive)

#include <cuda_runtime.h>
#include <stdio.h>

#define N 1024

__global__ void matMulNaive(const float* A, const float* B, float* C, int n) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < n && col < n) {
        float sum = 0.0f;
        for (int k = 0; k < n; k++) {
            sum += A[row * n + k] * B[k * n + col];
        }
        C[row * n + col] = sum;
    }
}

int main() {
    int n = N;
    size_t bytes = n * n * sizeof(float);

    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);

    for (int i = 0; i < n * n; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 1.0f;
    }

    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, bytes);
    cudaMalloc(&d_B, bytes);
    cudaMalloc(&d_C, bytes);

    cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice);

    dim3 blockDim(16, 16);
    dim3 gridDim((n + blockDim.x - 1) / blockDim.x,
                 (n + blockDim.y - 1) / blockDim.y);

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    matMulNaive<<<gridDim, blockDim>>>(d_A, d_B, d_C, n);
    cudaEventRecord(stop);

    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);

    printf("Matrix multiplication (%dx%d): %.3f ms\n", n, n, milliseconds);

    double flops = 2.0 * n * n * n;
    double gflops = flops / (milliseconds * 1e6);
    printf("Performance: %.2f GFLOPS\n", gflops);

    cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost);
    printf("C[0][0] = %.1f (expected %.1f)\n", h_C[0], (float)n);

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}

4. CUDA Memory Management

4.1 cudaMalloc / cudaFree / cudaMemcpy

// Basic memory operations
float *d_ptr;
cudaMalloc(&d_ptr, size);
cudaMemset(d_ptr, 0, size);
cudaMemcpy(d_ptr, h_ptr, size, cudaMemcpyHostToDevice);
cudaMemcpy(h_ptr, d_ptr, size, cudaMemcpyDeviceToHost);
cudaMemcpy(d_dst, d_src, size, cudaMemcpyDeviceToDevice);
cudaFree(d_ptr);

// 2D memory (useful for matrices)
float *d_matrix;
size_t pitch;
cudaMallocPitch(&d_matrix, &pitch, width * sizeof(float), height);
cudaMemcpy2D(d_matrix, pitch,
             h_matrix, width * sizeof(float),
             width * sizeof(float), height,
             cudaMemcpyHostToDevice);

4.2 Unified Memory

Unified Memory lets CPU and GPU share the same virtual address space.

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void incrementKernel(int* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        data[idx] += 1;
    }
}

int main() {
    const int N = 1 << 20;
    int *data;

    // Unified Memory allocation — accessible from both CPU and GPU
    cudaMallocManaged(&data, N * sizeof(int));

    // Initialize on CPU (no cudaMemcpy needed)
    for (int i = 0; i < N; i++) {
        data[i] = i;
    }

    // Process on GPU
    incrementKernel<<<(N + 255) / 256, 256>>>(data, N);
    cudaDeviceSynchronize();

    // Check results on CPU (no cudaMemcpy needed)
    printf("data[0] = %d (expected 1)\n", data[0]);
    printf("data[N-1] = %d (expected %d)\n", data[N-1], N);

    cudaFree(data);
    return 0;
}

On Ampere and later, use prefetch hints to improve performance:

// Move data to GPU ahead of kernel launch
cudaMemPrefetchAsync(data, N * sizeof(int), deviceId);
// Move data back to CPU
cudaMemPrefetchAsync(data, N * sizeof(int), cudaCpuDeviceId);
// Memory access hint
cudaMemAdvise(data, N * sizeof(int),
              cudaMemAdviseSetPreferredLocation, deviceId);

4.3 Tiled Matrix Multiplication with Shared Memory

Using shared memory dramatically reduces global memory accesses.

#include <cuda_runtime.h>

#define TILE_SIZE 16

__global__ void matMulTiled(const float* A, const float* B, float* C, int n) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < (n + TILE_SIZE - 1) / TILE_SIZE; t++) {
        int aCol = t * TILE_SIZE + threadIdx.x;
        int bRow = t * TILE_SIZE + threadIdx.y;

        tileA[threadIdx.y][threadIdx.x] = (row < n && aCol < n)
            ? A[row * n + aCol] : 0.0f;
        tileB[threadIdx.y][threadIdx.x] = (bRow < n && col < n)
            ? B[bRow * n + col] : 0.0f;

        __syncthreads();

        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        __syncthreads();
    }

    if (row < n && col < n) {
        C[row * n + col] = sum;
    }
}

Performance improvement:

Naive: 2N global memory accesses per thread
Tiled: tiles loaded N/TILE_SIZE times, each reused TILE_SIZE times
Result: TILE_SIZE-fold reduction in global memory traffic (16x for TILE_SIZE=16)

4.4 Pinned Memory

#include <cuda_runtime.h>

void compareBandwidth(int N) {
    size_t bytes = N * sizeof(float);
    float *h_pageable, *h_pinned;
    float *d_data;

    h_pageable = (float*)malloc(bytes);
    cudaMallocHost(&h_pinned, bytes);  // Page-locked allocation
    cudaMalloc(&d_data, bytes);

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Measure pageable bandwidth
    cudaEventRecord(start);
    for (int i = 0; i < 10; i++) {
        cudaMemcpy(d_data, h_pageable, bytes, cudaMemcpyHostToDevice);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float pageable_ms;
    cudaEventElapsedTime(&pageable_ms, start, stop);

    // Measure pinned bandwidth
    cudaEventRecord(start);
    for (int i = 0; i < 10; i++) {
        cudaMemcpy(d_data, h_pinned, bytes, cudaMemcpyHostToDevice);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float pinned_ms;
    cudaEventElapsedTime(&pinned_ms, start, stop);

    printf("Pageable: %.2f GB/s\n",
           (bytes * 10.0) / (pageable_ms * 1e6));
    printf("Pinned:   %.2f GB/s\n",
           (bytes * 10.0) / (pinned_ms * 1e6));

    free(h_pageable);
    cudaFreeHost(h_pinned);
    cudaFree(d_data);
}

Pinned memory is locked in physical RAM, preventing OS page-outs. This enables direct DMA transfers without CPU involvement, yielding 2–3x better transfer rates. Excessive use can impact overall system memory performance.

5. Advanced CUDA Optimization

5.1 Memory Coalescing

When all 32 threads in a warp access contiguous memory addresses, the hardware merges them into a single transaction. Non-contiguous (strided) access requires multiple transactions.

// Bad: strided access (non-coalesced)
__global__ void badAccess(float* data, int N, int stride) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N / stride) {
        data[idx * stride] *= 2.0f;
    }
}

// Good: sequential access (coalesced)
__global__ void goodAccess(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        data[idx] *= 2.0f;
    }
}

// Coalesced matrix transpose using shared memory
#define TILE 32
#define PADDING 1  // Prevent bank conflicts

__global__ void transposeCoalesced(const float* in, float* out,
                                    int rows, int cols) {
    __shared__ float tile[TILE][TILE + PADDING];

    int x = blockIdx.x * TILE + threadIdx.x;
    int y = blockIdx.y * TILE + threadIdx.y;

    // Coalesced read from input
    if (x < cols && y < rows) {
        tile[threadIdx.y][threadIdx.x] = in[y * cols + x];
    }
    __syncthreads();

    x = blockIdx.y * TILE + threadIdx.x;
    y = blockIdx.x * TILE + threadIdx.y;

    // Coalesced write to output
    if (x < rows && y < cols) {
        out[y * rows + x] = tile[threadIdx.x][threadIdx.y];
    }
}

5.2 Shared Memory Bank Conflicts

Shared memory is divided into 32 banks, interleaved at 4-byte granularity. When multiple threads in a warp access the same bank, accesses are serialized.

// Bank conflict: stride=1 has no conflict, stride=16 causes 2-way conflict
__global__ void bankConflictDemo() {
    __shared__ float shared[32];

    int tid = threadIdx.x;

    // No conflict: thread i accesses shared[i] (different banks)
    float val1 = shared[tid];

    // 2-way bank conflict: threads 0 and 16 both access bank 0
    float val2 = shared[tid * 2 % 32];

    // Broadcast (no conflict despite all threads reading shared[0])
    float val3 = shared[0];

    (void)val1; (void)val2; (void)val3;
}

// Solve bank conflicts with padding
__global__ void noBankConflict(float* data) {
    __shared__ float tile[32][33];  // 33 = 32 + 1 padding

    int tid = threadIdx.x;
    int bid = threadIdx.y;

    tile[bid][tid] = data[bid * 32 + tid];
    __syncthreads();

    data[tid * 32 + bid] = tile[bid][tid];
}

5.3 Minimizing Warp Divergence

// Bad: divergence within a warp
__global__ void warpDivergenceBad(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        if (idx % 2 == 0) {
            data[idx] = data[idx] * 2.0f;
        } else {
            data[idx] = data[idx] + 1.0f;
        }
        // Both branches execute sequentially — 2x slower
    }
}

// Good: branch-free
__global__ void noWarpDivergence(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float mask = (float)(idx % 2);
        data[idx] = data[idx] * (2.0f - mask) + mask;
    }
}

// Align branch boundaries to warp boundaries
__global__ void warpAlignedBranch(float* even_data, float* odd_data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int warp_id = idx / 32;

    if (warp_id % 2 == 0) {
        // All threads in this warp take the same branch — no divergence
        if (idx < N) even_data[idx] *= 2.0f;
    } else {
        if (idx < N) odd_data[idx] += 1.0f;
    }
}

5.4 Occupancy Optimization

Occupancy is the ratio of active warps on an SM to the maximum number of warps that SM supports.

#include <cuda_runtime.h>
#include <cuda_occupancy.h>

void checkOccupancy() {
    void* kernelFunc = (void*)matMulTiled;

    int blockSize;
    int minGridSize;

    // Automatically compute optimal block size
    cudaOccupancyMaxPotentialBlockSize(
        &minGridSize, &blockSize,
        kernelFunc, 0, 0
    );

    printf("Optimal block size: %d\n", blockSize);
    printf("Minimum grid size:  %d\n", minGridSize);

    int maxActiveBlocks;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &maxActiveBlocks, kernelFunc, blockSize, 0);

    int device;
    cudaGetDevice(&device);
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, device);

    float occupancy =
        (maxActiveBlocks * blockSize / prop.warpSize)
        / (float)(prop.maxThreadsPerMultiProcessor / prop.warpSize);

    printf("Theoretical occupancy: %.1f%%\n", occupancy * 100);
}

// Use __launch_bounds__ to guide compiler register allocation
__global__ __launch_bounds__(256, 4)
void optimizedKernel(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        data[idx] *= 2.0f;
    }
}

5.5 CUDA Streams and Asynchronous Execution

#include <cuda_runtime.h>

#define NUM_STREAMS 4
#define CHUNK_SIZE (1 << 20)

void asyncProcessing(float* h_in, float* h_out, int N) {
    cudaStream_t streams[NUM_STREAMS];
    float *d_in[NUM_STREAMS], *d_out[NUM_STREAMS];

    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamCreate(&streams[i]);
        cudaMalloc(&d_in[i],  CHUNK_SIZE * sizeof(float));
        cudaMalloc(&d_out[i], CHUNK_SIZE * sizeof(float));
    }

    int numChunks = (N + CHUNK_SIZE - 1) / CHUNK_SIZE;

    for (int chunk = 0; chunk < numChunks; chunk++) {
        int streamId = chunk % NUM_STREAMS;
        int offset = chunk * CHUNK_SIZE;
        int size = min(CHUNK_SIZE, N - offset);

        // Overlapping transfers and kernel execution across streams
        cudaMemcpyAsync(d_in[streamId], h_in + offset,
                        size * sizeof(float),
                        cudaMemcpyHostToDevice, streams[streamId]);

        // Kernel launched asynchronously in stream
        int blockSize = 256;
        int gridSize = (size + blockSize - 1) / blockSize;
        // myKernel<<<gridSize, blockSize, 0, streams[streamId]>>>(...);

        cudaMemcpyAsync(h_out + offset, d_out[streamId],
                        size * sizeof(float),
                        cudaMemcpyDeviceToHost, streams[streamId]);
    }

    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamSynchronize(streams[i]);
        cudaStreamDestroy(streams[i]);
        cudaFree(d_in[i]);
        cudaFree(d_out[i]);
    }
}

Cross-stream synchronization:

cudaEvent_t event;
cudaEventCreate(&event);

// Record event in stream0
cudaEventRecord(event, stream0);

// stream1 waits for the event
cudaStreamWaitEvent(stream1, event, 0);

5.6 Performance Measurement with cudaEvent

float measureKernelTime(void (*kernel)(float*, int),
                         float* d_data, int N, int runs) {
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Warm-up
    kernel(d_data, N);
    cudaDeviceSynchronize();

    cudaEventRecord(start);
    for (int i = 0; i < runs; i++) {
        kernel(d_data, N);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float totalMs = 0;
    cudaEventElapsedTime(&totalMs, start, stop);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return totalMs / runs;
}

6. cuBLAS — Linear Algebra Library

cuBLAS is NVIDIA's high-performance BLAS implementation. It automatically leverages Tensor Cores for far superior performance compared to hand-written kernels.

Reference: https://docs.nvidia.com/cuda/cublas/

6.1 Basic Setup

#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <stdio.h>

#define CUBLAS_CHECK(err) \
    do { \
        cublasStatus_t e = (err); \
        if (e != CUBLAS_STATUS_SUCCESS) { \
            fprintf(stderr, "cuBLAS error: %d at %s:%d\n", \
                    e, __FILE__, __LINE__); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

int main() {
    cublasHandle_t handle;
    CUBLAS_CHECK(cublasCreate(&handle));

    // cuBLAS uses column-major order (Fortran-style)
    // Be careful when working with C++ row-major matrices

    CUBLAS_CHECK(cublasDestroy(handle));
    return 0;
}

6.2 SGEMM — Single-Precision Matrix Multiplication

cuBLAS uses column-major storage. For row-major C = A _ B, compute Ct = Bt _ At in column-major terms.

// C = alpha * A * B + beta * C
// A: m x k,  B: k x n,  C: m x n
void cublasMatMul(cublasHandle_t handle,
                   int m, int n, int k,
                   const float* d_A, const float* d_B, float* d_C) {
    const float alpha = 1.0f;
    const float beta  = 0.0f;

    cublasSgemm(handle,
                CUBLAS_OP_N,
                CUBLAS_OP_N,
                n, m, k,
                &alpha,
                d_B, n,
                d_A, k,
                &beta,
                d_C, n);
}

int main() {
    int m = 1024, n = 1024, k = 1024;
    size_t bytesA = m * k * sizeof(float);
    size_t bytesB = k * n * sizeof(float);
    size_t bytesC = m * n * sizeof(float);

    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, bytesA);
    cudaMalloc(&d_B, bytesB);
    cudaMalloc(&d_C, bytesC);
    cudaMemset(d_A, 0, bytesA);
    cudaMemset(d_B, 0, bytesB);

    cublasHandle_t handle;
    cublasCreate(&handle);
    cublasSetMathMode(handle, CUBLAS_DEFAULT_MATH);

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    cublasMatMul(handle, m, n, k, d_A, d_B, d_C);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms;
    cudaEventElapsedTime(&ms, start, stop);
    double tflops = 2.0 * m * n * k / (ms * 1e9);
    printf("cuBLAS SGEMM: %.2f ms, %.2f TFLOPS\n", ms, tflops);

    cublasDestroy(handle);
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    return 0;
}

6.3 Batched GEMM

Perform multiple small matrix multiplications in a single call:

void batchedMatMul(cublasHandle_t handle,
                    int m, int n, int k, int batchCount,
                    float** d_Aarray, float** d_Barray, float** d_Carray) {
    const float alpha = 1.0f;
    const float beta  = 0.0f;

    cublasSgemmBatched(handle,
                       CUBLAS_OP_N, CUBLAS_OP_N,
                       n, m, k,
                       &alpha,
                       (const float**)d_Barray, n,
                       (const float**)d_Aarray, k,
                       &beta,
                       d_Carray, n,
                       batchCount);
}

// Strided batched (contiguous memory layout)
void stridedBatchedMatMul(cublasHandle_t handle,
                           int m, int n, int k, int batchCount,
                           float* d_A, float* d_B, float* d_C) {
    const float alpha = 1.0f;
    const float beta  = 0.0f;

    long long strideA = (long long)m * k;
    long long strideB = (long long)k * n;
    long long strideC = (long long)m * n;

    cublasSgemmStridedBatched(handle,
                               CUBLAS_OP_N, CUBLAS_OP_N,
                               n, m, k,
                               &alpha,
                               d_B, n, strideB,
                               d_A, k, strideA,
                               &beta,
                               d_C, n, strideC,
                               batchCount);
}

7. cuDNN — Deep Learning Primitives

cuDNN provides GPU-accelerated implementations of convolution, pooling, normalization, and other deep learning operations. PyTorch, TensorFlow, and most deep learning frameworks use cuDNN as their backend.

Reference: https://docs.nvidia.com/deeplearning/cudnn/

7.1 cuDNN Convolution

#include <cudnn.h>
#include <cuda_runtime.h>
#include <stdio.h>

#define CUDNN_CHECK(err) \
    do { \
        cudnnStatus_t e = (err); \
        if (e != CUDNN_STATUS_SUCCESS) { \
            fprintf(stderr, "cuDNN error: %s at %s:%d\n", \
                    cudnnGetErrorString(e), __FILE__, __LINE__); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

int main() {
    cudnnHandle_t handle;
    CUDNN_CHECK(cudnnCreate(&handle));

    // Input: N=1, C=3, H=224, W=224 (NCHW format)
    int N = 1, C = 3, H = 224, W = 224;
    // Filter: K=64, C=3, R=3, S=3
    int K = 64, R = 3, S = 3;
    int pad_h = 1, pad_w = 1;
    int stride_h = 1, stride_w = 1;
    int dilation_h = 1, dilation_w = 1;

    cudnnTensorDescriptor_t inputDesc, outputDesc;
    cudnnFilterDescriptor_t filterDesc;
    cudnnConvolutionDescriptor_t convDesc;

    CUDNN_CHECK(cudnnCreateTensorDescriptor(&inputDesc));
    CUDNN_CHECK(cudnnCreateTensorDescriptor(&outputDesc));
    CUDNN_CHECK(cudnnCreateFilterDescriptor(&filterDesc));
    CUDNN_CHECK(cudnnCreateConvolutionDescriptor(&convDesc));

    CUDNN_CHECK(cudnnSetTensor4dDescriptor(
        inputDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, N, C, H, W));

    CUDNN_CHECK(cudnnSetFilter4dDescriptor(
        filterDesc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, K, C, R, S));

    CUDNN_CHECK(cudnnSetConvolution2dDescriptor(
        convDesc, pad_h, pad_w, stride_h, stride_w,
        dilation_h, dilation_w,
        CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT));

    CUDNN_CHECK(cudnnSetConvolutionMathType(convDesc, CUDNN_TENSOR_OP_MATH));

    // Compute output dimensions
    int out_N, out_C, out_H, out_W;
    CUDNN_CHECK(cudnnGetConvolution2dForwardOutputDim(
        convDesc, inputDesc, filterDesc,
        &out_N, &out_C, &out_H, &out_W));
    printf("Output: %d x %d x %d x %d\n", out_N, out_C, out_H, out_W);

    CUDNN_CHECK(cudnnSetTensor4dDescriptor(
        outputDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
        out_N, out_C, out_H, out_W));

    // Find the best algorithm
    cudnnConvolutionFwdAlgoPerf_t perfResults[10];
    int returnedCount;
    CUDNN_CHECK(cudnnFindConvolutionForwardAlgorithm(
        handle, inputDesc, filterDesc, convDesc, outputDesc,
        10, &returnedCount, perfResults));

    cudnnConvolutionFwdAlgo_t algo = perfResults[0].algo;
    printf("Best algorithm: %d (%.3f ms)\n", algo, perfResults[0].time);

    // Allocate workspace
    size_t workspaceSize;
    CUDNN_CHECK(cudnnGetConvolutionForwardWorkspaceSize(
        handle, inputDesc, filterDesc, convDesc, outputDesc,
        algo, &workspaceSize));

    void* workspace;
    cudaMalloc(&workspace, workspaceSize);

    float *d_input, *d_filter, *d_output;
    cudaMalloc(&d_input,  N * C * H * W * sizeof(float));
    cudaMalloc(&d_filter, K * C * R * S * sizeof(float));
    cudaMalloc(&d_output, out_N * out_C * out_H * out_W * sizeof(float));

    float alpha = 1.0f, beta = 0.0f;
    CUDNN_CHECK(cudnnConvolutionForward(
        handle, &alpha,
        inputDesc, d_input,
        filterDesc, d_filter,
        convDesc, algo, workspace, workspaceSize,
        &beta,
        outputDesc, d_output));

    printf("Convolution completed successfully!\n");

    cudaFree(d_input); cudaFree(d_filter);
    cudaFree(d_output); cudaFree(workspace);
    cudnnDestroyTensorDescriptor(inputDesc);
    cudnnDestroyTensorDescriptor(outputDesc);
    cudnnDestroyFilterDescriptor(filterDesc);
    cudnnDestroyConvolutionDescriptor(convDesc);
    cudnnDestroy(handle);

    return 0;
}

8. Mixed Precision Training

8.1 Numeric Format Comparison

Format	Bits	Exponent	Mantissa	Dynamic Range	Precision
FP32	32	8	23	~1e38	7 digits
FP16	16	5	10	~65504	3 digits
BF16	16	8	7	~1e38	2 digits
TF32	19	8	10	~1e38	3 digits
FP8	8	5/4	2/3	Limited	Low

FP16 halves memory usage and enables Tensor Cores but has a narrow range prone to overflow/underflow. BF16 matches FP32's exponent range for better numerical stability.

8.2 FP16 in CUDA

#include <cuda_fp16.h>
#include <cuda_runtime.h>

// FP16 vector addition
__global__ void fp16VectorAdd(const __half* A, const __half* B,
                               __half* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = __hadd(A[idx], B[idx]);
    }
}

// Vectorized FP16 (process 2 elements at once)
__global__ void fp16VectorAddVec2(const __half2* A, const __half2* B,
                                   __half2* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N / 2) {
        C[idx] = __hadd2(A[idx], B[idx]);
    }
}

// FP32 <-> FP16 conversion
__global__ void convertF32toF16(const float* f32, __half* f16, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        f16[idx] = __float2half(f32[idx]);
    }
}

__global__ void convertF16toF32(const __half* f16, float* f32, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        f32[idx] = __half2float(f16[idx]);
    }
}

8.3 PyTorch AMP (Automatic Mixed Precision)

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

model = nn.Sequential(
    nn.Linear(1024, 4096),
    nn.ReLU(),
    nn.Linear(4096, 4096),
    nn.ReLU(),
    nn.Linear(4096, 10)
).cuda()

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# GradScaler prevents FP16 gradient underflow via loss scaling
scaler = GradScaler()

def train_step(x, y):
    optimizer.zero_grad()

    # autocast automatically casts to FP16/BF16
    with autocast(device_type='cuda', dtype=torch.float16):
        output = model(x)
        loss = criterion(output, y)

    # Scaled backward pass
    scaler.scale(loss).backward()

    # Unscale before gradient clipping
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # step() only updates if gradients are finite
    scaler.step(optimizer)
    scaler.update()

    return loss.item()

# BF16 variant (Ampere and later — no loss scaling needed)
def train_step_bf16(x, y):
    optimizer.zero_grad()
    with autocast(device_type='cuda', dtype=torch.bfloat16):
        output = model(x)
        loss = criterion(output, y)
    loss.backward()
    optimizer.step()
    return loss.item()

batch_size = 128
for epoch in range(10):
    x = torch.randn(batch_size, 1024).cuda()
    y = torch.randint(0, 10, (batch_size,)).cuda()
    loss = train_step(x, y)
    if epoch % 2 == 0:
        print(f"Epoch {epoch}: loss = {loss:.4f}, "
              f"scale = {scaler.get_scale():.0f}")

8.4 Loss Scaling

FP16's 10-bit mantissa means gradients can flush to zero (underflow). Loss scaling multiplies the loss by a large factor before the backward pass, then divides before the optimizer step.

# Manual loss scaling (conceptual illustration)
scale_factor = 2**15  # 32768

with autocast():
    loss = criterion(output, y)

(loss * scale_factor).backward()

for param in model.parameters():
    if param.grad is not None:
        param.grad /= scale_factor

valid = all(
    torch.isfinite(p.grad).all()
    for p in model.parameters()
    if p.grad is not None
)

if valid:
    optimizer.step()
else:
    print("Skipping step: invalid gradients detected")

GradScaler automates this and dynamically adjusts the scale factor (doubles on success, halves on overflow).

9. CUDA Profiling and Debugging

9.1 Nsight Systems

Nsight Systems provides system-level profiling:

# Command-line profiling
nsys profile --trace=cuda,nvtx,osrt \
             --output=profile_output \
             ./my_application

# Open in GUI
nsys-ui profile_output.nsys-rep

# Generate text report
nsys stats profile_output.nsys-rep

9.2 Nsight Compute

Kernel-level deep analysis:

# Profile a specific kernel
ncu --set full \
    --kernel-name matMulTiled \
    --launch-count 1 \
    --output kernel_profile \
    ./my_application

# Collect selected metrics
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
l1tex__t_sector_hit_rate.pct,\
l2cache__hit_rate.pct \
    ./my_application

9.3 NVTX Markers for Profiling Ranges

#include <nvtx3/nvToolsExt.h>

void profiledFunction(float* d_data, int N) {
    nvtxRangePushA("Data Transfer H2D");
    cudaMemcpy(d_data, h_data, N * sizeof(float),
               cudaMemcpyHostToDevice);
    nvtxRangePop();

    nvtxRangePushA("Kernel Execution");
    myKernel<<<gridDim, blockDim>>>(d_data, N);
    cudaDeviceSynchronize();
    nvtxRangePop();

    nvtxRangePushA("Data Transfer D2H");
    cudaMemcpy(h_output, d_data, N * sizeof(float),
               cudaMemcpyDeviceToHost);
    nvtxRangePop();
}

9.4 CUDA Memory Checking

# Detect out-of-bounds and use-after-free
compute-sanitizer --tool memcheck ./my_application

# Detect race conditions
compute-sanitizer --tool racecheck ./my_application

# Detect uninitialized memory access
compute-sanitizer --tool initcheck ./my_application

# Detect synchronization errors
compute-sanitizer --tool synccheck ./my_application

10. Custom CUDA Deep Learning Operators

10.1 PyTorch Custom CUDA Extension

Writing fused kernels reduces memory round-trips and unlocks performance beyond standard library composition. Reference: https://pytorch.org/tutorials/advanced/cpp_extension.html

Project layout:

custom_op/
  setup.py
  fused_ops.cpp
  fused_ops_kernel.cu

fused_ops_kernel.cu:

#include <cuda_runtime.h>
#include <cuda_fp16.h>

// Fused Bias + ReLU
__global__ void fusedBiasReluKernel(
    float* output,
    const float* input,
    const float* bias,
    int batch_size,
    int features
) {
    int b = blockIdx.x;
    int f = blockIdx.y * blockDim.x + threadIdx.x;

    if (b < batch_size && f < features) {
        float val = input[b * features + f] + bias[f];
        output[b * features + f] = fmaxf(val, 0.0f);
    }
}

// GELU approximation
// GELU(x) = x * Phi(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
__device__ float gelu_approx(float x) {
    const float sqrt_2_over_pi = 0.7978845608f;
    const float coeff = 0.044715f;
    float x3 = x * x * x;
    float inner = sqrt_2_over_pi * (x + coeff * x3);
    return 0.5f * x * (1.0f + tanhf(inner));
}

// Fused Bias + GELU
__global__ void fusedBiasGeluKernel(
    float* output,
    const float* input,
    const float* bias,
    int batch_size,
    int features
) {
    int b = blockIdx.x;
    int f = blockIdx.y * blockDim.x + threadIdx.x;

    if (b < batch_size && f < features) {
        float val = input[b * features + f] + bias[f];
        output[b * features + f] = gelu_approx(val);
    }
}

// Layer Normalization kernel
__global__ void layerNormKernel(
    float* output,
    const float* input,
    const float* gamma,
    const float* beta,
    int batch_size,
    int features,
    float eps
) {
    int b = blockIdx.x;
    if (b >= batch_size) return;

    __shared__ float shared_mean;
    __shared__ float shared_var;

    // Step 1: Compute mean (reduction)
    float sum = 0.0f;
    for (int f = threadIdx.x; f < features; f += blockDim.x) {
        sum += input[b * features + f];
    }
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        sum += __shfl_down_sync(0xffffffff, sum, offset);
    }
    if (threadIdx.x % warpSize == 0) {
        atomicAdd(&shared_mean, sum);
    }
    __syncthreads();

    float mean = shared_mean / features;

    // Step 2: Compute variance
    float var_sum = 0.0f;
    for (int f = threadIdx.x; f < features; f += blockDim.x) {
        float diff = input[b * features + f] - mean;
        var_sum += diff * diff;
    }
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        var_sum += __shfl_down_sync(0xffffffff, var_sum, offset);
    }
    if (threadIdx.x % warpSize == 0) {
        atomicAdd(&shared_var, var_sum);
    }
    __syncthreads();

    float inv_std = rsqrtf(shared_var / features + eps);

    // Step 3: Normalize and apply scale/shift
    for (int f = threadIdx.x; f < features; f += blockDim.x) {
        float normalized = (input[b * features + f] - mean) * inv_std;
        output[b * features + f] = gamma[f] * normalized + beta[f];
    }
}

fused_ops.cpp:

#include <torch/extension.h>
#include <cuda_runtime.h>

// Forward declarations
void fusedBiasReluKernel(float*, const float*, const float*, int, int);
void fusedBiasGeluKernel(float*, const float*, const float*, int, int);

torch::Tensor fused_bias_relu(torch::Tensor input, torch::Tensor bias) {
    TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
    TORCH_CHECK(bias.is_cuda(), "bias must be a CUDA tensor");
    TORCH_CHECK(input.is_contiguous(), "input must be contiguous");

    int batch_size = input.size(0);
    int features = input.size(1);
    auto output = torch::empty_like(input);

    dim3 block(256);
    dim3 grid(batch_size, (features + 255) / 256);

    fusedBiasReluKernel<<<grid, block>>>(
        output.data_ptr<float>(),
        input.data_ptr<float>(),
        bias.data_ptr<float>(),
        batch_size, features
    );

    return output;
}

torch::Tensor fused_bias_gelu(torch::Tensor input, torch::Tensor bias) {
    TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
    TORCH_CHECK(bias.is_cuda(), "bias must be a CUDA tensor");

    int batch_size = input.size(0);
    int features = input.size(1);
    auto output = torch::empty_like(input);

    dim3 block(256);
    dim3 grid(batch_size, (features + 255) / 256);

    fusedBiasGeluKernel<<<grid, block>>>(
        output.data_ptr<float>(),
        input.data_ptr<float>(),
        bias.data_ptr<float>(),
        batch_size, features
    );

    return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("fused_bias_relu", &fused_bias_relu, "Fused Bias + ReLU (CUDA)");
    m.def("fused_bias_gelu", &fused_bias_gelu, "Fused Bias + GELU (CUDA)");
}

setup.py:

from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name='fused_ops',
    ext_modules=[
        CUDAExtension(
            name='fused_ops',
            sources=[
                'fused_ops.cpp',
                'fused_ops_kernel.cu',
            ],
            extra_compile_args={
                'cxx': ['-O3'],
                'nvcc': [
                    '-O3',
                    '-arch=sm_80',
                    '--use_fast_math',
                    '-Xptxas', '-v',
                ],
            }
        )
    ],
    cmdclass={
        'build_ext': BuildExtension
    }
)

Build and use:

cd custom_op
python setup.py install

import torch
import fused_ops

batch_size, features = 128, 4096
x = torch.randn(batch_size, features, device='cuda')
bias = torch.randn(features, device='cuda')

# Reference: separate ops
y_ref = torch.relu(x + bias)

# Custom fused op
y_custom = fused_ops.fused_bias_relu(x, bias)

print(f"Max diff: {(y_ref - y_custom).abs().max():.6f}")

import time

def benchmark(fn, *args, warmup=10, runs=100):
    for _ in range(warmup):
        fn(*args)
    torch.cuda.synchronize()
    start = time.perf_counter()
    for _ in range(runs):
        fn(*args)
    torch.cuda.synchronize()
    return (time.perf_counter() - start) / runs * 1000

ref_time    = benchmark(lambda x, b: torch.relu(x + b), x, bias)
custom_time = benchmark(fused_ops.fused_bias_relu, x, bias)

print(f"PyTorch baseline:  {ref_time:.3f} ms")
print(f"Custom fused op:   {custom_time:.3f} ms")
print(f"Speedup:           {ref_time / custom_time:.2f}x")

10.2 Flash Attention Concept

Flash Attention reduces attention memory usage from O(N²) to O(N) by tiling the computation to keep the attention matrix in shared memory rather than global memory:

// Simplified Flash Attention Forward (illustrative)
__global__ void flashAttentionForward(
    float* output,
    const float* Q,
    const float* K,
    const float* V,
    float* L,
    int B, int H, int N, int D,
    int BLOCK_N
) {
    int b = blockIdx.z;
    int h = blockIdx.y;
    int q_block = blockIdx.x;
    int q_start = q_block * BLOCK_N;

    extern __shared__ float smem[];
    float* q_tile = smem;
    float* k_tile = smem + BLOCK_N * D;
    float* v_tile = k_tile + BLOCK_N * D;

    // Load Q tile
    for (int i = threadIdx.x; i < BLOCK_N * D; i += blockDim.x) {
        int qi = i / D, di = i % D;
        int q_idx = q_start + qi;
        q_tile[i] = (q_idx < N)
            ? Q[((b * H + h) * N + q_idx) * D + di] : 0.0f;
    }

    float m_prev[32] = {};   // running max values
    float l_prev[32] = {};   // running sum values

    // Iterate over K/V tiles
    for (int kv_block = 0; kv_block < (N + BLOCK_N - 1) / BLOCK_N; kv_block++) {
        int kv_start = kv_block * BLOCK_N;

        for (int i = threadIdx.x; i < BLOCK_N * D; i += blockDim.x) {
            int ki = i / D, di = i % D;
            int k_idx = kv_start + ki;
            k_tile[i] = (k_idx < N)
                ? K[((b * H + h) * N + k_idx) * D + di] : 0.0f;
        }
        __syncthreads();

        int tid = threadIdx.x;
        if (tid < BLOCK_N && (q_start + tid) < N) {
            float m_new = m_prev[tid];
            for (int k = 0; k < BLOCK_N && (kv_start + k) < N; k++) {
                float dot = 0.0f;
                for (int d = 0; d < D; d++) {
                    dot += q_tile[tid * D + d] * k_tile[k * D + d];
                }
                dot /= sqrtf((float)D);
                m_new = fmaxf(m_new, dot);
            }
            // Numerically stable running softmax update would continue here
        }
        __syncthreads();
    }
}

For production use, see the open-source Flash Attention implementation at https://github.com/Dao-AILab/flash-attention.

11. Optimization Checklist

11.1 Memory

Verify global memory coalescing: warp threads should access contiguous addresses
Use shared memory: cache frequently reused data
Eliminate bank conflicts: add padding to shared memory arrays
Prevent register spilling: check with --ptxas-options=-v
Batch H2D/D2H transfers: overlap with computation using streams

11.2 Execution

Maximize occupancy: use cudaOccupancyMaxPotentialBlockSize
Eliminate warp divergence: align branch boundaries with warp boundaries
Stream pipelining: overlap data transfers with kernel execution
Kernel fusion: merge multi-step operations to reduce memory round-trips
Mixed precision: use FP16/BF16 to activate Tensor Cores

11.3 Library Usage

cuBLAS: always prefer cuBLAS for matrix operations over manual kernels
cuDNN: use for all standard deep learning primitives
cuSPARSE: sparse matrix operations
Thrust: parallel algorithms (sort, reduce, scan)
CUB: block/warp-level collective operation primitives

12. Reference Benchmarks

A100 80 GB GPU, approximate figures:

Operation	Size	Time	Performance
Vector add (FP32)	1M elements	0.1 ms	80 GB/s
Naive matmul	1024x1024	8 ms	2.6 TFLOPS
Tiled matmul	1024x1024	2 ms	10.4 TFLOPS
cuBLAS SGEMM	1024x1024	0.3 ms	72 TFLOPS
cuBLAS HGEMM	4096x4096	1.5 ms	183 TFLOPS
Convolution (cuDNN)	224x224x64	0.5 ms	—

References

CUDA C++ Programming Guide — Official NVIDIA documentation
CUDA Runtime API — API reference
CUDA Toolkit — Installation and samples
cuDNN Developer Guide — Deep learning primitives
cuBLAS Library — BLAS library
Nsight Systems — System profiler
PyTorch CUDA Notes — PyTorch CUDA usage
PyTorch C++ Extension Tutorial — Custom operators
CUDA Best Practices Guide — Optimization guide
Flash Attention — Efficient attention implementation