Skip to content
Published on

CUDA Programming Fundamentals: Complete Guide to GPU Parallel Computing

Authors
  • Name
    Twitter

1. GPU vs CPU Architecture Differences

To understand CUDA programming, you must first grasp the fundamental architectural differences between GPUs and CPUs.

1.1 CPU: A Processor Optimized for Sequential Processing

The CPU (Central Processing Unit) is equipped with complex control flow, branch prediction, and large caches, making it optimized for sequential tasks. A typical high-performance CPU has 8 to 64 cores, each capable of independently executing complex instructions at high speed. Most of the CPU's transistors are dedicated to control logic and cache, focusing on maximizing single-thread execution speed.

1.2 GPU: A Processor Optimized for Massive Parallel Processing

In contrast, the GPU (Graphics Processing Unit) is equipped with thousands of small cores, specializing in massively parallel computation. NVIDIA GPUs are organized into units called Streaming Multiprocessors (SMs), each containing dozens to hundreds of CUDA Cores. Most of the GPU's transistors are dedicated to arithmetic logic units (ALUs), enabling simultaneous execution of thousands of threads.

CharacteristicCPUGPU
Number of Cores8-64 (high performance)Thousands to tens of thousands
Core CharacteristicsComplex and powerfulSimple and lightweight
Cache SizeLarge (tens of MB)Relatively small
Optimal TasksSequential, complex branchesMassive data parallel processing
Memory BandwidthRelatively lowVery high (HBM)

1.3 SIMT Execution Model

NVIDIA GPUs use the SIMT (Single Instruction, Multiple Threads) execution model. SIMT is similar to SIMD (Single Instruction, Multiple Data), but with a key difference. In SIMD, the vector width is exposed to software, whereas SIMT specifies the execution and branching behavior of individual threads. Each thread has its own program counter and register state, and can logically follow independent execution paths.

The core of SIMT is warp-based execution. The GPU groups 32 threads into a single warp and executes the same instruction simultaneously. Maximum performance is achieved when all threads in a warp follow the same code path. When threads take different branches, performance degrades (known as warp divergence, discussed in detail later).


2. CUDA Programming Model: Grid, Block, Thread Hierarchy

The most fundamental concept in the CUDA programming model is the thread hierarchy. When a kernel function is invoked in CUDA, numerous threads are created and executed in parallel, organized into a three-level hierarchy: Grid, Block, and Thread.

2.1 Thread

A thread is the most basic unit of CUDA execution. Each thread executes one instance of the kernel code and has its own registers and local memory. Each thread determines which data it should process through its unique ID.

2.2 Thread Block (Block)

A Thread Block is a group of threads. Threads within the same block share the following characteristics:

  • Can share data through Shared Memory
  • Can synchronize via __syncthreads()
  • Execute on a single SM and do not migrate to another SM during execution
  • Can contain a maximum of 1024 threads (may vary by Compute Capability)

Thread Blocks can be organized in 1D, 2D, or 3D, allowing natural indexing for vector, matrix, and volume data.

2.3 Grid

A Grid is a collection of Thread Blocks. A single kernel invocation creates one Grid. Grids can also be organized in 1D, 2D, or 3D. Direct data sharing via Shared Memory between different blocks is not possible, and synchronization is limited (special APIs such as Cooperative Groups must be used).

2.4 Thread Block Cluster (Compute Capability 9.0 and above)

Starting with the NVIDIA Hopper architecture (Compute Capability 9.0), an optional layer called Thread Block Cluster was added. A cluster consists of multiple Thread Blocks, and blocks within the same cluster execute on the same GPC (GPU Processing Cluster), enabling access to each other's Shared Memory through Distributed Shared Memory.

Grid
 +-- Block Cluster (optional, CC 9.0+)
      +-- Thread Block (up to 1024 Threads)
           +-- Thread (individual execution unit)

3. CUDA Kernel Development and Execution Configuration

3.1 Kernel Function Definition

In CUDA, functions that execute on the GPU are called kernels. Kernel functions are defined using the __global__ qualifier, and the return type must be void.

__global__ void myKernel(int *data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        data[idx] = data[idx] * 2;
    }
}

CUDA has three function qualifiers:

QualifierExecution LocationCall Location
__global__GPU (Device)CPU (Host) or GPU
__device__GPU (Device)GPU (Device)
__host__CPU (Host)CPU (Host)

Using __host__ and __device__ together compiles the function for both host and device.

3.2 Execution Configuration

When calling a kernel, the Grid and Block dimensions are specified using the <<<gridDim, blockDim>>> syntax.

// 1D configuration
int N = 1024;
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
myKernel<<<blocksPerGrid, threadsPerBlock>>>(d_data, N);

// 2D configuration
dim3 blockDim(16, 16);      // 16x16 = 256 threads per block
dim3 gridDim(64, 64);       // 64x64 blocks
matMulKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

// Shared Memory size and Stream specification
myKernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(args);

The full form of the <<<>>> syntax is <<<gridDim, blockDim, sharedMemBytes, stream>>>. The third argument is the size in bytes of dynamically allocated Shared Memory, and the fourth is the CUDA Stream. If omitted, they default to 0 and the default stream, respectively.

Important: There is a hardware limit on the number of threads per block. Currently, all NVIDIA GPUs support a maximum of 1024 threads per block. Exceeding this value will prevent the kernel from launching.


4. Thread Indexing: threadIdx, blockIdx, blockDim, gridDim

Inside a kernel, each thread uses built-in variables to determine its position. These variables are of type uint3 or dim3, with .x, .y, .z members.

4.1 Built-in Variables

VariableDescription
threadIdxThread index within the block (starting from 0)
blockIdxBlock index within the grid (starting from 0)
blockDimBlock dimensions (number of threads in the block)
gridDimGrid dimensions (number of blocks in the grid)
warpSizeWarp size (currently always 32)

4.2 Global Thread ID Computation

For a 1D grid with 1D blocks:

int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;

For a 2D grid with 2D blocks:

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int globalIdx = row * width + col;

The same pattern extends to 3D:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;

Boundary checks are essential. When the data size is not an exact multiple of the block size, some threads in the last block may fall outside the valid range, so boundary checking must always be performed:

if (globalIdx < N) {
    // Perform computation only for valid indices
    output[globalIdx] = input[globalIdx] * 2;
}

5. GPU Memory Types

CUDA GPUs provide multiple memory spaces, each with different access speeds, sizes, and visibility (scope). Choosing the right memory is the key to CUDA program optimization.

5.1 Register

  • Location: On-chip (inside the SM)
  • Visibility: Private to each thread
  • Speed: Fastest (1 cycle latency)
  • Size: 64K 32-bit registers per SM, up to 255 per thread

Local variables within a kernel are allocated to registers by default. High register usage reduces the number of threads that can simultaneously reside on an SM, decreasing occupancy.

5.2 Local Memory

  • Location: Off-chip (Device Memory, same physical location as Global Memory)
  • Visibility: Private to each thread
  • Speed: Same as Global Memory (slow, hundreds of cycles)
  • Purpose: Local variables that don't fit in registers (register spill), large arrays

Despite the name "Local," it is actually located off-chip, so access speed is slow. The compiler automatically spills to Local Memory when registers are insufficient.

5.3 Shared Memory

  • Location: On-chip (inside the SM, shares physical space with L1 Cache)
  • Visibility: All threads within the same Thread Block
  • Speed: Close to registers (~5 cycles without bank conflicts)
  • Size: Typically 48KB-164KB per SM (varies by architecture)
__global__ void sharedMemExample(float *data) {
    __shared__ float sharedData[256];  // Static allocation

    int tid = threadIdx.x;
    sharedData[tid] = data[blockIdx.x * blockDim.x + tid];

    __syncthreads();  // Synchronize all threads in the block

    // Operations using sharedData
    float result = sharedData[tid] + sharedData[255 - tid];
    data[blockIdx.x * blockDim.x + tid] = result;
}

Shared Memory is divided into banks (typically 32), and when multiple threads access the same bank simultaneously, a bank conflict occurs, causing serialization. Designing access patterns that avoid bank conflicts is important.

5.4 Global Memory

  • Location: Off-chip (Device DRAM, i.e., HBM or GDDR)
  • Visibility: All threads + host
  • Speed: Slowest (hundreds of cycles latency)
  • Size: Largest (several GB to tens of GB)

Memory allocated with cudaMalloc() is Global Memory. Access speed can be improved through L1/L2 caches, and bandwidth should be maximized through coalesced access (adjacent threads accessing contiguous memory addresses).

5.5 Constant Memory

  • Location: Off-chip (Global Memory region), cached via dedicated cache
  • Visibility: All threads (read-only)
  • Speed: Very fast on cache hit
  • Size: 64KB
__constant__ float constData[256];

// Set values from host
cudaMemcpyToSymbol(constData, hostData, sizeof(float) * 256);

Delivers best performance when all threads in a warp read the same address. Through a broadcast mechanism, a single memory read delivers the value to all threads in the warp.

5.6 Texture Memory

  • Location: Off-chip, cached via dedicated Texture Cache
  • Visibility: All threads (read-only)
  • Features: Optimized for 2D spatial locality, hardware interpolation support

Texture Memory is advantageous for access patterns involving spatially adjacent data in 2D/3D datasets such as image processing. In modern CUDA, it is used alongside Surface Objects.

5.7 Memory Hierarchy Summary

Fast <--------------------------------------------> Slow
Register > Shared Memory > L1/L2 Cache > Constant/Texture Cache > Global Memory
(On-chip)   (On-chip)      (On-chip)     (Cached)                 (Off-chip)

6. Memory Management API

6.1 Explicit Memory Management

The traditional CUDA memory management approach explicitly separates and manages host and device memory.

cudaMalloc: Device Memory Allocation

float *d_array;
cudaMalloc((void **)&d_array, N * sizeof(float));

cudaMalloc allocates memory in the GPU's Global Memory. The allocated pointer (d_array) is only valid on the device and cannot be directly dereferenced on the host.

cudaMemcpy: Host-Device Data Transfer

// Host -> Device
cudaMemcpy(d_array, h_array, N * sizeof(float), cudaMemcpyHostToDevice);

// Device -> Host
cudaMemcpy(h_result, d_result, N * sizeof(float), cudaMemcpyDeviceToHost);

// Device -> Device
cudaMemcpy(d_dest, d_src, N * sizeof(float), cudaMemcpyDeviceToDevice);

cudaMemcpy is a synchronous function that blocks the host thread until the transfer completes. For asynchronous transfers, use cudaMemcpyAsync with a CUDA Stream.

cudaFree: Device Memory Deallocation

cudaFree(d_array);

Typical CUDA Program Flow

// 1. Allocate host memory and initialize data
float *h_input = (float *)malloc(N * sizeof(float));
float *h_output = (float *)malloc(N * sizeof(float));
initializeData(h_input, N);

// 2. Allocate device memory
float *d_input, *d_output;
cudaMalloc(&d_input, N * sizeof(float));
cudaMalloc(&d_output, N * sizeof(float));

// 3. Transfer data Host -> Device
cudaMemcpy(d_input, h_input, N * sizeof(float), cudaMemcpyHostToDevice);

// 4. Launch kernel
int blockSize = 256;
int gridSize = (N + blockSize - 1) / blockSize;
myKernel<<<gridSize, blockSize>>>(d_input, d_output, N);

// 5. Transfer results Device -> Host
cudaMemcpy(h_output, d_output, N * sizeof(float), cudaMemcpyDeviceToHost);

// 6. Free memory
cudaFree(d_input);
cudaFree(d_output);
free(h_input);
free(h_output);

6.2 Unified Memory

Unified Memory, introduced in CUDA 6.0, allows the host and device to share a single address space. Data movement is managed automatically by the CUDA runtime.

float *data;
cudaMallocManaged(&data, N * sizeof(float));

// Accessible from host
for (int i = 0; i < N; i++) {
    data[i] = (float)i;
}

// Same pointer used on device
myKernel<<<gridSize, blockSize>>>(data, N);
cudaDeviceSynchronize();

// Access results from host (after synchronization)
printf("Result: %f\n", data[0]);

cudaFree(data);  // cudaMallocManaged also freed with cudaFree

Advantages of Unified Memory:

  • No cudaMemcpy calls needed, resulting in more concise code
  • Same pointer used on both host and device
  • Runtime handles data migration on-demand
  • Can process data exceeding device memory capacity (oversubscription)

However, in terms of performance, programs optimized to overlap kernel execution and data transfer using Streams and cudaMemcpyAsync can achieve higher performance than programs using only Unified Memory. cudaMemPrefetchAsync can be used to improve Unified Memory performance.


7. Warp Execution and Warp Divergence

7.1 The Concept of a Warp

A warp is the basic scheduling unit of execution in CUDA. The SM's warp scheduler groups threads into groups of 32 to form a warp and issues instructions at the warp level. The 32 threads in the same warp share the same program counter (PC) and execute the same instruction simultaneously.

Within a Thread Block, warps are formed in order of thread ID:

  • Warp 0: Threads 0-31
  • Warp 1: Threads 32-63
  • Warp 2: Threads 64-95
  • ... and so on

7.2 Warp Divergence

Warp divergence occurs when threads within a warp follow different paths at conditional branches (if, switch, for, etc.).

__global__ void divergentKernel(int *data, int *result) {
    int tid = threadIdx.x;

    // Warp Divergence occurs!
    if (tid % 2 == 0) {
        result[tid] = data[tid] * 2;      // Even threads
    } else {
        result[tid] = data[tid] + 10;     // Odd threads
    }
}

In the above code, even-numbered and odd-numbered threads within a warp take different branches. The GPU handles this by executing each branch path sequentially, disabling threads that don't belong to the current path. As a result, the execution times of both branches are summed, degrading performance.

Strategies for Minimizing Warp Divergence

// BAD: Branching within the same warp
if (threadIdx.x % 2 == 0) { ... }

// BETTER: Design branching to occur at warp boundaries
if (threadIdx.x / 32 % 2 == 0) { ... }
// or
if (blockIdx.x % 2 == 0) { ... }

The core principle is to design code so that threads within the same warp follow the same code path. If the branching condition aligns with warp boundaries, each entire warp executes only one path, preventing divergence.


8. Occupancy Concept and Optimization

8.1 What is Occupancy?

Occupancy is the ratio of active warps to the maximum number of warps that can be simultaneously active on an SM.

Occupancy = Active Warps / Maximum Warps per SM

For example, if an SM supports a maximum of 64 warps and 32 warps are actually active, the occupancy is 50%.

8.2 Factors Affecting Occupancy

Occupancy is determined by three resources:

  1. Register usage: More registers per thread means fewer total threads that can reside on the SM. With 64K 32-bit registers per SM, using 128 registers per thread allows only 512 threads (16 warps) on the SM.

  2. Shared Memory usage: More Shared Memory per block means fewer blocks that can be simultaneously placed on the SM.

  3. Block size (number of threads): There is a maximum block count limit per SM (e.g., 32 for Ampere CC 8.0, 16 for CC 8.6), so blocks that are too small may hit this limit and fail to fill the SM's thread capacity.

8.3 Occupancy Optimization Methods

  • Choose block sizes of 128, 256, or 512: Empirically, 256 is a good starting point. Too small (32, 64) hits the per-SM block count limit, while too large (1024) increases resource requirements and may lower occupancy.

  • Control register usage: Use the __launch_bounds__ qualifier or the compiler option -maxrregcount to limit registers per kernel.

__global__ void __launch_bounds__(256, 4)  // maxThreadsPerBlock, minBlocksPerSM
myKernel(float *data) {
    // ...
}
  • Use the CUDA Occupancy Calculator: Use the NVIDIA-provided Occupancy Calculator spreadsheet or the cudaOccupancyMaxPotentialBlockSize() API to automatically determine the optimal block size.
int blockSize;
int minGridSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, 0);

Note: High occupancy does not always guarantee high performance. Memory access patterns, instruction-level parallelism (ILP), Shared Memory utilization, and other factors collectively affect performance.


9. Practical Examples

9.1 Vector Addition

The most basic CUDA example: vector addition. It adds corresponding elements of two arrays and stores the result.

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void vectorAdd(const float *A, const float *B, float *C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    int N = 1 << 20;  // Approximately 1 million elements
    size_t size = N * sizeof(float);

    // Allocate host memory
    float *h_A = (float *)malloc(size);
    float *h_B = (float *)malloc(size);
    float *h_C = (float *)malloc(size);

    // Initialize data
    for (int i = 0; i < N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // Allocate device memory
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Transfer Host -> Device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Transfer Device -> Host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Verification
    for (int i = 0; i < N; i++) {
        if (fabs(h_C[i] - 3.0f) > 1e-5) {
            fprintf(stderr, "Verification failed at index %d!\n", i);
            return -1;
        }
    }
    printf("Vector addition successful!\n");

    // Cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

Key Points:

  • Each thread processes one element (1:1 mapping)
  • (N + threadsPerBlock - 1) / threadsPerBlock performs ceiling division to cover all elements
  • Boundary check (if (idx < N)) prevents out-of-bounds access

9.2 Matrix Multiplication

Matrix multiplication is a representative CUDA optimization example. We first look at the naive version, then cover the tiled version using Shared Memory.

Naive Version

__global__ void matMulNaive(float *A, float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

// Launch
dim3 blockDim(16, 16);
dim3 gridDim((N + 15) / 16, (N + 15) / 16);
matMulNaive<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

The naive version has each thread computing one element of the result matrix while reading an entire row of matrix A and an entire column of matrix B from Global Memory. This results in O(N^3) Global Memory accesses, which is highly inefficient.

Tiled Version (Using Shared Memory)

#define TILE_SIZE 16

__global__ void matMulTiled(float *A, float *B, float *C, int N) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    // Iterate over tiles
    for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Load tile into Shared Memory
        if (row < N && (t * TILE_SIZE + threadIdx.x) < N)
            tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        else
            tileA[threadIdx.y][threadIdx.x] = 0.0f;

        if ((t * TILE_SIZE + threadIdx.y) < N && col < N)
            tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tileB[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();

        // Multiply-add within the tile
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        __syncthreads();
    }

    if (row < N && col < N) {
        C[row * N + col] = sum;
    }
}

Key ideas behind tiled matrix multiplication:

  1. Data from Global Memory is divided into TILE_SIZE x TILE_SIZE tiles and loaded into Shared Memory
  2. Reads from Shared Memory are approximately 100x faster than from Global Memory
  3. Multiple threads within the block reuse the same data, greatly reducing Global Memory accesses
  4. __syncthreads() ensures all threads have completed tile loading before computation begins

10. NVIDIA GPU Generation Characteristics (Compute Capability)

Compute Capability (CC) is a version number indicating a GPU's hardware features and specifications. The major version represents the architecture generation, and the minor version represents improvements within a generation.

CCArchitectureRepresentative GPUKey Features
3.xKeplerGTX 680, K40Dynamic Parallelism, Hyper-Q
5.xMaxwellGTX 980, M40Energy efficiency improvements, SM redesign
6.xPascalGTX 1080, P100HBM2, NVLink, FP16 support
7.0VoltaV1001st gen Tensor Core, Independent Thread Scheduling
7.5TuringRTX 2080, T4RT Core, INT8/INT4 Tensor Core
8.0AmpereA1003rd gen Tensor Core, TF32, BF16, Sparsity
8.6AmpereRTX 3090Max 48 Warps per SM (vs 64 for 8.0)
8.9Ada LovelaceRTX 4090, L404th gen Tensor Core, FP8, Shader Execution Reordering
9.0HopperH100Thread Block Cluster, Transformer Engine, DPX, FP8
10.0BlackwellB200, GB2005th gen Tensor Core, FP4/FP6, 208B transistors, HBM3e

Volta (CC 7.0): Independent Thread Scheduling was introduced, allowing threads within a warp to branch more flexibly. Earlier architectures relied on implicit lock-step synchronization within warps, but starting from Volta, explicit synchronization (__syncwarp()) is required.

Hopper (CC 9.0): The Thread Block Cluster concept was added, enabling multiple Thread Blocks to cooperate through Distributed Shared Memory. The TMA (Tensor Memory Accelerator) unit for asynchronous data movement was also introduced.

Blackwell (CC 10.0): The 5th generation Tensor Core natively supports FP4 and FP6 precision, using micro-tensor formats and dynamic range scaling. AI computation performance is significantly improved over Hopper, delivering up to 20 PFLOPS of AI compute performance.


11. Debugging and Profiling Tools

11.1 cuda-gdb

cuda-gdb is a CUDA extension of GNU GDB that can debug CUDA applications running on actual GPU hardware.

Key features:

  • Setting breakpoints within GPU kernels
  • Querying state at the thread, block, and warp level
  • Inspecting device memory contents
  • Simultaneously debugging host and device code
# Compile with debug information
nvcc -g -G -o myapp myapp.cu

# Start debugging with cuda-gdb
cuda-gdb ./myapp

# Inside cuda-gdb
(cuda-gdb) break myKernel
(cuda-gdb) run
(cuda-gdb) cuda thread       # Current thread info
(cuda-gdb) cuda block        # Current block info
(cuda-gdb) info cuda threads # List all CUDA threads

11.2 Compute Sanitizer

Compute Sanitizer is a suite of tools for checking the functional correctness of CUDA programs, included in the CUDA Toolkit. It provides four sub-tools:

ToolFunction
memcheckDetects memory access errors (out-of-bounds, misaligned)
racecheckDetects Shared Memory data races
initcheckDetects uninitialized Global Memory accesses
synccheckDetects thread synchronization errors (misuse of __syncthreads(), etc.)
# Run memcheck
compute-sanitizer --tool memcheck ./myapp

# Run racecheck
compute-sanitizer --tool racecheck ./myapp

# Run initcheck
compute-sanitizer --tool initcheck ./myapp

11.3 NVIDIA Nsight

NVIDIA Nsight is a suite of tools providing integrated development, debugging, and profiling environments.

  • Nsight Systems: Analyzes system-wide performance. Visually displays CPU-GPU timelines, kernel execution times, memory transfers, and API calls. Used as the first step to identify overall bottlenecks.

  • Nsight Compute: Analyzes detailed performance metrics of individual CUDA kernels. Examines occupancy, memory bandwidth utilization, instruction throughput, warp state, and more in fine detail. Used for optimizing specific kernels.

  • Nsight Visual Studio Edition / VS Code Extension: Provides CUDA debugging and profiling support integrated into the IDE.

# Profile with Nsight Systems
nsys profile --stats=true ./myapp

# Analyze kernels with Nsight Compute
ncu --set full ./myapp

Debugging/Profiling Workflow

  1. Verify functional correctness: Use Compute Sanitizer (memcheck, racecheck) to catch memory errors and race conditions first
  2. System-level analysis: Use Nsight Systems to identify overall bottlenecks (CPU-GPU synchronization, memory transfers, etc.)
  3. Kernel-level optimization: Use Nsight Compute to analyze detailed performance metrics of bottleneck kernels and optimize

12. Error Handling Best Practices

Most CUDA API calls return an error code of type cudaError_t. In production code, errors must always be checked.

#define CUDA_CHECK(call)                                                    \
    do {                                                                    \
        cudaError_t err = call;                                             \
        if (err != cudaSuccess) {                                           \
            fprintf(stderr, "CUDA error at %s:%d: %s\n",                    \
                    __FILE__, __LINE__, cudaGetErrorString(err));            \
            exit(EXIT_FAILURE);                                             \
        }                                                                   \
    } while (0)

// Usage examples
CUDA_CHECK(cudaMalloc(&d_array, size));
CUDA_CHECK(cudaMemcpy(d_array, h_array, size, cudaMemcpyHostToDevice));

// Error checking after kernel launch
myKernel<<<gridSize, blockSize>>>(d_array, N);
CUDA_CHECK(cudaGetLastError());       // Kernel launch error
CUDA_CHECK(cudaDeviceSynchronize());  // Error during kernel execution

Since kernel calls are asynchronous, cudaGetLastError() checks for errors at launch time, and cudaDeviceSynchronize() checks for errors that occurred during kernel execution.


Conclusion

CUDA programming is a core technology for leveraging the massive parallel processing capabilities of GPUs. Here is a summary of what we covered:

  • GPU Architecture: Thousands of lightweight cores operate in the SIMT model, optimized for data parallel processing
  • Thread Hierarchy: Parallel work is organized into a three-level hierarchy of Grid, Block, and Thread
  • Memory Hierarchy: You must understand and appropriately utilize the characteristics of various memory spaces including Register, Shared, and Global
  • Warp and Occupancy: Minimize warp divergence and consider occupancy to optimize execution configuration
  • Debugging Tools: Use Compute Sanitizer and Nsight Systems/Compute for systematic debugging and optimization

CUDA programming goes beyond simply writing kernels; it involves deeply understanding GPU hardware characteristics and optimizing code accordingly. We recommend continuously referencing NVIDIA's official CUDA Programming Guide while applying these concepts to real projects.


References