- Authors
- Name
- 1. GPU vs CPU Architecture Differences
- 2. CUDA Programming Model: Grid, Block, Thread Hierarchy
- 3. CUDA Kernel Development and Execution Configuration
- 4. Thread Indexing: threadIdx, blockIdx, blockDim, gridDim
- 5. GPU Memory Types
- 6. Memory Management API
- 7. Warp Execution and Warp Divergence
- 8. Occupancy Concept and Optimization
- 9. Practical Examples
- 10. NVIDIA GPU Generation Characteristics (Compute Capability)
- 11. Debugging and Profiling Tools
- 12. Error Handling Best Practices
- Conclusion
- References
1. GPU vs CPU Architecture Differences
To understand CUDA programming, you must first grasp the fundamental architectural differences between GPUs and CPUs.
1.1 CPU: A Processor Optimized for Sequential Processing
The CPU (Central Processing Unit) is equipped with complex control flow, branch prediction, and large caches, making it optimized for sequential tasks. A typical high-performance CPU has 8 to 64 cores, each capable of independently executing complex instructions at high speed. Most of the CPU's transistors are dedicated to control logic and cache, focusing on maximizing single-thread execution speed.
1.2 GPU: A Processor Optimized for Massive Parallel Processing
In contrast, the GPU (Graphics Processing Unit) is equipped with thousands of small cores, specializing in massively parallel computation. NVIDIA GPUs are organized into units called Streaming Multiprocessors (SMs), each containing dozens to hundreds of CUDA Cores. Most of the GPU's transistors are dedicated to arithmetic logic units (ALUs), enabling simultaneous execution of thousands of threads.
| Characteristic | CPU | GPU |
|---|---|---|
| Number of Cores | 8-64 (high performance) | Thousands to tens of thousands |
| Core Characteristics | Complex and powerful | Simple and lightweight |
| Cache Size | Large (tens of MB) | Relatively small |
| Optimal Tasks | Sequential, complex branches | Massive data parallel processing |
| Memory Bandwidth | Relatively low | Very high (HBM) |
1.3 SIMT Execution Model
NVIDIA GPUs use the SIMT (Single Instruction, Multiple Threads) execution model. SIMT is similar to SIMD (Single Instruction, Multiple Data), but with a key difference. In SIMD, the vector width is exposed to software, whereas SIMT specifies the execution and branching behavior of individual threads. Each thread has its own program counter and register state, and can logically follow independent execution paths.
The core of SIMT is warp-based execution. The GPU groups 32 threads into a single warp and executes the same instruction simultaneously. Maximum performance is achieved when all threads in a warp follow the same code path. When threads take different branches, performance degrades (known as warp divergence, discussed in detail later).
2. CUDA Programming Model: Grid, Block, Thread Hierarchy
The most fundamental concept in the CUDA programming model is the thread hierarchy. When a kernel function is invoked in CUDA, numerous threads are created and executed in parallel, organized into a three-level hierarchy: Grid, Block, and Thread.
2.1 Thread
A thread is the most basic unit of CUDA execution. Each thread executes one instance of the kernel code and has its own registers and local memory. Each thread determines which data it should process through its unique ID.
2.2 Thread Block (Block)
A Thread Block is a group of threads. Threads within the same block share the following characteristics:
- Can share data through Shared Memory
- Can synchronize via
__syncthreads() - Execute on a single SM and do not migrate to another SM during execution
- Can contain a maximum of 1024 threads (may vary by Compute Capability)
Thread Blocks can be organized in 1D, 2D, or 3D, allowing natural indexing for vector, matrix, and volume data.
2.3 Grid
A Grid is a collection of Thread Blocks. A single kernel invocation creates one Grid. Grids can also be organized in 1D, 2D, or 3D. Direct data sharing via Shared Memory between different blocks is not possible, and synchronization is limited (special APIs such as Cooperative Groups must be used).
2.4 Thread Block Cluster (Compute Capability 9.0 and above)
Starting with the NVIDIA Hopper architecture (Compute Capability 9.0), an optional layer called Thread Block Cluster was added. A cluster consists of multiple Thread Blocks, and blocks within the same cluster execute on the same GPC (GPU Processing Cluster), enabling access to each other's Shared Memory through Distributed Shared Memory.
Grid
+-- Block Cluster (optional, CC 9.0+)
+-- Thread Block (up to 1024 Threads)
+-- Thread (individual execution unit)
3. CUDA Kernel Development and Execution Configuration
3.1 Kernel Function Definition
In CUDA, functions that execute on the GPU are called kernels. Kernel functions are defined using the __global__ qualifier, and the return type must be void.
__global__ void myKernel(int *data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
data[idx] = data[idx] * 2;
}
}
CUDA has three function qualifiers:
| Qualifier | Execution Location | Call Location |
|---|---|---|
__global__ | GPU (Device) | CPU (Host) or GPU |
__device__ | GPU (Device) | GPU (Device) |
__host__ | CPU (Host) | CPU (Host) |
Using __host__ and __device__ together compiles the function for both host and device.
3.2 Execution Configuration
When calling a kernel, the Grid and Block dimensions are specified using the <<<gridDim, blockDim>>> syntax.
// 1D configuration
int N = 1024;
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
myKernel<<<blocksPerGrid, threadsPerBlock>>>(d_data, N);
// 2D configuration
dim3 blockDim(16, 16); // 16x16 = 256 threads per block
dim3 gridDim(64, 64); // 64x64 blocks
matMulKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);
// Shared Memory size and Stream specification
myKernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(args);
The full form of the <<<>>> syntax is <<<gridDim, blockDim, sharedMemBytes, stream>>>. The third argument is the size in bytes of dynamically allocated Shared Memory, and the fourth is the CUDA Stream. If omitted, they default to 0 and the default stream, respectively.
Important: There is a hardware limit on the number of threads per block. Currently, all NVIDIA GPUs support a maximum of 1024 threads per block. Exceeding this value will prevent the kernel from launching.
4. Thread Indexing: threadIdx, blockIdx, blockDim, gridDim
Inside a kernel, each thread uses built-in variables to determine its position. These variables are of type uint3 or dim3, with .x, .y, .z members.
4.1 Built-in Variables
| Variable | Description |
|---|---|
threadIdx | Thread index within the block (starting from 0) |
blockIdx | Block index within the grid (starting from 0) |
blockDim | Block dimensions (number of threads in the block) |
gridDim | Grid dimensions (number of blocks in the grid) |
warpSize | Warp size (currently always 32) |
4.2 Global Thread ID Computation
For a 1D grid with 1D blocks:
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;
For a 2D grid with 2D blocks:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int globalIdx = row * width + col;
The same pattern extends to 3D:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;
Boundary checks are essential. When the data size is not an exact multiple of the block size, some threads in the last block may fall outside the valid range, so boundary checking must always be performed:
if (globalIdx < N) {
// Perform computation only for valid indices
output[globalIdx] = input[globalIdx] * 2;
}
5. GPU Memory Types
CUDA GPUs provide multiple memory spaces, each with different access speeds, sizes, and visibility (scope). Choosing the right memory is the key to CUDA program optimization.
5.1 Register
- Location: On-chip (inside the SM)
- Visibility: Private to each thread
- Speed: Fastest (1 cycle latency)
- Size: 64K 32-bit registers per SM, up to 255 per thread
Local variables within a kernel are allocated to registers by default. High register usage reduces the number of threads that can simultaneously reside on an SM, decreasing occupancy.
5.2 Local Memory
- Location: Off-chip (Device Memory, same physical location as Global Memory)
- Visibility: Private to each thread
- Speed: Same as Global Memory (slow, hundreds of cycles)
- Purpose: Local variables that don't fit in registers (register spill), large arrays
Despite the name "Local," it is actually located off-chip, so access speed is slow. The compiler automatically spills to Local Memory when registers are insufficient.
5.3 Shared Memory
- Location: On-chip (inside the SM, shares physical space with L1 Cache)
- Visibility: All threads within the same Thread Block
- Speed: Close to registers (~5 cycles without bank conflicts)
- Size: Typically 48KB-164KB per SM (varies by architecture)
__global__ void sharedMemExample(float *data) {
__shared__ float sharedData[256]; // Static allocation
int tid = threadIdx.x;
sharedData[tid] = data[blockIdx.x * blockDim.x + tid];
__syncthreads(); // Synchronize all threads in the block
// Operations using sharedData
float result = sharedData[tid] + sharedData[255 - tid];
data[blockIdx.x * blockDim.x + tid] = result;
}
Shared Memory is divided into banks (typically 32), and when multiple threads access the same bank simultaneously, a bank conflict occurs, causing serialization. Designing access patterns that avoid bank conflicts is important.
5.4 Global Memory
- Location: Off-chip (Device DRAM, i.e., HBM or GDDR)
- Visibility: All threads + host
- Speed: Slowest (hundreds of cycles latency)
- Size: Largest (several GB to tens of GB)
Memory allocated with cudaMalloc() is Global Memory. Access speed can be improved through L1/L2 caches, and bandwidth should be maximized through coalesced access (adjacent threads accessing contiguous memory addresses).
5.5 Constant Memory
- Location: Off-chip (Global Memory region), cached via dedicated cache
- Visibility: All threads (read-only)
- Speed: Very fast on cache hit
- Size: 64KB
__constant__ float constData[256];
// Set values from host
cudaMemcpyToSymbol(constData, hostData, sizeof(float) * 256);
Delivers best performance when all threads in a warp read the same address. Through a broadcast mechanism, a single memory read delivers the value to all threads in the warp.
5.6 Texture Memory
- Location: Off-chip, cached via dedicated Texture Cache
- Visibility: All threads (read-only)
- Features: Optimized for 2D spatial locality, hardware interpolation support
Texture Memory is advantageous for access patterns involving spatially adjacent data in 2D/3D datasets such as image processing. In modern CUDA, it is used alongside Surface Objects.
5.7 Memory Hierarchy Summary
Fast <--------------------------------------------> Slow
Register > Shared Memory > L1/L2 Cache > Constant/Texture Cache > Global Memory
(On-chip) (On-chip) (On-chip) (Cached) (Off-chip)
6. Memory Management API
6.1 Explicit Memory Management
The traditional CUDA memory management approach explicitly separates and manages host and device memory.
cudaMalloc: Device Memory Allocation
float *d_array;
cudaMalloc((void **)&d_array, N * sizeof(float));
cudaMalloc allocates memory in the GPU's Global Memory. The allocated pointer (d_array) is only valid on the device and cannot be directly dereferenced on the host.
cudaMemcpy: Host-Device Data Transfer
// Host -> Device
cudaMemcpy(d_array, h_array, N * sizeof(float), cudaMemcpyHostToDevice);
// Device -> Host
cudaMemcpy(h_result, d_result, N * sizeof(float), cudaMemcpyDeviceToHost);
// Device -> Device
cudaMemcpy(d_dest, d_src, N * sizeof(float), cudaMemcpyDeviceToDevice);
cudaMemcpy is a synchronous function that blocks the host thread until the transfer completes. For asynchronous transfers, use cudaMemcpyAsync with a CUDA Stream.
cudaFree: Device Memory Deallocation
cudaFree(d_array);
Typical CUDA Program Flow
// 1. Allocate host memory and initialize data
float *h_input = (float *)malloc(N * sizeof(float));
float *h_output = (float *)malloc(N * sizeof(float));
initializeData(h_input, N);
// 2. Allocate device memory
float *d_input, *d_output;
cudaMalloc(&d_input, N * sizeof(float));
cudaMalloc(&d_output, N * sizeof(float));
// 3. Transfer data Host -> Device
cudaMemcpy(d_input, h_input, N * sizeof(float), cudaMemcpyHostToDevice);
// 4. Launch kernel
int blockSize = 256;
int gridSize = (N + blockSize - 1) / blockSize;
myKernel<<<gridSize, blockSize>>>(d_input, d_output, N);
// 5. Transfer results Device -> Host
cudaMemcpy(h_output, d_output, N * sizeof(float), cudaMemcpyDeviceToHost);
// 6. Free memory
cudaFree(d_input);
cudaFree(d_output);
free(h_input);
free(h_output);
6.2 Unified Memory
Unified Memory, introduced in CUDA 6.0, allows the host and device to share a single address space. Data movement is managed automatically by the CUDA runtime.
float *data;
cudaMallocManaged(&data, N * sizeof(float));
// Accessible from host
for (int i = 0; i < N; i++) {
data[i] = (float)i;
}
// Same pointer used on device
myKernel<<<gridSize, blockSize>>>(data, N);
cudaDeviceSynchronize();
// Access results from host (after synchronization)
printf("Result: %f\n", data[0]);
cudaFree(data); // cudaMallocManaged also freed with cudaFree
Advantages of Unified Memory:
- No
cudaMemcpycalls needed, resulting in more concise code - Same pointer used on both host and device
- Runtime handles data migration on-demand
- Can process data exceeding device memory capacity (oversubscription)
However, in terms of performance, programs optimized to overlap kernel execution and data transfer using Streams and cudaMemcpyAsync can achieve higher performance than programs using only Unified Memory. cudaMemPrefetchAsync can be used to improve Unified Memory performance.
7. Warp Execution and Warp Divergence
7.1 The Concept of a Warp
A warp is the basic scheduling unit of execution in CUDA. The SM's warp scheduler groups threads into groups of 32 to form a warp and issues instructions at the warp level. The 32 threads in the same warp share the same program counter (PC) and execute the same instruction simultaneously.
Within a Thread Block, warps are formed in order of thread ID:
- Warp 0: Threads 0-31
- Warp 1: Threads 32-63
- Warp 2: Threads 64-95
- ... and so on
7.2 Warp Divergence
Warp divergence occurs when threads within a warp follow different paths at conditional branches (if, switch, for, etc.).
__global__ void divergentKernel(int *data, int *result) {
int tid = threadIdx.x;
// Warp Divergence occurs!
if (tid % 2 == 0) {
result[tid] = data[tid] * 2; // Even threads
} else {
result[tid] = data[tid] + 10; // Odd threads
}
}
In the above code, even-numbered and odd-numbered threads within a warp take different branches. The GPU handles this by executing each branch path sequentially, disabling threads that don't belong to the current path. As a result, the execution times of both branches are summed, degrading performance.
Strategies for Minimizing Warp Divergence
// BAD: Branching within the same warp
if (threadIdx.x % 2 == 0) { ... }
// BETTER: Design branching to occur at warp boundaries
if (threadIdx.x / 32 % 2 == 0) { ... }
// or
if (blockIdx.x % 2 == 0) { ... }
The core principle is to design code so that threads within the same warp follow the same code path. If the branching condition aligns with warp boundaries, each entire warp executes only one path, preventing divergence.
8. Occupancy Concept and Optimization
8.1 What is Occupancy?
Occupancy is the ratio of active warps to the maximum number of warps that can be simultaneously active on an SM.
Occupancy = Active Warps / Maximum Warps per SM
For example, if an SM supports a maximum of 64 warps and 32 warps are actually active, the occupancy is 50%.
8.2 Factors Affecting Occupancy
Occupancy is determined by three resources:
Register usage: More registers per thread means fewer total threads that can reside on the SM. With 64K 32-bit registers per SM, using 128 registers per thread allows only 512 threads (16 warps) on the SM.
Shared Memory usage: More Shared Memory per block means fewer blocks that can be simultaneously placed on the SM.
Block size (number of threads): There is a maximum block count limit per SM (e.g., 32 for Ampere CC 8.0, 16 for CC 8.6), so blocks that are too small may hit this limit and fail to fill the SM's thread capacity.
8.3 Occupancy Optimization Methods
Choose block sizes of 128, 256, or 512: Empirically, 256 is a good starting point. Too small (32, 64) hits the per-SM block count limit, while too large (1024) increases resource requirements and may lower occupancy.
Control register usage: Use the
__launch_bounds__qualifier or the compiler option-maxrregcountto limit registers per kernel.
__global__ void __launch_bounds__(256, 4) // maxThreadsPerBlock, minBlocksPerSM
myKernel(float *data) {
// ...
}
- Use the CUDA Occupancy Calculator: Use the NVIDIA-provided Occupancy Calculator spreadsheet or the
cudaOccupancyMaxPotentialBlockSize()API to automatically determine the optimal block size.
int blockSize;
int minGridSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, 0);
Note: High occupancy does not always guarantee high performance. Memory access patterns, instruction-level parallelism (ILP), Shared Memory utilization, and other factors collectively affect performance.
9. Practical Examples
9.1 Vector Addition
The most basic CUDA example: vector addition. It adds corresponding elements of two arrays and stores the result.
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void vectorAdd(const float *A, const float *B, float *C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
int main() {
int N = 1 << 20; // Approximately 1 million elements
size_t size = N * sizeof(float);
// Allocate host memory
float *h_A = (float *)malloc(size);
float *h_B = (float *)malloc(size);
float *h_C = (float *)malloc(size);
// Initialize data
for (int i = 0; i < N; i++) {
h_A[i] = 1.0f;
h_B[i] = 2.0f;
}
// Allocate device memory
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
// Transfer Host -> Device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Launch kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Transfer Device -> Host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Verification
for (int i = 0; i < N; i++) {
if (fabs(h_C[i] - 3.0f) > 1e-5) {
fprintf(stderr, "Verification failed at index %d!\n", i);
return -1;
}
}
printf("Vector addition successful!\n");
// Cleanup
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
free(h_A); free(h_B); free(h_C);
return 0;
}
Key Points:
- Each thread processes one element (1:1 mapping)
(N + threadsPerBlock - 1) / threadsPerBlockperforms ceiling division to cover all elements- Boundary check (
if (idx < N)) prevents out-of-bounds access
9.2 Matrix Multiplication
Matrix multiplication is a representative CUDA optimization example. We first look at the naive version, then cover the tiled version using Shared Memory.
Naive Version
__global__ void matMulNaive(float *A, float *B, float *C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N) {
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
}
// Launch
dim3 blockDim(16, 16);
dim3 gridDim((N + 15) / 16, (N + 15) / 16);
matMulNaive<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);
The naive version has each thread computing one element of the result matrix while reading an entire row of matrix A and an entire column of matrix B from Global Memory. This results in O(N^3) Global Memory accesses, which is highly inefficient.
Tiled Version (Using Shared Memory)
#define TILE_SIZE 16
__global__ void matMulTiled(float *A, float *B, float *C, int N) {
__shared__ float tileA[TILE_SIZE][TILE_SIZE];
__shared__ float tileB[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
// Iterate over tiles
for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
// Load tile into Shared Memory
if (row < N && (t * TILE_SIZE + threadIdx.x) < N)
tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
else
tileA[threadIdx.y][threadIdx.x] = 0.0f;
if ((t * TILE_SIZE + threadIdx.y) < N && col < N)
tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
else
tileB[threadIdx.y][threadIdx.x] = 0.0f;
__syncthreads();
// Multiply-add within the tile
for (int k = 0; k < TILE_SIZE; k++) {
sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
}
__syncthreads();
}
if (row < N && col < N) {
C[row * N + col] = sum;
}
}
Key ideas behind tiled matrix multiplication:
- Data from Global Memory is divided into TILE_SIZE x TILE_SIZE tiles and loaded into Shared Memory
- Reads from Shared Memory are approximately 100x faster than from Global Memory
- Multiple threads within the block reuse the same data, greatly reducing Global Memory accesses
__syncthreads()ensures all threads have completed tile loading before computation begins
10. NVIDIA GPU Generation Characteristics (Compute Capability)
Compute Capability (CC) is a version number indicating a GPU's hardware features and specifications. The major version represents the architecture generation, and the minor version represents improvements within a generation.
| CC | Architecture | Representative GPU | Key Features |
|---|---|---|---|
| 3.x | Kepler | GTX 680, K40 | Dynamic Parallelism, Hyper-Q |
| 5.x | Maxwell | GTX 980, M40 | Energy efficiency improvements, SM redesign |
| 6.x | Pascal | GTX 1080, P100 | HBM2, NVLink, FP16 support |
| 7.0 | Volta | V100 | 1st gen Tensor Core, Independent Thread Scheduling |
| 7.5 | Turing | RTX 2080, T4 | RT Core, INT8/INT4 Tensor Core |
| 8.0 | Ampere | A100 | 3rd gen Tensor Core, TF32, BF16, Sparsity |
| 8.6 | Ampere | RTX 3090 | Max 48 Warps per SM (vs 64 for 8.0) |
| 8.9 | Ada Lovelace | RTX 4090, L40 | 4th gen Tensor Core, FP8, Shader Execution Reordering |
| 9.0 | Hopper | H100 | Thread Block Cluster, Transformer Engine, DPX, FP8 |
| 10.0 | Blackwell | B200, GB200 | 5th gen Tensor Core, FP4/FP6, 208B transistors, HBM3e |
Key Programming-Related Changes by Architecture
Volta (CC 7.0): Independent Thread Scheduling was introduced, allowing threads within a warp to branch more flexibly. Earlier architectures relied on implicit lock-step synchronization within warps, but starting from Volta, explicit synchronization (__syncwarp()) is required.
Hopper (CC 9.0): The Thread Block Cluster concept was added, enabling multiple Thread Blocks to cooperate through Distributed Shared Memory. The TMA (Tensor Memory Accelerator) unit for asynchronous data movement was also introduced.
Blackwell (CC 10.0): The 5th generation Tensor Core natively supports FP4 and FP6 precision, using micro-tensor formats and dynamic range scaling. AI computation performance is significantly improved over Hopper, delivering up to 20 PFLOPS of AI compute performance.
11. Debugging and Profiling Tools
11.1 cuda-gdb
cuda-gdb is a CUDA extension of GNU GDB that can debug CUDA applications running on actual GPU hardware.
Key features:
- Setting breakpoints within GPU kernels
- Querying state at the thread, block, and warp level
- Inspecting device memory contents
- Simultaneously debugging host and device code
# Compile with debug information
nvcc -g -G -o myapp myapp.cu
# Start debugging with cuda-gdb
cuda-gdb ./myapp
# Inside cuda-gdb
(cuda-gdb) break myKernel
(cuda-gdb) run
(cuda-gdb) cuda thread # Current thread info
(cuda-gdb) cuda block # Current block info
(cuda-gdb) info cuda threads # List all CUDA threads
11.2 Compute Sanitizer
Compute Sanitizer is a suite of tools for checking the functional correctness of CUDA programs, included in the CUDA Toolkit. It provides four sub-tools:
| Tool | Function |
|---|---|
| memcheck | Detects memory access errors (out-of-bounds, misaligned) |
| racecheck | Detects Shared Memory data races |
| initcheck | Detects uninitialized Global Memory accesses |
| synccheck | Detects thread synchronization errors (misuse of __syncthreads(), etc.) |
# Run memcheck
compute-sanitizer --tool memcheck ./myapp
# Run racecheck
compute-sanitizer --tool racecheck ./myapp
# Run initcheck
compute-sanitizer --tool initcheck ./myapp
11.3 NVIDIA Nsight
NVIDIA Nsight is a suite of tools providing integrated development, debugging, and profiling environments.
Nsight Systems: Analyzes system-wide performance. Visually displays CPU-GPU timelines, kernel execution times, memory transfers, and API calls. Used as the first step to identify overall bottlenecks.
Nsight Compute: Analyzes detailed performance metrics of individual CUDA kernels. Examines occupancy, memory bandwidth utilization, instruction throughput, warp state, and more in fine detail. Used for optimizing specific kernels.
Nsight Visual Studio Edition / VS Code Extension: Provides CUDA debugging and profiling support integrated into the IDE.
# Profile with Nsight Systems
nsys profile --stats=true ./myapp
# Analyze kernels with Nsight Compute
ncu --set full ./myapp
Debugging/Profiling Workflow
- Verify functional correctness: Use Compute Sanitizer (memcheck, racecheck) to catch memory errors and race conditions first
- System-level analysis: Use Nsight Systems to identify overall bottlenecks (CPU-GPU synchronization, memory transfers, etc.)
- Kernel-level optimization: Use Nsight Compute to analyze detailed performance metrics of bottleneck kernels and optimize
12. Error Handling Best Practices
Most CUDA API calls return an error code of type cudaError_t. In production code, errors must always be checked.
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Usage examples
CUDA_CHECK(cudaMalloc(&d_array, size));
CUDA_CHECK(cudaMemcpy(d_array, h_array, size, cudaMemcpyHostToDevice));
// Error checking after kernel launch
myKernel<<<gridSize, blockSize>>>(d_array, N);
CUDA_CHECK(cudaGetLastError()); // Kernel launch error
CUDA_CHECK(cudaDeviceSynchronize()); // Error during kernel execution
Since kernel calls are asynchronous, cudaGetLastError() checks for errors at launch time, and cudaDeviceSynchronize() checks for errors that occurred during kernel execution.
Conclusion
CUDA programming is a core technology for leveraging the massive parallel processing capabilities of GPUs. Here is a summary of what we covered:
- GPU Architecture: Thousands of lightweight cores operate in the SIMT model, optimized for data parallel processing
- Thread Hierarchy: Parallel work is organized into a three-level hierarchy of Grid, Block, and Thread
- Memory Hierarchy: You must understand and appropriately utilize the characteristics of various memory spaces including Register, Shared, and Global
- Warp and Occupancy: Minimize warp divergence and consider occupancy to optimize execution configuration
- Debugging Tools: Use Compute Sanitizer and Nsight Systems/Compute for systematic debugging and optimization
CUDA programming goes beyond simply writing kernels; it involves deeply understanding GPU hardware characteristics and optimizing code accordingly. We recommend continuously referencing NVIDIA's official CUDA Programming Guide while applying these concepts to real projects.
References
- CUDA Programming Guide - NVIDIA Official Documentation
- CUDA C++ Programming Guide (Legacy)
- CUDA Programming Guide - Programming Model
- CUDA Programming Guide - Writing CUDA SIMT Kernels
- CUDA Programming Guide - Advanced Kernel Programming
- CUDA Programming Guide - Unified and System Memory
- CUDA Programming Guide - Unified Memory
- CUDA Programming Guide - Compute Capabilities
- CUDA Runtime API - Memory Management
- CUDA GPU Compute Capability - NVIDIA Developer
- CUDA Refresher: The CUDA Programming Model - NVIDIA Technical Blog
- Using Shared Memory in CUDA C/C++ - NVIDIA Technical Blog
- Using CUDA Warp-Level Primitives - NVIDIA Technical Blog
- CUDA-GDB - NVIDIA Developer
- Compute Sanitizer - NVIDIA Documentation
- Nsight Developer Tools - NVIDIA Developer
- NVIDIA Blackwell Tuning Guide
- NVIDIA Ampere GPU Architecture Tuning Guide
- NVIDIA Ada GPU Architecture Tuning Guide
- CUDA Samples - Matrix Multiplication (GitHub)