- Authors
- Name
- 1. Introduction: What is CUDA
- 2. GPU Architecture Fundamentals
- 3. CUDA Development Environment Setup
- 4. CUDA Programming Basics (C/C++)
- 5. CUDA Memory Management In-Depth
- 6. CUDA Streams and Asynchronous Execution
- 7. CUDA Optimization Techniques
- 8. CUDA in Python
- 9. Hands-on Examples
- 10. CUDA Tools and Libraries
- 11. CUDA Troubleshooting
- 12. CUDA Alternative Technologies Comparison
- 13. References
- Conclusion
1. Introduction: What is CUDA
1.1 Definition of CUDA
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It enables general-purpose computing (GPGPU) by leveraging thousands of GPU cores. Since its initial release in 2007, it has become the de facto standard in various fields including AI, scientific simulation, image processing, and financial modeling.
As of 2025, the latest CUDA Toolkit is version 13.1, which introduces CUDA Tile, a tile-based programming model, and cuTile Python DSL, providing Tensor Core abstraction and forward compatibility for next-generation Blackwell GPUs.
1.2 History of GPU Computing and the Emergence of CUDA
The evolution of GPU computing can be summarized by era as follows.
| Period | Event | Significance |
|---|---|---|
| 2001 | GPGPU research begins | General-purpose computing attempts using shaders |
| 2006 | NVIDIA Tesla architecture announced | First unified shader architecture |
| 2007 | CUDA 1.0 released | Birth of GPU general-purpose computing |
| 2012 | AlexNet (ImageNet) | Dawn of GPU deep learning era |
| 2017 | Volta + Tensor Core | Mixed precision computation acceleration |
| 2020 | Ampere (A100) | TF32, Sparsity support |
| 2022 | Hopper (H100) | Transformer Engine, FP8 support |
| 2024 | Blackwell (B200) | FP4, 5th gen Tensor Core |
| 2025 | CUDA 13.0/13.1 released | CUDA Tile, cuTile DSL introduction |
1.3 CPU vs GPU Architecture Comparison
CPUs and GPUs have fundamentally different design philosophies.
| Attribute | CPU | GPU |
|---|---|---|
| Design goal | Low latency (serial processing) | High throughput (parallel processing) |
| Core count | A few to dozens | Thousands to tens of thousands |
| Clock speed | High (4-6 GHz) | Relatively low (1-2 GHz) |
| Cache size | Large (tens of MB) | Small (per core) |
| Control logic | Complex (branch prediction, OoO execution) | Simple (latency hiding through many threads) |
| Parallelism model | SIMD (Single Instruction Multiple Data) | SIMT (Single Instruction Multiple Threads) |
| Optimal workload | Complex branching, sequential logic | Large-scale data-parallel computation |
SIMD vs SIMT key difference: SIMD is a vector operation method that processes multiple data simultaneously with a single instruction. SIMT goes a step further, where each thread has its own program counter while executing the same instruction simultaneously. When branching occurs, threads within a Warp can execute different paths, but this causes Warp Divergence, which degrades performance.
1.4 Why CUDA is Essential for AI/ML/Deep Learning
The reasons CUDA is central to modern AI/ML workloads are clear.
- Matrix computation acceleration: Dramatically accelerates matrix multiplication (GEMM), the core of deep learning, with Tensor Cores
- Software ecosystem: Optimized library ecosystem including cuDNN, cuBLAS, NCCL, TensorRT
- Framework support: All major frameworks (PyTorch, TensorFlow, JAX) use CUDA as their default backend
- Mixed precision: Maximizes training/inference speed with low-precision computation down to FP16, BF16, FP8, and FP4
- Multi-GPU scaling: Distributed training across hundreds of GPUs via NVLink and NVSwitch
2. GPU Architecture Fundamentals
2.1 NVIDIA GPU Internal Structure
Let us examine the core components of an NVIDIA GPU hierarchically.
GPU (GPC - Graphics Processing Cluster)
├── SM (Streaming Multiprocessor) x N
│ ├── CUDA Core (FP32/INT32) x 128 (Hopper)
│ ├── Tensor Core x 4 (4th gen, Hopper)
│ ├── RT Core (Ray Tracing) x 1
│ ├── Warp Scheduler x 4
│ ├── Register File (256 KB)
│ ├── L1 Cache / Shared Memory (shared, up to 228 KB)
│ └── SFU (Special Function Unit)
├── L2 Cache (shared)
├── Memory Controller
└── HBM (High Bandwidth Memory)
Key core types:
- CUDA Core: The basic processing unit for general-purpose floating-point/integer operations. Handles FP32, FP64, INT32 operations
- Tensor Core: A core specialized for Matrix Multiply-Accumulate (MMA) operations. The key accelerator for deep learning training/inference
- RT Core: Dedicated hardware for Ray Tracing acceleration. Primarily used for graphics workloads
2.2 Memory Hierarchy
GPU memory is organized hierarchically by speed and size.
| Memory Type | Location | Size (H100) | Bandwidth | Access Scope | Characteristics |
|---|---|---|---|---|---|
| Register | Inside SM | 256 KB/SM | Fastest | Thread-private | Fastest but limited |
| Shared Memory | Inside SM | Up to 228 KB/SM | Very fast | Shared within Block | Programmer-managed cache |
| L1 Cache | Inside SM | Shared with Shared | Very fast | SM-private | HW-managed |
| L2 Cache | GPU-wide | 50 MB | Fast | Shared across all SMs | HW-managed |
| Global Memory (HBM) | Off-GPU | 80 GB | 3.35 TB/s | Globally accessible | Largest but highest latency |
| Constant Memory | GPU-wide | 64 KB | Fast when cached | Read-only | Broadcast-optimized |
| Texture Memory | GPU-wide | Shared with Global | Fast when cached | Read-only | 2D spatial locality optimized |
Speed: Register > Shared/L1 > L2 > Global (HBM)
Size: Global (HBM) > L2 > Shared/L1 > Register
2.3 Warp, Block, Grid Concepts
CUDA's execution model has a 3-level hierarchical structure.
Grid (kernel execution unit)
├── Block (0,0) Block (1,0) Block (2,0)
│ ├── Warp 0 ├── Warp 0 ├── Warp 0
│ │ ├── Thread 0 │ ├── Thread 0 │ ├── Thread 0
│ │ ├── Thread 1 │ ├── Thread 1 │ ├── Thread 1
│ │ ├── ... │ ├── ... │ ├── ...
│ │ └── Thread 31 │ └── Thread 31 │ └── Thread 31
│ ├── Warp 1 ├── Warp 1 ├── Warp 1
│ └── ... └── ... └── ...
├── Block (0,1) Block (1,1) ...
└── ...
- Thread: The smallest execution unit of GPU computation
- Warp: An execution group of 32 Threads. The unit that executes the same instruction simultaneously on an SM (the core of SIMT)
- Block (Thread Block): Composed of multiple Warps. Executes on the same SM and shares Shared Memory
- Grid: The collection of Blocks that make up the entire kernel execution
Key constraints:
| Item | Limit (Compute Capability 9.0) |
|---|---|
| Max Threads per Block | 1,024 |
| Warp size | 32 (fixed) |
| Max Block dimensions | (1024, 1024, 64) |
| Max Grid dimensions | (2^31-1, 65535, 65535) |
| Max Blocks per SM | 32 |
| Max Warps per SM | 64 |
2.4 Compute Capability Version Differences
Compute Capability (CC) defines the feature set of GPU hardware.
| CC | Architecture | Representative GPU | Key Features |
|---|---|---|---|
| 7.0 | Volta | V100 | 1st gen Tensor Core, independent thread scheduling |
| 7.5 | Turing | RTX 2080 | INT8/INT4 Tensor Core, RT Core |
| 8.0 | Ampere | A100 | 3rd gen Tensor Core, TF32, Sparsity |
| 8.6 | Ampere | RTX 3090 | Consumer Ampere |
| 8.9 | Ada Lovelace | RTX 4090 | 4th gen Tensor Core, FP8, DLSS 3 |
| 9.0 | Hopper | H100 | 4th gen Tensor Core, Transformer Engine, DPX |
| 10.0 | Blackwell | B200 | 5th gen Tensor Core, FP4, TMEM |
| 12.0 | Blackwell Ultra | B300 | Enhanced 5th gen Tensor Core, 288 GB HBM3E |
Note that starting with CUDA 13.0, support for Maxwell (CC 5.x), Pascal (CC 6.x), and Volta (CC 7.0) has been removed.
2.5 Latest GPU Generation Comparison
Comparing three generations of data center GPUs.
| Spec | A100 (SXM) | H100 (SXM5) | B200 (SXM) | B300 (SXM) |
|---|---|---|---|---|
| Architecture | Ampere | Hopper | Blackwell | Blackwell Ultra |
| CUDA Cores | 6,912 | 16,896 | 18,432 | 18,432+ |
| Tensor Cores | 432 (3rd gen) | 528 (4th gen) | 5th gen | 5th gen |
| Memory | 80 GB HBM2e | 80 GB HBM3 | 180 GB HBM3E | 288 GB HBM3E |
| Memory BW | 2.0 TB/s | 3.35 TB/s | 7.7 TB/s | 8.0 TB/s |
| FP32 Perf | 19.5 TFLOPS | 60 TFLOPS | Undisclosed | Undisclosed |
| FP16 Tensor | 312 TFLOPS | 990 TFLOPS | Undisclosed | Undisclosed |
| FP4 Tensor | Not supported | Not supported | 9.0 PFLOPS | 14.0 PFLOPS |
| NVLink | 3rd gen (600 GB/s) | 4th gen (900 GB/s) | 5th gen (1.8 TB/s) | 5th gen (1.8 TB/s) |
| TDP | 400W | 700W | 1,000W | 1,400W |
| Cooling | Air/Liquid | Air/Liquid | Liquid recommended | Liquid required (DLC) |
The B200 achieves 3x training performance and 15x inference performance over the A100. The B300 (Blackwell Ultra) provides 14 PFLOPS in FP4 operations, 55.6% faster than the B200.
3. CUDA Development Environment Setup
3.1 CUDA Toolkit Installation
Linux (Ubuntu/Debian)
# 1. Check NVIDIA driver
nvidia-smi
# 2. Add CUDA Keyring (Ubuntu 22.04 example)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# 3. Install CUDA Toolkit
sudo apt-get install cuda-toolkit-13-1
# 4. Set environment variables
export PATH=/usr/local/cuda-13.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.1/lib64:$LD_LIBRARY_PATH
# 5. Verify installation
nvcc --version
Windows
# 1. Download CUDA Toolkit from the NVIDIA official site
# https://developer.nvidia.com/cuda-downloads
# 2. Check environment variables after installation
echo %CUDA_PATH%
# Typically C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1
# 3. Verify
nvcc --version
nvidia-smi
3.2 nvidia-smi Command Usage
nvidia-smi is the essential tool for monitoring GPU status.
# Basic info output
nvidia-smi
# Output example:
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 13.1 |
# |-----------------------------------------+------------------------+----------------------|
# | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
# | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
# |=========================================+========================+======================|
# | 0 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
# | N/A 32C P0 72W / 700W | 1234MiB / 81559MiB | 0% Default |
# +-----------------------------------------+------------------------+----------------------+
# Continuous monitoring (1 second interval)
nvidia-smi -l 1
# dmon - GPU metrics monitoring
nvidia-smi dmon -s pucvmet -d 1
# Specific GPU info
nvidia-smi -i 0 -q
# GPU process monitoring
nvidia-smi pmon -i 0
# GPU clock and memory info
nvidia-smi -q -d CLOCK,MEMORY
# CSV format output (for scripting)
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total \
--format=csv,noheader,nounits
3.3 nvcc Compiler Usage
nvcc is the NVIDIA CUDA Compiler for compiling CUDA source code.
# Basic compilation
nvcc -o hello hello.cu
# Target specific architecture
nvcc -arch=sm_90 -o kernel kernel.cu # Hopper H100
nvcc -arch=sm_80 -o kernel kernel.cu # Ampere A100
# Debug mode
nvcc -g -G -o debug_kernel kernel.cu
# Optimization level
nvcc -O3 -o optimized kernel.cu
# Generate PTX code (intermediate representation)
nvcc -ptx kernel.cu
# Support multiple architectures simultaneously (Fat Binary)
nvcc -gencode arch=compute_80,code=sm_80 \
-gencode arch=compute_90,code=sm_90 \
-o multi_arch kernel.cu
# Library linking
nvcc -lcublas -lcurand -o linked_kernel kernel.cu
# Detailed compilation info
nvcc --resource-usage -o kernel kernel.cu
3.4 CUDA Version Check and Compatibility
There are two types of CUDA versions, and it is important not to confuse them.
# 1. Driver-supported CUDA version (Driver API)
nvidia-smi
# "CUDA Version: 13.1" shown in upper right
# This is the maximum CUDA runtime version the driver supports
# 2. CUDA Toolkit version (Runtime API)
nvcc --version
# Shows "release 13.1"
# The actually installed CUDA Toolkit version
# 3. Check at runtime
python3 -c "import torch; print(torch.version.cuda)"
Key compatibility rules:
- The driver CUDA version must be greater than or equal to the Toolkit CUDA version
- Example: If the driver supports CUDA 13.1, you can also use CUDA 12.x Toolkit (backward compatible)
- According to CUDA Toolkit's Minor Version Compatibility policy, binary compatibility is maintained within the same Major version
3.5 Using CUDA in Docker
The NVIDIA Container Toolkit allows you to leverage GPUs in Docker containers.
# 1. Install NVIDIA Container Toolkit (Ubuntu)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# 2. Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# 3. Run GPU container
docker run --rm --gpus all nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi
# Use specific GPUs only
docker run --rm --gpus '"device=0,1"' nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi
# GPU usage in docker compose
# docker-compose.yml:
# docker-compose.yml
services:
gpu-app:
image: nvidia/cuda:13.1.0-devel-ubuntu22.04
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: nvidia-smi
3.6 Installing PyTorch/TensorFlow CUDA via conda/pip
# PyTorch (CUDA 12.4 build example)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Using conda
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
# TensorFlow (auto GPU detection)
pip install tensorflow[and-cuda]
# Check CUDA availability (Python)
python3 -c "
import torch
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'CUDA version: {torch.version.cuda}')
print(f'GPU count: {torch.cuda.device_count()}')
if torch.cuda.is_available():
print(f'GPU name: {torch.cuda.get_device_name(0)}')
"
4. CUDA Programming Basics (C/C++)
4.1 Host (CPU) vs Device (GPU) Code
In CUDA programming, CPU-side code is called Host and GPU-side code is called Device.
┌─────────────────────────────────────────────────────────┐
│ Host (CPU) │
│ ┌─────────────────────────────────────────────────────┐│
│ │ 1. Prepare data (Host Memory) ││
│ │ 2. Allocate GPU memory (cudaMalloc) ││
│ │ 3. Transfer data: Host -> Device (cudaMemcpy) ││
│ │ 4. Launch kernel (<<<grid, block>>>) ││
│ │ 5. Transfer results: Device -> Host (cudaMemcpy) ││
│ │ 6. Free GPU memory (cudaFree) ││
│ └─────────────────────────────────────────────────────┘│
│ ↕ PCIe / NVLink │
│ ┌─────────────────────────────────────────────────────┐│
│ │ Device (GPU) ││
│ │ - Parallel execution of kernel functions ││
│ │ - Thousands of threads processing simultaneously ││
│ └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘
4.2 Function Qualifiers
CUDA provides three function qualifiers.
| Qualifier | Execution Location | Callable From | Description |
|---|---|---|---|
__global__ | Device (GPU) | Host (CPU) | Kernel function. Return type must be void |
__device__ | Device (GPU) | Device (GPU) | Helper function callable only within GPU |
__host__ | Host (CPU) | Host (CPU) | Normal CPU function (default, can be omitted) |
__host__ __device__ | Both | Both | Usable on both CPU and GPU |
// __global__: Kernel function - called from Host, executed on Device
__global__ void myKernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
data[idx] *= 2.0f;
}
}
// __device__: Device-only helper function
__device__ float square(float x) {
return x * x;
}
// __host__ __device__: CPU/GPU compatible function
__host__ __device__ float add(float a, float b) {
return a + b;
}
4.3 Kernel Launch Syntax
Kernel functions are launched using special triple angle bracket syntax.
// Kernel launch syntax
// kernel<<<gridDim, blockDim, sharedMemSize, stream>>>(args...);
//
// gridDim: Number of Blocks in Grid (dim3)
// blockDim: Number of Threads in Block (dim3)
// sharedMemSize: Dynamic Shared Memory size (bytes, optional)
// stream: Execution stream (optional)
// 1D example: 256 threads, 1 block
myKernel<<<1, 256>>>(d_data, n);
// 1D example: Process N data elements with 256-thread blocks
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
myKernel<<<blocksPerGrid, threadsPerBlock>>>(d_data, n);
// 2D example: Image processing
dim3 blockDim(16, 16); // 16x16 = 256 threads/block
dim3 gridDim(
(width + 15) / 16,
(height + 15) / 16
);
imageKernel<<<gridDim, blockDim>>>(d_image, width, height);
4.4 Thread Hierarchy
Each thread can determine its position through built-in variables.
// Built-in Variables
threadIdx.x, threadIdx.y, threadIdx.z // Thread index within Block
blockIdx.x, blockIdx.y, blockIdx.z // Block index within Grid
blockDim.x, blockDim.y, blockDim.z // Block dimension sizes
gridDim.x, gridDim.y, gridDim.z // Grid dimension sizes
// Global Thread ID calculation (1D)
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;
// Global Thread ID calculation (2D)
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int globalIdx = row * width + col;
// Total thread count (for Grid-stride loop)
int totalThreads = gridDim.x * blockDim.x;
4.5 Memory Management Functions
// GPU memory allocation
float* d_data;
cudaMalloc((void**)&d_data, n * sizeof(float));
// Host -> Device copy
cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice);
// Device -> Host copy
cudaMemcpy(h_data, d_data, n * sizeof(float), cudaMemcpyDeviceToHost);
// GPU memory free
cudaFree(d_data);
// GPU memory initialization
cudaMemset(d_data, 0, n * sizeof(float));
// Error check macro (essential!)
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA Error at %s:%d - %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while(0)
// Usage example
CUDA_CHECK(cudaMalloc((void**)&d_data, n * sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice));
4.6 Hello World Example
// hello_cuda.cu
#include <stdio.h>
__global__ void helloKernel() {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
printf("Hello from GPU thread %d (block %d, thread %d)\n",
tid, blockIdx.x, threadIdx.x);
}
int main() {
// 2 blocks, 4 threads each = 8 threads total
helloKernel<<<2, 4>>>();
// Wait for GPU work to complete
cudaDeviceSynchronize();
printf("Hello from CPU!\n");
return 0;
}
# Compile and run
nvcc -o hello hello_cuda.cu
./hello
4.7 Vector Addition Example
This is the "Hello World" of CUDA - a vector addition example.
// vector_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA Error: %s at %s:%d\n", \
cudaGetErrorString(err), __FILE__, __LINE__); \
exit(EXIT_FAILURE); \
} \
} while(0)
// GPU kernel: vector addition
__global__ void vectorAdd(const float* A, const float* B, float* C, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
C[idx] = A[idx] + B[idx];
}
}
int main() {
const int N = 1 << 20; // 1M elements
size_t bytes = N * sizeof(float);
// Host memory allocation and initialization
float *h_A = (float*)malloc(bytes);
float *h_B = (float*)malloc(bytes);
float *h_C = (float*)malloc(bytes);
for (int i = 0; i < N; i++) {
h_A[i] = (float)i;
h_B[i] = (float)(i * 2);
}
// Device memory allocation
float *d_A, *d_B, *d_C;
CUDA_CHECK(cudaMalloc(&d_A, bytes));
CUDA_CHECK(cudaMalloc(&d_B, bytes));
CUDA_CHECK(cudaMalloc(&d_C, bytes));
// Host -> Device copy
CUDA_CHECK(cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice));
// Kernel launch
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Error check
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());
// Device -> Host copy
CUDA_CHECK(cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost));
// Result verification
for (int i = 0; i < N; i++) {
if (fabs(h_C[i] - (h_A[i] + h_B[i])) > 1e-5) {
fprintf(stderr, "Verification failed at index %d!\n", i);
exit(EXIT_FAILURE);
}
}
printf("Vector addition of %d elements: SUCCESS\n", N);
// Memory cleanup
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
free(h_A); free(h_B); free(h_C);
return 0;
}
4.8 Matrix Multiplication Example
// matmul.cu
#include <stdio.h>
#include <cuda_runtime.h>
#define TILE_SIZE 16
// Basic matrix multiplication kernel
__global__ void matMulBasic(const float* A, const float* B, float* C,
int M, int N, int K) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < N) {
float sum = 0.0f;
for (int k = 0; k < K; k++) {
sum += A[row * K + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
}
// Tiled matrix multiplication using Shared Memory (optimized version)
__global__ void matMulTiled(const float* A, const float* B, float* C,
int M, int N, int K) {
__shared__ float tileA[TILE_SIZE][TILE_SIZE];
__shared__ float tileB[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
// Iterate over tiles
for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
// Load tiles into Shared Memory
if (row < M && (t * TILE_SIZE + threadIdx.x) < K)
tileA[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
else
tileA[threadIdx.y][threadIdx.x] = 0.0f;
if ((t * TILE_SIZE + threadIdx.y) < K && col < N)
tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
else
tileB[threadIdx.y][threadIdx.x] = 0.0f;
__syncthreads(); // Wait for tile load to complete
// Multiply-accumulate within tile
for (int k = 0; k < TILE_SIZE; k++) {
sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
}
__syncthreads(); // Synchronize before loading next tile
}
if (row < M && col < N) {
C[row * N + col] = sum;
}
}
int main() {
const int M = 1024, N = 1024, K = 1024;
// ... (memory allocation, initialization, data transfer omitted)
dim3 blockDim(TILE_SIZE, TILE_SIZE);
dim3 gridDim((N + TILE_SIZE - 1) / TILE_SIZE,
(M + TILE_SIZE - 1) / TILE_SIZE);
matMulTiled<<<gridDim, blockDim>>>(d_A, d_B, d_C, M, N, K);
// ... (result copy, verification, memory cleanup omitted)
return 0;
}
5. CUDA Memory Management In-Depth
5.1 Global Memory and Coalesced Access
Coalesced Access is critical to performance when accessing Global Memory. When 32 threads in a Warp access contiguous memory addresses, the GPU combines them into a single transaction.
// Coalesced Access (good pattern)
// Threads in a Warp access contiguous memory
__global__ void coalesced(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
data[idx] = data[idx] * 2.0f; // Contiguous access
}
}
// Non-coalesced Access (bad pattern)
// Threads access non-contiguously with a stride
__global__ void strided(float* data, int n, int stride) {
int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
if (idx < n) {
data[idx] = data[idx] * 2.0f; // Strided access
}
}
// AoS (Array of Structures) vs SoA (Structure of Arrays)
// AoS: Non-coalesced (bad)
struct ParticleAoS {
float x, y, z;
float vx, vy, vz;
};
// Thread 0: particle[0].x, Thread 1: particle[1].x -> strided access
// SoA: Coalesced (good)
struct ParticlesSoA {
float* x; float* y; float* z;
float* vx; float* vy; float* vz;
};
// Thread 0: x[0], Thread 1: x[1] -> contiguous access
5.2 Shared Memory and Bank Conflict
Shared Memory is high-speed memory shared within a thread block on an SM. It consists of 32 banks, and when different threads simultaneously access the same bank, a Bank Conflict occurs.
// Shared Memory usage example
__global__ void sharedMemExample(float* input, float* output, int n) {
// Static allocation
__shared__ float sharedData[256];
int tid = threadIdx.x;
int gid = blockIdx.x * blockDim.x + threadIdx.x;
// Load from Global -> Shared
if (gid < n) {
sharedData[tid] = input[gid];
}
__syncthreads(); // Wait until all threads have loaded
// Compute on Shared Memory (neighbor element sum example)
if (tid > 0 && tid < blockDim.x - 1 && gid < n) {
output[gid] = sharedData[tid - 1] + sharedData[tid] + sharedData[tid + 1];
}
}
// Bank Conflict avoidance: padding technique
// Problem: When accessing columns of a 32x32 matrix, all threads map to the same bank
__shared__ float tile[32][32]; // Bank Conflict!
__shared__ float tile[32][32 + 1]; // Padding avoids Bank Conflict
5.3 Unified Memory
Unified Memory is an abstraction layer that allows both CPU and GPU to access memory through the same pointer.
// Unified Memory usage
float* data;
cudaMallocManaged(&data, N * sizeof(float));
// Initialize on CPU
for (int i = 0; i < N; i++) {
data[i] = (float)i;
}
// GPU kernel execution - no separate cudaMemcpy needed!
myKernel<<<blocks, threads>>>(data, N);
cudaDeviceSynchronize();
// Access results directly from CPU
printf("Result: %f\n", data[0]);
// Free
cudaFree(data);
// Memory prefetch hints (performance optimization)
cudaMemPrefetchAsync(data, N * sizeof(float), deviceId); // Prefetch to GPU
cudaMemPrefetchAsync(data, N * sizeof(float), cudaCpuDeviceId); // Prefetch to CPU
5.4 Pinned Memory
Pinned Memory (Page-locked Memory) is host memory excluded from OS page swapping, which increases data transfer speed between GPU and host.
// Pinned Memory allocation
float* h_pinned;
cudaMallocHost(&h_pinned, N * sizeof(float)); // or cudaHostAlloc
// Transfer speed comparison vs regular memory
// Pageable: ~12 GB/s (PCIe Gen4)
// Pinned: ~25 GB/s (PCIe Gen4) - approximately 2x faster
// Required for asynchronous transfers
cudaMemcpyAsync(d_data, h_pinned, bytes, cudaMemcpyHostToDevice, stream);
// Free
cudaFreeHost(h_pinned);
5.5 Memory Pool
Memory Pool, introduced in CUDA 11.2, reduces overhead from repetitive memory allocation/deallocation.
// Stream-ordered Memory Allocator (CUDA 11.2+)
float* d_data;
cudaMallocAsync(&d_data, bytes, stream);
// Asynchronous free after use
cudaFreeAsync(d_data, stream);
// Memory Pool configuration
cudaMemPool_t mempool;
cudaDeviceGetDefaultMemPool(&mempool, deviceId);
// Set pool size limit
uint64_t threshold = 1ULL << 30; // 1 GB
cudaMemPoolSetAttribute(mempool, cudaMemPoolAttrReleaseThreshold, &threshold);
5.6 Memory Bandwidth Optimization Summary
| Technique | Effect | Implementation Difficulty |
|---|---|---|
| Coalesced Access | Maximize Global Memory throughput | Low |
| SoA layout | Guarantee Coalesced Access | Medium |
| Shared Memory Tiling | Reduce Global Memory access count | Medium |
| Bank Conflict avoidance (padding) | Maximize Shared Memory throughput | Low |
| Pinned Memory | 2x faster Host-Device transfer | Low |
| Unified Memory + Prefetch | Programming convenience + performance | Low |
| Memory Pool | Eliminate allocation/deallocation overhead | Low |
| Texture Memory | Leverage 2D spatial locality | Medium |
6. CUDA Streams and Asynchronous Execution
6.1 CUDA Stream Concept
A CUDA Stream is a sequence of commands executed in order on the GPU. Operations on different Streams can execute concurrently.
Default Stream (Stream 0):
[MemcpyH2D] -> [Kernel A] -> [MemcpyD2H]
(sequential, no pipeline)
Multi-Stream:
Stream 1: [MemcpyH2D_1] -> [Kernel_1] -> [MemcpyD2H_1]
Stream 2: [MemcpyH2D_2] -> [Kernel_2] -> [MemcpyD2H_2]
Stream 3: [MemcpyH2D_3] -> [Kernel_3] -> [MemcpyD2H_3]
(overlapped execution, maximize GPU utilization)
6.2 Stream Creation and Usage
// Stream creation
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
// Asynchronous memory transfer (Pinned Memory required!)
cudaMemcpyAsync(d_A, h_A, bytes, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(d_B, h_B, bytes, cudaMemcpyHostToDevice, stream2);
// Kernel execution (on specific Stream)
kernelA<<<grid, block, 0, stream1>>>(d_A);
kernelB<<<grid, block, 0, stream2>>>(d_B);
// Asynchronous result copy
cudaMemcpyAsync(h_A, d_A, bytes, cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(h_B, d_B, bytes, cudaMemcpyDeviceToHost, stream2);
// Stream synchronization
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
// Stream cleanup
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
6.3 Multi-Stream Pipeline Pattern
const int NUM_STREAMS = 4;
const int CHUNK_SIZE = N / NUM_STREAMS;
cudaStream_t streams[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
cudaStreamCreate(&streams[i]);
}
// Pinned Memory allocation
float *h_in, *h_out;
cudaMallocHost(&h_in, N * sizeof(float));
cudaMallocHost(&h_out, N * sizeof(float));
float *d_in, *d_out;
cudaMalloc(&d_in, N * sizeof(float));
cudaMalloc(&d_out, N * sizeof(float));
// Pipeline execution
for (int i = 0; i < NUM_STREAMS; i++) {
int offset = i * CHUNK_SIZE;
size_t chunkBytes = CHUNK_SIZE * sizeof(float);
// 1. Host -> Device (async)
cudaMemcpyAsync(d_in + offset, h_in + offset,
chunkBytes, cudaMemcpyHostToDevice, streams[i]);
// 2. Kernel execution
int blocks = (CHUNK_SIZE + 255) / 256;
processKernel<<<blocks, 256, 0, streams[i]>>>(
d_in + offset, d_out + offset, CHUNK_SIZE);
// 3. Device -> Host (async)
cudaMemcpyAsync(h_out + offset, d_out + offset,
chunkBytes, cudaMemcpyDeviceToHost, streams[i]);
}
// Wait for all to complete
cudaDeviceSynchronize();
6.4 CUDA Events (Timing Measurement)
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// Start timing
cudaEventRecord(start);
// Kernel execution
myKernel<<<grid, block>>>(d_data, N);
// Stop timing
cudaEventRecord(stop);
cudaEventSynchronize(stop);
// Calculate elapsed time
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel execution time: %.3f ms\n", milliseconds);
// Bandwidth calculation
float bandwidth = (bytes_read + bytes_written) / (milliseconds * 1e6);
printf("Effective bandwidth: %.2f GB/s\n", bandwidth);
cudaEventDestroy(start);
cudaEventDestroy(stop);
6.5 Synchronization Function Comparison
| Function | Scope | Purpose |
|---|---|---|
cudaDeviceSynchronize() | Entire Device | Wait for all Streams to complete |
cudaStreamSynchronize(stream) | Specific Stream | Wait for that Stream to complete |
cudaEventSynchronize(event) | Specific Event | Wait for that Event to complete |
cudaStreamWaitEvent(stream, event) | Between Streams | Stream waits for Event before proceeding |
__syncthreads() | Within Block | Synchronize threads within Block (in kernel) |
7. CUDA Optimization Techniques
7.1 Occupancy Optimization
Occupancy is the ratio of actual active Warps to the maximum supported Warps on an SM. High Occupancy does not always mean peak performance, but it is critical for hiding memory latency.
// Occupancy calculation API
int blockSize;
int minGridSize;
// Automatically calculate optimal block size
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, 0);
printf("Optimal block size: %d\n", blockSize);
printf("Minimum grid size: %d\n", minGridSize);
// Check Occupancy for a specific block size
int maxActiveBlocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&maxActiveBlocks, myKernel, blockSize, 0);
printf("Max active blocks per SM: %d\n", maxActiveBlocks);
Block size selection guidelines:
| Principle | Description |
|---|---|
| Multiple of 32 | Align with Warp size to avoid thread waste |
| 128 ~ 512 recommended | Generally optimal range |
| Consider registers/Shared Memory | Reduce block size if resource usage is high |
Use cudaOccupancyMaxPotentialBlockSize | Automatic optimal calculation |
7.2 Minimizing Warp Divergence
When threads within a Warp take different branches, sequential execution occurs.
// Bad example: Warp Divergence
__global__ void divergentKernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
// Branches based on even/odd threadIdx.x
// -> Only half execute at a time within same Warp
if (threadIdx.x % 2 == 0) {
data[idx] = data[idx] * 2.0f;
} else {
data[idx] = data[idx] + 1.0f;
}
}
}
// Good example: Branch at Warp granularity
__global__ void convergentKernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int warpId = threadIdx.x / 32; // Branch by Warp ID
if (idx < n) {
// All threads in the same Warp take the same path
if (warpId % 2 == 0) {
data[idx] = data[idx] * 2.0f;
} else {
data[idx] = data[idx] + 1.0f;
}
}
}
7.3 Shared Memory Tiling (Matrix Multiplication Optimization)
Tiling is a key technique to reduce Global Memory access count. We already saw the implementation in section 4.8's matMulTiled, and the performance difference is summarized below.
| Method | Global Memory Accesses (MxNxK matrix) | Relative Performance |
|---|---|---|
| Naive | 2 * M * N * K | 1x |
| Tiled (Shared Memory) | 2 * M * N * K / TILE_SIZE | ~TILE_SIZE x |
With TILE_SIZE of 16, Global Memory accesses are reduced by 16x.
7.4 Loop Unrolling
// Manual Loop Unrolling
__global__ void unrolled(float* data, int n) {
int idx = blockIdx.x * blockDim.x * 4 + threadIdx.x;
// 4x Unrolling
if (idx < n) data[idx] *= 2.0f;
if (idx + 256 < n) data[idx + 256] *= 2.0f;
if (idx + 512 < n) data[idx + 512] *= 2.0f;
if (idx + 768 < n) data[idx + 768] *= 2.0f;
}
// Compiler directive-based Unrolling
__global__ void pragmaUnrolled(float* A, float* B, float* C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
#pragma unroll 8
for (int k = 0; k < N; k++) {
sum += A[idx * N + k] * B[k];
}
C[idx] = sum;
}
7.5 Grid-Stride Loop Pattern
A flexible pattern used when data size exceeds total thread count.
__global__ void gridStrideKernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
// Each thread processes multiple elements, striding by total thread count
for (int i = idx; i < n; i += stride) {
data[i] = data[i] * 2.0f;
}
}
// Advantages:
// 1. Works regardless of data size
// 2. Block/grid sizes can be freely adjusted
// 3. Increased work per thread reduces kernel launch overhead
7.6 NVIDIA Nsight Profiling
# Nsight Compute (kernel-level profiling)
ncu --set full -o profile_report ./my_cuda_app
# Profile specific kernel only
ncu --kernel-name myKernel --launch-skip 0 --launch-count 1 ./my_cuda_app
# Nsight Systems (system-level profiling)
nsys profile --trace=cuda,nvtx -o timeline_report ./my_cuda_app
# Key metrics to check
# - SM Throughput: SM utilization
# - Memory Throughput: Memory bandwidth utilization
# - Achieved Occupancy: Actual Occupancy
# - Warp Stall Reasons: Warp stall causes
# - L1/L2 Hit Rate: Cache hit rate
Profiling checklist:
| Metric | Target | Action if problematic |
|---|---|---|
| Achieved Occupancy | 50% or higher | Adjust block size, register usage |
| Memory Throughput | 60%+ of theoretical BW | Improve Coalesced Access, caching |
| Compute Throughput | 60%+ of theoretical ops | Improve ILP, remove unnecessary ops |
| Warp Divergence | Minimize | Restructure branching logic |
8. CUDA in Python
8.1 PyCUDA Basics
PyCUDA allows you to write and execute CUDA C kernels directly from Python.
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np
# Define CUDA kernel (write C code as a string)
mod = SourceModule("""
__global__ void multiply(float *dest, float *a, float *b, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
dest[idx] = a[idx] * b[idx];
}
}
""")
multiply = mod.get_function("multiply")
# Prepare data
n = 1024 * 1024
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
dest = np.zeros_like(a)
# Execute kernel
block_size = 256
grid_size = (n + block_size - 1) // block_size
multiply(
cuda.Out(dest), cuda.In(a), cuda.In(b), np.int32(n),
block=(block_size, 1, 1), grid=(grid_size, 1)
)
# Verify results
assert np.allclose(dest, a * b)
print("PyCUDA multiply: SUCCESS")
8.2 Numba CUDA JIT
Numba is the easiest way to JIT-compile Python functions into CUDA kernels.
from numba import cuda
import numpy as np
import math
# CUDA kernel definition
@cuda.jit
def vector_add(a, b, c):
idx = cuda.grid(1) # 1D global index
if idx < a.shape[0]:
c[idx] = a[idx] + b[idx]
# Prepare data
n = 1_000_000
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
c = np.zeros(n, dtype=np.float32)
# Transfer to Device
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)
# Execute kernel
threads_per_block = 256
blocks_per_grid = math.ceil(n / threads_per_block)
vector_add[blocks_per_grid, threads_per_block](d_a, d_b, d_c)
# Get results
result = d_c.copy_to_host()
assert np.allclose(result, a + b)
print("Numba CUDA vector_add: SUCCESS")
# 2D kernel example: matrix multiplication
@cuda.jit
def matmul_kernel(A, B, C):
row, col = cuda.grid(2) # 2D global index
if row < C.shape[0] and col < C.shape[1]:
tmp = 0.0
for k in range(A.shape[1]):
tmp += A[row, k] * B[k, col]
C[row, col] = tmp
# Using Shared Memory
@cuda.jit
def matmul_shared(A, B, C):
TILE = 16
sA = cuda.shared.array(shape=(TILE, TILE), dtype=np.float32)
sB = cuda.shared.array(shape=(TILE, TILE), dtype=np.float32)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
row = cuda.blockIdx.y * TILE + ty
col = cuda.blockIdx.x * TILE + tx
tmp = 0.0
for t in range((A.shape[1] + TILE - 1) // TILE):
if row < A.shape[0] and (t * TILE + tx) < A.shape[1]:
sA[ty, tx] = A[row, t * TILE + tx]
else:
sA[ty, tx] = 0.0
if (t * TILE + ty) < B.shape[0] and col < B.shape[1]:
sB[ty, tx] = B[t * TILE + ty, col]
else:
sB[ty, tx] = 0.0
cuda.syncthreads()
for k in range(TILE):
tmp += sA[ty, k] * sB[k, tx]
cuda.syncthreads()
if row < C.shape[0] and col < C.shape[1]:
C[row, col] = tmp
8.3 CuPy (NumPy Replacement)
CuPy is a library that provides the same API as NumPy but runs on the GPU.
import cupy as cp
import numpy as np
import time
# Same API as NumPy
a_gpu = cp.random.randn(10000, 10000, dtype=cp.float32)
b_gpu = cp.random.randn(10000, 10000, dtype=cp.float32)
# GPU matrix multiplication
start = time.time()
c_gpu = cp.dot(a_gpu, b_gpu)
cp.cuda.Stream.null.synchronize() # Wait for GPU work to complete
gpu_time = time.time() - start
# CPU comparison
a_cpu = cp.asnumpy(a_gpu)
b_cpu = cp.asnumpy(b_gpu)
start = time.time()
c_cpu = np.dot(a_cpu, b_cpu)
cpu_time = time.time() - start
print(f"GPU: {gpu_time:.4f}s, CPU: {cpu_time:.4f}s")
print(f"Speedup: {cpu_time / gpu_time:.1f}x")
# Custom CUDA kernel in CuPy
custom_kernel = cp.RawKernel(r'''
extern "C" __global__
void relu_kernel(const float* input, float* output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
output[idx] = fmaxf(input[idx], 0.0f);
}
}
''', 'relu_kernel')
n = 1_000_000
x = cp.random.randn(n, dtype=cp.float32)
y = cp.zeros(n, dtype=cp.float32)
block_size = 256
grid_size = (n + block_size - 1) // block_size
custom_kernel((grid_size,), (block_size,), (x, y, n))
# Verification
assert cp.allclose(y, cp.maximum(x, 0))
8.4 PyTorch CUDA Operations
Here are practical patterns for leveraging CUDA in PyTorch.
import torch
import torch.nn as nn
# ===== Basic GPU Usage =====
# Check GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")
# Move tensors to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Method 1: .to(device)
x = torch.randn(1000, 1000)
x_gpu = x.to(device)
# Method 2: .cuda()
x_gpu = x.cuda()
# Method 3: Create directly on GPU
x_gpu = torch.randn(1000, 1000, device='cuda')
# Specify a particular GPU
x_gpu1 = x.to('cuda:1') # Second GPU
# ===== Memory Management =====
# Check memory usage
print(torch.cuda.memory_summary())
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
# Clear cache
torch.cuda.empty_cache()
# Memory snapshot (for debugging)
torch.cuda.memory._record_memory_history()
# ... run code ...
snapshot = torch.cuda.memory._snapshot()
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
# ===== Mixed Precision Training =====
from torch.amp import autocast, GradScaler
model = nn.Linear(1024, 1024).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler('cuda')
for data, target in dataloader:
data, target = data.cuda(), target.cuda()
optimizer.zero_grad()
# Automatic FP16 conversion with autocast
with autocast('cuda'):
output = model(data)
loss = nn.functional.mse_loss(output, target)
# FP16 gradient scaling with GradScaler
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# Benefits of Mixed Precision:
# - ~50% reduction in memory usage
# - 1.5-3x faster training (leveraging Tensor Cores)
# - Minimal accuracy loss
8.5 TensorFlow GPU Configuration
import tensorflow as tf
# Check GPUs
print("GPUs available:", tf.config.list_physical_devices('GPU'))
# Incremental GPU memory allocation (recommended)
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
# Or set memory limit
tf.config.set_logical_device_configuration(
gpus[0],
[tf.config.LogicalDeviceConfiguration(memory_limit=8192)] # 8GB
)
# Use specific GPU
with tf.device('/GPU:0'):
a = tf.random.normal([1000, 1000])
b = tf.random.normal([1000, 1000])
c = tf.matmul(a, b)
# Mixed Precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')
8.6 RAPIDS (cuDF, cuML) Introduction
RAPIDS is a collection of GPU-accelerated data science libraries provided by NVIDIA. As of 2025, the latest version is 26.02, and cuML offers 5-175x faster performance compared to scikit-learn.
# cuDF: GPU DataFrame replacement for pandas
import cudf
# Same API as pandas
df = cudf.read_csv('large_data.csv')
result = df.groupby('category').agg({'value': 'mean', 'count': 'sum'})
filtered = df[df['value'] > 100]
# Interconversion with pandas
import pandas as pd
pdf = df.to_pandas() # cuDF -> pandas
gdf = cudf.from_pandas(pdf) # pandas -> cuDF
# cuML: GPU ML replacement for scikit-learn
import cuml
from cuml.ensemble import RandomForestClassifier
from cuml.cluster import KMeans
# Same API as scikit-learn
rf = RandomForestClassifier(n_estimators=100, max_depth=16)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
# Zero-code-change acceleration (cuml.accel)
# Run scikit-learn code on GPU without modification
import cuml.accel
cuml.accel.install()
from sklearn.ensemble import RandomForestClassifier # Automatically GPU-accelerated!
9. Hands-on Examples
9.1 Example 1: GPU Vector Operation Benchmark (Numba)
A practical example comparing CPU and GPU vector operation performance.
from numba import cuda, njit
import numpy as np
import time
import math
# CPU version (Numba JIT)
@njit
def vector_ops_cpu(a, b, c):
for i in range(a.shape[0]):
c[i] = math.sqrt(a[i] ** 2 + b[i] ** 2) * math.sin(a[i]) + math.log(abs(b[i]) + 1)
# GPU version (CUDA)
@cuda.jit
def vector_ops_gpu(a, b, c):
idx = cuda.grid(1)
if idx < a.shape[0]:
c[idx] = math.sqrt(a[idx] ** 2 + b[idx] ** 2) * math.sin(a[idx]) + math.log(abs(b[idx]) + 1)
# Benchmark
sizes = [100_000, 1_000_000, 10_000_000, 100_000_000]
for n in sizes:
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
c_cpu = np.zeros(n, dtype=np.float32)
c_gpu = np.zeros(n, dtype=np.float32)
# CPU
start = time.time()
vector_ops_cpu(a, b, c_cpu)
cpu_time = time.time() - start
# GPU
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c_gpu)
threads = 256
blocks = math.ceil(n / threads)
# Warm-up
vector_ops_gpu[blocks, threads](d_a, d_b, d_c)
cuda.synchronize()
start = time.time()
vector_ops_gpu[blocks, threads](d_a, d_b, d_c)
cuda.synchronize()
gpu_time = time.time() - start
c_gpu = d_c.copy_to_host()
print(f"N={n:>12,}: CPU={cpu_time:.4f}s, GPU={gpu_time:.4f}s, "
f"Speedup={cpu_time/gpu_time:.1f}x")
9.2 Example 2: Image Processing Acceleration (CuPy)
An example of GPU-accelerated real-time image filtering.
import cupy as cp
import numpy as np
from PIL import Image
def gpu_gaussian_blur(image_array, kernel_size=5, sigma=1.0):
"""GPU Gaussian blur implementation"""
# Create Gaussian kernel
ax = cp.arange(-kernel_size // 2 + 1., kernel_size // 2 + 1.)
xx, yy = cp.meshgrid(ax, ax)
kernel = cp.exp(-(xx**2 + yy**2) / (2. * sigma**2))
kernel = kernel / kernel.sum()
img_gpu = cp.asarray(image_array, dtype=cp.float32)
result = cp.zeros_like(img_gpu)
pad = kernel_size // 2
# Process per channel
for c in range(img_gpu.shape[2]):
padded = cp.pad(img_gpu[:, :, c], pad, mode='reflect')
# 2D Convolution (using CuPy FFT)
from cupyx.scipy.ndimage import convolve
result[:, :, c] = convolve(img_gpu[:, :, c], kernel)
return cp.asnumpy(result.clip(0, 255).astype(cp.uint8))
# Sobel Edge Detection implemented directly with CuPy RawKernel
sobel_kernel = cp.RawKernel(r'''
extern "C" __global__
void sobel_filter(const float* input, float* output,
int width, int height) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x > 0 && x < width - 1 && y > 0 && y < height - 1) {
// Sobel X
float gx = -1.0f * input[(y-1)*width + (x-1)]
+ 1.0f * input[(y-1)*width + (x+1)]
- 2.0f * input[y*width + (x-1)]
+ 2.0f * input[y*width + (x+1)]
- 1.0f * input[(y+1)*width + (x-1)]
+ 1.0f * input[(y+1)*width + (x+1)];
// Sobel Y
float gy = -1.0f * input[(y-1)*width + (x-1)]
- 2.0f * input[(y-1)*width + x]
- 1.0f * input[(y-1)*width + (x+1)]
+ 1.0f * input[(y+1)*width + (x-1)]
+ 2.0f * input[(y+1)*width + x]
+ 1.0f * input[(y+1)*width + (x+1)];
output[y * width + x] = sqrtf(gx * gx + gy * gy);
}
}
''', 'sobel_filter')
def gpu_sobel_edge_detection(gray_image):
"""GPU Sobel edge detection"""
h, w = gray_image.shape
img_gpu = cp.asarray(gray_image, dtype=cp.float32)
output_gpu = cp.zeros_like(img_gpu)
block = (16, 16)
grid = ((w + 15) // 16, (h + 15) // 16)
sobel_kernel(grid, block, (img_gpu, output_gpu, w, h))
return cp.asnumpy(output_gpu.clip(0, 255).astype(cp.uint8))
# Usage example
# img = np.array(Image.open('photo.jpg'))
# blurred = gpu_gaussian_blur(img, kernel_size=11, sigma=3.0)
# gray = np.mean(img, axis=2).astype(np.float32)
# edges = gpu_sobel_edge_detection(gray)
9.3 Example 3: PyTorch Model GPU Training
A complete PyTorch GPU training pipeline example.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torch.amp import autocast, GradScaler
import time
# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if device.type == 'cuda':
print(f"GPU: {torch.cuda.get_device_name(0)}")
# Model definition
class ResidualBlock(nn.Module):
def __init__(self, dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(dim, dim),
nn.LayerNorm(dim),
nn.GELU(),
nn.Linear(dim, dim),
nn.LayerNorm(dim),
)
def forward(self, x):
return x + self.net(x)
class DeepModel(nn.Module):
def __init__(self, input_dim=784, hidden_dim=512, num_classes=10, num_blocks=6):
super().__init__()
self.input_proj = nn.Linear(input_dim, hidden_dim)
self.blocks = nn.Sequential(*[ResidualBlock(hidden_dim) for _ in range(num_blocks)])
self.head = nn.Linear(hidden_dim, num_classes)
def forward(self, x):
x = self.input_proj(x)
x = self.blocks(x)
return self.head(x)
# Data generation (in practice, load a dataset)
X_train = torch.randn(50000, 784)
y_train = torch.randint(0, 10, (50000,))
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True,
num_workers=4, pin_memory=True) # pin_memory is important!
# Move model to GPU
model = DeepModel().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scaler = GradScaler('cuda')
# Training loop (Mixed Precision)
def train_epoch(model, loader, criterion, optimizer, scaler, device):
model.train()
total_loss = 0
correct = 0
total = 0
for batch_idx, (data, target) in enumerate(loader):
data, target = data.to(device, non_blocking=True), target.to(device, non_blocking=True)
optimizer.zero_grad(set_to_none=True) # set_to_none=True is more efficient
with autocast('cuda'):
output = model(data)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
total_loss += loss.item()
_, predicted = output.max(1)
total += target.size(0)
correct += predicted.eq(target).sum().item()
return total_loss / len(loader), 100. * correct / total
# Run training
num_epochs = 10
for epoch in range(num_epochs):
start = time.time()
loss, acc = train_epoch(model, train_loader, criterion, optimizer, scaler, device)
elapsed = time.time() - start
print(f"Epoch {epoch+1}/{num_epochs}: Loss={loss:.4f}, "
f"Acc={acc:.2f}%, Time={elapsed:.2f}s")
# Print GPU memory status
if device.type == 'cuda':
alloc = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f" GPU Memory: {alloc:.2f} GB allocated, {reserved:.2f} GB reserved")
9.4 Example 4: Multi-GPU Training
DataParallel (Simple but Inefficient)
import torch
import torch.nn as nn
model = DeepModel().cuda()
# Apply DataParallel (one line!)
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs with DataParallel")
model = nn.DataParallel(model)
# Rest of training code is the same
# Drawback: Load imbalance with GPU 0 being the bottleneck
DistributedDataParallel (Recommended)
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
import os
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def cleanup():
dist.destroy_process_group()
def train_ddp(rank, world_size):
setup(rank, world_size)
# Place model on the corresponding GPU
model = DeepModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
# Distributed data loader
train_dataset = TensorDataset(X_train, y_train)
sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank)
train_loader = DataLoader(train_dataset, batch_size=256,
sampler=sampler, num_workers=4, pin_memory=True)
optimizer = optim.AdamW(ddp_model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
scaler = GradScaler('cuda')
for epoch in range(10):
sampler.set_epoch(epoch) # Shuffle each epoch
for data, target in train_loader:
data = data.to(rank, non_blocking=True)
target = target.to(rank, non_blocking=True)
optimizer.zero_grad(set_to_none=True)
with autocast('cuda'):
output = ddp_model(data)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
if rank == 0:
print(f"Epoch {epoch+1} complete")
cleanup()
# Execution
# torchrun --nproc_per_node=4 train_ddp.py
import torch.multiprocessing as mp
world_size = torch.cuda.device_count()
mp.spawn(train_ddp, args=(world_size,), nprocs=world_size, join=True)
9.5 Example 5: Writing CUDA Kernels Directly (PyCUDA)
Implementing a practical Reduction (summation) kernel with PyCUDA.
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np
# Parallel Reduction kernel
mod = SourceModule("""
__global__ void parallel_reduce(float *input, float *output, int n) {
extern __shared__ float sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x * 2 + threadIdx.x;
// First level reduction (performed during global load)
float sum = 0.0f;
if (i < n) sum += input[i];
if (i + blockDim.x < n) sum += input[i + blockDim.x];
sdata[tid] = sum;
__syncthreads();
// Reduction in Shared Memory
for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// No synchronization needed within a Warp (Warp-level reduction)
if (tid < 32) {
volatile float *smem = sdata;
smem[tid] += smem[tid + 32];
smem[tid] += smem[tid + 16];
smem[tid] += smem[tid + 8];
smem[tid] += smem[tid + 4];
smem[tid] += smem[tid + 2];
smem[tid] += smem[tid + 1];
}
if (tid == 0) {
output[blockIdx.x] = sdata[0];
}
}
""")
reduce_kernel = mod.get_function("parallel_reduce")
# Execution
n = 1 << 20 # 1M elements
data = np.random.randn(n).astype(np.float32)
block_size = 256
grid_size = (n + block_size * 2 - 1) // (block_size * 2)
d_input = cuda.mem_alloc(data.nbytes)
d_output = cuda.mem_alloc(grid_size * 4) # float32
cuda.memcpy_htod(d_input, data)
reduce_kernel(d_input, d_output, np.int32(n),
block=(block_size, 1, 1), grid=(grid_size, 1),
shared=block_size * 4)
# Get partial sums
partial_sums = np.zeros(grid_size, dtype=np.float32)
cuda.memcpy_dtoh(partial_sums, d_output)
gpu_sum = partial_sums.sum()
# Verification
cpu_sum = data.sum()
print(f"GPU sum: {gpu_sum:.6f}")
print(f"CPU sum: {cpu_sum:.6f}")
print(f"Difference: {abs(gpu_sum - cpu_sum):.6f}")
10. CUDA Tools and Libraries
10.1 Core Library Ecosystem
| Library | Purpose | Key Features |
|---|---|---|
| cuDNN | Deep Learning | Optimized DNN primitives: Convolution, RNN, Attention, BatchNorm |
| cuBLAS | Linear Algebra | BLAS Level 1/2/3 operations: GEMM, TRSM. FP8/BF16 Group GEMM support (Blackwell) |
| NCCL | Multi-GPU Comms | Collective communication: AllReduce, AllGather, Broadcast. NVLink/NVSwitch optimized |
| TensorRT | Inference Opt | Graph optimization, quantization (INT8/FP8), layer fusion, dynamic batching |
| Triton | GPU Programming | Python DSL for high-performance kernels. Developed by OpenAI |
| CUTLASS | Custom GEMM | Template-based CUDA matrix multiplication library |
| FlashAttention | Attention Accel | IO-aware algorithm for Transformer Attention acceleration |
| cuSPARSE | Sparse Matrices | Optimized sparse matrix operations |
| cuRAND | Random Numbers | GPU-accelerated pseudo/quasi-random number generation |
| cuFFT | FFT | GPU-accelerated Fast Fourier Transform |
10.2 cuDNN
# cuDNN is built into PyTorch/TensorFlow and used automatically
# Enable cuDNN benchmark mode in PyTorch
import torch
torch.backends.cudnn.benchmark = True # Auto-select optimal algorithm
torch.backends.cudnn.deterministic = False # Performance priority (when reproducibility is not needed)
# Check cuDNN version
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"cuDNN enabled: {torch.backends.cudnn.enabled}")
10.3 NCCL (Multi-GPU Communication)
# NCCL usage in PyTorch (internal to DistributedDataParallel)
import torch.distributed as dist
# Initialize NCCL backend
dist.init_process_group(backend='nccl')
# Key collective communication operations
# AllReduce: Sum tensors from all GPUs and distribute to each GPU
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
# AllGather: Gather tensors from all GPUs and copy the full set to each GPU
gathered = [torch.zeros_like(tensor) for _ in range(world_size)]
dist.all_gather(gathered, tensor)
# Broadcast: Copy a tensor from one GPU to all GPUs
dist.broadcast(tensor, src=0)
10.4 TensorRT
# Inference optimization with TensorRT
import tensorrt as trt
import torch
# Convert PyTorch model to ONNX
model = MyModel().cuda().eval()
dummy_input = torch.randn(1, 3, 224, 224).cuda()
torch.onnx.export(model, dummy_input, "model.onnx",
input_names=['input'], output_names=['output'],
dynamic_axes={'input': {0: 'batch_size'}})
# Simple conversion with torch_tensorrt (PyTorch 2.x)
import torch_tensorrt
optimized_model = torch_tensorrt.compile(model,
inputs=[torch_tensorrt.Input(
min_shape=[1, 3, 224, 224],
opt_shape=[8, 3, 224, 224],
max_shape=[32, 3, 224, 224],
dtype=torch.float16
)],
enabled_precisions={torch.float16},
)
# Inference
with torch.no_grad():
output = optimized_model(input_tensor.half().cuda())
# Typical TensorRT performance gains:
# - FP32 -> FP16: 2-3x faster
# - FP32 -> INT8: 3-5x faster
# - Layer fusion + optimization: Additional 20-50% improvement
10.5 OpenAI Triton
Triton is a programming language for writing high-performance GPU kernels in Python.
import triton
import triton.language as tl
import torch
# Triton kernel: vector addition
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
tl.store(output_ptr + offsets, output, mask=mask)
# Usage
def triton_add(x: torch.Tensor, y: torch.Tensor):
output = torch.empty_like(x)
n_elements = output.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
return output
# Execution
x = torch.randn(1_000_000, device='cuda')
y = torch.randn(1_000_000, device='cuda')
result = triton_add(x, y)
assert torch.allclose(result, x + y)
# Advantages of Triton:
# - Higher-level abstraction than CUDA C
# - Automatic memory access optimization
# - Write high-performance kernels with Python syntax
# - Backend for PyTorch 2.0 torch.compile
10.6 FlashAttention
FlashAttention accelerates Transformer Self-Attention with an IO-aware algorithm.
# FlashAttention-3 (H100 optimized)
# Install: pip install flash-attn
from flash_attn import flash_attn_func
# Input: (batch, seqlen, nheads, headdim)
q = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)
k = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)
v = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)
# FlashAttention execution
output = flash_attn_func(q, k, v, causal=True)
# Performance comparison (FlashAttention-3 vs standard Attention):
# - Memory: O(N^2) -> O(N) (sequence length N)
# - Speed: 2-4x faster (up to 740 TFLOPS on H100 FP16)
# - FP8: up to 1.2 PFLOPS (H100)
# - FlashAttention-3 is 1.5-2.0x faster than FlashAttention-2
# PyTorch 2.0+ built-in scaled_dot_product_attention
output = torch.nn.functional.scaled_dot_product_attention(
q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2),
is_causal=True
)
# PyTorch automatically selects FlashAttention or Memory-Efficient Attention
11. CUDA Troubleshooting
11.1 CUDA OOM (Out of Memory) Solutions
GPU memory shortage is the most common CUDA issue.
# Error: RuntimeError: CUDA out of memory.
# Solution 1: Reduce batch size
# batch_size = 64 -> 32 -> 16
# Solution 2: Use Mixed Precision (~50% memory savings)
from torch.amp import autocast, GradScaler
scaler = GradScaler('cuda')
# Solution 3: Gradient Accumulation
accumulation_steps = 4
for i, (data, target) in enumerate(loader):
with autocast('cuda'):
loss = model(data) / accumulation_steps
scaler.scale(loss).backward()
if (i + 1) % accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
# Solution 4: Gradient Checkpointing (memory-compute tradeoff)
from torch.utils.checkpoint import checkpoint
class MemEfficientModel(nn.Module):
def forward(self, x):
# Don't store intermediate activations; recompute during backward
x = checkpoint(self.block1, x, use_reentrant=False)
x = checkpoint(self.block2, x, use_reentrant=False)
return x
# Solution 5: Clear cache
torch.cuda.empty_cache()
# Solution 6: Debug memory leaks
print(torch.cuda.memory_summary())
# Solution 7: Delete unnecessary tensors immediately
del large_tensor
torch.cuda.empty_cache()
# Solution 8: Disable gradients during inference
with torch.no_grad():
output = model(input_data)
11.2 CUDA Driver vs Runtime Version Mismatch
# Problem: "CUDA driver version is insufficient for CUDA runtime version"
# How to check
nvidia-smi # Check driver-supported CUDA version
nvcc --version # Check installed CUDA Toolkit version
# Rule: Driver CUDA >= Toolkit CUDA
# Example: If driver supports CUDA 12.2, CUDA 13.0 Toolkit cannot be used
# Solution 1: Update driver
sudo apt-get update
sudo apt-get install nvidia-driver-560 # Install latest driver
# Solution 2: Downgrade to compatible CUDA Toolkit
sudo apt-get install cuda-toolkit-12-4
# Solution 3: Install matching CUDA build of PyTorch/TensorFlow
pip install torch --index-url https://download.pytorch.org/whl/cu124
11.3 nvidia-smi Key Commands Summary
| Command | Purpose |
|---|---|
nvidia-smi | GPU status summary |
nvidia-smi -l 1 | Refresh every 1 second |
nvidia-smi -q | Detailed info |
nvidia-smi -q -d MEMORY | Memory details |
nvidia-smi -q -d CLOCK | Clock speeds |
nvidia-smi -q -d TEMPERATURE | Temperature info |
nvidia-smi -q -d POWER | Power consumption |
nvidia-smi -q -d PERFORMANCE | Performance state |
nvidia-smi --query-gpu=... --format=csv | CSV output |
nvidia-smi pmon -i 0 | Process monitoring |
nvidia-smi dmon -d 1 | Device monitoring |
nvidia-smi topo -m | GPU topology (NVLink, etc.) |
nvidia-smi -r -i 0 | Reset GPU 0 |
nvidia-smi -pm 1 | Enable Persistence Mode |
nvidia-smi -pl 300 | Set power limit (W) |
11.4 cuda-gdb Debugging
# CUDA debug build
nvcc -g -G -O0 -o debug_app app.cu
# Run cuda-gdb
cuda-gdb ./debug_app
# Key commands
(cuda-gdb) break myKernel # Breakpoint on kernel
(cuda-gdb) run # Run
(cuda-gdb) cuda thread # Current thread info
(cuda-gdb) cuda block # Current block info
(cuda-gdb) cuda kernel # Current kernel info
(cuda-gdb) cuda thread (0,0,0) # Switch to specific thread
(cuda-gdb) cuda block (1,0,0) # Switch to specific block
(cuda-gdb) info cuda threads # List active threads
(cuda-gdb) print threadIdx.x # Print built-in variable
(cuda-gdb) print data[idx] # Check data
# compute-sanitizer (memory error detection)
compute-sanitizer --tool memcheck ./my_app
compute-sanitizer --tool racecheck ./my_app # Race condition detection
compute-sanitizer --tool initcheck ./my_app # Initialization check
11.5 Common Error Messages and Solutions
| Error Message | Cause | Solution |
|---|---|---|
CUDA error: out of memory | GPU memory shortage | Reduce batch size, Mixed Precision, Gradient Checkpointing |
CUDA error: device-side assert triggered | Assert failure or index out of bounds in kernel | Run with CUDA_LAUNCH_BLOCKING=1 to identify exact location |
CUDA error: an illegal memory access | Invalid memory access | Debug with compute-sanitizer, check index bounds |
CUDA error: no kernel image is available | Compute Capability mismatch | Recompile with correct -arch=sm_XX option |
CUDA driver version is insufficient | Driver version too old | Update driver or downgrade CUDA Toolkit |
CUDA error: invalid device function | Wrong architecture target | Check -gencode options, use Fat Binary |
cuDNN error: CUDNN_STATUS_NOT_SUPPORTED | cuDNN version mismatch or unsupported op | Update cuDNN or check input format |
NCCL error: unhandled system error | Multi-GPU communication failure | Check network config, enable NCCL_DEBUG=INFO |
# Debugging tip: Switch to synchronous execution to pinpoint exact error location
CUDA_LAUNCH_BLOCKING=1 python train.py
# NCCL debugging
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.launch train.py
# PyTorch CUDA memory debugging
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python train.py
12. CUDA Alternative Technologies Comparison
12.1 Major GPU Computing Frameworks
| Technology | Developer | Target Hardware | Languages | Ecosystem Maturity | Primary Use |
|---|---|---|---|---|---|
| CUDA | NVIDIA | NVIDIA GPU | C/C++/Python | Very High | AI/ML, HPC, scientific computing |
| ROCm | AMD | AMD GPU (MI300X, etc.) | C/C++ (HIP) | Medium | AI training/inference (PyTorch supported) |
| OpenCL | Khronos | General (GPU/CPU/FPGA) | C/C++ | Medium | Cross-platform GPU computing |
| SYCL | Khronos | General | C++ | Growing | oneAPI (Intel), cross-platform |
| Metal | Apple | Apple Silicon | Swift/Obj-C/C++ | Apple ecosystem | macOS/iOS GPU computing |
| Vulkan Compute | Khronos | General GPU | C/GLSL/SPIR-V | Medium | Cross-platform GPU computation |
| Triton | OpenAI | NVIDIA/AMD GPU | Python | Growing | High-level GPU kernel programming |
| WebGPU | W3C | Browser GPU | WGSL/JS | Early | Web-based GPU computation |
12.2 CUDA vs ROCm Detailed Comparison
CUDA code:
cudaMalloc(&d_ptr, size);
myKernel<<<grid, block>>>(d_ptr);
cudaDeviceSynchronize();
ROCm HIP code (nearly identical API):
hipMalloc(&d_ptr, size);
myKernel<<<grid, block>>>(d_ptr);
hipDeviceSynchronize();
AMD ROCm's HIP (Heterogeneous-compute Interface for Portability) provides a very similar API to CUDA, and the hipify-perl tool can automatically convert CUDA code to HIP code. PyTorch officially supports ROCm, enabling training on AMD MI300X and similar hardware.
12.3 Selection Guide
| Scenario | Recommended Technology | Reason |
|---|---|---|
| AI/ML training and inference | CUDA | Optimized library ecosystem (cuDNN, TensorRT) |
| AMD GPU usage | ROCm (HIP) | CUDA-compatible API, PyTorch support |
| Cross-platform requirement | OpenCL or SYCL | Broad hardware support |
| Apple environment | Metal | Only option for macOS/iOS |
| Custom kernels (Python) | Triton | Higher productivity than CUDA C |
| Web browser GPU | WebGPU | Standard web API |
13. References
13.1 Official Documentation
- NVIDIA CUDA Programming Guide: Official reference for GPU architecture and programming model
- CUDA C++ Best Practices Guide: Collection of best practices for performance optimization
- CUDA Toolkit Release Notes: Change logs and compatibility information for each version
- NVIDIA Developer Blog: Latest technology trends and tutorials
13.2 Key Reference Links
| Resource | URL |
|---|---|
| CUDA Toolkit Download | https://developer.nvidia.com/cuda-downloads |
| CUDA Programming Guide | https://docs.nvidia.com/cuda/cuda-c-programming-guide/ |
| CUDA Best Practices | https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ |
| cuDNN Documentation | https://docs.nvidia.com/deeplearning/cudnn/ |
| TensorRT Documentation | https://docs.nvidia.com/deeplearning/tensorrt/ |
| NCCL Documentation | https://docs.nvidia.com/deeplearning/nccl/ |
| Nsight Compute | https://developer.nvidia.com/nsight-compute |
| Nsight Systems | https://developer.nvidia.com/nsight-systems |
| RAPIDS Official Site | https://rapids.ai/ |
| FlashAttention GitHub | https://github.com/Dao-AILab/flash-attention |
| OpenAI Triton GitHub | https://github.com/openai/triton |
| PyTorch CUDA Docs | https://pytorch.org/docs/stable/cuda.html |
13.3 Recommended Learning Path
Beginner:
1. Check GPU with nvidia-smi
2. Use .cuda() in PyTorch
3. Apply Mixed Precision Training
4. Write simple kernels with Numba CUDA JIT
Intermediate:
5. Write kernels directly in CUDA C/C++
6. Optimize with Shared Memory, Coalesced Access
7. Asynchronous execution with CUDA Streams
8. Profile with Nsight Compute
Advanced:
9. Custom Attention kernels with Triton
10. Inference optimization with TensorRT
11. Multi-GPU DDP training
12. Custom GEMM implementation with CUTLASS
Conclusion
CUDA is more than a GPU programming tool -- it forms the foundation of modern AI infrastructure. Alongside hardware evolution from Ampere to Hopper to Blackwell, the CUDA Toolkit has entered the 13.x era, beginning to offer higher-level abstractions such as CUDA Tile and cuTile DSL.
The key points for effectively leveraging CUDA in practice are as follows.
- Build solid foundations: Understanding the Warp, Block, Grid execution model and memory hierarchy is essential for optimization
- Actively use Python tools: Most GPU acceleration can be achieved with PyTorch's Mixed Precision, CuPy, Numba, and similar tools
- Profile first: Identify bottlenecks with Nsight Compute/Systems before investing in optimization
- Memory is key: Memory optimization techniques like Coalesced Access, Shared Memory Tiling, and Pinned Memory account for 80% of performance improvements
- Leverage the ecosystem: Maximize use of already-optimized libraries like cuDNN, TensorRT, and FlashAttention
The world of GPU computing continues to evolve. The emergence of high-level programming models like Triton, the expansion of low-precision operations such as FP4/FP6, and the introduction of tile-based programming represented by CUDA Tile are all moving GPU programming toward lowering the barrier to entry while maximizing hardware performance. We hope this guide serves as a starting point for that journey.