Skip to content
Published on

CUDA Hands-on Complete Guide: Everything About GPU Computing

Authors
  • Name
    Twitter

1. Introduction: What is CUDA

1.1 Definition of CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It enables general-purpose computing (GPGPU) by leveraging thousands of GPU cores. Since its initial release in 2007, it has become the de facto standard in various fields including AI, scientific simulation, image processing, and financial modeling.

As of 2025, the latest CUDA Toolkit is version 13.1, which introduces CUDA Tile, a tile-based programming model, and cuTile Python DSL, providing Tensor Core abstraction and forward compatibility for next-generation Blackwell GPUs.

1.2 History of GPU Computing and the Emergence of CUDA

The evolution of GPU computing can be summarized by era as follows.

PeriodEventSignificance
2001GPGPU research beginsGeneral-purpose computing attempts using shaders
2006NVIDIA Tesla architecture announcedFirst unified shader architecture
2007CUDA 1.0 releasedBirth of GPU general-purpose computing
2012AlexNet (ImageNet)Dawn of GPU deep learning era
2017Volta + Tensor CoreMixed precision computation acceleration
2020Ampere (A100)TF32, Sparsity support
2022Hopper (H100)Transformer Engine, FP8 support
2024Blackwell (B200)FP4, 5th gen Tensor Core
2025CUDA 13.0/13.1 releasedCUDA Tile, cuTile DSL introduction

1.3 CPU vs GPU Architecture Comparison

CPUs and GPUs have fundamentally different design philosophies.

AttributeCPUGPU
Design goalLow latency (serial processing)High throughput (parallel processing)
Core countA few to dozensThousands to tens of thousands
Clock speedHigh (4-6 GHz)Relatively low (1-2 GHz)
Cache sizeLarge (tens of MB)Small (per core)
Control logicComplex (branch prediction, OoO execution)Simple (latency hiding through many threads)
Parallelism modelSIMD (Single Instruction Multiple Data)SIMT (Single Instruction Multiple Threads)
Optimal workloadComplex branching, sequential logicLarge-scale data-parallel computation

SIMD vs SIMT key difference: SIMD is a vector operation method that processes multiple data simultaneously with a single instruction. SIMT goes a step further, where each thread has its own program counter while executing the same instruction simultaneously. When branching occurs, threads within a Warp can execute different paths, but this causes Warp Divergence, which degrades performance.

1.4 Why CUDA is Essential for AI/ML/Deep Learning

The reasons CUDA is central to modern AI/ML workloads are clear.

  • Matrix computation acceleration: Dramatically accelerates matrix multiplication (GEMM), the core of deep learning, with Tensor Cores
  • Software ecosystem: Optimized library ecosystem including cuDNN, cuBLAS, NCCL, TensorRT
  • Framework support: All major frameworks (PyTorch, TensorFlow, JAX) use CUDA as their default backend
  • Mixed precision: Maximizes training/inference speed with low-precision computation down to FP16, BF16, FP8, and FP4
  • Multi-GPU scaling: Distributed training across hundreds of GPUs via NVLink and NVSwitch

2. GPU Architecture Fundamentals

2.1 NVIDIA GPU Internal Structure

Let us examine the core components of an NVIDIA GPU hierarchically.

GPU (GPC - Graphics Processing Cluster)
├── SM (Streaming Multiprocessor) x N
│   ├── CUDA Core (FP32/INT32) x 128 (Hopper)
│   ├── Tensor Core x 4 (4th gen, Hopper)
│   ├── RT Core (Ray Tracing) x 1
│   ├── Warp Scheduler x 4
│   ├── Register File (256 KB)
│   ├── L1 Cache / Shared Memory (shared, up to 228 KB)
│   └── SFU (Special Function Unit)
├── L2 Cache (shared)
├── Memory Controller
└── HBM (High Bandwidth Memory)

Key core types:

  • CUDA Core: The basic processing unit for general-purpose floating-point/integer operations. Handles FP32, FP64, INT32 operations
  • Tensor Core: A core specialized for Matrix Multiply-Accumulate (MMA) operations. The key accelerator for deep learning training/inference
  • RT Core: Dedicated hardware for Ray Tracing acceleration. Primarily used for graphics workloads

2.2 Memory Hierarchy

GPU memory is organized hierarchically by speed and size.

Memory TypeLocationSize (H100)BandwidthAccess ScopeCharacteristics
RegisterInside SM256 KB/SMFastestThread-privateFastest but limited
Shared MemoryInside SMUp to 228 KB/SMVery fastShared within BlockProgrammer-managed cache
L1 CacheInside SMShared with SharedVery fastSM-privateHW-managed
L2 CacheGPU-wide50 MBFastShared across all SMsHW-managed
Global Memory (HBM)Off-GPU80 GB3.35 TB/sGlobally accessibleLargest but highest latency
Constant MemoryGPU-wide64 KBFast when cachedRead-onlyBroadcast-optimized
Texture MemoryGPU-wideShared with GlobalFast when cachedRead-only2D spatial locality optimized
Speed:  Register > Shared/L1 > L2 > Global (HBM)
Size:   Global (HBM) > L2 > Shared/L1 > Register

2.3 Warp, Block, Grid Concepts

CUDA's execution model has a 3-level hierarchical structure.

Grid (kernel execution unit)
├── Block (0,0)          Block (1,0)          Block (2,0)
│   ├── Warp 0           ├── Warp 0           ├── Warp 0
│   │   ├── Thread 0     │   ├── Thread 0     │   ├── Thread 0
│   │   ├── Thread 1     │   ├── Thread 1     │   ├── Thread 1
│   │   ├── ...          │   ├── ...          │   ├── ...
│   │   └── Thread 31    │   └── Thread 31    │   └── Thread 31
│   ├── Warp 1           ├── Warp 1           ├── Warp 1
│   └── ...              └── ...              └── ...
├── Block (0,1)          Block (1,1)          ...
└── ...
  • Thread: The smallest execution unit of GPU computation
  • Warp: An execution group of 32 Threads. The unit that executes the same instruction simultaneously on an SM (the core of SIMT)
  • Block (Thread Block): Composed of multiple Warps. Executes on the same SM and shares Shared Memory
  • Grid: The collection of Blocks that make up the entire kernel execution

Key constraints:

ItemLimit (Compute Capability 9.0)
Max Threads per Block1,024
Warp size32 (fixed)
Max Block dimensions(1024, 1024, 64)
Max Grid dimensions(2^31-1, 65535, 65535)
Max Blocks per SM32
Max Warps per SM64

2.4 Compute Capability Version Differences

Compute Capability (CC) defines the feature set of GPU hardware.

CCArchitectureRepresentative GPUKey Features
7.0VoltaV1001st gen Tensor Core, independent thread scheduling
7.5TuringRTX 2080INT8/INT4 Tensor Core, RT Core
8.0AmpereA1003rd gen Tensor Core, TF32, Sparsity
8.6AmpereRTX 3090Consumer Ampere
8.9Ada LovelaceRTX 40904th gen Tensor Core, FP8, DLSS 3
9.0HopperH1004th gen Tensor Core, Transformer Engine, DPX
10.0BlackwellB2005th gen Tensor Core, FP4, TMEM
12.0Blackwell UltraB300Enhanced 5th gen Tensor Core, 288 GB HBM3E

Note that starting with CUDA 13.0, support for Maxwell (CC 5.x), Pascal (CC 6.x), and Volta (CC 7.0) has been removed.

2.5 Latest GPU Generation Comparison

Comparing three generations of data center GPUs.

SpecA100 (SXM)H100 (SXM5)B200 (SXM)B300 (SXM)
ArchitectureAmpereHopperBlackwellBlackwell Ultra
CUDA Cores6,91216,89618,43218,432+
Tensor Cores432 (3rd gen)528 (4th gen)5th gen5th gen
Memory80 GB HBM2e80 GB HBM3180 GB HBM3E288 GB HBM3E
Memory BW2.0 TB/s3.35 TB/s7.7 TB/s8.0 TB/s
FP32 Perf19.5 TFLOPS60 TFLOPSUndisclosedUndisclosed
FP16 Tensor312 TFLOPS990 TFLOPSUndisclosedUndisclosed
FP4 TensorNot supportedNot supported9.0 PFLOPS14.0 PFLOPS
NVLink3rd gen (600 GB/s)4th gen (900 GB/s)5th gen (1.8 TB/s)5th gen (1.8 TB/s)
TDP400W700W1,000W1,400W
CoolingAir/LiquidAir/LiquidLiquid recommendedLiquid required (DLC)

The B200 achieves 3x training performance and 15x inference performance over the A100. The B300 (Blackwell Ultra) provides 14 PFLOPS in FP4 operations, 55.6% faster than the B200.


3. CUDA Development Environment Setup

3.1 CUDA Toolkit Installation

Linux (Ubuntu/Debian)

# 1. Check NVIDIA driver
nvidia-smi

# 2. Add CUDA Keyring (Ubuntu 22.04 example)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# 3. Install CUDA Toolkit
sudo apt-get install cuda-toolkit-13-1

# 4. Set environment variables
export PATH=/usr/local/cuda-13.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.1/lib64:$LD_LIBRARY_PATH

# 5. Verify installation
nvcc --version

Windows

# 1. Download CUDA Toolkit from the NVIDIA official site
# https://developer.nvidia.com/cuda-downloads

# 2. Check environment variables after installation
echo %CUDA_PATH%
# Typically C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1

# 3. Verify
nvcc --version
nvidia-smi

3.2 nvidia-smi Command Usage

nvidia-smi is the essential tool for monitoring GPU status.

# Basic info output
nvidia-smi

# Output example:
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 560.35.03    Driver Version: 560.35.03    CUDA Version: 13.1                 |
# |-----------------------------------------+------------------------+----------------------|
# | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
# |=========================================+========================+======================|
# |   0  NVIDIA H100 80GB HBM3          On | 00000000:3B:00.0   Off |                    0 |
# | N/A   32C    P0              72W / 700W |    1234MiB / 81559MiB  |      0%      Default |
# +-----------------------------------------+------------------------+----------------------+

# Continuous monitoring (1 second interval)
nvidia-smi -l 1

# dmon - GPU metrics monitoring
nvidia-smi dmon -s pucvmet -d 1

# Specific GPU info
nvidia-smi -i 0 -q

# GPU process monitoring
nvidia-smi pmon -i 0

# GPU clock and memory info
nvidia-smi -q -d CLOCK,MEMORY

# CSV format output (for scripting)
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total \
  --format=csv,noheader,nounits

3.3 nvcc Compiler Usage

nvcc is the NVIDIA CUDA Compiler for compiling CUDA source code.

# Basic compilation
nvcc -o hello hello.cu

# Target specific architecture
nvcc -arch=sm_90 -o kernel kernel.cu   # Hopper H100
nvcc -arch=sm_80 -o kernel kernel.cu   # Ampere A100

# Debug mode
nvcc -g -G -o debug_kernel kernel.cu

# Optimization level
nvcc -O3 -o optimized kernel.cu

# Generate PTX code (intermediate representation)
nvcc -ptx kernel.cu

# Support multiple architectures simultaneously (Fat Binary)
nvcc -gencode arch=compute_80,code=sm_80 \
     -gencode arch=compute_90,code=sm_90 \
     -o multi_arch kernel.cu

# Library linking
nvcc -lcublas -lcurand -o linked_kernel kernel.cu

# Detailed compilation info
nvcc --resource-usage -o kernel kernel.cu

3.4 CUDA Version Check and Compatibility

There are two types of CUDA versions, and it is important not to confuse them.

# 1. Driver-supported CUDA version (Driver API)
nvidia-smi
# "CUDA Version: 13.1" shown in upper right
# This is the maximum CUDA runtime version the driver supports

# 2. CUDA Toolkit version (Runtime API)
nvcc --version
# Shows "release 13.1"
# The actually installed CUDA Toolkit version

# 3. Check at runtime
python3 -c "import torch; print(torch.version.cuda)"

Key compatibility rules:

  • The driver CUDA version must be greater than or equal to the Toolkit CUDA version
  • Example: If the driver supports CUDA 13.1, you can also use CUDA 12.x Toolkit (backward compatible)
  • According to CUDA Toolkit's Minor Version Compatibility policy, binary compatibility is maintained within the same Major version

3.5 Using CUDA in Docker

The NVIDIA Container Toolkit allows you to leverage GPUs in Docker containers.

# 1. Install NVIDIA Container Toolkit (Ubuntu)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# 2. Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# 3. Run GPU container
docker run --rm --gpus all nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi

# Use specific GPUs only
docker run --rm --gpus '"device=0,1"' nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi

# GPU usage in docker compose
# docker-compose.yml:
# docker-compose.yml
services:
  gpu-app:
    image: nvidia/cuda:13.1.0-devel-ubuntu22.04
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: nvidia-smi

3.6 Installing PyTorch/TensorFlow CUDA via conda/pip

# PyTorch (CUDA 12.4 build example)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Using conda
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia

# TensorFlow (auto GPU detection)
pip install tensorflow[and-cuda]

# Check CUDA availability (Python)
python3 -c "
import torch
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'CUDA version: {torch.version.cuda}')
print(f'GPU count: {torch.cuda.device_count()}')
if torch.cuda.is_available():
    print(f'GPU name: {torch.cuda.get_device_name(0)}')
"

4. CUDA Programming Basics (C/C++)

4.1 Host (CPU) vs Device (GPU) Code

In CUDA programming, CPU-side code is called Host and GPU-side code is called Device.

┌─────────────────────────────────────────────────────────┐
Host (CPU)│  ┌─────────────────────────────────────────────────────┐│
│  │ 1. Prepare data (Host Memory)                       ││
│  │ 2. Allocate GPU memory (cudaMalloc)                 ││
│  │ 3. Transfer data: Host -> Device (cudaMemcpy)       ││
│  │ 4. Launch kernel (<<<grid, block>>>)                ││
│  │ 5. Transfer results: Device -> Host (cudaMemcpy)    ││
│  │ 6. Free GPU memory (cudaFree)                       ││
│  └─────────────────────────────────────────────────────┘│
│                        ↕ PCIe / NVLink│  ┌─────────────────────────────────────────────────────┐│
│  │  Device (GPU)                                       ││
│  │  - Parallel execution of kernel functions           ││
│  │  - Thousands of threads processing simultaneously   ││
│  └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘

4.2 Function Qualifiers

CUDA provides three function qualifiers.

QualifierExecution LocationCallable FromDescription
__global__Device (GPU)Host (CPU)Kernel function. Return type must be void
__device__Device (GPU)Device (GPU)Helper function callable only within GPU
__host__Host (CPU)Host (CPU)Normal CPU function (default, can be omitted)
__host__ __device__BothBothUsable on both CPU and GPU
// __global__: Kernel function - called from Host, executed on Device
__global__ void myKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] *= 2.0f;
    }
}

// __device__: Device-only helper function
__device__ float square(float x) {
    return x * x;
}

// __host__ __device__: CPU/GPU compatible function
__host__ __device__ float add(float a, float b) {
    return a + b;
}

4.3 Kernel Launch Syntax

Kernel functions are launched using special triple angle bracket syntax.

// Kernel launch syntax
// kernel<<<gridDim, blockDim, sharedMemSize, stream>>>(args...);
//
// gridDim:       Number of Blocks in Grid (dim3)
// blockDim:      Number of Threads in Block (dim3)
// sharedMemSize: Dynamic Shared Memory size (bytes, optional)
// stream:        Execution stream (optional)

// 1D example: 256 threads, 1 block
myKernel<<<1, 256>>>(d_data, n);

// 1D example: Process N data elements with 256-thread blocks
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
myKernel<<<blocksPerGrid, threadsPerBlock>>>(d_data, n);

// 2D example: Image processing
dim3 blockDim(16, 16);       // 16x16 = 256 threads/block
dim3 gridDim(
    (width + 15) / 16,
    (height + 15) / 16
);
imageKernel<<<gridDim, blockDim>>>(d_image, width, height);

4.4 Thread Hierarchy

Each thread can determine its position through built-in variables.

// Built-in Variables
threadIdx.x, threadIdx.y, threadIdx.z  // Thread index within Block
blockIdx.x,  blockIdx.y,  blockIdx.z   // Block index within Grid
blockDim.x,  blockDim.y,  blockDim.z   // Block dimension sizes
gridDim.x,   gridDim.y,   gridDim.z    // Grid dimension sizes

// Global Thread ID calculation (1D)
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;

// Global Thread ID calculation (2D)
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int globalIdx = row * width + col;

// Total thread count (for Grid-stride loop)
int totalThreads = gridDim.x * blockDim.x;

4.5 Memory Management Functions

// GPU memory allocation
float* d_data;
cudaMalloc((void**)&d_data, n * sizeof(float));

// Host -> Device copy
cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice);

// Device -> Host copy
cudaMemcpy(h_data, d_data, n * sizeof(float), cudaMemcpyDeviceToHost);

// GPU memory free
cudaFree(d_data);

// GPU memory initialization
cudaMemset(d_data, 0, n * sizeof(float));

// Error check macro (essential!)
#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error at %s:%d - %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// Usage example
CUDA_CHECK(cudaMalloc((void**)&d_data, n * sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice));

4.6 Hello World Example

// hello_cuda.cu
#include <stdio.h>

__global__ void helloKernel() {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    printf("Hello from GPU thread %d (block %d, thread %d)\n",
           tid, blockIdx.x, threadIdx.x);
}

int main() {
    // 2 blocks, 4 threads each = 8 threads total
    helloKernel<<<2, 4>>>();

    // Wait for GPU work to complete
    cudaDeviceSynchronize();

    printf("Hello from CPU!\n");
    return 0;
}
# Compile and run
nvcc -o hello hello_cuda.cu
./hello

4.7 Vector Addition Example

This is the "Hello World" of CUDA - a vector addition example.

// vector_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error: %s at %s:%d\n", \
                    cudaGetErrorString(err), __FILE__, __LINE__); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// GPU kernel: vector addition
__global__ void vectorAdd(const float* A, const float* B, float* C, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    const int N = 1 << 20;  // 1M elements
    size_t bytes = N * sizeof(float);

    // Host memory allocation and initialization
    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);

    for (int i = 0; i < N; i++) {
        h_A[i] = (float)i;
        h_B[i] = (float)(i * 2);
    }

    // Device memory allocation
    float *d_A, *d_B, *d_C;
    CUDA_CHECK(cudaMalloc(&d_A, bytes));
    CUDA_CHECK(cudaMalloc(&d_B, bytes));
    CUDA_CHECK(cudaMalloc(&d_C, bytes));

    // Host -> Device copy
    CUDA_CHECK(cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice));

    // Kernel launch
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Error check
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    // Device -> Host copy
    CUDA_CHECK(cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost));

    // Result verification
    for (int i = 0; i < N; i++) {
        if (fabs(h_C[i] - (h_A[i] + h_B[i])) > 1e-5) {
            fprintf(stderr, "Verification failed at index %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }
    printf("Vector addition of %d elements: SUCCESS\n", N);

    // Memory cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

4.8 Matrix Multiplication Example

// matmul.cu
#include <stdio.h>
#include <cuda_runtime.h>

#define TILE_SIZE 16

// Basic matrix multiplication kernel
__global__ void matMulBasic(const float* A, const float* B, float* C,
                            int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < M && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < K; k++) {
            sum += A[row * K + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

// Tiled matrix multiplication using Shared Memory (optimized version)
__global__ void matMulTiled(const float* A, const float* B, float* C,
                            int M, int N, int K) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    // Iterate over tiles
    for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Load tiles into Shared Memory
        if (row < M && (t * TILE_SIZE + threadIdx.x) < K)
            tileA[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
        else
            tileA[threadIdx.y][threadIdx.x] = 0.0f;

        if ((t * TILE_SIZE + threadIdx.y) < K && col < N)
            tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tileB[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();  // Wait for tile load to complete

        // Multiply-accumulate within tile
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        __syncthreads();  // Synchronize before loading next tile
    }

    if (row < M && col < N) {
        C[row * N + col] = sum;
    }
}

int main() {
    const int M = 1024, N = 1024, K = 1024;

    // ... (memory allocation, initialization, data transfer omitted)

    dim3 blockDim(TILE_SIZE, TILE_SIZE);
    dim3 gridDim((N + TILE_SIZE - 1) / TILE_SIZE,
                 (M + TILE_SIZE - 1) / TILE_SIZE);

    matMulTiled<<<gridDim, blockDim>>>(d_A, d_B, d_C, M, N, K);

    // ... (result copy, verification, memory cleanup omitted)
    return 0;
}

5. CUDA Memory Management In-Depth

5.1 Global Memory and Coalesced Access

Coalesced Access is critical to performance when accessing Global Memory. When 32 threads in a Warp access contiguous memory addresses, the GPU combines them into a single transaction.

// Coalesced Access (good pattern)
// Threads in a Warp access contiguous memory
__global__ void coalesced(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] = data[idx] * 2.0f;  // Contiguous access
    }
}

// Non-coalesced Access (bad pattern)
// Threads access non-contiguously with a stride
__global__ void strided(float* data, int n, int stride) {
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    if (idx < n) {
        data[idx] = data[idx] * 2.0f;  // Strided access
    }
}

// AoS (Array of Structures) vs SoA (Structure of Arrays)
// AoS: Non-coalesced (bad)
struct ParticleAoS {
    float x, y, z;
    float vx, vy, vz;
};
// Thread 0: particle[0].x, Thread 1: particle[1].x -> strided access

// SoA: Coalesced (good)
struct ParticlesSoA {
    float* x;  float* y;  float* z;
    float* vx; float* vy; float* vz;
};
// Thread 0: x[0], Thread 1: x[1] -> contiguous access

5.2 Shared Memory and Bank Conflict

Shared Memory is high-speed memory shared within a thread block on an SM. It consists of 32 banks, and when different threads simultaneously access the same bank, a Bank Conflict occurs.

// Shared Memory usage example
__global__ void sharedMemExample(float* input, float* output, int n) {
    // Static allocation
    __shared__ float sharedData[256];

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;

    // Load from Global -> Shared
    if (gid < n) {
        sharedData[tid] = input[gid];
    }
    __syncthreads();  // Wait until all threads have loaded

    // Compute on Shared Memory (neighbor element sum example)
    if (tid > 0 && tid < blockDim.x - 1 && gid < n) {
        output[gid] = sharedData[tid - 1] + sharedData[tid] + sharedData[tid + 1];
    }
}

// Bank Conflict avoidance: padding technique
// Problem: When accessing columns of a 32x32 matrix, all threads map to the same bank
__shared__ float tile[32][32];       // Bank Conflict!
__shared__ float tile[32][32 + 1];   // Padding avoids Bank Conflict

5.3 Unified Memory

Unified Memory is an abstraction layer that allows both CPU and GPU to access memory through the same pointer.

// Unified Memory usage
float* data;
cudaMallocManaged(&data, N * sizeof(float));

// Initialize on CPU
for (int i = 0; i < N; i++) {
    data[i] = (float)i;
}

// GPU kernel execution - no separate cudaMemcpy needed!
myKernel<<<blocks, threads>>>(data, N);
cudaDeviceSynchronize();

// Access results directly from CPU
printf("Result: %f\n", data[0]);

// Free
cudaFree(data);

// Memory prefetch hints (performance optimization)
cudaMemPrefetchAsync(data, N * sizeof(float), deviceId);  // Prefetch to GPU
cudaMemPrefetchAsync(data, N * sizeof(float), cudaCpuDeviceId);  // Prefetch to CPU

5.4 Pinned Memory

Pinned Memory (Page-locked Memory) is host memory excluded from OS page swapping, which increases data transfer speed between GPU and host.

// Pinned Memory allocation
float* h_pinned;
cudaMallocHost(&h_pinned, N * sizeof(float));  // or cudaHostAlloc

// Transfer speed comparison vs regular memory
// Pageable:  ~12 GB/s (PCIe Gen4)
// Pinned:    ~25 GB/s (PCIe Gen4) - approximately 2x faster

// Required for asynchronous transfers
cudaMemcpyAsync(d_data, h_pinned, bytes, cudaMemcpyHostToDevice, stream);

// Free
cudaFreeHost(h_pinned);

5.5 Memory Pool

Memory Pool, introduced in CUDA 11.2, reduces overhead from repetitive memory allocation/deallocation.

// Stream-ordered Memory Allocator (CUDA 11.2+)
float* d_data;
cudaMallocAsync(&d_data, bytes, stream);

// Asynchronous free after use
cudaFreeAsync(d_data, stream);

// Memory Pool configuration
cudaMemPool_t mempool;
cudaDeviceGetDefaultMemPool(&mempool, deviceId);

// Set pool size limit
uint64_t threshold = 1ULL << 30;  // 1 GB
cudaMemPoolSetAttribute(mempool, cudaMemPoolAttrReleaseThreshold, &threshold);

5.6 Memory Bandwidth Optimization Summary

TechniqueEffectImplementation Difficulty
Coalesced AccessMaximize Global Memory throughputLow
SoA layoutGuarantee Coalesced AccessMedium
Shared Memory TilingReduce Global Memory access countMedium
Bank Conflict avoidance (padding)Maximize Shared Memory throughputLow
Pinned Memory2x faster Host-Device transferLow
Unified Memory + PrefetchProgramming convenience + performanceLow
Memory PoolEliminate allocation/deallocation overheadLow
Texture MemoryLeverage 2D spatial localityMedium

6. CUDA Streams and Asynchronous Execution

6.1 CUDA Stream Concept

A CUDA Stream is a sequence of commands executed in order on the GPU. Operations on different Streams can execute concurrently.

Default Stream (Stream 0):
[MemcpyH2D] -> [Kernel A] -> [MemcpyD2H]
                                           (sequential, no pipeline)

Multi-Stream:
Stream 1: [MemcpyH2D_1] -> [Kernel_1] -> [MemcpyD2H_1]
Stream 2:     [MemcpyH2D_2] -> [Kernel_2] -> [MemcpyD2H_2]
Stream 3:         [MemcpyH2D_3] -> [Kernel_3] -> [MemcpyD2H_3]
                                           (overlapped execution, maximize GPU utilization)

6.2 Stream Creation and Usage

// Stream creation
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Asynchronous memory transfer (Pinned Memory required!)
cudaMemcpyAsync(d_A, h_A, bytes, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(d_B, h_B, bytes, cudaMemcpyHostToDevice, stream2);

// Kernel execution (on specific Stream)
kernelA<<<grid, block, 0, stream1>>>(d_A);
kernelB<<<grid, block, 0, stream2>>>(d_B);

// Asynchronous result copy
cudaMemcpyAsync(h_A, d_A, bytes, cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(h_B, d_B, bytes, cudaMemcpyDeviceToHost, stream2);

// Stream synchronization
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

// Stream cleanup
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

6.3 Multi-Stream Pipeline Pattern

const int NUM_STREAMS = 4;
const int CHUNK_SIZE = N / NUM_STREAMS;

cudaStream_t streams[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
    cudaStreamCreate(&streams[i]);
}

// Pinned Memory allocation
float *h_in, *h_out;
cudaMallocHost(&h_in, N * sizeof(float));
cudaMallocHost(&h_out, N * sizeof(float));

float *d_in, *d_out;
cudaMalloc(&d_in, N * sizeof(float));
cudaMalloc(&d_out, N * sizeof(float));

// Pipeline execution
for (int i = 0; i < NUM_STREAMS; i++) {
    int offset = i * CHUNK_SIZE;
    size_t chunkBytes = CHUNK_SIZE * sizeof(float);

    // 1. Host -> Device (async)
    cudaMemcpyAsync(d_in + offset, h_in + offset,
                    chunkBytes, cudaMemcpyHostToDevice, streams[i]);

    // 2. Kernel execution
    int blocks = (CHUNK_SIZE + 255) / 256;
    processKernel<<<blocks, 256, 0, streams[i]>>>(
        d_in + offset, d_out + offset, CHUNK_SIZE);

    // 3. Device -> Host (async)
    cudaMemcpyAsync(h_out + offset, d_out + offset,
                    chunkBytes, cudaMemcpyDeviceToHost, streams[i]);
}

// Wait for all to complete
cudaDeviceSynchronize();

6.4 CUDA Events (Timing Measurement)

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

// Start timing
cudaEventRecord(start);

// Kernel execution
myKernel<<<grid, block>>>(d_data, N);

// Stop timing
cudaEventRecord(stop);
cudaEventSynchronize(stop);

// Calculate elapsed time
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel execution time: %.3f ms\n", milliseconds);

// Bandwidth calculation
float bandwidth = (bytes_read + bytes_written) / (milliseconds * 1e6);
printf("Effective bandwidth: %.2f GB/s\n", bandwidth);

cudaEventDestroy(start);
cudaEventDestroy(stop);

6.5 Synchronization Function Comparison

FunctionScopePurpose
cudaDeviceSynchronize()Entire DeviceWait for all Streams to complete
cudaStreamSynchronize(stream)Specific StreamWait for that Stream to complete
cudaEventSynchronize(event)Specific EventWait for that Event to complete
cudaStreamWaitEvent(stream, event)Between StreamsStream waits for Event before proceeding
__syncthreads()Within BlockSynchronize threads within Block (in kernel)

7. CUDA Optimization Techniques

7.1 Occupancy Optimization

Occupancy is the ratio of actual active Warps to the maximum supported Warps on an SM. High Occupancy does not always mean peak performance, but it is critical for hiding memory latency.

// Occupancy calculation API
int blockSize;
int minGridSize;

// Automatically calculate optimal block size
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, 0);
printf("Optimal block size: %d\n", blockSize);
printf("Minimum grid size: %d\n", minGridSize);

// Check Occupancy for a specific block size
int maxActiveBlocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &maxActiveBlocks, myKernel, blockSize, 0);
printf("Max active blocks per SM: %d\n", maxActiveBlocks);

Block size selection guidelines:

PrincipleDescription
Multiple of 32Align with Warp size to avoid thread waste
128 ~ 512 recommendedGenerally optimal range
Consider registers/Shared MemoryReduce block size if resource usage is high
Use cudaOccupancyMaxPotentialBlockSizeAutomatic optimal calculation

7.2 Minimizing Warp Divergence

When threads within a Warp take different branches, sequential execution occurs.

// Bad example: Warp Divergence
__global__ void divergentKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        // Branches based on even/odd threadIdx.x
        // -> Only half execute at a time within same Warp
        if (threadIdx.x % 2 == 0) {
            data[idx] = data[idx] * 2.0f;
        } else {
            data[idx] = data[idx] + 1.0f;
        }
    }
}

// Good example: Branch at Warp granularity
__global__ void convergentKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int warpId = threadIdx.x / 32;  // Branch by Warp ID
    if (idx < n) {
        // All threads in the same Warp take the same path
        if (warpId % 2 == 0) {
            data[idx] = data[idx] * 2.0f;
        } else {
            data[idx] = data[idx] + 1.0f;
        }
    }
}

7.3 Shared Memory Tiling (Matrix Multiplication Optimization)

Tiling is a key technique to reduce Global Memory access count. We already saw the implementation in section 4.8's matMulTiled, and the performance difference is summarized below.

MethodGlobal Memory Accesses (MxNxK matrix)Relative Performance
Naive2 * M * N * K1x
Tiled (Shared Memory)2 * M * N * K / TILE_SIZE~TILE_SIZE x

With TILE_SIZE of 16, Global Memory accesses are reduced by 16x.

7.4 Loop Unrolling

// Manual Loop Unrolling
__global__ void unrolled(float* data, int n) {
    int idx = blockIdx.x * blockDim.x * 4 + threadIdx.x;

    // 4x Unrolling
    if (idx < n)         data[idx] *= 2.0f;
    if (idx + 256 < n)   data[idx + 256] *= 2.0f;
    if (idx + 512 < n)   data[idx + 512] *= 2.0f;
    if (idx + 768 < n)   data[idx + 768] *= 2.0f;
}

// Compiler directive-based Unrolling
__global__ void pragmaUnrolled(float* A, float* B, float* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;

    #pragma unroll 8
    for (int k = 0; k < N; k++) {
        sum += A[idx * N + k] * B[k];
    }
    C[idx] = sum;
}

7.5 Grid-Stride Loop Pattern

A flexible pattern used when data size exceeds total thread count.

__global__ void gridStrideKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    // Each thread processes multiple elements, striding by total thread count
    for (int i = idx; i < n; i += stride) {
        data[i] = data[i] * 2.0f;
    }
}

// Advantages:
// 1. Works regardless of data size
// 2. Block/grid sizes can be freely adjusted
// 3. Increased work per thread reduces kernel launch overhead

7.6 NVIDIA Nsight Profiling

# Nsight Compute (kernel-level profiling)
ncu --set full -o profile_report ./my_cuda_app

# Profile specific kernel only
ncu --kernel-name myKernel --launch-skip 0 --launch-count 1 ./my_cuda_app

# Nsight Systems (system-level profiling)
nsys profile --trace=cuda,nvtx -o timeline_report ./my_cuda_app

# Key metrics to check
# - SM Throughput: SM utilization
# - Memory Throughput: Memory bandwidth utilization
# - Achieved Occupancy: Actual Occupancy
# - Warp Stall Reasons: Warp stall causes
# - L1/L2 Hit Rate: Cache hit rate

Profiling checklist:

MetricTargetAction if problematic
Achieved Occupancy50% or higherAdjust block size, register usage
Memory Throughput60%+ of theoretical BWImprove Coalesced Access, caching
Compute Throughput60%+ of theoretical opsImprove ILP, remove unnecessary ops
Warp DivergenceMinimizeRestructure branching logic

8. CUDA in Python

8.1 PyCUDA Basics

PyCUDA allows you to write and execute CUDA C kernels directly from Python.

import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np

# Define CUDA kernel (write C code as a string)
mod = SourceModule("""
__global__ void multiply(float *dest, float *a, float *b, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        dest[idx] = a[idx] * b[idx];
    }
}
""")

multiply = mod.get_function("multiply")

# Prepare data
n = 1024 * 1024
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
dest = np.zeros_like(a)

# Execute kernel
block_size = 256
grid_size = (n + block_size - 1) // block_size

multiply(
    cuda.Out(dest), cuda.In(a), cuda.In(b), np.int32(n),
    block=(block_size, 1, 1), grid=(grid_size, 1)
)

# Verify results
assert np.allclose(dest, a * b)
print("PyCUDA multiply: SUCCESS")

8.2 Numba CUDA JIT

Numba is the easiest way to JIT-compile Python functions into CUDA kernels.

from numba import cuda
import numpy as np
import math

# CUDA kernel definition
@cuda.jit
def vector_add(a, b, c):
    idx = cuda.grid(1)  # 1D global index
    if idx < a.shape[0]:
        c[idx] = a[idx] + b[idx]

# Prepare data
n = 1_000_000
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
c = np.zeros(n, dtype=np.float32)

# Transfer to Device
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)

# Execute kernel
threads_per_block = 256
blocks_per_grid = math.ceil(n / threads_per_block)
vector_add[blocks_per_grid, threads_per_block](d_a, d_b, d_c)

# Get results
result = d_c.copy_to_host()
assert np.allclose(result, a + b)
print("Numba CUDA vector_add: SUCCESS")


# 2D kernel example: matrix multiplication
@cuda.jit
def matmul_kernel(A, B, C):
    row, col = cuda.grid(2)  # 2D global index
    if row < C.shape[0] and col < C.shape[1]:
        tmp = 0.0
        for k in range(A.shape[1]):
            tmp += A[row, k] * B[k, col]
        C[row, col] = tmp

# Using Shared Memory
@cuda.jit
def matmul_shared(A, B, C):
    TILE = 16
    sA = cuda.shared.array(shape=(TILE, TILE), dtype=np.float32)
    sB = cuda.shared.array(shape=(TILE, TILE), dtype=np.float32)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    row = cuda.blockIdx.y * TILE + ty
    col = cuda.blockIdx.x * TILE + tx

    tmp = 0.0
    for t in range((A.shape[1] + TILE - 1) // TILE):
        if row < A.shape[0] and (t * TILE + tx) < A.shape[1]:
            sA[ty, tx] = A[row, t * TILE + tx]
        else:
            sA[ty, tx] = 0.0

        if (t * TILE + ty) < B.shape[0] and col < B.shape[1]:
            sB[ty, tx] = B[t * TILE + ty, col]
        else:
            sB[ty, tx] = 0.0

        cuda.syncthreads()
        for k in range(TILE):
            tmp += sA[ty, k] * sB[k, tx]
        cuda.syncthreads()

    if row < C.shape[0] and col < C.shape[1]:
        C[row, col] = tmp

8.3 CuPy (NumPy Replacement)

CuPy is a library that provides the same API as NumPy but runs on the GPU.

import cupy as cp
import numpy as np
import time

# Same API as NumPy
a_gpu = cp.random.randn(10000, 10000, dtype=cp.float32)
b_gpu = cp.random.randn(10000, 10000, dtype=cp.float32)

# GPU matrix multiplication
start = time.time()
c_gpu = cp.dot(a_gpu, b_gpu)
cp.cuda.Stream.null.synchronize()  # Wait for GPU work to complete
gpu_time = time.time() - start

# CPU comparison
a_cpu = cp.asnumpy(a_gpu)
b_cpu = cp.asnumpy(b_gpu)
start = time.time()
c_cpu = np.dot(a_cpu, b_cpu)
cpu_time = time.time() - start

print(f"GPU: {gpu_time:.4f}s, CPU: {cpu_time:.4f}s")
print(f"Speedup: {cpu_time / gpu_time:.1f}x")

# Custom CUDA kernel in CuPy
custom_kernel = cp.RawKernel(r'''
extern "C" __global__
void relu_kernel(const float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = fmaxf(input[idx], 0.0f);
    }
}
''', 'relu_kernel')

n = 1_000_000
x = cp.random.randn(n, dtype=cp.float32)
y = cp.zeros(n, dtype=cp.float32)

block_size = 256
grid_size = (n + block_size - 1) // block_size
custom_kernel((grid_size,), (block_size,), (x, y, n))

# Verification
assert cp.allclose(y, cp.maximum(x, 0))

8.4 PyTorch CUDA Operations

Here are practical patterns for leveraging CUDA in PyTorch.

import torch
import torch.nn as nn

# ===== Basic GPU Usage =====

# Check GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")

# Move tensors to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Method 1: .to(device)
x = torch.randn(1000, 1000)
x_gpu = x.to(device)

# Method 2: .cuda()
x_gpu = x.cuda()

# Method 3: Create directly on GPU
x_gpu = torch.randn(1000, 1000, device='cuda')

# Specify a particular GPU
x_gpu1 = x.to('cuda:1')  # Second GPU

# ===== Memory Management =====

# Check memory usage
print(torch.cuda.memory_summary())
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Clear cache
torch.cuda.empty_cache()

# Memory snapshot (for debugging)
torch.cuda.memory._record_memory_history()
# ... run code ...
snapshot = torch.cuda.memory._snapshot()
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")


# ===== Mixed Precision Training =====
from torch.amp import autocast, GradScaler

model = nn.Linear(1024, 1024).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler('cuda')

for data, target in dataloader:
    data, target = data.cuda(), target.cuda()
    optimizer.zero_grad()

    # Automatic FP16 conversion with autocast
    with autocast('cuda'):
        output = model(data)
        loss = nn.functional.mse_loss(output, target)

    # FP16 gradient scaling with GradScaler
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# Benefits of Mixed Precision:
# - ~50% reduction in memory usage
# - 1.5-3x faster training (leveraging Tensor Cores)
# - Minimal accuracy loss

8.5 TensorFlow GPU Configuration

import tensorflow as tf

# Check GPUs
print("GPUs available:", tf.config.list_physical_devices('GPU'))

# Incremental GPU memory allocation (recommended)
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

# Or set memory limit
tf.config.set_logical_device_configuration(
    gpus[0],
    [tf.config.LogicalDeviceConfiguration(memory_limit=8192)]  # 8GB
)

# Use specific GPU
with tf.device('/GPU:0'):
    a = tf.random.normal([1000, 1000])
    b = tf.random.normal([1000, 1000])
    c = tf.matmul(a, b)

# Mixed Precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')

8.6 RAPIDS (cuDF, cuML) Introduction

RAPIDS is a collection of GPU-accelerated data science libraries provided by NVIDIA. As of 2025, the latest version is 26.02, and cuML offers 5-175x faster performance compared to scikit-learn.

# cuDF: GPU DataFrame replacement for pandas
import cudf

# Same API as pandas
df = cudf.read_csv('large_data.csv')
result = df.groupby('category').agg({'value': 'mean', 'count': 'sum'})
filtered = df[df['value'] > 100]

# Interconversion with pandas
import pandas as pd
pdf = df.to_pandas()   # cuDF -> pandas
gdf = cudf.from_pandas(pdf)  # pandas -> cuDF


# cuML: GPU ML replacement for scikit-learn
import cuml
from cuml.ensemble import RandomForestClassifier
from cuml.cluster import KMeans

# Same API as scikit-learn
rf = RandomForestClassifier(n_estimators=100, max_depth=16)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)

# Zero-code-change acceleration (cuml.accel)
# Run scikit-learn code on GPU without modification
import cuml.accel
cuml.accel.install()

from sklearn.ensemble import RandomForestClassifier  # Automatically GPU-accelerated!

9. Hands-on Examples

9.1 Example 1: GPU Vector Operation Benchmark (Numba)

A practical example comparing CPU and GPU vector operation performance.

from numba import cuda, njit
import numpy as np
import time
import math

# CPU version (Numba JIT)
@njit
def vector_ops_cpu(a, b, c):
    for i in range(a.shape[0]):
        c[i] = math.sqrt(a[i] ** 2 + b[i] ** 2) * math.sin(a[i]) + math.log(abs(b[i]) + 1)

# GPU version (CUDA)
@cuda.jit
def vector_ops_gpu(a, b, c):
    idx = cuda.grid(1)
    if idx < a.shape[0]:
        c[idx] = math.sqrt(a[idx] ** 2 + b[idx] ** 2) * math.sin(a[idx]) + math.log(abs(b[idx]) + 1)

# Benchmark
sizes = [100_000, 1_000_000, 10_000_000, 100_000_000]

for n in sizes:
    a = np.random.randn(n).astype(np.float32)
    b = np.random.randn(n).astype(np.float32)
    c_cpu = np.zeros(n, dtype=np.float32)
    c_gpu = np.zeros(n, dtype=np.float32)

    # CPU
    start = time.time()
    vector_ops_cpu(a, b, c_cpu)
    cpu_time = time.time() - start

    # GPU
    d_a = cuda.to_device(a)
    d_b = cuda.to_device(b)
    d_c = cuda.to_device(c_gpu)

    threads = 256
    blocks = math.ceil(n / threads)

    # Warm-up
    vector_ops_gpu[blocks, threads](d_a, d_b, d_c)
    cuda.synchronize()

    start = time.time()
    vector_ops_gpu[blocks, threads](d_a, d_b, d_c)
    cuda.synchronize()
    gpu_time = time.time() - start

    c_gpu = d_c.copy_to_host()

    print(f"N={n:>12,}: CPU={cpu_time:.4f}s, GPU={gpu_time:.4f}s, "
          f"Speedup={cpu_time/gpu_time:.1f}x")

9.2 Example 2: Image Processing Acceleration (CuPy)

An example of GPU-accelerated real-time image filtering.

import cupy as cp
import numpy as np
from PIL import Image

def gpu_gaussian_blur(image_array, kernel_size=5, sigma=1.0):
    """GPU Gaussian blur implementation"""
    # Create Gaussian kernel
    ax = cp.arange(-kernel_size // 2 + 1., kernel_size // 2 + 1.)
    xx, yy = cp.meshgrid(ax, ax)
    kernel = cp.exp(-(xx**2 + yy**2) / (2. * sigma**2))
    kernel = kernel / kernel.sum()

    img_gpu = cp.asarray(image_array, dtype=cp.float32)
    result = cp.zeros_like(img_gpu)

    pad = kernel_size // 2
    # Process per channel
    for c in range(img_gpu.shape[2]):
        padded = cp.pad(img_gpu[:, :, c], pad, mode='reflect')
        # 2D Convolution (using CuPy FFT)
        from cupyx.scipy.ndimage import convolve
        result[:, :, c] = convolve(img_gpu[:, :, c], kernel)

    return cp.asnumpy(result.clip(0, 255).astype(cp.uint8))


# Sobel Edge Detection implemented directly with CuPy RawKernel
sobel_kernel = cp.RawKernel(r'''
extern "C" __global__
void sobel_filter(const float* input, float* output,
                  int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x > 0 && x < width - 1 && y > 0 && y < height - 1) {
        // Sobel X
        float gx = -1.0f * input[(y-1)*width + (x-1)]
                   + 1.0f * input[(y-1)*width + (x+1)]
                   - 2.0f * input[y*width + (x-1)]
                   + 2.0f * input[y*width + (x+1)]
                   - 1.0f * input[(y+1)*width + (x-1)]
                   + 1.0f * input[(y+1)*width + (x+1)];

        // Sobel Y
        float gy = -1.0f * input[(y-1)*width + (x-1)]
                   - 2.0f * input[(y-1)*width + x]
                   - 1.0f * input[(y-1)*width + (x+1)]
                   + 1.0f * input[(y+1)*width + (x-1)]
                   + 2.0f * input[(y+1)*width + x]
                   + 1.0f * input[(y+1)*width + (x+1)];

        output[y * width + x] = sqrtf(gx * gx + gy * gy);
    }
}
''', 'sobel_filter')

def gpu_sobel_edge_detection(gray_image):
    """GPU Sobel edge detection"""
    h, w = gray_image.shape
    img_gpu = cp.asarray(gray_image, dtype=cp.float32)
    output_gpu = cp.zeros_like(img_gpu)

    block = (16, 16)
    grid = ((w + 15) // 16, (h + 15) // 16)

    sobel_kernel(grid, block, (img_gpu, output_gpu, w, h))

    return cp.asnumpy(output_gpu.clip(0, 255).astype(cp.uint8))

# Usage example
# img = np.array(Image.open('photo.jpg'))
# blurred = gpu_gaussian_blur(img, kernel_size=11, sigma=3.0)
# gray = np.mean(img, axis=2).astype(np.float32)
# edges = gpu_sobel_edge_detection(gray)

9.3 Example 3: PyTorch Model GPU Training

A complete PyTorch GPU training pipeline example.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torch.amp import autocast, GradScaler
import time

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Model definition
class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim),
            nn.LayerNorm(dim),
            nn.GELU(),
            nn.Linear(dim, dim),
            nn.LayerNorm(dim),
        )

    def forward(self, x):
        return x + self.net(x)

class DeepModel(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=512, num_classes=10, num_blocks=6):
        super().__init__()
        self.input_proj = nn.Linear(input_dim, hidden_dim)
        self.blocks = nn.Sequential(*[ResidualBlock(hidden_dim) for _ in range(num_blocks)])
        self.head = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.input_proj(x)
        x = self.blocks(x)
        return self.head(x)

# Data generation (in practice, load a dataset)
X_train = torch.randn(50000, 784)
y_train = torch.randint(0, 10, (50000,))
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True,
                          num_workers=4, pin_memory=True)  # pin_memory is important!

# Move model to GPU
model = DeepModel().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scaler = GradScaler('cuda')

# Training loop (Mixed Precision)
def train_epoch(model, loader, criterion, optimizer, scaler, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(loader):
        data, target = data.to(device, non_blocking=True), target.to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)  # set_to_none=True is more efficient

        with autocast('cuda'):
            output = model(data)
            loss = criterion(output, target)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        total_loss += loss.item()
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()

    return total_loss / len(loader), 100. * correct / total

# Run training
num_epochs = 10
for epoch in range(num_epochs):
    start = time.time()
    loss, acc = train_epoch(model, train_loader, criterion, optimizer, scaler, device)
    elapsed = time.time() - start

    print(f"Epoch {epoch+1}/{num_epochs}: Loss={loss:.4f}, "
          f"Acc={acc:.2f}%, Time={elapsed:.2f}s")

    # Print GPU memory status
    if device.type == 'cuda':
        alloc = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"  GPU Memory: {alloc:.2f} GB allocated, {reserved:.2f} GB reserved")

9.4 Example 4: Multi-GPU Training

DataParallel (Simple but Inefficient)

import torch
import torch.nn as nn

model = DeepModel().cuda()

# Apply DataParallel (one line!)
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs with DataParallel")
    model = nn.DataParallel(model)

# Rest of training code is the same
# Drawback: Load imbalance with GPU 0 being the bottleneck
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
import os

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train_ddp(rank, world_size):
    setup(rank, world_size)

    # Place model on the corresponding GPU
    model = DeepModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    # Distributed data loader
    train_dataset = TensorDataset(X_train, y_train)
    sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank)
    train_loader = DataLoader(train_dataset, batch_size=256,
                              sampler=sampler, num_workers=4, pin_memory=True)

    optimizer = optim.AdamW(ddp_model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()
    scaler = GradScaler('cuda')

    for epoch in range(10):
        sampler.set_epoch(epoch)  # Shuffle each epoch
        for data, target in train_loader:
            data = data.to(rank, non_blocking=True)
            target = target.to(rank, non_blocking=True)

            optimizer.zero_grad(set_to_none=True)
            with autocast('cuda'):
                output = ddp_model(data)
                loss = criterion(output, target)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

        if rank == 0:
            print(f"Epoch {epoch+1} complete")

    cleanup()

# Execution
# torchrun --nproc_per_node=4 train_ddp.py
import torch.multiprocessing as mp
world_size = torch.cuda.device_count()
mp.spawn(train_ddp, args=(world_size,), nprocs=world_size, join=True)

9.5 Example 5: Writing CUDA Kernels Directly (PyCUDA)

Implementing a practical Reduction (summation) kernel with PyCUDA.

import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np

# Parallel Reduction kernel
mod = SourceModule("""
__global__ void parallel_reduce(float *input, float *output, int n) {
    extern __shared__ float sdata[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x * 2 + threadIdx.x;

    // First level reduction (performed during global load)
    float sum = 0.0f;
    if (i < n) sum += input[i];
    if (i + blockDim.x < n) sum += input[i + blockDim.x];
    sdata[tid] = sum;

    __syncthreads();

    // Reduction in Shared Memory
    for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // No synchronization needed within a Warp (Warp-level reduction)
    if (tid < 32) {
        volatile float *smem = sdata;
        smem[tid] += smem[tid + 32];
        smem[tid] += smem[tid + 16];
        smem[tid] += smem[tid + 8];
        smem[tid] += smem[tid + 4];
        smem[tid] += smem[tid + 2];
        smem[tid] += smem[tid + 1];
    }

    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}
""")

reduce_kernel = mod.get_function("parallel_reduce")

# Execution
n = 1 << 20  # 1M elements
data = np.random.randn(n).astype(np.float32)

block_size = 256
grid_size = (n + block_size * 2 - 1) // (block_size * 2)

d_input = cuda.mem_alloc(data.nbytes)
d_output = cuda.mem_alloc(grid_size * 4)  # float32

cuda.memcpy_htod(d_input, data)

reduce_kernel(d_input, d_output, np.int32(n),
              block=(block_size, 1, 1), grid=(grid_size, 1),
              shared=block_size * 4)

# Get partial sums
partial_sums = np.zeros(grid_size, dtype=np.float32)
cuda.memcpy_dtoh(partial_sums, d_output)
gpu_sum = partial_sums.sum()

# Verification
cpu_sum = data.sum()
print(f"GPU sum: {gpu_sum:.6f}")
print(f"CPU sum: {cpu_sum:.6f}")
print(f"Difference: {abs(gpu_sum - cpu_sum):.6f}")

10. CUDA Tools and Libraries

10.1 Core Library Ecosystem

LibraryPurposeKey Features
cuDNNDeep LearningOptimized DNN primitives: Convolution, RNN, Attention, BatchNorm
cuBLASLinear AlgebraBLAS Level 1/2/3 operations: GEMM, TRSM. FP8/BF16 Group GEMM support (Blackwell)
NCCLMulti-GPU CommsCollective communication: AllReduce, AllGather, Broadcast. NVLink/NVSwitch optimized
TensorRTInference OptGraph optimization, quantization (INT8/FP8), layer fusion, dynamic batching
TritonGPU ProgrammingPython DSL for high-performance kernels. Developed by OpenAI
CUTLASSCustom GEMMTemplate-based CUDA matrix multiplication library
FlashAttentionAttention AccelIO-aware algorithm for Transformer Attention acceleration
cuSPARSESparse MatricesOptimized sparse matrix operations
cuRANDRandom NumbersGPU-accelerated pseudo/quasi-random number generation
cuFFTFFTGPU-accelerated Fast Fourier Transform

10.2 cuDNN

# cuDNN is built into PyTorch/TensorFlow and used automatically

# Enable cuDNN benchmark mode in PyTorch
import torch
torch.backends.cudnn.benchmark = True   # Auto-select optimal algorithm
torch.backends.cudnn.deterministic = False  # Performance priority (when reproducibility is not needed)

# Check cuDNN version
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"cuDNN enabled: {torch.backends.cudnn.enabled}")

10.3 NCCL (Multi-GPU Communication)

# NCCL usage in PyTorch (internal to DistributedDataParallel)
import torch.distributed as dist

# Initialize NCCL backend
dist.init_process_group(backend='nccl')

# Key collective communication operations
# AllReduce: Sum tensors from all GPUs and distribute to each GPU
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)

# AllGather: Gather tensors from all GPUs and copy the full set to each GPU
gathered = [torch.zeros_like(tensor) for _ in range(world_size)]
dist.all_gather(gathered, tensor)

# Broadcast: Copy a tensor from one GPU to all GPUs
dist.broadcast(tensor, src=0)

10.4 TensorRT

# Inference optimization with TensorRT
import tensorrt as trt
import torch

# Convert PyTorch model to ONNX
model = MyModel().cuda().eval()
dummy_input = torch.randn(1, 3, 224, 224).cuda()
torch.onnx.export(model, dummy_input, "model.onnx",
                  input_names=['input'], output_names=['output'],
                  dynamic_axes={'input': {0: 'batch_size'}})

# Simple conversion with torch_tensorrt (PyTorch 2.x)
import torch_tensorrt

optimized_model = torch_tensorrt.compile(model,
    inputs=[torch_tensorrt.Input(
        min_shape=[1, 3, 224, 224],
        opt_shape=[8, 3, 224, 224],
        max_shape=[32, 3, 224, 224],
        dtype=torch.float16
    )],
    enabled_precisions={torch.float16},
)

# Inference
with torch.no_grad():
    output = optimized_model(input_tensor.half().cuda())

# Typical TensorRT performance gains:
# - FP32 -> FP16: 2-3x faster
# - FP32 -> INT8: 3-5x faster
# - Layer fusion + optimization: Additional 20-50% improvement

10.5 OpenAI Triton

Triton is a programming language for writing high-performance GPU kernels in Python.

import triton
import triton.language as tl
import torch

# Triton kernel: vector addition
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements

    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y

    tl.store(output_ptr + offsets, output, mask=mask)

# Usage
def triton_add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output

# Execution
x = torch.randn(1_000_000, device='cuda')
y = torch.randn(1_000_000, device='cuda')
result = triton_add(x, y)
assert torch.allclose(result, x + y)

# Advantages of Triton:
# - Higher-level abstraction than CUDA C
# - Automatic memory access optimization
# - Write high-performance kernels with Python syntax
# - Backend for PyTorch 2.0 torch.compile

10.6 FlashAttention

FlashAttention accelerates Transformer Self-Attention with an IO-aware algorithm.

# FlashAttention-3 (H100 optimized)
# Install: pip install flash-attn

from flash_attn import flash_attn_func

# Input: (batch, seqlen, nheads, headdim)
q = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)
k = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)
v = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)

# FlashAttention execution
output = flash_attn_func(q, k, v, causal=True)

# Performance comparison (FlashAttention-3 vs standard Attention):
# - Memory: O(N^2) -> O(N) (sequence length N)
# - Speed: 2-4x faster (up to 740 TFLOPS on H100 FP16)
# - FP8: up to 1.2 PFLOPS (H100)
# - FlashAttention-3 is 1.5-2.0x faster than FlashAttention-2

# PyTorch 2.0+ built-in scaled_dot_product_attention
output = torch.nn.functional.scaled_dot_product_attention(
    q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2),
    is_causal=True
)
# PyTorch automatically selects FlashAttention or Memory-Efficient Attention

11. CUDA Troubleshooting

11.1 CUDA OOM (Out of Memory) Solutions

GPU memory shortage is the most common CUDA issue.

# Error: RuntimeError: CUDA out of memory.

# Solution 1: Reduce batch size
# batch_size = 64 -> 32 -> 16

# Solution 2: Use Mixed Precision (~50% memory savings)
from torch.amp import autocast, GradScaler
scaler = GradScaler('cuda')

# Solution 3: Gradient Accumulation
accumulation_steps = 4
for i, (data, target) in enumerate(loader):
    with autocast('cuda'):
        loss = model(data) / accumulation_steps
    scaler.scale(loss).backward()

    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

# Solution 4: Gradient Checkpointing (memory-compute tradeoff)
from torch.utils.checkpoint import checkpoint
class MemEfficientModel(nn.Module):
    def forward(self, x):
        # Don't store intermediate activations; recompute during backward
        x = checkpoint(self.block1, x, use_reentrant=False)
        x = checkpoint(self.block2, x, use_reentrant=False)
        return x

# Solution 5: Clear cache
torch.cuda.empty_cache()

# Solution 6: Debug memory leaks
print(torch.cuda.memory_summary())

# Solution 7: Delete unnecessary tensors immediately
del large_tensor
torch.cuda.empty_cache()

# Solution 8: Disable gradients during inference
with torch.no_grad():
    output = model(input_data)

11.2 CUDA Driver vs Runtime Version Mismatch

# Problem: "CUDA driver version is insufficient for CUDA runtime version"

# How to check
nvidia-smi           # Check driver-supported CUDA version
nvcc --version       # Check installed CUDA Toolkit version

# Rule: Driver CUDA >= Toolkit CUDA
# Example: If driver supports CUDA 12.2, CUDA 13.0 Toolkit cannot be used

# Solution 1: Update driver
sudo apt-get update
sudo apt-get install nvidia-driver-560  # Install latest driver

# Solution 2: Downgrade to compatible CUDA Toolkit
sudo apt-get install cuda-toolkit-12-4

# Solution 3: Install matching CUDA build of PyTorch/TensorFlow
pip install torch --index-url https://download.pytorch.org/whl/cu124

11.3 nvidia-smi Key Commands Summary

CommandPurpose
nvidia-smiGPU status summary
nvidia-smi -l 1Refresh every 1 second
nvidia-smi -qDetailed info
nvidia-smi -q -d MEMORYMemory details
nvidia-smi -q -d CLOCKClock speeds
nvidia-smi -q -d TEMPERATURETemperature info
nvidia-smi -q -d POWERPower consumption
nvidia-smi -q -d PERFORMANCEPerformance state
nvidia-smi --query-gpu=... --format=csvCSV output
nvidia-smi pmon -i 0Process monitoring
nvidia-smi dmon -d 1Device monitoring
nvidia-smi topo -mGPU topology (NVLink, etc.)
nvidia-smi -r -i 0Reset GPU 0
nvidia-smi -pm 1Enable Persistence Mode
nvidia-smi -pl 300Set power limit (W)

11.4 cuda-gdb Debugging

# CUDA debug build
nvcc -g -G -O0 -o debug_app app.cu

# Run cuda-gdb
cuda-gdb ./debug_app

# Key commands
(cuda-gdb) break myKernel            # Breakpoint on kernel
(cuda-gdb) run                        # Run
(cuda-gdb) cuda thread               # Current thread info
(cuda-gdb) cuda block                 # Current block info
(cuda-gdb) cuda kernel                # Current kernel info
(cuda-gdb) cuda thread (0,0,0)        # Switch to specific thread
(cuda-gdb) cuda block (1,0,0)         # Switch to specific block
(cuda-gdb) info cuda threads          # List active threads
(cuda-gdb) print threadIdx.x          # Print built-in variable
(cuda-gdb) print data[idx]            # Check data

# compute-sanitizer (memory error detection)
compute-sanitizer --tool memcheck ./my_app
compute-sanitizer --tool racecheck ./my_app   # Race condition detection
compute-sanitizer --tool initcheck ./my_app   # Initialization check

11.5 Common Error Messages and Solutions

Error MessageCauseSolution
CUDA error: out of memoryGPU memory shortageReduce batch size, Mixed Precision, Gradient Checkpointing
CUDA error: device-side assert triggeredAssert failure or index out of bounds in kernelRun with CUDA_LAUNCH_BLOCKING=1 to identify exact location
CUDA error: an illegal memory accessInvalid memory accessDebug with compute-sanitizer, check index bounds
CUDA error: no kernel image is availableCompute Capability mismatchRecompile with correct -arch=sm_XX option
CUDA driver version is insufficientDriver version too oldUpdate driver or downgrade CUDA Toolkit
CUDA error: invalid device functionWrong architecture targetCheck -gencode options, use Fat Binary
cuDNN error: CUDNN_STATUS_NOT_SUPPORTEDcuDNN version mismatch or unsupported opUpdate cuDNN or check input format
NCCL error: unhandled system errorMulti-GPU communication failureCheck network config, enable NCCL_DEBUG=INFO
# Debugging tip: Switch to synchronous execution to pinpoint exact error location
CUDA_LAUNCH_BLOCKING=1 python train.py

# NCCL debugging
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.launch train.py

# PyTorch CUDA memory debugging
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python train.py

12. CUDA Alternative Technologies Comparison

12.1 Major GPU Computing Frameworks

TechnologyDeveloperTarget HardwareLanguagesEcosystem MaturityPrimary Use
CUDANVIDIANVIDIA GPUC/C++/PythonVery HighAI/ML, HPC, scientific computing
ROCmAMDAMD GPU (MI300X, etc.)C/C++ (HIP)MediumAI training/inference (PyTorch supported)
OpenCLKhronosGeneral (GPU/CPU/FPGA)C/C++MediumCross-platform GPU computing
SYCLKhronosGeneralC++GrowingoneAPI (Intel), cross-platform
MetalAppleApple SiliconSwift/Obj-C/C++Apple ecosystemmacOS/iOS GPU computing
Vulkan ComputeKhronosGeneral GPUC/GLSL/SPIR-VMediumCross-platform GPU computation
TritonOpenAINVIDIA/AMD GPUPythonGrowingHigh-level GPU kernel programming
WebGPUW3CBrowser GPUWGSL/JSEarlyWeb-based GPU computation

12.2 CUDA vs ROCm Detailed Comparison

CUDA code:
  cudaMalloc(&d_ptr, size);
  myKernel<<<grid, block>>>(d_ptr);
  cudaDeviceSynchronize();

ROCm HIP code (nearly identical API):
  hipMalloc(&d_ptr, size);
  myKernel<<<grid, block>>>(d_ptr);
  hipDeviceSynchronize();

AMD ROCm's HIP (Heterogeneous-compute Interface for Portability) provides a very similar API to CUDA, and the hipify-perl tool can automatically convert CUDA code to HIP code. PyTorch officially supports ROCm, enabling training on AMD MI300X and similar hardware.

12.3 Selection Guide

ScenarioRecommended TechnologyReason
AI/ML training and inferenceCUDAOptimized library ecosystem (cuDNN, TensorRT)
AMD GPU usageROCm (HIP)CUDA-compatible API, PyTorch support
Cross-platform requirementOpenCL or SYCLBroad hardware support
Apple environmentMetalOnly option for macOS/iOS
Custom kernels (Python)TritonHigher productivity than CUDA C
Web browser GPUWebGPUStandard web API

13. References

13.1 Official Documentation

  • NVIDIA CUDA Programming Guide: Official reference for GPU architecture and programming model
  • CUDA C++ Best Practices Guide: Collection of best practices for performance optimization
  • CUDA Toolkit Release Notes: Change logs and compatibility information for each version
  • NVIDIA Developer Blog: Latest technology trends and tutorials
ResourceURL
CUDA Toolkit Downloadhttps://developer.nvidia.com/cuda-downloads
CUDA Programming Guidehttps://docs.nvidia.com/cuda/cuda-c-programming-guide/
CUDA Best Practiceshttps://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
cuDNN Documentationhttps://docs.nvidia.com/deeplearning/cudnn/
TensorRT Documentationhttps://docs.nvidia.com/deeplearning/tensorrt/
NCCL Documentationhttps://docs.nvidia.com/deeplearning/nccl/
Nsight Computehttps://developer.nvidia.com/nsight-compute
Nsight Systemshttps://developer.nvidia.com/nsight-systems
RAPIDS Official Sitehttps://rapids.ai/
FlashAttention GitHubhttps://github.com/Dao-AILab/flash-attention
OpenAI Triton GitHubhttps://github.com/openai/triton
PyTorch CUDA Docshttps://pytorch.org/docs/stable/cuda.html
Beginner:
  1. Check GPU with nvidia-smi
  2. Use .cuda() in PyTorch
  3. Apply Mixed Precision Training
  4. Write simple kernels with Numba CUDA JIT

Intermediate:
  5. Write kernels directly in CUDA C/C++
  6. Optimize with Shared Memory, Coalesced Access
  7. Asynchronous execution with CUDA Streams
  8. Profile with Nsight Compute

Advanced:
  9. Custom Attention kernels with Triton
  10. Inference optimization with TensorRT
  11. Multi-GPU DDP training
  12. Custom GEMM implementation with CUTLASS

Conclusion

CUDA is more than a GPU programming tool -- it forms the foundation of modern AI infrastructure. Alongside hardware evolution from Ampere to Hopper to Blackwell, the CUDA Toolkit has entered the 13.x era, beginning to offer higher-level abstractions such as CUDA Tile and cuTile DSL.

The key points for effectively leveraging CUDA in practice are as follows.

  1. Build solid foundations: Understanding the Warp, Block, Grid execution model and memory hierarchy is essential for optimization
  2. Actively use Python tools: Most GPU acceleration can be achieved with PyTorch's Mixed Precision, CuPy, Numba, and similar tools
  3. Profile first: Identify bottlenecks with Nsight Compute/Systems before investing in optimization
  4. Memory is key: Memory optimization techniques like Coalesced Access, Shared Memory Tiling, and Pinned Memory account for 80% of performance improvements
  5. Leverage the ecosystem: Maximize use of already-optimized libraries like cuDNN, TensorRT, and FlashAttention

The world of GPU computing continues to evolve. The emergence of high-level programming models like Triton, the expansion of low-precision operations such as FP4/FP6, and the introduction of tile-based programming represented by CUDA Tile are all moving GPU programming toward lowering the barrier to entry while maximizing hardware performance. We hope this guide serves as a starting point for that journey.