CUDA Hands-on Complete Guide: Everything About GPU Computing

1. Introduction: What is CUDA
2. GPU Architecture Fundamentals
3. CUDA Development Environment Setup
4. CUDA Programming Basics (C/C++)
5. CUDA Memory Management In-Depth
6. CUDA Streams and Asynchronous Execution
7. CUDA Optimization Techniques
8. CUDA in Python
9. Hands-on Examples
10. CUDA Tools and Libraries
11. CUDA Troubleshooting
12. CUDA Alternative Technologies Comparison
13. References
Conclusion

1. Introduction: What is CUDA

1.1 Definition of CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It enables general-purpose computing (GPGPU) by leveraging thousands of GPU cores. Since its initial release in 2007, it has become the de facto standard in various fields including AI, scientific simulation, image processing, and financial modeling.

As of 2025, the latest CUDA Toolkit is version 13.1, which introduces CUDA Tile, a tile-based programming model, and cuTile Python DSL, providing Tensor Core abstraction and forward compatibility for next-generation Blackwell GPUs.

1.2 History of GPU Computing and the Emergence of CUDA

The evolution of GPU computing can be summarized by era as follows.

Period	Event	Significance
2001	GPGPU research begins	General-purpose computing attempts using shaders
2006	NVIDIA Tesla architecture announced	First unified shader architecture
2007	CUDA 1.0 released	Birth of GPU general-purpose computing
2012	AlexNet (ImageNet)	Dawn of GPU deep learning era
2017	Volta + Tensor Core	Mixed precision computation acceleration
2020	Ampere (A100)	TF32, Sparsity support
2022	Hopper (H100)	Transformer Engine, FP8 support
2024	Blackwell (B200)	FP4, 5th gen Tensor Core
2025	CUDA 13.0/13.1 released	CUDA Tile, cuTile DSL introduction

1.3 CPU vs GPU Architecture Comparison

CPUs and GPUs have fundamentally different design philosophies.

Attribute	CPU	GPU
Design goal	Low latency (serial processing)	High throughput (parallel processing)
Core count	A few to dozens	Thousands to tens of thousands
Clock speed	High (4-6 GHz)	Relatively low (1-2 GHz)
Cache size	Large (tens of MB)	Small (per core)
Control logic	Complex (branch prediction, OoO execution)	Simple (latency hiding through many threads)
Parallelism model	SIMD (Single Instruction Multiple Data)	SIMT (Single Instruction Multiple Threads)
Optimal workload	Complex branching, sequential logic	Large-scale data-parallel computation

SIMD vs SIMT key difference: SIMD is a vector operation method that processes multiple data simultaneously with a single instruction. SIMT goes a step further, where each thread has its own program counter while executing the same instruction simultaneously. When branching occurs, threads within a Warp can execute different paths, but this causes Warp Divergence, which degrades performance.

1.4 Why CUDA is Essential for AI/ML/Deep Learning

The reasons CUDA is central to modern AI/ML workloads are clear.

Matrix computation acceleration: Dramatically accelerates matrix multiplication (GEMM), the core of deep learning, with Tensor Cores
Software ecosystem: Optimized library ecosystem including cuDNN, cuBLAS, NCCL, TensorRT
Framework support: All major frameworks (PyTorch, TensorFlow, JAX) use CUDA as their default backend
Mixed precision: Maximizes training/inference speed with low-precision computation down to FP16, BF16, FP8, and FP4
Multi-GPU scaling: Distributed training across hundreds of GPUs via NVLink and NVSwitch

2. GPU Architecture Fundamentals

2.1 NVIDIA GPU Internal Structure

Let us examine the core components of an NVIDIA GPU hierarchically.

GPU (GPC - Graphics Processing Cluster)
├── SM (Streaming Multiprocessor) x N
│   ├── CUDA Core (FP32/INT32) x 128 (Hopper)
│   ├── Tensor Core x 4 (4th gen, Hopper)
│   ├── RT Core (Ray Tracing) x 1
│   ├── Warp Scheduler x 4
│   ├── Register File (256 KB)
│   ├── L1 Cache / Shared Memory (shared, up to 228 KB)
│   └── SFU (Special Function Unit)
├── L2 Cache (shared)
├── Memory Controller
└── HBM (High Bandwidth Memory)

Key core types:

CUDA Core: The basic processing unit for general-purpose floating-point/integer operations. Handles FP32, FP64, INT32 operations
Tensor Core: A core specialized for Matrix Multiply-Accumulate (MMA) operations. The key accelerator for deep learning training/inference
RT Core: Dedicated hardware for Ray Tracing acceleration. Primarily used for graphics workloads

2.2 Memory Hierarchy

GPU memory is organized hierarchically by speed and size.

Memory Type	Location	Size (H100)	Bandwidth	Access Scope	Characteristics
Register	Inside SM	256 KB/SM	Fastest	Thread-private	Fastest but limited
Shared Memory	Inside SM	Up to 228 KB/SM	Very fast	Shared within Block	Programmer-managed cache
L1 Cache	Inside SM	Shared with Shared	Very fast	SM-private	HW-managed
L2 Cache	GPU-wide	50 MB	Fast	Shared across all SMs	HW-managed
Global Memory (HBM)	Off-GPU	80 GB	3.35 TB/s	Globally accessible	Largest but highest latency
Constant Memory	GPU-wide	64 KB	Fast when cached	Read-only	Broadcast-optimized
Texture Memory	GPU-wide	Shared with Global	Fast when cached	Read-only	2D spatial locality optimized

Speed:  Register > Shared/L1 > L2 > Global (HBM)
Size:   Global (HBM) > L2 > Shared/L1 > Register

2.3 Warp, Block, Grid Concepts

CUDA's execution model has a 3-level hierarchical structure.

Grid (kernel execution unit)
├── Block (0,0)          Block (1,0)          Block (2,0)
│   ├── Warp 0           ├── Warp 0           ├── Warp 0
│   │   ├── Thread 0     │   ├── Thread 0     │   ├── Thread 0
│   │   ├── Thread 1     │   ├── Thread 1     │   ├── Thread 1
│   │   ├── ...          │   ├── ...          │   ├── ...
│   │   └── Thread 31    │   └── Thread 31    │   └── Thread 31
│   ├── Warp 1           ├── Warp 1           ├── Warp 1
│   └── ...              └── ...              └── ...
├── Block (0,1)          Block (1,1)          ...
└── ...

Thread: The smallest execution unit of GPU computation
Warp: An execution group of 32 Threads. The unit that executes the same instruction simultaneously on an SM (the core of SIMT)
Block (Thread Block): Composed of multiple Warps. Executes on the same SM and shares Shared Memory
Grid: The collection of Blocks that make up the entire kernel execution

Key constraints:

Item	Limit (Compute Capability 9.0)
Max Threads per Block	1,024
Warp size	32 (fixed)
Max Block dimensions	(1024, 1024, 64)
Max Grid dimensions	(2^31-1, 65535, 65535)
Max Blocks per SM	32
Max Warps per SM	64

2.4 Compute Capability Version Differences

Compute Capability (CC) defines the feature set of GPU hardware.

CC	Architecture	Representative GPU	Key Features
7.0	Volta	V100	1st gen Tensor Core, independent thread scheduling
7.5	Turing	RTX 2080	INT8/INT4 Tensor Core, RT Core
8.0	Ampere	A100	3rd gen Tensor Core, TF32, Sparsity
8.6	Ampere	RTX 3090	Consumer Ampere
8.9	Ada Lovelace	RTX 4090	4th gen Tensor Core, FP8, DLSS 3
9.0	Hopper	H100	4th gen Tensor Core, Transformer Engine, DPX
10.0	Blackwell	B200	5th gen Tensor Core, FP4, TMEM
12.0	Blackwell Ultra	B300	Enhanced 5th gen Tensor Core, 288 GB HBM3E

Note that starting with CUDA 13.0, support for Maxwell (CC 5.x), Pascal (CC 6.x), and Volta (CC 7.0) has been removed.

2.5 Latest GPU Generation Comparison

Comparing three generations of data center GPUs.

Spec	A100 (SXM)	H100 (SXM5)	B200 (SXM)	B300 (SXM)
Architecture	Ampere	Hopper	Blackwell	Blackwell Ultra
CUDA Cores	6,912	16,896	18,432	18,432+
Tensor Cores	432 (3rd gen)	528 (4th gen)	5th gen	5th gen
Memory	80 GB HBM2e	80 GB HBM3	180 GB HBM3E	288 GB HBM3E
Memory BW	2.0 TB/s	3.35 TB/s	7.7 TB/s	8.0 TB/s
FP32 Perf	19.5 TFLOPS	60 TFLOPS	Undisclosed	Undisclosed
FP16 Tensor	312 TFLOPS	990 TFLOPS	Undisclosed	Undisclosed
FP4 Tensor	Not supported	Not supported	9.0 PFLOPS	14.0 PFLOPS
NVLink	3rd gen (600 GB/s)	4th gen (900 GB/s)	5th gen (1.8 TB/s)	5th gen (1.8 TB/s)
TDP	400W	700W	1,000W	1,400W
Cooling	Air/Liquid	Air/Liquid	Liquid recommended	Liquid required (DLC)

The B200 achieves 3x training performance and 15x inference performance over the A100. The B300 (Blackwell Ultra) provides 14 PFLOPS in FP4 operations, 55.6% faster than the B200.

3. CUDA Development Environment Setup

3.1 CUDA Toolkit Installation

Linux (Ubuntu/Debian)

# 1. Check NVIDIA driver
nvidia-smi

# 2. Add CUDA Keyring (Ubuntu 22.04 example)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# 3. Install CUDA Toolkit
sudo apt-get install cuda-toolkit-13-1

# 4. Set environment variables
export PATH=/usr/local/cuda-13.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.1/lib64:$LD_LIBRARY_PATH

# 5. Verify installation
nvcc --version

Windows

# 1. Download CUDA Toolkit from the NVIDIA official site
# https://developer.nvidia.com/cuda-downloads

# 2. Check environment variables after installation
echo %CUDA_PATH%
# Typically C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1

# 3. Verify
nvcc --version
nvidia-smi

3.2 nvidia-smi Command Usage

nvidia-smi is the essential tool for monitoring GPU status.

# Basic info output
nvidia-smi

# Output example:
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 560.35.03    Driver Version: 560.35.03    CUDA Version: 13.1                 |
# |-----------------------------------------+------------------------+----------------------|
# | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
# |=========================================+========================+======================|
# |   0  NVIDIA H100 80GB HBM3          On | 00000000:3B:00.0   Off |                    0 |
# | N/A   32C    P0              72W / 700W |    1234MiB / 81559MiB  |      0%      Default |
# +-----------------------------------------+------------------------+----------------------+

# Continuous monitoring (1 second interval)
nvidia-smi -l 1

# dmon - GPU metrics monitoring
nvidia-smi dmon -s pucvmet -d 1

# Specific GPU info
nvidia-smi -i 0 -q

# GPU process monitoring
nvidia-smi pmon -i 0

# GPU clock and memory info
nvidia-smi -q -d CLOCK,MEMORY

# CSV format output (for scripting)
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total \
  --format=csv,noheader,nounits

3.3 nvcc Compiler Usage

nvcc is the NVIDIA CUDA Compiler for compiling CUDA source code.

# Basic compilation
nvcc -o hello hello.cu

# Target specific architecture
nvcc -arch=sm_90 -o kernel kernel.cu   # Hopper H100
nvcc -arch=sm_80 -o kernel kernel.cu   # Ampere A100

# Debug mode
nvcc -g -G -o debug_kernel kernel.cu

# Optimization level
nvcc -O3 -o optimized kernel.cu

# Generate PTX code (intermediate representation)
nvcc -ptx kernel.cu

# Support multiple architectures simultaneously (Fat Binary)
nvcc -gencode arch=compute_80,code=sm_80 \
     -gencode arch=compute_90,code=sm_90 \
     -o multi_arch kernel.cu

# Library linking
nvcc -lcublas -lcurand -o linked_kernel kernel.cu

# Detailed compilation info
nvcc --resource-usage -o kernel kernel.cu

3.4 CUDA Version Check and Compatibility

There are two types of CUDA versions, and it is important not to confuse them.

# 1. Driver-supported CUDA version (Driver API)
nvidia-smi
# "CUDA Version: 13.1" shown in upper right
# This is the maximum CUDA runtime version the driver supports

# 2. CUDA Toolkit version (Runtime API)
nvcc --version
# Shows "release 13.1"
# The actually installed CUDA Toolkit version

# 3. Check at runtime
python3 -c "import torch; print(torch.version.cuda)"

Key compatibility rules:

The driver CUDA version must be greater than or equal to the Toolkit CUDA version
Example: If the driver supports CUDA 13.1, you can also use CUDA 12.x Toolkit (backward compatible)
According to CUDA Toolkit's Minor Version Compatibility policy, binary compatibility is maintained within the same Major version

3.5 Using CUDA in Docker

The NVIDIA Container Toolkit allows you to leverage GPUs in Docker containers.

# 1. Install NVIDIA Container Toolkit (Ubuntu)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# 2. Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# 3. Run GPU container
docker run --rm --gpus all nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi

# Use specific GPUs only
docker run --rm --gpus '"device=0,1"' nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi

# GPU usage in docker compose
# docker-compose.yml:

# docker-compose.yml
services:
  gpu-app:
    image: nvidia/cuda:13.1.0-devel-ubuntu22.04
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: nvidia-smi

3.6 Installing PyTorch/TensorFlow CUDA via conda/pip

# PyTorch (CUDA 12.4 build example)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Using conda
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia

# TensorFlow (auto GPU detection)
pip install tensorflow[and-cuda]

# Check CUDA availability (Python)
python3 -c "
import torch
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'CUDA version: {torch.version.cuda}')
print(f'GPU count: {torch.cuda.device_count()}')
if torch.cuda.is_available():
    print(f'GPU name: {torch.cuda.get_device_name(0)}')
"

4. CUDA Programming Basics (C/C++)

4.1 Host (CPU) vs Device (GPU) Code

In CUDA programming, CPU-side code is called Host and GPU-side code is called Device.

┌─────────────────────────────────────────────────────────┐
│  Host (CPU)                                             │
│  ┌─────────────────────────────────────────────────────┐│
│  │ 1. Prepare data (Host Memory)                       ││
│  │ 2. Allocate GPU memory (cudaMalloc)                 ││
│  │ 3. Transfer data: Host -> Device (cudaMemcpy)       ││
│  │ 4. Launch kernel (<<<grid, block>>>)                ││
│  │ 5. Transfer results: Device -> Host (cudaMemcpy)    ││
│  │ 6. Free GPU memory (cudaFree)                       ││
│  └─────────────────────────────────────────────────────┘│
│                        ↕ PCIe / NVLink                  │
│  ┌─────────────────────────────────────────────────────┐│
│  │  Device (GPU)                                       ││
│  │  - Parallel execution of kernel functions           ││
│  │  - Thousands of threads processing simultaneously   ││
│  └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘

4.2 Function Qualifiers

CUDA provides three function qualifiers.

Qualifier	Execution Location	Callable From	Description
`__global__`	Device (GPU)	Host (CPU)	Kernel function. Return type must be `void`
`__device__`	Device (GPU)	Device (GPU)	Helper function callable only within GPU
`__host__`	Host (CPU)	Host (CPU)	Normal CPU function (default, can be omitted)
`__host__ __device__`	Both	Both	Usable on both CPU and GPU

// __global__: Kernel function - called from Host, executed on Device
__global__ void myKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] *= 2.0f;
    }
}

// __device__: Device-only helper function
__device__ float square(float x) {
    return x * x;
}

// __host__ __device__: CPU/GPU compatible function
__host__ __device__ float add(float a, float b) {
    return a + b;
}

4.3 Kernel Launch Syntax

Kernel functions are launched using special triple angle bracket syntax.

// Kernel launch syntax
// kernel<<<gridDim, blockDim, sharedMemSize, stream>>>(args...);
//
// gridDim:       Number of Blocks in Grid (dim3)
// blockDim:      Number of Threads in Block (dim3)
// sharedMemSize: Dynamic Shared Memory size (bytes, optional)
// stream:        Execution stream (optional)

// 1D example: 256 threads, 1 block
myKernel<<<1, 256>>>(d_data, n);

// 1D example: Process N data elements with 256-thread blocks
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
myKernel<<<blocksPerGrid, threadsPerBlock>>>(d_data, n);

// 2D example: Image processing
dim3 blockDim(16, 16);       // 16x16 = 256 threads/block
dim3 gridDim(
    (width + 15) / 16,
    (height + 15) / 16
);
imageKernel<<<gridDim, blockDim>>>(d_image, width, height);

4.4 Thread Hierarchy

Each thread can determine its position through built-in variables.

// Built-in Variables
threadIdx.x, threadIdx.y, threadIdx.z  // Thread index within Block
blockIdx.x,  blockIdx.y,  blockIdx.z   // Block index within Grid
blockDim.x,  blockDim.y,  blockDim.z   // Block dimension sizes
gridDim.x,   gridDim.y,   gridDim.z    // Grid dimension sizes

// Global Thread ID calculation (1D)
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;

// Global Thread ID calculation (2D)
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int globalIdx = row * width + col;

// Total thread count (for Grid-stride loop)
int totalThreads = gridDim.x * blockDim.x;

4.5 Memory Management Functions

// GPU memory allocation
float* d_data;
cudaMalloc((void**)&d_data, n * sizeof(float));

// Host -> Device copy
cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice);

// Device -> Host copy
cudaMemcpy(h_data, d_data, n * sizeof(float), cudaMemcpyDeviceToHost);

// GPU memory free
cudaFree(d_data);

// GPU memory initialization
cudaMemset(d_data, 0, n * sizeof(float));

// Error check macro (essential!)
#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error at %s:%d - %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// Usage example
CUDA_CHECK(cudaMalloc((void**)&d_data, n * sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice));

4.6 Hello World Example

// hello_cuda.cu
#include <stdio.h>

__global__ void helloKernel() {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    printf("Hello from GPU thread %d (block %d, thread %d)\n",
           tid, blockIdx.x, threadIdx.x);
}

int main() {
    // 2 blocks, 4 threads each = 8 threads total
    helloKernel<<<2, 4>>>();

    // Wait for GPU work to complete
    cudaDeviceSynchronize();

    printf("Hello from CPU!\n");
    return 0;
}

# Compile and run
nvcc -o hello hello_cuda.cu
./hello

4.7 Vector Addition Example

This is the "Hello World" of CUDA - a vector addition example.

// vector_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error: %s at %s:%d\n", \
                    cudaGetErrorString(err), __FILE__, __LINE__); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// GPU kernel: vector addition
__global__ void vectorAdd(const float* A, const float* B, float* C, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    const int N = 1 << 20;  // 1M elements
    size_t bytes = N * sizeof(float);

    // Host memory allocation and initialization
    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);

    for (int i = 0; i < N; i++) {
        h_A[i] = (float)i;
        h_B[i] = (float)(i * 2);
    }

    // Device memory allocation
    float *d_A, *d_B, *d_C;
    CUDA_CHECK(cudaMalloc(&d_A, bytes));
    CUDA_CHECK(cudaMalloc(&d_B, bytes));
    CUDA_CHECK(cudaMalloc(&d_C, bytes));

    // Host -> Device copy
    CUDA_CHECK(cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice));

    // Kernel launch
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Error check
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    // Device -> Host copy
    CUDA_CHECK(cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost));

    // Result verification
    for (int i = 0; i < N; i++) {
        if (fabs(h_C[i] - (h_A[i] + h_B[i])) > 1e-5) {
            fprintf(stderr, "Verification failed at index %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }
    printf("Vector addition of %d elements: SUCCESS\n", N);

    // Memory cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

4.8 Matrix Multiplication Example

// matmul.cu
#include <stdio.h>
#include <cuda_runtime.h>

#define TILE_SIZE 16

// Basic matrix multiplication kernel
__global__ void matMulBasic(const float* A, const float* B, float* C,
                            int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < M && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < K; k++) {
            sum += A[row * K + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

// Tiled matrix multiplication using Shared Memory (optimized version)
__global__ void matMulTiled(const float* A, const float* B, float* C,
                            int M, int N, int K) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    // Iterate over tiles
    for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Load tiles into Shared Memory
        if (row < M && (t * TILE_SIZE + threadIdx.x) < K)
            tileA[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
        else
            tileA[threadIdx.y][threadIdx.x] = 0.0f;

        if ((t * TILE_SIZE + threadIdx.y) < K && col < N)
            tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tileB[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();  // Wait for tile load to complete

        // Multiply-accumulate within tile
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        __syncthreads();  // Synchronize before loading next tile
    }

    if (row < M && col < N) {
        C[row * N + col] = sum;
    }
}

int main() {
    const int M = 1024, N = 1024, K = 1024;

    // ... (memory allocation, initialization, data transfer omitted)

    dim3 blockDim(TILE_SIZE, TILE_SIZE);
    dim3 gridDim((N + TILE_SIZE - 1) / TILE_SIZE,
                 (M + TILE_SIZE - 1) / TILE_SIZE);

    matMulTiled<<<gridDim, blockDim>>>(d_A, d_B, d_C, M, N, K);

    // ... (result copy, verification, memory cleanup omitted)
    return 0;
}

5. CUDA Memory Management In-Depth

5.1 Global Memory and Coalesced Access

Coalesced Access is critical to performance when accessing Global Memory. When 32 threads in a Warp access contiguous memory addresses, the GPU combines them into a single transaction.

// Coalesced Access (good pattern)
// Threads in a Warp access contiguous memory
__global__ void coalesced(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] = data[idx] * 2.0f;  // Contiguous access
    }
}

// Non-coalesced Access (bad pattern)
// Threads access non-contiguously with a stride
__global__ void strided(float* data, int n, int stride) {
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    if (idx < n) {
        data[idx] = data[idx] * 2.0f;  // Strided access
    }
}

// AoS (Array of Structures) vs SoA (Structure of Arrays)
// AoS: Non-coalesced (bad)
struct ParticleAoS {
    float x, y, z;
    float vx, vy, vz;
};
// Thread 0: particle[0].x, Thread 1: particle[1].x -> strided access

// SoA: Coalesced (good)
struct ParticlesSoA {
    float* x;  float* y;  float* z;
    float* vx; float* vy; float* vz;
};
// Thread 0: x[0], Thread 1: x[1] -> contiguous access

5.2 Shared Memory and Bank Conflict

Shared Memory is high-speed memory shared within a thread block on an SM. It consists of 32 banks, and when different threads simultaneously access the same bank, a Bank Conflict occurs.

// Shared Memory usage example
__global__ void sharedMemExample(float* input, float* output, int n) {
    // Static allocation
    __shared__ float sharedData[256];

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;

    // Load from Global -> Shared
    if (gid < n) {
        sharedData[tid] = input[gid];
    }
    __syncthreads();  // Wait until all threads have loaded

    // Compute on Shared Memory (neighbor element sum example)
    if (tid > 0 && tid < blockDim.x - 1 && gid < n) {
        output[gid] = sharedData[tid - 1] + sharedData[tid] + sharedData[tid + 1];
    }
}

// Bank Conflict avoidance: padding technique
// Problem: When accessing columns of a 32x32 matrix, all threads map to the same bank
__shared__ float tile[32][32];       // Bank Conflict!
__shared__ float tile[32][32 + 1];   // Padding avoids Bank Conflict

5.3 Unified Memory

Unified Memory is an abstraction layer that allows both CPU and GPU to access memory through the same pointer.

// Unified Memory usage
float* data;
cudaMallocManaged(&data, N * sizeof(float));

// Initialize on CPU
for (int i = 0; i < N; i++) {
    data[i] = (float)i;
}

// GPU kernel execution - no separate cudaMemcpy needed!
myKernel<<<blocks, threads>>>(data, N);
cudaDeviceSynchronize();

// Access results directly from CPU
printf("Result: %f\n", data[0]);

// Free
cudaFree(data);

// Memory prefetch hints (performance optimization)
cudaMemPrefetchAsync(data, N * sizeof(float), deviceId);  // Prefetch to GPU
cudaMemPrefetchAsync(data, N * sizeof(float), cudaCpuDeviceId);  // Prefetch to CPU

5.4 Pinned Memory

Pinned Memory (Page-locked Memory) is host memory excluded from OS page swapping, which increases data transfer speed between GPU and host.

// Pinned Memory allocation
float* h_pinned;
cudaMallocHost(&h_pinned, N * sizeof(float));  // or cudaHostAlloc

// Transfer speed comparison vs regular memory
// Pageable:  ~12 GB/s (PCIe Gen4)
// Pinned:    ~25 GB/s (PCIe Gen4) - approximately 2x faster

// Required for asynchronous transfers
cudaMemcpyAsync(d_data, h_pinned, bytes, cudaMemcpyHostToDevice, stream);

// Free
cudaFreeHost(h_pinned);

5.5 Memory Pool

Memory Pool, introduced in CUDA 11.2, reduces overhead from repetitive memory allocation/deallocation.

// Stream-ordered Memory Allocator (CUDA 11.2+)
float* d_data;
cudaMallocAsync(&d_data, bytes, stream);

// Asynchronous free after use
cudaFreeAsync(d_data, stream);

// Memory Pool configuration
cudaMemPool_t mempool;
cudaDeviceGetDefaultMemPool(&mempool, deviceId);

// Set pool size limit
uint64_t threshold = 1ULL << 30;  // 1 GB
cudaMemPoolSetAttribute(mempool, cudaMemPoolAttrReleaseThreshold, &threshold);

5.6 Memory Bandwidth Optimization Summary

Technique	Effect	Implementation Difficulty
Coalesced Access	Maximize Global Memory throughput	Low
SoA layout	Guarantee Coalesced Access	Medium
Shared Memory Tiling	Reduce Global Memory access count	Medium
Bank Conflict avoidance (padding)	Maximize Shared Memory throughput	Low
Pinned Memory	2x faster Host-Device transfer	Low
Unified Memory + Prefetch	Programming convenience + performance	Low
Memory Pool	Eliminate allocation/deallocation overhead	Low
Texture Memory	Leverage 2D spatial locality	Medium

6. CUDA Streams and Asynchronous Execution

6.1 CUDA Stream Concept

A CUDA Stream is a sequence of commands executed in order on the GPU. Operations on different Streams can execute concurrently.

Default Stream (Stream 0):
[MemcpyH2D] -> [Kernel A] -> [MemcpyD2H]
                                           (sequential, no pipeline)

Multi-Stream:
Stream 1: [MemcpyH2D_1] -> [Kernel_1] -> [MemcpyD2H_1]
Stream 2:     [MemcpyH2D_2] -> [Kernel_2] -> [MemcpyD2H_2]
Stream 3:         [MemcpyH2D_3] -> [Kernel_3] -> [MemcpyD2H_3]
                                           (overlapped execution, maximize GPU utilization)

6.2 Stream Creation and Usage

// Stream creation
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Asynchronous memory transfer (Pinned Memory required!)
cudaMemcpyAsync(d_A, h_A, bytes, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(d_B, h_B, bytes, cudaMemcpyHostToDevice, stream2);

// Kernel execution (on specific Stream)
kernelA<<<grid, block, 0, stream1>>>(d_A);
kernelB<<<grid, block, 0, stream2>>>(d_B);

// Asynchronous result copy
cudaMemcpyAsync(h_A, d_A, bytes, cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(h_B, d_B, bytes, cudaMemcpyDeviceToHost, stream2);

// Stream synchronization
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

// Stream cleanup
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

6.3 Multi-Stream Pipeline Pattern

const int NUM_STREAMS = 4;
const int CHUNK_SIZE = N / NUM_STREAMS;

cudaStream_t streams[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
    cudaStreamCreate(&streams[i]);
}

// Pinned Memory allocation
float *h_in, *h_out;
cudaMallocHost(&h_in, N * sizeof(float));
cudaMallocHost(&h_out, N * sizeof(float));

float *d_in, *d_out;
cudaMalloc(&d_in, N * sizeof(float));
cudaMalloc(&d_out, N * sizeof(float));

// Pipeline execution
for (int i = 0; i < NUM_STREAMS; i++) {
    int offset = i * CHUNK_SIZE;
    size_t chunkBytes = CHUNK_SIZE * sizeof(float);

    // 1. Host -> Device (async)
    cudaMemcpyAsync(d_in + offset, h_in + offset,
                    chunkBytes, cudaMemcpyHostToDevice, streams[i]);

    // 2. Kernel execution
    int blocks = (CHUNK_SIZE + 255) / 256;
    processKernel<<<blocks, 256, 0, streams[i]>>>(
        d_in + offset, d_out + offset, CHUNK_SIZE);

    // 3. Device -> Host (async)
    cudaMemcpyAsync(h_out + offset, d_out + offset,
                    chunkBytes, cudaMemcpyDeviceToHost, streams[i]);
}

// Wait for all to complete
cudaDeviceSynchronize();

6.4 CUDA Events (Timing Measurement)

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

// Start timing
cudaEventRecord(start);

// Kernel execution
myKernel<<<grid, block>>>(d_data, N);

// Stop timing
cudaEventRecord(stop);
cudaEventSynchronize(stop);

// Calculate elapsed time
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel execution time: %.3f ms\n", milliseconds);

// Bandwidth calculation
float bandwidth = (bytes_read + bytes_written) / (milliseconds * 1e6);
printf("Effective bandwidth: %.2f GB/s\n", bandwidth);

cudaEventDestroy(start);
cudaEventDestroy(stop);

6.5 Synchronization Function Comparison

Function	Scope	Purpose
`cudaDeviceSynchronize()`	Entire Device	Wait for all Streams to complete
`cudaStreamSynchronize(stream)`	Specific Stream	Wait for that Stream to complete
`cudaEventSynchronize(event)`	Specific Event	Wait for that Event to complete
`cudaStreamWaitEvent(stream, event)`	Between Streams	Stream waits for Event before proceeding
`__syncthreads()`	Within Block	Synchronize threads within Block (in kernel)

7. CUDA Optimization Techniques

7.1 Occupancy Optimization

Occupancy is the ratio of actual active Warps to the maximum supported Warps on an SM. High Occupancy does not always mean peak performance, but it is critical for hiding memory latency.

// Occupancy calculation API
int blockSize;
int minGridSize;

// Automatically calculate optimal block size
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, 0);
printf("Optimal block size: %d\n", blockSize);
printf("Minimum grid size: %d\n", minGridSize);

// Check Occupancy for a specific block size
int maxActiveBlocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &maxActiveBlocks, myKernel, blockSize, 0);
printf("Max active blocks per SM: %d\n", maxActiveBlocks);

Block size selection guidelines:

Principle	Description
Multiple of 32	Align with Warp size to avoid thread waste
128 ~ 512 recommended	Generally optimal range
Consider registers/Shared Memory	Reduce block size if resource usage is high
Use `cudaOccupancyMaxPotentialBlockSize`	Automatic optimal calculation

7.2 Minimizing Warp Divergence

When threads within a Warp take different branches, sequential execution occurs.

// Bad example: Warp Divergence
__global__ void divergentKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        // Branches based on even/odd threadIdx.x
        // -> Only half execute at a time within same Warp
        if (threadIdx.x % 2 == 0) {
            data[idx] = data[idx] * 2.0f;
        } else {
            data[idx] = data[idx] + 1.0f;
        }
    }
}

// Good example: Branch at Warp granularity
__global__ void convergentKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int warpId = threadIdx.x / 32;  // Branch by Warp ID
    if (idx < n) {
        // All threads in the same Warp take the same path
        if (warpId % 2 == 0) {
            data[idx] = data[idx] * 2.0f;
        } else {
            data[idx] = data[idx] + 1.0f;
        }
    }
}

7.3 Shared Memory Tiling (Matrix Multiplication Optimization)

Tiling is a key technique to reduce Global Memory access count. We already saw the implementation in section 4.8's matMulTiled, and the performance difference is summarized below.

Method	Global Memory Accesses (MxNxK matrix)	Relative Performance
Naive	`2 * M * N * K`	1x
Tiled (Shared Memory)	`2 * M * N * K / TILE_SIZE`	~TILE_SIZE x

With TILE_SIZE of 16, Global Memory accesses are reduced by 16x.

7.4 Loop Unrolling

// Manual Loop Unrolling
__global__ void unrolled(float* data, int n) {
    int idx = blockIdx.x * blockDim.x * 4 + threadIdx.x;

    // 4x Unrolling
    if (idx < n)         data[idx] *= 2.0f;
    if (idx + 256 < n)   data[idx + 256] *= 2.0f;
    if (idx + 512 < n)   data[idx + 512] *= 2.0f;
    if (idx + 768 < n)   data[idx + 768] *= 2.0f;
}

// Compiler directive-based Unrolling
__global__ void pragmaUnrolled(float* A, float* B, float* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;

    #pragma unroll 8
    for (int k = 0; k < N; k++) {
        sum += A[idx * N + k] * B[k];
    }
    C[idx] = sum;
}

7.5 Grid-Stride Loop Pattern

A flexible pattern used when data size exceeds total thread count.

__global__ void gridStrideKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    // Each thread processes multiple elements, striding by total thread count
    for (int i = idx; i < n; i += stride) {
        data[i] = data[i] * 2.0f;
    }
}

// Advantages:
// 1. Works regardless of data size
// 2. Block/grid sizes can be freely adjusted
// 3. Increased work per thread reduces kernel launch overhead

7.6 NVIDIA Nsight Profiling

# Nsight Compute (kernel-level profiling)
ncu --set full -o profile_report ./my_cuda_app

# Profile specific kernel only
ncu --kernel-name myKernel --launch-skip 0 --launch-count 1 ./my_cuda_app

# Nsight Systems (system-level profiling)
nsys profile --trace=cuda,nvtx -o timeline_report ./my_cuda_app

# Key metrics to check
# - SM Throughput: SM utilization
# - Memory Throughput: Memory bandwidth utilization
# - Achieved Occupancy: Actual Occupancy
# - Warp Stall Reasons: Warp stall causes
# - L1/L2 Hit Rate: Cache hit rate

Profiling checklist:

Metric	Target	Action if problematic
Achieved Occupancy	50% or higher	Adjust block size, register usage
Memory Throughput	60%+ of theoretical BW	Improve Coalesced Access, caching
Compute Throughput	60%+ of theoretical ops	Improve ILP, remove unnecessary ops
Warp Divergence	Minimize	Restructure branching logic

8. CUDA in Python

8.1 PyCUDA Basics

PyCUDA allows you to write and execute CUDA C kernels directly from Python.

import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np

# Define CUDA kernel (write C code as a string)
mod = SourceModule("""
__global__ void multiply(float *dest, float *a, float *b, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        dest[idx] = a[idx] * b[idx];
    }
}
""")

multiply = mod.get_function("multiply")

# Prepare data
n = 1024 * 1024
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
dest = np.zeros_like(a)

# Execute kernel
block_size = 256
grid_size = (n + block_size - 1) // block_size

multiply(
    cuda.Out(dest), cuda.In(a), cuda.In(b), np.int32(n),
    block=(block_size, 1, 1), grid=(grid_size, 1)
)

# Verify results
assert np.allclose(dest, a * b)
print("PyCUDA multiply: SUCCESS")

8.2 Numba CUDA JIT

Numba is the easiest way to JIT-compile Python functions into CUDA kernels.

from numba import cuda
import numpy as np
import math

# CUDA kernel definition
@cuda.jit
def vector_add(a, b, c):
    idx = cuda.grid(1)  # 1D global index
    if idx < a.shape[0]:
        c[idx] = a[idx] + b[idx]

# Prepare data
n = 1_000_000
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
c = np.zeros(n, dtype=np.float32)

# Transfer to Device
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)

# Execute kernel
threads_per_block = 256
blocks_per_grid = math.ceil(n / threads_per_block)
vector_add[blocks_per_grid, threads_per_block](d_a, d_b, d_c)

# Get results
result = d_c.copy_to_host()
assert np.allclose(result, a + b)
print("Numba CUDA vector_add: SUCCESS")


# 2D kernel example: matrix multiplication
@cuda.jit
def matmul_kernel(A, B, C):
    row, col = cuda.grid(2)  # 2D global index
    if row < C.shape[0] and col < C.shape[1]:
        tmp = 0.0
        for k in range(A.shape[1]):
            tmp += A[row, k] * B[k, col]
        C[row, col] = tmp

# Using Shared Memory
@cuda.jit
def matmul_shared(A, B, C):
    TILE = 16
    sA = cuda.shared.array(shape=(TILE, TILE), dtype=np.float32)
    sB = cuda.shared.array(shape=(TILE, TILE), dtype=np.float32)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    row = cuda.blockIdx.y * TILE + ty
    col = cuda.blockIdx.x * TILE + tx

    tmp = 0.0
    for t in range((A.shape[1] + TILE - 1) // TILE):
        if row < A.shape[0] and (t * TILE + tx) < A.shape[1]:
            sA[ty, tx] = A[row, t * TILE + tx]
        else:
            sA[ty, tx] = 0.0

        if (t * TILE + ty) < B.shape[0] and col < B.shape[1]:
            sB[ty, tx] = B[t * TILE + ty, col]
        else:
            sB[ty, tx] = 0.0

        cuda.syncthreads()
        for k in range(TILE):
            tmp += sA[ty, k] * sB[k, tx]
        cuda.syncthreads()

    if row < C.shape[0] and col < C.shape[1]:
        C[row, col] = tmp

8.3 CuPy (NumPy Replacement)

CuPy is a library that provides the same API as NumPy but runs on the GPU.

import cupy as cp
import numpy as np
import time

# Same API as NumPy
a_gpu = cp.random.randn(10000, 10000, dtype=cp.float32)
b_gpu = cp.random.randn(10000, 10000, dtype=cp.float32)

# GPU matrix multiplication
start = time.time()
c_gpu = cp.dot(a_gpu, b_gpu)
cp.cuda.Stream.null.synchronize()  # Wait for GPU work to complete
gpu_time = time.time() - start

# CPU comparison
a_cpu = cp.asnumpy(a_gpu)
b_cpu = cp.asnumpy(b_gpu)
start = time.time()
c_cpu = np.dot(a_cpu, b_cpu)
cpu_time = time.time() - start

print(f"GPU: {gpu_time:.4f}s, CPU: {cpu_time:.4f}s")
print(f"Speedup: {cpu_time / gpu_time:.1f}x")

# Custom CUDA kernel in CuPy
custom_kernel = cp.RawKernel(r'''
extern "C" __global__
void relu_kernel(const float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = fmaxf(input[idx], 0.0f);
    }
}
''', 'relu_kernel')

n = 1_000_000
x = cp.random.randn(n, dtype=cp.float32)
y = cp.zeros(n, dtype=cp.float32)

block_size = 256
grid_size = (n + block_size - 1) // block_size
custom_kernel((grid_size,), (block_size,), (x, y, n))

# Verification
assert cp.allclose(y, cp.maximum(x, 0))

8.4 PyTorch CUDA Operations

Here are practical patterns for leveraging CUDA in PyTorch.

import torch
import torch.nn as nn

# ===== Basic GPU Usage =====

# Check GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")

# Move tensors to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Method 1: .to(device)
x = torch.randn(1000, 1000)
x_gpu = x.to(device)

# Method 2: .cuda()
x_gpu = x.cuda()

# Method 3: Create directly on GPU
x_gpu = torch.randn(1000, 1000, device='cuda')

# Specify a particular GPU
x_gpu1 = x.to('cuda:1')  # Second GPU

# ===== Memory Management =====

# Check memory usage
print(torch.cuda.memory_summary())
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Clear cache
torch.cuda.empty_cache()

# Memory snapshot (for debugging)
torch.cuda.memory._record_memory_history()
# ... run code ...
snapshot = torch.cuda.memory._snapshot()
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")


# ===== Mixed Precision Training =====
from torch.amp import autocast, GradScaler

model = nn.Linear(1024, 1024).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler('cuda')

for data, target in dataloader:
    data, target = data.cuda(), target.cuda()
    optimizer.zero_grad()

    # Automatic FP16 conversion with autocast
    with autocast('cuda'):
        output = model(data)
        loss = nn.functional.mse_loss(output, target)

    # FP16 gradient scaling with GradScaler
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# Benefits of Mixed Precision:
# - ~50% reduction in memory usage
# - 1.5-3x faster training (leveraging Tensor Cores)
# - Minimal accuracy loss

8.5 TensorFlow GPU Configuration

import tensorflow as tf

# Check GPUs
print("GPUs available:", tf.config.list_physical_devices('GPU'))

# Incremental GPU memory allocation (recommended)
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

# Or set memory limit
tf.config.set_logical_device_configuration(
    gpus[0],
    [tf.config.LogicalDeviceConfiguration(memory_limit=8192)]  # 8GB
)

# Use specific GPU
with tf.device('/GPU:0'):
    a = tf.random.normal([1000, 1000])
    b = tf.random.normal([1000, 1000])
    c = tf.matmul(a, b)

# Mixed Precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')

8.6 RAPIDS (cuDF, cuML) Introduction

RAPIDS is a collection of GPU-accelerated data science libraries provided by NVIDIA. As of 2025, the latest version is 26.02, and cuML offers 5-175x faster performance compared to scikit-learn.

# cuDF: GPU DataFrame replacement for pandas
import cudf

# Same API as pandas
df = cudf.read_csv('large_data.csv')
result = df.groupby('category').agg({'value': 'mean', 'count': 'sum'})
filtered = df[df['value'] > 100]

# Interconversion with pandas
import pandas as pd
pdf = df.to_pandas()   # cuDF -> pandas
gdf = cudf.from_pandas(pdf)  # pandas -> cuDF


# cuML: GPU ML replacement for scikit-learn
import cuml
from cuml.ensemble import RandomForestClassifier
from cuml.cluster import KMeans

# Same API as scikit-learn
rf = RandomForestClassifier(n_estimators=100, max_depth=16)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)

# Zero-code-change acceleration (cuml.accel)
# Run scikit-learn code on GPU without modification
import cuml.accel
cuml.accel.install()

from sklearn.ensemble import RandomForestClassifier  # Automatically GPU-accelerated!

9. Hands-on Examples

9.1 Example 1: GPU Vector Operation Benchmark (Numba)

A practical example comparing CPU and GPU vector operation performance.

from numba import cuda, njit
import numpy as np
import time
import math

# CPU version (Numba JIT)
@njit
def vector_ops_cpu(a, b, c):
    for i in range(a.shape[0]):
        c[i] = math.sqrt(a[i] ** 2 + b[i] ** 2) * math.sin(a[i]) + math.log(abs(b[i]) + 1)

# GPU version (CUDA)
@cuda.jit
def vector_ops_gpu(a, b, c):
    idx = cuda.grid(1)
    if idx < a.shape[0]:
        c[idx] = math.sqrt(a[idx] ** 2 + b[idx] ** 2) * math.sin(a[idx]) + math.log(abs(b[idx]) + 1)

# Benchmark
sizes = [100_000, 1_000_000, 10_000_000, 100_000_000]

for n in sizes:
    a = np.random.randn(n).astype(np.float32)
    b = np.random.randn(n).astype(np.float32)
    c_cpu = np.zeros(n, dtype=np.float32)
    c_gpu = np.zeros(n, dtype=np.float32)

    # CPU
    start = time.time()
    vector_ops_cpu(a, b, c_cpu)
    cpu_time = time.time() - start

    # GPU
    d_a = cuda.to_device(a)
    d_b = cuda.to_device(b)
    d_c = cuda.to_device(c_gpu)

    threads = 256
    blocks = math.ceil(n / threads)

    # Warm-up
    vector_ops_gpu[blocks, threads](d_a, d_b, d_c)
    cuda.synchronize()

    start = time.time()
    vector_ops_gpu[blocks, threads](d_a, d_b, d_c)
    cuda.synchronize()
    gpu_time = time.time() - start

    c_gpu = d_c.copy_to_host()

    print(f"N={n:>12,}: CPU={cpu_time:.4f}s, GPU={gpu_time:.4f}s, "
          f"Speedup={cpu_time/gpu_time:.1f}x")

9.2 Example 2: Image Processing Acceleration (CuPy)

An example of GPU-accelerated real-time image filtering.

import cupy as cp
import numpy as np
from PIL import Image

def gpu_gaussian_blur(image_array, kernel_size=5, sigma=1.0):
    """GPU Gaussian blur implementation"""
    # Create Gaussian kernel
    ax = cp.arange(-kernel_size // 2 + 1., kernel_size // 2 + 1.)
    xx, yy = cp.meshgrid(ax, ax)
    kernel = cp.exp(-(xx**2 + yy**2) / (2. * sigma**2))
    kernel = kernel / kernel.sum()

    img_gpu = cp.asarray(image_array, dtype=cp.float32)
    result = cp.zeros_like(img_gpu)

    pad = kernel_size // 2
    # Process per channel
    for c in range(img_gpu.shape[2]):
        padded = cp.pad(img_gpu[:, :, c], pad, mode='reflect')
        # 2D Convolution (using CuPy FFT)
        from cupyx.scipy.ndimage import convolve
        result[:, :, c] = convolve(img_gpu[:, :, c], kernel)

    return cp.asnumpy(result.clip(0, 255).astype(cp.uint8))


# Sobel Edge Detection implemented directly with CuPy RawKernel
sobel_kernel = cp.RawKernel(r'''
extern "C" __global__
void sobel_filter(const float* input, float* output,
                  int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x > 0 && x < width - 1 && y > 0 && y < height - 1) {
        // Sobel X
        float gx = -1.0f * input[(y-1)*width + (x-1)]
                   + 1.0f * input[(y-1)*width + (x+1)]
                   - 2.0f * input[y*width + (x-1)]
                   + 2.0f * input[y*width + (x+1)]
                   - 1.0f * input[(y+1)*width + (x-1)]
                   + 1.0f * input[(y+1)*width + (x+1)];

        // Sobel Y
        float gy = -1.0f * input[(y-1)*width + (x-1)]
                   - 2.0f * input[(y-1)*width + x]
                   - 1.0f * input[(y-1)*width + (x+1)]
                   + 1.0f * input[(y+1)*width + (x-1)]
                   + 2.0f * input[(y+1)*width + x]
                   + 1.0f * input[(y+1)*width + (x+1)];

        output[y * width + x] = sqrtf(gx * gx + gy * gy);
    }
}
''', 'sobel_filter')

def gpu_sobel_edge_detection(gray_image):
    """GPU Sobel edge detection"""
    h, w = gray_image.shape
    img_gpu = cp.asarray(gray_image, dtype=cp.float32)
    output_gpu = cp.zeros_like(img_gpu)

    block = (16, 16)
    grid = ((w + 15) // 16, (h + 15) // 16)

    sobel_kernel(grid, block, (img_gpu, output_gpu, w, h))

    return cp.asnumpy(output_gpu.clip(0, 255).astype(cp.uint8))

# Usage example
# img = np.array(Image.open('photo.jpg'))
# blurred = gpu_gaussian_blur(img, kernel_size=11, sigma=3.0)
# gray = np.mean(img, axis=2).astype(np.float32)
# edges = gpu_sobel_edge_detection(gray)

9.3 Example 3: PyTorch Model GPU Training

A complete PyTorch GPU training pipeline example.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torch.amp import autocast, GradScaler
import time

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Model definition
class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim),
            nn.LayerNorm(dim),
            nn.GELU(),
            nn.Linear(dim, dim),
            nn.LayerNorm(dim),
        )

    def forward(self, x):
        return x + self.net(x)

class DeepModel(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=512, num_classes=10, num_blocks=6):
        super().__init__()
        self.input_proj = nn.Linear(input_dim, hidden_dim)
        self.blocks = nn.Sequential(*[ResidualBlock(hidden_dim) for _ in range(num_blocks)])
        self.head = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.input_proj(x)
        x = self.blocks(x)
        return self.head(x)

# Data generation (in practice, load a dataset)
X_train = torch.randn(50000, 784)
y_train = torch.randint(0, 10, (50000,))
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True,
                          num_workers=4, pin_memory=True)  # pin_memory is important!

# Move model to GPU
model = DeepModel().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scaler = GradScaler('cuda')

# Training loop (Mixed Precision)
def train_epoch(model, loader, criterion, optimizer, scaler, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(loader):
        data, target = data.to(device, non_blocking=True), target.to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)  # set_to_none=True is more efficient

        with autocast('cuda'):
            output = model(data)
            loss = criterion(output, target)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        total_loss += loss.item()
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()

    return total_loss / len(loader), 100. * correct / total

# Run training
num_epochs = 10
for epoch in range(num_epochs):
    start = time.time()
    loss, acc = train_epoch(model, train_loader, criterion, optimizer, scaler, device)
    elapsed = time.time() - start

    print(f"Epoch {epoch+1}/{num_epochs}: Loss={loss:.4f}, "
          f"Acc={acc:.2f}%, Time={elapsed:.2f}s")

    # Print GPU memory status
    if device.type == 'cuda':
        alloc = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"  GPU Memory: {alloc:.2f} GB allocated, {reserved:.2f} GB reserved")

9.4 Example 4: Multi-GPU Training

DataParallel (Simple but Inefficient)

import torch
import torch.nn as nn

model = DeepModel().cuda()

# Apply DataParallel (one line!)
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs with DataParallel")
    model = nn.DataParallel(model)

# Rest of training code is the same
# Drawback: Load imbalance with GPU 0 being the bottleneck

DistributedDataParallel (Recommended)

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
import os

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train_ddp(rank, world_size):
    setup(rank, world_size)

    # Place model on the corresponding GPU
    model = DeepModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    # Distributed data loader
    train_dataset = TensorDataset(X_train, y_train)
    sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank)
    train_loader = DataLoader(train_dataset, batch_size=256,
                              sampler=sampler, num_workers=4, pin_memory=True)

    optimizer = optim.AdamW(ddp_model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()
    scaler = GradScaler('cuda')

    for epoch in range(10):
        sampler.set_epoch(epoch)  # Shuffle each epoch
        for data, target in train_loader:
            data = data.to(rank, non_blocking=True)
            target = target.to(rank, non_blocking=True)

            optimizer.zero_grad(set_to_none=True)
            with autocast('cuda'):
                output = ddp_model(data)
                loss = criterion(output, target)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

        if rank == 0:
            print(f"Epoch {epoch+1} complete")

    cleanup()

# Execution
# torchrun --nproc_per_node=4 train_ddp.py
import torch.multiprocessing as mp
world_size = torch.cuda.device_count()
mp.spawn(train_ddp, args=(world_size,), nprocs=world_size, join=True)

9.5 Example 5: Writing CUDA Kernels Directly (PyCUDA)

Implementing a practical Reduction (summation) kernel with PyCUDA.

import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np

# Parallel Reduction kernel
mod = SourceModule("""
__global__ void parallel_reduce(float *input, float *output, int n) {
    extern __shared__ float sdata[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x * 2 + threadIdx.x;

    // First level reduction (performed during global load)
    float sum = 0.0f;
    if (i < n) sum += input[i];
    if (i + blockDim.x < n) sum += input[i + blockDim.x];
    sdata[tid] = sum;

    __syncthreads();

    // Reduction in Shared Memory
    for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // No synchronization needed within a Warp (Warp-level reduction)
    if (tid < 32) {
        volatile float *smem = sdata;
        smem[tid] += smem[tid + 32];
        smem[tid] += smem[tid + 16];
        smem[tid] += smem[tid + 8];
        smem[tid] += smem[tid + 4];
        smem[tid] += smem[tid + 2];
        smem[tid] += smem[tid + 1];
    }

    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}
""")

reduce_kernel = mod.get_function("parallel_reduce")

# Execution
n = 1 << 20  # 1M elements
data = np.random.randn(n).astype(np.float32)

block_size = 256
grid_size = (n + block_size * 2 - 1) // (block_size * 2)

d_input = cuda.mem_alloc(data.nbytes)
d_output = cuda.mem_alloc(grid_size * 4)  # float32

cuda.memcpy_htod(d_input, data)

reduce_kernel(d_input, d_output, np.int32(n),
              block=(block_size, 1, 1), grid=(grid_size, 1),
              shared=block_size * 4)

# Get partial sums
partial_sums = np.zeros(grid_size, dtype=np.float32)
cuda.memcpy_dtoh(partial_sums, d_output)
gpu_sum = partial_sums.sum()

# Verification
cpu_sum = data.sum()
print(f"GPU sum: {gpu_sum:.6f}")
print(f"CPU sum: {cpu_sum:.6f}")
print(f"Difference: {abs(gpu_sum - cpu_sum):.6f}")

10. CUDA Tools and Libraries

10.1 Core Library Ecosystem

Library	Purpose	Key Features
cuDNN	Deep Learning	Optimized DNN primitives: Convolution, RNN, Attention, BatchNorm
cuBLAS	Linear Algebra	BLAS Level 1/2/3 operations: GEMM, TRSM. FP8/BF16 Group GEMM support (Blackwell)
NCCL	Multi-GPU Comms	Collective communication: AllReduce, AllGather, Broadcast. NVLink/NVSwitch optimized
TensorRT	Inference Opt	Graph optimization, quantization (INT8/FP8), layer fusion, dynamic batching
Triton	GPU Programming	Python DSL for high-performance kernels. Developed by OpenAI
CUTLASS	Custom GEMM	Template-based CUDA matrix multiplication library
FlashAttention	Attention Accel	IO-aware algorithm for Transformer Attention acceleration
cuSPARSE	Sparse Matrices	Optimized sparse matrix operations
cuRAND	Random Numbers	GPU-accelerated pseudo/quasi-random number generation
cuFFT	FFT	GPU-accelerated Fast Fourier Transform

10.2 cuDNN

# cuDNN is built into PyTorch/TensorFlow and used automatically

# Enable cuDNN benchmark mode in PyTorch
import torch
torch.backends.cudnn.benchmark = True   # Auto-select optimal algorithm
torch.backends.cudnn.deterministic = False  # Performance priority (when reproducibility is not needed)

# Check cuDNN version
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"cuDNN enabled: {torch.backends.cudnn.enabled}")

10.3 NCCL (Multi-GPU Communication)

# NCCL usage in PyTorch (internal to DistributedDataParallel)
import torch.distributed as dist

# Initialize NCCL backend
dist.init_process_group(backend='nccl')

# Key collective communication operations
# AllReduce: Sum tensors from all GPUs and distribute to each GPU
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)

# AllGather: Gather tensors from all GPUs and copy the full set to each GPU
gathered = [torch.zeros_like(tensor) for _ in range(world_size)]
dist.all_gather(gathered, tensor)

# Broadcast: Copy a tensor from one GPU to all GPUs
dist.broadcast(tensor, src=0)

10.4 TensorRT

# Inference optimization with TensorRT
import tensorrt as trt
import torch

# Convert PyTorch model to ONNX
model = MyModel().cuda().eval()
dummy_input = torch.randn(1, 3, 224, 224).cuda()
torch.onnx.export(model, dummy_input, "model.onnx",
                  input_names=['input'], output_names=['output'],
                  dynamic_axes={'input': {0: 'batch_size'}})

# Simple conversion with torch_tensorrt (PyTorch 2.x)
import torch_tensorrt

optimized_model = torch_tensorrt.compile(model,
    inputs=[torch_tensorrt.Input(
        min_shape=[1, 3, 224, 224],
        opt_shape=[8, 3, 224, 224],
        max_shape=[32, 3, 224, 224],
        dtype=torch.float16
    )],
    enabled_precisions={torch.float16},
)

# Inference
with torch.no_grad():
    output = optimized_model(input_tensor.half().cuda())

# Typical TensorRT performance gains:
# - FP32 -> FP16: 2-3x faster
# - FP32 -> INT8: 3-5x faster
# - Layer fusion + optimization: Additional 20-50% improvement

10.5 OpenAI Triton

Triton is a programming language for writing high-performance GPU kernels in Python.

import triton
import triton.language as tl
import torch

# Triton kernel: vector addition
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements

    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y

    tl.store(output_ptr + offsets, output, mask=mask)

# Usage
def triton_add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output

# Execution
x = torch.randn(1_000_000, device='cuda')
y = torch.randn(1_000_000, device='cuda')
result = triton_add(x, y)
assert torch.allclose(result, x + y)

# Advantages of Triton:
# - Higher-level abstraction than CUDA C
# - Automatic memory access optimization
# - Write high-performance kernels with Python syntax
# - Backend for PyTorch 2.0 torch.compile

10.6 FlashAttention

FlashAttention accelerates Transformer Self-Attention with an IO-aware algorithm.

# FlashAttention-3 (H100 optimized)
# Install: pip install flash-attn

from flash_attn import flash_attn_func

# Input: (batch, seqlen, nheads, headdim)
q = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)
k = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)
v = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)

# FlashAttention execution
output = flash_attn_func(q, k, v, causal=True)

# Performance comparison (FlashAttention-3 vs standard Attention):
# - Memory: O(N^2) -> O(N) (sequence length N)
# - Speed: 2-4x faster (up to 740 TFLOPS on H100 FP16)
# - FP8: up to 1.2 PFLOPS (H100)
# - FlashAttention-3 is 1.5-2.0x faster than FlashAttention-2

# PyTorch 2.0+ built-in scaled_dot_product_attention
output = torch.nn.functional.scaled_dot_product_attention(
    q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2),
    is_causal=True
)
# PyTorch automatically selects FlashAttention or Memory-Efficient Attention

11. CUDA Troubleshooting

11.1 CUDA OOM (Out of Memory) Solutions

GPU memory shortage is the most common CUDA issue.

# Error: RuntimeError: CUDA out of memory.

# Solution 1: Reduce batch size
# batch_size = 64 -> 32 -> 16

# Solution 2: Use Mixed Precision (~50% memory savings)
from torch.amp import autocast, GradScaler
scaler = GradScaler('cuda')

# Solution 3: Gradient Accumulation
accumulation_steps = 4
for i, (data, target) in enumerate(loader):
    with autocast('cuda'):
        loss = model(data) / accumulation_steps
    scaler.scale(loss).backward()

    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

# Solution 4: Gradient Checkpointing (memory-compute tradeoff)
from torch.utils.checkpoint import checkpoint
class MemEfficientModel(nn.Module):
    def forward(self, x):
        # Don't store intermediate activations; recompute during backward
        x = checkpoint(self.block1, x, use_reentrant=False)
        x = checkpoint(self.block2, x, use_reentrant=False)
        return x

# Solution 5: Clear cache
torch.cuda.empty_cache()

# Solution 6: Debug memory leaks
print(torch.cuda.memory_summary())

# Solution 7: Delete unnecessary tensors immediately
del large_tensor
torch.cuda.empty_cache()

# Solution 8: Disable gradients during inference
with torch.no_grad():
    output = model(input_data)

11.2 CUDA Driver vs Runtime Version Mismatch

# Problem: "CUDA driver version is insufficient for CUDA runtime version"

# How to check
nvidia-smi           # Check driver-supported CUDA version
nvcc --version       # Check installed CUDA Toolkit version

# Rule: Driver CUDA >= Toolkit CUDA
# Example: If driver supports CUDA 12.2, CUDA 13.0 Toolkit cannot be used

# Solution 1: Update driver
sudo apt-get update
sudo apt-get install nvidia-driver-560  # Install latest driver

# Solution 2: Downgrade to compatible CUDA Toolkit
sudo apt-get install cuda-toolkit-12-4

# Solution 3: Install matching CUDA build of PyTorch/TensorFlow
pip install torch --index-url https://download.pytorch.org/whl/cu124

11.3 nvidia-smi Key Commands Summary

Command	Purpose
`nvidia-smi`	GPU status summary
`nvidia-smi -l 1`	Refresh every 1 second
`nvidia-smi -q`	Detailed info
`nvidia-smi -q -d MEMORY`	Memory details
`nvidia-smi -q -d CLOCK`	Clock speeds
`nvidia-smi -q -d TEMPERATURE`	Temperature info
`nvidia-smi -q -d POWER`	Power consumption
`nvidia-smi -q -d PERFORMANCE`	Performance state
`nvidia-smi --query-gpu=... --format=csv`	CSV output
`nvidia-smi pmon -i 0`	Process monitoring
`nvidia-smi dmon -d 1`	Device monitoring
`nvidia-smi topo -m`	GPU topology (NVLink, etc.)
`nvidia-smi -r -i 0`	Reset GPU 0
`nvidia-smi -pm 1`	Enable Persistence Mode
`nvidia-smi -pl 300`	Set power limit (W)

11.4 cuda-gdb Debugging

# CUDA debug build
nvcc -g -G -O0 -o debug_app app.cu

# Run cuda-gdb
cuda-gdb ./debug_app

# Key commands
(cuda-gdb) break myKernel            # Breakpoint on kernel
(cuda-gdb) run                        # Run
(cuda-gdb) cuda thread               # Current thread info
(cuda-gdb) cuda block                 # Current block info
(cuda-gdb) cuda kernel                # Current kernel info
(cuda-gdb) cuda thread (0,0,0)        # Switch to specific thread
(cuda-gdb) cuda block (1,0,0)         # Switch to specific block
(cuda-gdb) info cuda threads          # List active threads
(cuda-gdb) print threadIdx.x          # Print built-in variable
(cuda-gdb) print data[idx]            # Check data

# compute-sanitizer (memory error detection)
compute-sanitizer --tool memcheck ./my_app
compute-sanitizer --tool racecheck ./my_app   # Race condition detection
compute-sanitizer --tool initcheck ./my_app   # Initialization check

11.5 Common Error Messages and Solutions

Error Message	Cause	Solution
`CUDA error: out of memory`	GPU memory shortage	Reduce batch size, Mixed Precision, Gradient Checkpointing
`CUDA error: device-side assert triggered`	Assert failure or index out of bounds in kernel	Run with `CUDA_LAUNCH_BLOCKING=1` to identify exact location
`CUDA error: an illegal memory access`	Invalid memory access	Debug with `compute-sanitizer`, check index bounds
`CUDA error: no kernel image is available`	Compute Capability mismatch	Recompile with correct `-arch=sm_XX` option
`CUDA driver version is insufficient`	Driver version too old	Update driver or downgrade CUDA Toolkit
`CUDA error: invalid device function`	Wrong architecture target	Check `-gencode` options, use Fat Binary
`cuDNN error: CUDNN_STATUS_NOT_SUPPORTED`	cuDNN version mismatch or unsupported op	Update cuDNN or check input format
`NCCL error: unhandled system error`	Multi-GPU communication failure	Check network config, enable `NCCL_DEBUG=INFO`

# Debugging tip: Switch to synchronous execution to pinpoint exact error location
CUDA_LAUNCH_BLOCKING=1 python train.py

# NCCL debugging
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.launch train.py

# PyTorch CUDA memory debugging
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python train.py

12. CUDA Alternative Technologies Comparison

12.1 Major GPU Computing Frameworks

Technology	Developer	Target Hardware	Languages	Ecosystem Maturity	Primary Use
CUDA	NVIDIA	NVIDIA GPU	C/C++/Python	Very High	AI/ML, HPC, scientific computing
ROCm	AMD	AMD GPU (MI300X, etc.)	C/C++ (HIP)	Medium	AI training/inference (PyTorch supported)
OpenCL	Khronos	General (GPU/CPU/FPGA)	C/C++	Medium	Cross-platform GPU computing
SYCL	Khronos	General	C++	Growing	oneAPI (Intel), cross-platform
Metal	Apple	Apple Silicon	Swift/Obj-C/C++	Apple ecosystem	macOS/iOS GPU computing
Vulkan Compute	Khronos	General GPU	C/GLSL/SPIR-V	Medium	Cross-platform GPU computation
Triton	OpenAI	NVIDIA/AMD GPU	Python	Growing	High-level GPU kernel programming
WebGPU	W3C	Browser GPU	WGSL/JS	Early	Web-based GPU computation

12.2 CUDA vs ROCm Detailed Comparison

CUDA code:
  cudaMalloc(&d_ptr, size);
  myKernel<<<grid, block>>>(d_ptr);
  cudaDeviceSynchronize();

ROCm HIP code (nearly identical API):
  hipMalloc(&d_ptr, size);
  myKernel<<<grid, block>>>(d_ptr);
  hipDeviceSynchronize();

AMD ROCm's HIP (Heterogeneous-compute Interface for Portability) provides a very similar API to CUDA, and the hipify-perl tool can automatically convert CUDA code to HIP code. PyTorch officially supports ROCm, enabling training on AMD MI300X and similar hardware.

12.3 Selection Guide

Scenario	Recommended Technology	Reason
AI/ML training and inference	CUDA	Optimized library ecosystem (cuDNN, TensorRT)
AMD GPU usage	ROCm (HIP)	CUDA-compatible API, PyTorch support
Cross-platform requirement	OpenCL or SYCL	Broad hardware support
Apple environment	Metal	Only option for macOS/iOS
Custom kernels (Python)	Triton	Higher productivity than CUDA C
Web browser GPU	WebGPU	Standard web API

13. References

13.1 Official Documentation

NVIDIA CUDA Programming Guide: Official reference for GPU architecture and programming model
CUDA C++ Best Practices Guide: Collection of best practices for performance optimization
CUDA Toolkit Release Notes: Change logs and compatibility information for each version
NVIDIA Developer Blog: Latest technology trends and tutorials

13.2 Key Reference Links

Resource	URL
CUDA Toolkit Download	`https://developer.nvidia.com/cuda-downloads`
CUDA Programming Guide	`https://docs.nvidia.com/cuda/cuda-c-programming-guide/`
CUDA Best Practices	`https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/`
cuDNN Documentation	`https://docs.nvidia.com/deeplearning/cudnn/`
TensorRT Documentation	`https://docs.nvidia.com/deeplearning/tensorrt/`
NCCL Documentation	`https://docs.nvidia.com/deeplearning/nccl/`
Nsight Compute	`https://developer.nvidia.com/nsight-compute`
Nsight Systems	`https://developer.nvidia.com/nsight-systems`
RAPIDS Official Site	`https://rapids.ai/`
FlashAttention GitHub	`https://github.com/Dao-AILab/flash-attention`
OpenAI Triton GitHub	`https://github.com/openai/triton`
PyTorch CUDA Docs	`https://pytorch.org/docs/stable/cuda.html`

13.3 Recommended Learning Path

Beginner:
  1. Check GPU with nvidia-smi
  2. Use .cuda() in PyTorch
  3. Apply Mixed Precision Training
  4. Write simple kernels with Numba CUDA JIT

Intermediate:
  5. Write kernels directly in CUDA C/C++
  6. Optimize with Shared Memory, Coalesced Access
  7. Asynchronous execution with CUDA Streams
  8. Profile with Nsight Compute

Advanced:
  9. Custom Attention kernels with Triton
  10. Inference optimization with TensorRT
  11. Multi-GPU DDP training
  12. Custom GEMM implementation with CUTLASS

Conclusion

CUDA is more than a GPU programming tool -- it forms the foundation of modern AI infrastructure. Alongside hardware evolution from Ampere to Hopper to Blackwell, the CUDA Toolkit has entered the 13.x era, beginning to offer higher-level abstractions such as CUDA Tile and cuTile DSL.

The key points for effectively leveraging CUDA in practice are as follows.

Build solid foundations: Understanding the Warp, Block, Grid execution model and memory hierarchy is essential for optimization
Actively use Python tools: Most GPU acceleration can be achieved with PyTorch's Mixed Precision, CuPy, Numba, and similar tools
Profile first: Identify bottlenecks with Nsight Compute/Systems before investing in optimization
Memory is key: Memory optimization techniques like Coalesced Access, Shared Memory Tiling, and Pinned Memory account for 80% of performance improvements
Leverage the ecosystem: Maximize use of already-optimized libraries like cuDNN, TensorRT, and FlashAttention

The world of GPU computing continues to evolve. The emergence of high-level programming models like Triton, the expansion of low-precision operations such as FP4/FP6, and the introduction of tile-based programming represented by CUDA Tile are all moving GPU programming toward lowering the barrier to entry while maximizing hardware performance. We hope this guide serves as a starting point for that journey.