Split View: CUDA Hands-on 완벽 가이드: GPU 컴퓨팅의 모든 것

CUDA Hands-on 완벽 가이드: GPU 컴퓨팅의 모든 것

1. 서론: CUDA란 무엇인가
2. GPU 아키텍처 기초
3. CUDA 개발 환경 설정
4. CUDA 프로그래밍 기초 (C/C++)
5. CUDA 메모리 관리 심층
6. CUDA 스트림과 비동기 실행
7. CUDA 최적화 기법
8. Python에서 CUDA 다루기
9. 실전 Hands-on 예제
10. CUDA 관련 도구 및 라이브러리
11. CUDA 트러블슈팅
12. CUDA 대안 기술 비교
13. 참고 자료
마무리

1. 서론: CUDA란 무엇인가

1.1 CUDA의 정의

**CUDA(Compute Unified Device Architecture)**는 NVIDIA가 개발한 병렬 컴퓨팅 플랫폼이자 프로그래밍 모델이다. GPU의 수천 개 코어를 활용하여 범용 연산(GPGPU)을 수행할 수 있게 해주며, 2007년 첫 출시 이후 AI, 과학 시뮬레이션, 영상 처리, 금융 모델링 등 다양한 분야에서 사실상의 표준으로 자리잡았다.

2025년 기준 최신 CUDA Toolkit은 13.1 버전까지 출시되었으며, CUDA 13.1에서는 CUDA Tile이라는 타일 기반 프로그래밍 모델과 cuTile Python DSL이 도입되어 Tensor Core 추상화와 차세대 Blackwell GPU에 대한 전방 호환성을 제공한다.

1.2 GPU 컴퓨팅의 역사와 CUDA의 등장 배경

GPU 컴퓨팅의 진화를 시대별로 정리하면 다음과 같다.

시기	사건	의의
2001	GPGPU 연구 시작	셰이더를 이용한 범용 연산 시도
2006	NVIDIA Tesla 아키텍처 발표	최초의 통합 셰이더 아키텍처
2007	CUDA 1.0 출시	GPU 범용 컴퓨팅의 시작
2012	AlexNet (ImageNet)	GPU 딥러닝 시대 개막
2017	Volta + Tensor Core	혼합 정밀도 연산 가속
2020	Ampere (A100)	TF32, Sparsity 지원
2022	Hopper (H100)	Transformer Engine, FP8 지원
2024	Blackwell (B200)	FP4, 5세대 Tensor Core
2025	CUDA 13.0/13.1 출시	CUDA Tile, cuTile DSL 도입

1.3 CPU vs GPU 아키텍처 비교

CPU와 GPU는 근본적으로 다른 설계 철학을 가진다.

특성	CPU	GPU
설계 목표	낮은 레이턴시 (직렬 처리)	높은 처리량 (병렬 처리)
코어 수	수 개 ~ 수십 개	수천 ~ 수만 개
클럭 속도	높음 (4~6 GHz)	상대적으로 낮음 (1~2 GHz)
캐시 크기	큼 (수십 MB)	작음 (코어당)
제어 로직	복잡 (분기 예측, OoO 실행)	단순 (다수 스레드로 레이턴시 은닉)
병렬화 모델	SIMD (Single Instruction Multiple Data)	SIMT (Single Instruction Multiple Threads)
최적 워크로드	복잡한 분기, 순차 로직	대규모 데이터 병렬 연산

SIMD vs SIMT 핵심 차이: SIMD는 하나의 명령으로 여러 데이터를 동시에 처리하는 벡터 연산 방식이다. SIMT는 여기서 한 걸음 더 나아가, 각 스레드가 독립적인 프로그램 카운터를 가지면서도 동일한 명령을 동시에 실행한다. 분기(branch)가 발생하면 Warp 내 스레드들이 서로 다른 경로를 실행할 수 있지만, 이때 Warp Divergence가 발생하여 성능이 저하된다.

1.4 CUDA가 AI/ML/딥러닝에서 필수인 이유

현대 AI/ML 워크로드에서 CUDA가 핵심인 이유는 명확하다.

행렬 연산 가속: 딥러닝의 핵심인 행렬 곱셈(GEMM)을 Tensor Core로 극적으로 가속
소프트웨어 생태계: cuDNN, cuBLAS, NCCL, TensorRT 등 최적화된 라이브러리 생태계
프레임워크 지원: PyTorch, TensorFlow, JAX 등 모든 주요 프레임워크가 CUDA를 기본 백엔드로 사용
혼합 정밀도: FP16, BF16, FP8, FP4까지 저정밀도 연산으로 학습/추론 속도 극대화
Multi-GPU 스케일링: NVLink, NVSwitch를 통한 수백 GPU 분산 학습

2. GPU 아키텍처 기초

2.1 NVIDIA GPU 내부 구조

NVIDIA GPU의 핵심 구성 요소를 계층적으로 살펴보자.

GPU (GPC - Graphics Processing Cluster)
├── SM (Streaming Multiprocessor) × N개
│   ├── CUDA Core (FP32/INT32) × 128개 (Hopper 기준)
│   ├── Tensor Core × 4개 (4세대, Hopper 기준)
│   ├── RT Core (Ray Tracing) × 1개
│   ├── Warp Scheduler × 4개
│   ├── Register File (256 KB)
│   ├── L1 Cache / Shared Memory (공유, 최대 228 KB)
│   └── SFU (Special Function Unit)
├── L2 Cache (공유)
├── Memory Controller
└── HBM (High Bandwidth Memory)

주요 코어 유형 설명:

CUDA Core: 범용 부동소수점/정수 연산을 수행하는 기본 처리 유닛. FP32, FP64, INT32 연산 담당
Tensor Core: 행렬 곱셈-누적(MMA, Matrix Multiply-Accumulate) 연산에 특화된 코어. 딥러닝 학습/추론의 핵심 가속기
RT Core: 광선 추적(Ray Tracing) 가속 전용 하드웨어. 주로 그래픽스 워크로드에서 사용

2.2 메모리 계층 구조

GPU 메모리는 속도와 크기에 따라 계층적으로 구성된다.

메모리 유형	위치	크기 (H100 기준)	대역폭	접근 범위	특성
Register	SM 내부	256 KB/SM	최고속	Thread 전용	가장 빠르나 한정적
Shared Memory	SM 내부	최대 228 KB/SM	매우 빠름	Block 내 공유	프로그래머 관리 캐시
L1 Cache	SM 내부	Shared와 공유	매우 빠름	SM 전용	HW 관리
L2 Cache	GPU 전역	50 MB	빠름	전체 SM 공유	HW 관리
Global Memory (HBM)	GPU 외부	80 GB	3.35 TB/s	전체 접근	가장 크나 레이턴시 높음
Constant Memory	GPU 전역	64 KB	캐시 시 빠름	읽기 전용	Broadcast 최적화
Texture Memory	GPU 전역	Global과 공유	캐시 시 빠름	읽기 전용	2D 공간 지역성 최적화

속도:  Register > Shared/L1 > L2 > Global (HBM)
크기:  Global (HBM) > L2 > Shared/L1 > Register

2.3 Warp, Block, Grid 개념

CUDA의 실행 모델은 3단계 계층 구조를 가진다.

Grid (커널 실행 단위)
├── Block (0,0)          Block (1,0)          Block (2,0)
│   ├── Warp 0           ├── Warp 0           ├── Warp 0
│   │   ├── Thread 0     │   ├── Thread 0     │   ├── Thread 0
│   │   ├── Thread 1     │   ├── Thread 1     │   ├── Thread 1
│   │   ├── ...          │   ├── ...          │   ├── ...
│   │   └── Thread 31    │   └── Thread 31    │   └── Thread 31
│   ├── Warp 1           ├── Warp 1           ├── Warp 1
│   └── ...              └── ...              └── ...
├── Block (0,1)          Block (1,1)          ...
└── ...

Thread: GPU 연산의 최소 실행 단위
Warp: 32개 Thread로 구성된 실행 그룹. SM에서 동시에 같은 명령을 실행하는 단위 (SIMT의 핵심)
Block (Thread Block): 여러 Warp로 구성. 동일 SM에서 실행되며, Shared Memory를 공유
Grid: 전체 커널 실행을 구성하는 Block들의 집합

핵심 제약 사항:

항목	제한값 (Compute Capability 9.0 기준)
Block당 최대 Thread 수	1,024
Warp 크기	32 (고정)
Block 차원 최대값	(1024, 1024, 64)
Grid 차원 최대값	(2^31-1, 65535, 65535)
SM당 최대 Block 수	32
SM당 최대 Warp 수	64

2.4 Compute Capability 버전별 차이

Compute Capability(CC)는 GPU 하드웨어의 기능 집합을 정의한다.

CC	아키텍처	대표 GPU	주요 특징
7.0	Volta	V100	1세대 Tensor Core, 독립 스레드 스케줄링
7.5	Turing	RTX 2080	INT8/INT4 Tensor Core, RT Core
8.0	Ampere	A100	3세대 Tensor Core, TF32, Sparsity
8.6	Ampere	RTX 3090	소비자용 Ampere
8.9	Ada Lovelace	RTX 4090	4세대 Tensor Core, FP8, DLSS 3
9.0	Hopper	H100	4세대 Tensor Core, Transformer Engine, DPX
10.0	Blackwell	B200	5세대 Tensor Core, FP4, TMEM
12.0	Blackwell Ultra	B300	5세대 Tensor Core 강화, 288 GB HBM3E

참고로 CUDA 13.0부터는 Maxwell(CC 5.x), Pascal(CC 6.x), Volta(CC 7.0)에 대한 지원이 제거되었다.

2.5 최신 GPU 세대 비교

데이터센터용 GPU 3세대를 비교한다.

사양	A100 (SXM)	H100 (SXM5)	B200 (SXM)	B300 (SXM)
아키텍처	Ampere	Hopper	Blackwell	Blackwell Ultra
CUDA Core	6,912	16,896	18,432	18,432+
Tensor Core	432 (3세대)	528 (4세대)	5세대	5세대
메모리	80 GB HBM2e	80 GB HBM3	180 GB HBM3E	288 GB HBM3E
메모리 대역폭	2.0 TB/s	3.35 TB/s	7.7 TB/s	8.0 TB/s
FP32 성능	19.5 TFLOPS	60 TFLOPS	비공개	비공개
FP16 Tensor	312 TFLOPS	990 TFLOPS	비공개	비공개
FP4 Tensor	미지원	미지원	9.0 PFLOPS	14.0 PFLOPS
NVLink	3세대 (600 GB/s)	4세대 (900 GB/s)	5세대 (1.8 TB/s)	5세대 (1.8 TB/s)
TDP	400W	700W	1,000W	1,400W
냉각	공냉/수냉	공냉/수냉	수냉 권장	수냉 필수 (DLC)

B200은 A100 대비 학습 성능 3배, 추론 성능 15배 향상을 달성했다. B300(Blackwell Ultra)은 FP4 연산에서 14 PFLOPS로 B200 대비 55.6% 빠른 성능을 제공한다.

3. CUDA 개발 환경 설정

3.1 CUDA Toolkit 설치

Linux (Ubuntu/Debian)

# 1. NVIDIA 드라이버 확인
nvidia-smi

# 2. CUDA Keyring 추가 (Ubuntu 22.04 예시)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# 3. CUDA Toolkit 설치
sudo apt-get install cuda-toolkit-13-1

# 4. 환경 변수 설정
export PATH=/usr/local/cuda-13.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.1/lib64:$LD_LIBRARY_PATH

# 5. 설치 확인
nvcc --version

Windows

# 1. NVIDIA 공식 사이트에서 CUDA Toolkit 다운로드
# https://developer.nvidia.com/cuda-downloads

# 2. 설치 후 환경 변수 확인
echo %CUDA_PATH%
# 일반적으로 C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1

# 3. 확인
nvcc --version
nvidia-smi

3.2 nvidia-smi 명령어 활용

nvidia-smi는 GPU 상태를 모니터링하는 핵심 도구이다.

# 기본 정보 출력
nvidia-smi

# 출력 예시:
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 560.35.03    Driver Version: 560.35.03    CUDA Version: 13.1                 |
# |-----------------------------------------+------------------------+----------------------|
# | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
# |=========================================+========================+======================|
# |   0  NVIDIA H100 80GB HBM3          On | 00000000:3B:00.0   Off |                    0 |
# | N/A   32C    P0              72W / 700W |    1234MiB / 81559MiB  |      0%      Default |
# +-----------------------------------------+------------------------+----------------------+

# 지속적 모니터링 (1초 간격)
nvidia-smi -l 1

# dmon - GPU 메트릭 모니터링
nvidia-smi dmon -s pucvmet -d 1

# 특정 GPU 정보
nvidia-smi -i 0 -q

# GPU 프로세스 확인
nvidia-smi pmon -i 0

# GPU 클럭, 메모리 정보
nvidia-smi -q -d CLOCK,MEMORY

# CSV 형식으로 출력 (스크립트용)
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total \
  --format=csv,noheader,nounits

3.3 nvcc 컴파일러 사용법

nvcc는 CUDA 소스 코드를 컴파일하는 NVIDIA CUDA Compiler이다.

# 기본 컴파일
nvcc -o hello hello.cu

# 특정 아키텍처 타겟
nvcc -arch=sm_90 -o kernel kernel.cu   # Hopper H100
nvcc -arch=sm_80 -o kernel kernel.cu   # Ampere A100

# 디버그 모드
nvcc -g -G -o debug_kernel kernel.cu

# 최적화 레벨
nvcc -O3 -o optimized kernel.cu

# PTX 코드 생성 (중간 표현)
nvcc -ptx kernel.cu

# 여러 아키텍처 동시 지원 (Fat Binary)
nvcc -gencode arch=compute_80,code=sm_80 \
     -gencode arch=compute_90,code=sm_90 \
     -o multi_arch kernel.cu

# 라이브러리 링크
nvcc -lcublas -lcurand -o linked_kernel kernel.cu

# 상세 컴파일 정보
nvcc --resource-usage -o kernel kernel.cu

3.4 CUDA 버전 확인 및 호환성

CUDA에는 두 가지 버전이 존재하며, 이를 혼동하지 않는 것이 중요하다.

# 1. 드라이버 지원 CUDA 버전 (Driver API)
nvidia-smi
# 우측 상단에 "CUDA Version: 13.1" 표시
# 이 버전은 드라이버가 지원하는 최대 CUDA 런타임 버전

# 2. CUDA Toolkit 버전 (Runtime API)
nvcc --version
# "release 13.1" 표시
# 실제 설치된 CUDA Toolkit 버전

# 3. 런타임에서 확인
python3 -c "import torch; print(torch.version.cuda)"

핵심 호환성 규칙:

드라이버 CUDA 버전이 Toolkit CUDA 버전 이상이어야 함
예: 드라이버가 CUDA 13.1을 지원하면 CUDA 12.x Toolkit도 사용 가능 (하위 호환)
CUDA Toolkit의 Minor Version Compatibility 정책에 따라 같은 Major 버전 내에서는 바이너리 호환

3.5 Docker에서 CUDA 사용

NVIDIA Container Toolkit을 사용하면 Docker 컨테이너에서 GPU를 활용할 수 있다.

# 1. NVIDIA Container Toolkit 설치 (Ubuntu)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# 2. Docker 런타임 설정
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# 3. GPU 컨테이너 실행
docker run --rm --gpus all nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi

# 특정 GPU만 사용
docker run --rm --gpus '"device=0,1"' nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi

# docker compose에서 GPU 사용
# docker-compose.yml:

# docker-compose.yml
services:
  gpu-app:
    image: nvidia/cuda:13.1.0-devel-ubuntu22.04
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: nvidia-smi

3.6 conda/pip으로 PyTorch/TensorFlow CUDA 설치

# PyTorch (CUDA 12.4 빌드 예시)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# conda를 사용하는 경우
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia

# TensorFlow (GPU 자동 감지)
pip install tensorflow[and-cuda]

# CUDA 사용 가능 여부 확인 (Python)
python3 -c "
import torch
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'CUDA version: {torch.version.cuda}')
print(f'GPU count: {torch.cuda.device_count()}')
if torch.cuda.is_available():
    print(f'GPU name: {torch.cuda.get_device_name(0)}')
"

4. CUDA 프로그래밍 기초 (C/C++)

4.1 Host(CPU) vs Device(GPU) 코드

CUDA 프로그래밍에서는 CPU 측 코드를 Host, GPU 측 코드를 Device라 부른다.

┌─────────────────────────────────────────────────────────┐
│  Host (CPU)                                             │
│  ┌─────────────────────────────────────────────────────┐│
│  │ 1. 데이터 준비 (Host Memory)                       ││
│  │ 2. GPU 메모리 할당 (cudaMalloc)                    ││
│  │ 3. 데이터 전송: Host -> Device (cudaMemcpy)        ││
│  │ 4. 커널 실행 (<<<grid, block>>>)                   ││
│  │ 5. 결과 전송: Device -> Host (cudaMemcpy)          ││
│  │ 6. GPU 메모리 해제 (cudaFree)                      ││
│  └─────────────────────────────────────────────────────┘│
│                        ↕ PCIe / NVLink                  │
│  ┌─────────────────────────────────────────────────────┐│
│  │  Device (GPU)                                       ││
│  │  - 커널 함수 병렬 실행                              ││
│  │  - 수천 스레드가 동시 처리                           ││
│  └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘

4.2 함수 한정자 (Function Qualifiers)

CUDA는 세 가지 함수 한정자를 제공한다.

한정자	실행 위치	호출 가능 위치	설명
`__global__`	Device (GPU)	Host (CPU)	커널 함수. 반환형은 반드시 `void`
`__device__`	Device (GPU)	Device (GPU)	GPU 내부에서만 호출 가능한 헬퍼 함수
`__host__`	Host (CPU)	Host (CPU)	일반 CPU 함수 (기본값, 생략 가능)
`__host__ __device__`	둘 다	둘 다	CPU/GPU 모두에서 사용 가능

// __global__: 커널 함수 - Host에서 호출, Device에서 실행
__global__ void myKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] *= 2.0f;
    }
}

// __device__: Device 전용 헬퍼 함수
__device__ float square(float x) {
    return x * x;
}

// __host__ __device__: CPU/GPU 겸용 함수
__host__ __device__ float add(float a, float b) {
    return a + b;
}

4.3 커널 호출 문법

커널 함수는 특수한 삼중 꺾쇠 문법으로 호출한다.

// 커널 호출 문법
// kernel<<<gridDim, blockDim, sharedMemSize, stream>>>(args...);
//
// gridDim:       Grid 내 Block 수 (dim3)
// blockDim:      Block 내 Thread 수 (dim3)
// sharedMemSize: 동적 Shared Memory 크기 (바이트, 선택)
// stream:        실행 스트림 (선택)

// 1D 예시: 256개 스레드, 1개 블록
myKernel<<<1, 256>>>(d_data, n);

// 1D 예시: N개 데이터를 256 스레드 블록으로 처리
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
myKernel<<<blocksPerGrid, threadsPerBlock>>>(d_data, n);

// 2D 예시: 이미지 처리
dim3 blockDim(16, 16);       // 16x16 = 256 스레드/블록
dim3 gridDim(
    (width + 15) / 16,
    (height + 15) / 16
);
imageKernel<<<gridDim, blockDim>>>(d_image, width, height);

4.4 Thread 계층 구조

각 스레드는 내장 변수를 통해 자신의 위치를 알 수 있다.

// 내장 변수 (Built-in Variables)
threadIdx.x, threadIdx.y, threadIdx.z  // Block 내 Thread 인덱스
blockIdx.x,  blockIdx.y,  blockIdx.z   // Grid 내 Block 인덱스
blockDim.x,  blockDim.y,  blockDim.z   // Block의 차원 크기
gridDim.x,   gridDim.y,   gridDim.z    // Grid의 차원 크기

// 글로벌 Thread ID 계산 (1D)
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;

// 글로벌 Thread ID 계산 (2D)
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int globalIdx = row * width + col;

// 전체 스레드 수 (Grid-stride loop용)
int totalThreads = gridDim.x * blockDim.x;

4.5 메모리 관리 함수

// GPU 메모리 할당
float* d_data;
cudaMalloc((void**)&d_data, n * sizeof(float));

// Host -> Device 복사
cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice);

// Device -> Host 복사
cudaMemcpy(h_data, d_data, n * sizeof(float), cudaMemcpyDeviceToHost);

// GPU 메모리 해제
cudaFree(d_data);

// GPU 메모리 초기화
cudaMemset(d_data, 0, n * sizeof(float));

// 에러 체크 매크로 (필수!)
#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error at %s:%d - %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// 사용 예
CUDA_CHECK(cudaMalloc((void**)&d_data, n * sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice));

4.6 Hello World 예제

// hello_cuda.cu
#include <stdio.h>

__global__ void helloKernel() {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    printf("Hello from GPU thread %d (block %d, thread %d)\n",
           tid, blockIdx.x, threadIdx.x);
}

int main() {
    // 2개 블록, 각 4개 스레드 = 총 8개 스레드
    helloKernel<<<2, 4>>>();

    // GPU 작업 완료 대기
    cudaDeviceSynchronize();

    printf("Hello from CPU!\n");
    return 0;
}

# 컴파일 및 실행
nvcc -o hello hello_cuda.cu
./hello

4.7 벡터 덧셈 예제

CUDA의 "Hello World"라 할 수 있는 벡터 덧셈 예제이다.

// vector_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error: %s at %s:%d\n", \
                    cudaGetErrorString(err), __FILE__, __LINE__); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// GPU 커널: 벡터 덧셈
__global__ void vectorAdd(const float* A, const float* B, float* C, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    const int N = 1 << 20;  // 1M 원소
    size_t bytes = N * sizeof(float);

    // Host 메모리 할당 및 초기화
    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);

    for (int i = 0; i < N; i++) {
        h_A[i] = (float)i;
        h_B[i] = (float)(i * 2);
    }

    // Device 메모리 할당
    float *d_A, *d_B, *d_C;
    CUDA_CHECK(cudaMalloc(&d_A, bytes));
    CUDA_CHECK(cudaMalloc(&d_B, bytes));
    CUDA_CHECK(cudaMalloc(&d_C, bytes));

    // Host -> Device 복사
    CUDA_CHECK(cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice));

    // 커널 실행
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // 에러 체크
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    // Device -> Host 복사
    CUDA_CHECK(cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost));

    // 결과 검증
    for (int i = 0; i < N; i++) {
        if (fabs(h_C[i] - (h_A[i] + h_B[i])) > 1e-5) {
            fprintf(stderr, "Verification failed at index %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }
    printf("Vector addition of %d elements: SUCCESS\n", N);

    // 메모리 해제
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

4.8 행렬 곱셈 예제

// matmul.cu
#include <stdio.h>
#include <cuda_runtime.h>

#define TILE_SIZE 16

// 기본 행렬 곱셈 커널
__global__ void matMulBasic(const float* A, const float* B, float* C,
                            int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < M && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < K; k++) {
            sum += A[row * K + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

// Shared Memory를 활용한 Tiled 행렬 곱셈 (최적화 버전)
__global__ void matMulTiled(const float* A, const float* B, float* C,
                            int M, int N, int K) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    // 타일 단위로 반복
    for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Shared Memory로 타일 로드
        if (row < M && (t * TILE_SIZE + threadIdx.x) < K)
            tileA[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
        else
            tileA[threadIdx.y][threadIdx.x] = 0.0f;

        if ((t * TILE_SIZE + threadIdx.y) < K && col < N)
            tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tileB[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();  // 타일 로드 완료 대기

        // 타일 내 곱셈-누적
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        __syncthreads();  // 다음 타일 로드 전 동기화
    }

    if (row < M && col < N) {
        C[row * N + col] = sum;
    }
}

int main() {
    const int M = 1024, N = 1024, K = 1024;

    // ... (메모리 할당, 초기화, 데이터 전송 생략)

    dim3 blockDim(TILE_SIZE, TILE_SIZE);
    dim3 gridDim((N + TILE_SIZE - 1) / TILE_SIZE,
                 (M + TILE_SIZE - 1) / TILE_SIZE);

    matMulTiled<<<gridDim, blockDim>>>(d_A, d_B, d_C, M, N, K);

    // ... (결과 복사, 검증, 메모리 해제 생략)
    return 0;
}

5. CUDA 메모리 관리 심층

5.1 Global Memory와 Coalesced Access

Global Memory 접근 시 **Coalesced Access(합쳤된 접근)**는 성능의 핵심이다. Warp 내 32개 스레드가 연속적인 메모리 주소에 접근하면, GPU가 이를 하나의 트랜잭션으로 묶어 처리한다.

// Coalesced Access (좋은 패턴)
// Warp 내 스레드들이 연속 메모리에 접근
__global__ void coalesced(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] = data[idx] * 2.0f;  // 연속 접근
    }
}

// Non-coalesced Access (나쁜 패턴)
// 스레드가 stride를 가지고 비연속적으로 접근
__global__ void strided(float* data, int n, int stride) {
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    if (idx < n) {
        data[idx] = data[idx] * 2.0f;  // stride만큼 건너뛴 접근
    }
}

// AoS(Array of Structures) vs SoA(Structure of Arrays)
// AoS: Non-coalesced (나쁨)
struct ParticleAoS {
    float x, y, z;
    float vx, vy, vz;
};
// 스레드 0: particle[0].x, 스레드 1: particle[1].x -> stride 접근

// SoA: Coalesced (좋음)
struct ParticlesSoA {
    float* x;  float* y;  float* z;
    float* vx; float* vy; float* vz;
};
// 스레드 0: x[0], 스레드 1: x[1] -> 연속 접근

5.2 Shared Memory와 Bank Conflict

Shared Memory는 SM 내 스레드 블록이 공유하는 고속 메모리이다. 32개의 Bank로 구성되며, 서로 다른 스레드가 같은 Bank에 동시 접근하면 Bank Conflict가 발생한다.

// Shared Memory 사용 예시
__global__ void sharedMemExample(float* input, float* output, int n) {
    // 정적 할당
    __shared__ float sharedData[256];

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;

    // Global -> Shared 로드
    if (gid < n) {
        sharedData[tid] = input[gid];
    }
    __syncthreads();  // 모든 스레드가 로드 완료할 때까지 대기

    // Shared Memory에서 연산 (이웃 원소 합산 예시)
    if (tid > 0 && tid < blockDim.x - 1 && gid < n) {
        output[gid] = sharedData[tid - 1] + sharedData[tid] + sharedData[tid + 1];
    }
}

// Bank Conflict 회피: 패딩 기법
// 문제: 32x32 행렬의 열(column) 접근 시 모든 스레드가 같은 Bank에 매핑
__shared__ float tile[32][32];       // Bank Conflict 발생!
__shared__ float tile[32][32 + 1];   // 패딩으로 Bank Conflict 회피

5.3 Unified Memory

Unified Memory는 CPU와 GPU가 동일한 포인터로 메모리에 접근할 수 있게 해주는 추상화 계층이다.

// Unified Memory 사용
float* data;
cudaMallocManaged(&data, N * sizeof(float));

// CPU에서 초기화
for (int i = 0; i < N; i++) {
    data[i] = (float)i;
}

// GPU 커널 실행 - 별도 cudaMemcpy 불필요!
myKernel<<<blocks, threads>>>(data, N);
cudaDeviceSynchronize();

// CPU에서 결과 바로 접근
printf("Result: %f\n", data[0]);

// 해제
cudaFree(data);

// 메모리 프리페치 힌트 (성능 최적화)
cudaMemPrefetchAsync(data, N * sizeof(float), deviceId);  // GPU로 프리페치
cudaMemPrefetchAsync(data, N * sizeof(float), cudaCpuDeviceId);  // CPU로 프리페치

5.4 Pinned Memory

Pinned Memory(Page-locked Memory)는 OS의 페이지 스왑 대상에서 제외된 호스트 메모리로, GPU와의 데이터 전송 속도를 높인다.

// Pinned Memory 할당
float* h_pinned;
cudaMallocHost(&h_pinned, N * sizeof(float));  // 또는 cudaHostAlloc

// 일반 메모리 대비 전송 속도 비교
// Pageable:  ~12 GB/s (PCIe Gen4)
// Pinned:    ~25 GB/s (PCIe Gen4) - 약 2배 빠름

// 비동기 전송에 필수
cudaMemcpyAsync(d_data, h_pinned, bytes, cudaMemcpyHostToDevice, stream);

// 해제
cudaFreeHost(h_pinned);

5.5 Memory Pool

CUDA 11.2부터 도입된 Memory Pool은 반복적인 메모리 할당/해제 오버헤드를 줄인다.

// Stream-ordered Memory Allocator (CUDA 11.2+)
float* d_data;
cudaMallocAsync(&d_data, bytes, stream);

// 사용 후 비동기 해제
cudaFreeAsync(d_data, stream);

// Memory Pool 설정
cudaMemPool_t mempool;
cudaDeviceGetDefaultMemPool(&mempool, deviceId);

// Pool 크기 제한 설정
uint64_t threshold = 1ULL << 30;  // 1 GB
cudaMemPoolSetAttribute(mempool, cudaMemPoolAttrReleaseThreshold, &threshold);

5.6 메모리 대역폭 최적화 요약

기법	효과	적용 난이도
Coalesced Access	Global Memory 처리량 극대화	낮음
SoA 레이아웃	Coalesced Access 보장	중간
Shared Memory Tiling	Global Memory 접근 횟수 감소	중간
Bank Conflict 회피 (패딩)	Shared Memory 처리량 극대화	낮음
Pinned Memory	Host-Device 전송 2배 가속	낮음
Unified Memory + Prefetch	프로그래밍 편의 + 성능	낮음
Memory Pool	할당/해제 오버헤드 제거	낮음
Texture Memory	2D 공간 지역성 활용	중간

6. CUDA 스트림과 비동기 실행

6.1 CUDA Stream 개념

CUDA Stream은 GPU에서 순서대로 실행되는 명령 시퀀스이다. 서로 다른 Stream의 작업은 동시에 실행될 수 있다.

Default Stream (Stream 0):
[MemcpyH2D] -> [Kernel A] -> [MemcpyD2H]
                                           (순차 실행, 파이프라인 미활용)

Multi-Stream:
Stream 1: [MemcpyH2D_1] -> [Kernel_1] -> [MemcpyD2H_1]
Stream 2:     [MemcpyH2D_2] -> [Kernel_2] -> [MemcpyD2H_2]
Stream 3:         [MemcpyH2D_3] -> [Kernel_3] -> [MemcpyD2H_3]
                                           (오버랩 실행, GPU 활용 극대화)

6.2 Stream 생성 및 사용

// Stream 생성
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// 비동기 메모리 전송 (Pinned Memory 필수!)
cudaMemcpyAsync(d_A, h_A, bytes, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(d_B, h_B, bytes, cudaMemcpyHostToDevice, stream2);

// 커널 실행 (특정 Stream에서)
kernelA<<<grid, block, 0, stream1>>>(d_A);
kernelB<<<grid, block, 0, stream2>>>(d_B);

// 결과 비동기 복사
cudaMemcpyAsync(h_A, d_A, bytes, cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(h_B, d_B, bytes, cudaMemcpyDeviceToHost, stream2);

// Stream 동기화
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

// Stream 해제
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

6.3 Multi-Stream 파이프라인 패턴

const int NUM_STREAMS = 4;
const int CHUNK_SIZE = N / NUM_STREAMS;

cudaStream_t streams[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
    cudaStreamCreate(&streams[i]);
}

// Pinned Memory 할당
float *h_in, *h_out;
cudaMallocHost(&h_in, N * sizeof(float));
cudaMallocHost(&h_out, N * sizeof(float));

float *d_in, *d_out;
cudaMalloc(&d_in, N * sizeof(float));
cudaMalloc(&d_out, N * sizeof(float));

// 파이프라인 실행
for (int i = 0; i < NUM_STREAMS; i++) {
    int offset = i * CHUNK_SIZE;
    size_t chunkBytes = CHUNK_SIZE * sizeof(float);

    // 1. Host -> Device (비동기)
    cudaMemcpyAsync(d_in + offset, h_in + offset,
                    chunkBytes, cudaMemcpyHostToDevice, streams[i]);

    // 2. 커널 실행
    int blocks = (CHUNK_SIZE + 255) / 256;
    processKernel<<<blocks, 256, 0, streams[i]>>>(
        d_in + offset, d_out + offset, CHUNK_SIZE);

    // 3. Device -> Host (비동기)
    cudaMemcpyAsync(h_out + offset, d_out + offset,
                    chunkBytes, cudaMemcpyDeviceToHost, streams[i]);
}

// 전체 완료 대기
cudaDeviceSynchronize();

6.4 CUDA Events (타이밍 측정)

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

// 타이밍 시작
cudaEventRecord(start);

// 커널 실행
myKernel<<<grid, block>>>(d_data, N);

// 타이밍 종료
cudaEventRecord(stop);
cudaEventSynchronize(stop);

// 경과 시간 계산
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel execution time: %.3f ms\n", milliseconds);

// 대역폭 계산
float bandwidth = (bytes_read + bytes_written) / (milliseconds * 1e6);
printf("Effective bandwidth: %.2f GB/s\n", bandwidth);

cudaEventDestroy(start);
cudaEventDestroy(stop);

6.5 동기화 함수 비교

함수	범위	용도
`cudaDeviceSynchronize()`	전체 Device	모든 Stream 완료 대기
`cudaStreamSynchronize(stream)`	특정 Stream	해당 Stream 완료 대기
`cudaEventSynchronize(event)`	특정 Event	해당 Event 완료 대기
`cudaStreamWaitEvent(stream, event)`	Stream 간	Stream이 Event 완료 대기 후 진행
`__syncthreads()`	Block 내부	Block 내 스레드 동기화 (커널 내)

7. CUDA 최적화 기법

7.1 Occupancy 최적화

Occupancy는 SM에서 실제 활성 Warp 수 대비 최대 지원 Warp 수의 비율이다. 높은 Occupancy가 항상 최고 성능을 의미하지는 않지만, 메모리 레이턴시를 은닉하는 데 핵심적이다.

// Occupancy 계산 API
int blockSize;
int minGridSize;

// 최적 블록 크기 자동 계산
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, 0);
printf("Optimal block size: %d\n", blockSize);
printf("Minimum grid size: %d\n", minGridSize);

// 특정 블록 크기에서의 Occupancy 확인
int maxActiveBlocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &maxActiveBlocks, myKernel, blockSize, 0);
printf("Max active blocks per SM: %d\n", maxActiveBlocks);

블록 크기 선택 가이드라인:

원칙	설명
32의 배수	Warp 크기에 맞춰 스레드 낭비 방지
128 ~ 512 권장	일반적으로 최적 범위
레지스터/Shared Memory 고려	자원 사용량이 높으면 블록 크기 줄여야 함
`cudaOccupancyMaxPotentialBlockSize` 활용	자동 최적 계산

7.2 Warp Divergence 최소화

Warp 내 스레드가 서로 다른 분기를 타면 순차 실행이 발생한다.

// 나쁜 예: Warp Divergence 발생
__global__ void divergentKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        // threadIdx.x가 짝수/홀수에 따라 분기
        // -> 같은 Warp 내에서 절반만 실행
        if (threadIdx.x % 2 == 0) {
            data[idx] = data[idx] * 2.0f;
        } else {
            data[idx] = data[idx] + 1.0f;
        }
    }
}

// 좋은 예: Warp 단위로 분기
__global__ void convergentKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int warpId = threadIdx.x / 32;  // Warp ID로 분기
    if (idx < n) {
        // 같은 Warp 내 모든 스레드가 같은 경로
        if (warpId % 2 == 0) {
            data[idx] = data[idx] * 2.0f;
        } else {
            data[idx] = data[idx] + 1.0f;
        }
    }
}

7.3 Shared Memory Tiling (행렬 곱셈 최적화)

Tiling은 Global Memory 접근 횟수를 줄이는 핵심 기법이다. 4.8절의 matMulTiled에서 이미 구현을 보았으며, 성능 차이를 정리하면 다음과 같다.

방식	Global Memory 접근 (MxNxK 행렬)	상대 성능
Naive	`2 * M * N * K`	1x
Tiled (Shared Memory)	`2 * M * N * K / TILE_SIZE`	약 TILE_SIZE 배

TILE_SIZE가 16일 때, Global Memory 접근 횟수가 16배 줄어든다.

7.4 Loop Unrolling

// 수동 Loop Unrolling
__global__ void unrolled(float* data, int n) {
    int idx = blockIdx.x * blockDim.x * 4 + threadIdx.x;

    // 4배 Unrolling
    if (idx < n)         data[idx] *= 2.0f;
    if (idx + 256 < n)   data[idx + 256] *= 2.0f;
    if (idx + 512 < n)   data[idx + 512] *= 2.0f;
    if (idx + 768 < n)   data[idx + 768] *= 2.0f;
}

// 컴파일러 지시자를 이용한 Unrolling
__global__ void pragmaUnrolled(float* A, float* B, float* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;

    #pragma unroll 8
    for (int k = 0; k < N; k++) {
        sum += A[idx * N + k] * B[k];
    }
    C[idx] = sum;
}

7.5 Grid-Stride Loop 패턴

데이터 크기가 총 스레드 수보다 클 때 사용하는 유연한 패턴이다.

__global__ void gridStrideKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    // 각 스레드가 stride만큼 건너뛰며 여러 원소 처리
    for (int i = idx; i < n; i += stride) {
        data[i] = data[i] * 2.0f;
    }
}

// 장점:
// 1. 데이터 크기에 관계없이 동작
// 2. 블록/그리드 크기를 자유롭게 조절 가능
// 3. 스레드당 작업량 증가로 커널 시작 오버헤드 감소

7.6 NVIDIA Nsight 프로파일링

# Nsight Compute (커널 레벨 프로파일링)
ncu --set full -o profile_report ./my_cuda_app

# 특정 커널만 프로파일링
ncu --kernel-name myKernel --launch-skip 0 --launch-count 1 ./my_cuda_app

# Nsight Systems (시스템 레벨 프로파일링)
nsys profile --trace=cuda,nvtx -o timeline_report ./my_cuda_app

# 주요 확인 메트릭
# - SM Throughput: SM 활용률
# - Memory Throughput: 메모리 대역폭 활용률
# - Achieved Occupancy: 실제 Occupancy
# - Warp Stall Reasons: Warp 정지 원인
# - L1/L2 Hit Rate: 캐시 적중률

프로파일링 체크리스트:

메트릭	목표	문제 시 조치
Achieved Occupancy	50% 이상	블록 크기, 레지스터 사용 조정
Memory Throughput	이론 대역폭의 60% 이상	Coalesced Access, 캐싱 개선
Compute Throughput	이론 연산량의 60% 이상	ILP 향상, 불필요 연산 제거
Warp Divergence	최소화	분기 로직 재구성

8. Python에서 CUDA 다루기

8.1 PyCUDA 기초

PyCUDA는 Python에서 CUDA C 커널을 직접 작성하고 실행할 수 있게 해준다.

import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np

# CUDA 커널 정의 (C 코드를 문자열로 작성)
mod = SourceModule("""
__global__ void multiply(float *dest, float *a, float *b, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        dest[idx] = a[idx] * b[idx];
    }
}
""")

multiply = mod.get_function("multiply")

# 데이터 준비
n = 1024 * 1024
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
dest = np.zeros_like(a)

# 커널 실행
block_size = 256
grid_size = (n + block_size - 1) // block_size

multiply(
    cuda.Out(dest), cuda.In(a), cuda.In(b), np.int32(n),
    block=(block_size, 1, 1), grid=(grid_size, 1)
)

# 결과 검증
assert np.allclose(dest, a * b)
print("PyCUDA multiply: SUCCESS")

8.2 Numba CUDA JIT

Numba는 Python 함수를 CUDA 커널로 JIT 컴파일하는 가장 쉬운 방법이다.

from numba import cuda
import numpy as np
import math

# CUDA 커널 정의
@cuda.jit
def vector_add(a, b, c):
    idx = cuda.grid(1)  # 1D 글로벌 인덱스
    if idx < a.shape[0]:
        c[idx] = a[idx] + b[idx]

# 데이터 준비
n = 1_000_000
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
c = np.zeros(n, dtype=np.float32)

# Device로 전송
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)

# 커널 실행
threads_per_block = 256
blocks_per_grid = math.ceil(n / threads_per_block)
vector_add[blocks_per_grid, threads_per_block](d_a, d_b, d_c)

# 결과 가져오기
result = d_c.copy_to_host()
assert np.allclose(result, a + b)
print("Numba CUDA vector_add: SUCCESS")


# 2D 커널 예시: 행렬 곱셈
@cuda.jit
def matmul_kernel(A, B, C):
    row, col = cuda.grid(2)  # 2D 글로벌 인덱스
    if row < C.shape[0] and col < C.shape[1]:
        tmp = 0.0
        for k in range(A.shape[1]):
            tmp += A[row, k] * B[k, col]
        C[row, col] = tmp

# Shared Memory 사용
@cuda.jit
def matmul_shared(A, B, C):
    TILE = 16
    sA = cuda.shared.array(shape=(TILE, TILE), dtype=np.float32)
    sB = cuda.shared.array(shape=(TILE, TILE), dtype=np.float32)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    row = cuda.blockIdx.y * TILE + ty
    col = cuda.blockIdx.x * TILE + tx

    tmp = 0.0
    for t in range((A.shape[1] + TILE - 1) // TILE):
        if row < A.shape[0] and (t * TILE + tx) < A.shape[1]:
            sA[ty, tx] = A[row, t * TILE + tx]
        else:
            sA[ty, tx] = 0.0

        if (t * TILE + ty) < B.shape[0] and col < B.shape[1]:
            sB[ty, tx] = B[t * TILE + ty, col]
        else:
            sB[ty, tx] = 0.0

        cuda.syncthreads()
        for k in range(TILE):
            tmp += sA[ty, k] * sB[k, tx]
        cuda.syncthreads()

    if row < C.shape[0] and col < C.shape[1]:
        C[row, col] = tmp

8.3 CuPy (NumPy 대체)

CuPy는 NumPy와 동일한 API를 제공하면서 GPU에서 실행되는 라이브러리이다.

import cupy as cp
import numpy as np
import time

# NumPy와 동일한 API
a_gpu = cp.random.randn(10000, 10000, dtype=cp.float32)
b_gpu = cp.random.randn(10000, 10000, dtype=cp.float32)

# GPU 행렬 곱셈
start = time.time()
c_gpu = cp.dot(a_gpu, b_gpu)
cp.cuda.Stream.null.synchronize()  # GPU 작업 완료 대기
gpu_time = time.time() - start

# CPU 비교
a_cpu = cp.asnumpy(a_gpu)
b_cpu = cp.asnumpy(b_gpu)
start = time.time()
c_cpu = np.dot(a_cpu, b_cpu)
cpu_time = time.time() - start

print(f"GPU: {gpu_time:.4f}s, CPU: {cpu_time:.4f}s")
print(f"Speedup: {cpu_time / gpu_time:.1f}x")

# CuPy에서 커스텀 CUDA 커널 사용
custom_kernel = cp.RawKernel(r'''
extern "C" __global__
void relu_kernel(const float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = fmaxf(input[idx], 0.0f);
    }
}
''', 'relu_kernel')

n = 1_000_000
x = cp.random.randn(n, dtype=cp.float32)
y = cp.zeros(n, dtype=cp.float32)

block_size = 256
grid_size = (n + block_size - 1) // block_size
custom_kernel((grid_size,), (block_size,), (x, y, n))

# 검증
assert cp.allclose(y, cp.maximum(x, 0))

8.4 PyTorch CUDA 연산

PyTorch에서 CUDA를 활용하는 실전 패턴을 정리한다.

import torch
import torch.nn as nn

# ===== 기본 GPU 사용 =====

# GPU 사용 가능 확인
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")

# 텐서를 GPU로 이동
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 방법 1: .to(device)
x = torch.randn(1000, 1000)
x_gpu = x.to(device)

# 방법 2: .cuda()
x_gpu = x.cuda()

# 방법 3: 직접 GPU에서 생성
x_gpu = torch.randn(1000, 1000, device='cuda')

# 특정 GPU 지정
x_gpu1 = x.to('cuda:1')  # 두 번째 GPU

# ===== 메모리 관리 =====

# 메모리 사용량 확인
print(torch.cuda.memory_summary())
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# 캐시 비우기
torch.cuda.empty_cache()

# 메모리 스냅샷 (디버깅용)
torch.cuda.memory._record_memory_history()
# ... 코드 실행 ...
snapshot = torch.cuda.memory._snapshot()
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")


# ===== Mixed Precision Training =====
from torch.amp import autocast, GradScaler

model = nn.Linear(1024, 1024).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler('cuda')

for data, target in dataloader:
    data, target = data.cuda(), target.cuda()
    optimizer.zero_grad()

    # autocas로 FP16 자동 전환
    with autocast('cuda'):
        output = model(data)
        loss = nn.functional.mse_loss(output, target)

    # GradScaler로 FP16 그래디언트 스케일링
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# Mixed Precision의 이점:
# - 메모리 사용량 약 50% 감소
# - 학습 속도 1.5~3배 향상 (Tensor Core 활용)
# - 정확도 손실 최소

8.5 TensorFlow GPU 설정

import tensorflow as tf

# GPU 확인
print("GPUs available:", tf.config.list_physical_devices('GPU'))

# GPU 메모리 점진적 할당 (권장)
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

# 또는 메모리 제한 설정
tf.config.set_logical_device_configuration(
    gpus[0],
    [tf.config.LogicalDeviceConfiguration(memory_limit=8192)]  # 8GB
)

# 특정 GPU 사용
with tf.device('/GPU:0'):
    a = tf.random.normal([1000, 1000])
    b = tf.random.normal([1000, 1000])
    c = tf.matmul(a, b)

# Mixed Precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')

8.6 RAPIDS (cuDF, cuML) 소개

RAPIDS는 NVIDIA가 제공하는 GPU 가속 데이터 사이언스 라이브러리 모음이다. 2025년 기준 최신 버전은 26.02이며, cuML은 scikit-learn 대비 5~175배 빠른 성능을 제공한다.

# cuDF: pandas 대체 GPU DataFrame
import cudf

# pandas와 동일한 API
df = cudf.read_csv('large_data.csv')
result = df.groupby('category').agg({'value': 'mean', 'count': 'sum'})
filtered = df[df['value'] > 100]

# pandas와 상호 변환
import pandas as pd
pdf = df.to_pandas()   # cuDF -> pandas
gdf = cudf.from_pandas(pdf)  # pandas -> cuDF


# cuML: scikit-learn 대체 GPU ML
import cuml
from cuml.ensemble import RandomForestClassifier
from cuml.cluster import KMeans

# scikit-learn과 동일한 API
rf = RandomForestClassifier(n_estimators=100, max_depth=16)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)

# Zero-code-change 가속 (cuml.accel)
# scikit-learn 코드를 수정 없이 GPU로 실행
import cuml.accel
cuml.accel.install()

from sklearn.ensemble import RandomForestClassifier  # 자동으로 GPU 가속!

9. 실전 Hands-on 예제

9.1 예제 1: GPU 벡터 연산 벤치마크 (Numba)

CPU와 GPU의 벡터 연산 성능을 비교하는 실전 예제이다.

from numba import cuda, njit
import numpy as np
import time
import math

# CPU 버전 (Numba JIT)
@njit
def vector_ops_cpu(a, b, c):
    for i in range(a.shape[0]):
        c[i] = math.sqrt(a[i] ** 2 + b[i] ** 2) * math.sin(a[i]) + math.log(abs(b[i]) + 1)

# GPU 버전 (CUDA)
@cuda.jit
def vector_ops_gpu(a, b, c):
    idx = cuda.grid(1)
    if idx < a.shape[0]:
        c[idx] = math.sqrt(a[idx] ** 2 + b[idx] ** 2) * math.sin(a[idx]) + math.log(abs(b[idx]) + 1)

# 벤치마크
sizes = [100_000, 1_000_000, 10_000_000, 100_000_000]

for n in sizes:
    a = np.random.randn(n).astype(np.float32)
    b = np.random.randn(n).astype(np.float32)
    c_cpu = np.zeros(n, dtype=np.float32)
    c_gpu = np.zeros(n, dtype=np.float32)

    # CPU
    start = time.time()
    vector_ops_cpu(a, b, c_cpu)
    cpu_time = time.time() - start

    # GPU
    d_a = cuda.to_device(a)
    d_b = cuda.to_device(b)
    d_c = cuda.to_device(c_gpu)

    threads = 256
    blocks = math.ceil(n / threads)

    # Warm-up
    vector_ops_gpu[blocks, threads](d_a, d_b, d_c)
    cuda.synchronize()

    start = time.time()
    vector_ops_gpu[blocks, threads](d_a, d_b, d_c)
    cuda.synchronize()
    gpu_time = time.time() - start

    c_gpu = d_c.copy_to_host()

    print(f"N={n:>12,}: CPU={cpu_time:.4f}s, GPU={gpu_time:.4f}s, "
          f"Speedup={cpu_time/gpu_time:.1f}x")

9.2 예제 2: 이미지 처리 가속 (CuPy)

GPU를 활용한 실시간 이미지 필터링 예제이다.

import cupy as cp
import numpy as np
from PIL import Image

def gpu_gaussian_blur(image_array, kernel_size=5, sigma=1.0):
    """GPU 가우시안 블러 구현"""
    # 가우시안 커널 생성
    ax = cp.arange(-kernel_size // 2 + 1., kernel_size // 2 + 1.)
    xx, yy = cp.meshgrid(ax, ax)
    kernel = cp.exp(-(xx**2 + yy**2) / (2. * sigma**2))
    kernel = kernel / kernel.sum()

    img_gpu = cp.asarray(image_array, dtype=cp.float32)
    result = cp.zeros_like(img_gpu)

    pad = kernel_size // 2
    # 채널별 처리
    for c in range(img_gpu.shape[2]):
        padded = cp.pad(img_gpu[:, :, c], pad, mode='reflect')
        # 2D Convolution (CuPy FFT 활용)
        from cupyx.scipy.ndimage import convolve
        result[:, :, c] = convolve(img_gpu[:, :, c], kernel)

    return cp.asnumpy(result.clip(0, 255).astype(cp.uint8))


# CuPy RawKernel로 직접 구현한 Sobel Edge Detection
sobel_kernel = cp.RawKernel(r'''
extern "C" __global__
void sobel_filter(const float* input, float* output,
                  int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x > 0 && x < width - 1 && y > 0 && y < height - 1) {
        // Sobel X
        float gx = -1.0f * input[(y-1)*width + (x-1)]
                   + 1.0f * input[(y-1)*width + (x+1)]
                   - 2.0f * input[y*width + (x-1)]
                   + 2.0f * input[y*width + (x+1)]
                   - 1.0f * input[(y+1)*width + (x-1)]
                   + 1.0f * input[(y+1)*width + (x+1)];

        // Sobel Y
        float gy = -1.0f * input[(y-1)*width + (x-1)]
                   - 2.0f * input[(y-1)*width + x]
                   - 1.0f * input[(y-1)*width + (x+1)]
                   + 1.0f * input[(y+1)*width + (x-1)]
                   + 2.0f * input[(y+1)*width + x]
                   + 1.0f * input[(y+1)*width + (x+1)];

        output[y * width + x] = sqrtf(gx * gx + gy * gy);
    }
}
''', 'sobel_filter')

def gpu_sobel_edge_detection(gray_image):
    """GPU Sobel 엣지 검출"""
    h, w = gray_image.shape
    img_gpu = cp.asarray(gray_image, dtype=cp.float32)
    output_gpu = cp.zeros_like(img_gpu)

    block = (16, 16)
    grid = ((w + 15) // 16, (h + 15) // 16)

    sobel_kernel(grid, block, (img_gpu, output_gpu, w, h))

    return cp.asnumpy(output_gpu.clip(0, 255).astype(cp.uint8))

# 사용 예
# img = np.array(Image.open('photo.jpg'))
# blurred = gpu_gaussian_blur(img, kernel_size=11, sigma=3.0)
# gray = np.mean(img, axis=2).astype(np.float32)
# edges = gpu_sobel_edge_detection(gray)

9.3 예제 3: PyTorch 모델 GPU 학습

완전한 PyTorch GPU 학습 파이프라인 예제이다.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torch.amp import autocast, GradScaler
import time

# 디바이스 설정
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# 모델 정의
class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim),
            nn.LayerNorm(dim),
            nn.GELU(),
            nn.Linear(dim, dim),
            nn.LayerNorm(dim),
        )

    def forward(self, x):
        return x + self.net(x)

class DeepModel(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=512, num_classes=10, num_blocks=6):
        super().__init__()
        self.input_proj = nn.Linear(input_dim, hidden_dim)
        self.blocks = nn.Sequential(*[ResidualBlock(hidden_dim) for _ in range(num_blocks)])
        self.head = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.input_proj(x)
        x = self.blocks(x)
        return self.head(x)

# 데이터 생성 (실제로는 데이터셋 로드)
X_train = torch.randn(50000, 784)
y_train = torch.randint(0, 10, (50000,))
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True,
                          num_workers=4, pin_memory=True)  # pin_memory 중요!

# 모델을 GPU로 이동
model = DeepModel().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scaler = GradScaler('cuda')

# 학습 루프 (Mixed Precision)
def train_epoch(model, loader, criterion, optimizer, scaler, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(loader):
        data, target = data.to(device, non_blocking=True), target.to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)  # set_to_none=True가 더 효율적

        with autocast('cuda'):
            output = model(data)
            loss = criterion(output, target)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        total_loss += loss.item()
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()

    return total_loss / len(loader), 100. * correct / total

# 학습 실행
num_epochs = 10
for epoch in range(num_epochs):
    start = time.time()
    loss, acc = train_epoch(model, train_loader, criterion, optimizer, scaler, device)
    elapsed = time.time() - start

    print(f"Epoch {epoch+1}/{num_epochs}: Loss={loss:.4f}, "
          f"Acc={acc:.2f}%, Time={elapsed:.2f}s")

    # GPU 메모리 상태 출력
    if device.type == 'cuda':
        alloc = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"  GPU Memory: {alloc:.2f} GB allocated, {reserved:.2f} GB reserved")

9.4 예제 4: Multi-GPU 학습

DataParallel (간단하지만 비효율적)

import torch
import torch.nn as nn

model = DeepModel().cuda()

# DataParallel 적용 (한 줄로 끝!)
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs with DataParallel")
    model = nn.DataParallel(model)

# 이후 학습 코드는 동일
# 단점: GPU 0에 부하가 집중되는 불균형 문제

DistributedDataParallel (권장)

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
import os

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train_ddp(rank, world_size):
    setup(rank, world_size)

    # 모델을 해당 GPU에 배치
    model = DeepModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    # 분산 데이터 로더
    train_dataset = TensorDataset(X_train, y_train)
    sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank)
    train_loader = DataLoader(train_dataset, batch_size=256,
                              sampler=sampler, num_workers=4, pin_memory=True)

    optimizer = optim.AdamW(ddp_model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()
    scaler = GradScaler('cuda')

    for epoch in range(10):
        sampler.set_epoch(epoch)  # 매 에폭마다 셔플링
        for data, target in train_loader:
            data = data.to(rank, non_blocking=True)
            target = target.to(rank, non_blocking=True)

            optimizer.zero_grad(set_to_none=True)
            with autocast('cuda'):
                output = ddp_model(data)
                loss = criterion(output, target)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

        if rank == 0:
            print(f"Epoch {epoch+1} complete")

    cleanup()

# 실행
# torchrun --nproc_per_node=4 train_ddp.py
import torch.multiprocessing as mp
world_size = torch.cuda.device_count()
mp.spawn(train_ddp, args=(world_size,), nprocs=world_size, join=True)

9.5 예제 5: CUDA 커널 직접 작성 (PyCUDA)

실전에서 유용한 Reduction(합산) 커널을 PyCUDA로 구현한다.

import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np

# Parallel Reduction 커널
mod = SourceModule("""
__global__ void parallel_reduce(float *input, float *output, int n) {
    extern __shared__ float sdata[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x * 2 + threadIdx.x;

    // 첫 번째 레벨 리덕션 (글로벌 로드 시 수행)
    float sum = 0.0f;
    if (i < n) sum += input[i];
    if (i + blockDim.x < n) sum += input[i + blockDim.x];
    sdata[tid] = sum;

    __syncthreads();

    // Shared Memory에서 리덕션
    for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // Warp 내에서는 동기화 불필요 (Warp-level reduction)
    if (tid < 32) {
        volatile float *smem = sdata;
        smem[tid] += smem[tid + 32];
        smem[tid] += smem[tid + 16];
        smem[tid] += smem[tid + 8];
        smem[tid] += smem[tid + 4];
        smem[tid] += smem[tid + 2];
        smem[tid] += smem[tid + 1];
    }

    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}
""")

reduce_kernel = mod.get_function("parallel_reduce")

# 실행
n = 1 << 20  # 1M 원소
data = np.random.randn(n).astype(np.float32)

block_size = 256
grid_size = (n + block_size * 2 - 1) // (block_size * 2)

d_input = cuda.mem_alloc(data.nbytes)
d_output = cuda.mem_alloc(grid_size * 4)  # float32

cuda.memcpy_htod(d_input, data)

reduce_kernel(d_input, d_output, np.int32(n),
              block=(block_size, 1, 1), grid=(grid_size, 1),
              shared=block_size * 4)

# 부분 합계 가져오기
partial_sums = np.zeros(grid_size, dtype=np.float32)
cuda.memcpy_dtoh(partial_sums, d_output)
gpu_sum = partial_sums.sum()

# 검증
cpu_sum = data.sum()
print(f"GPU sum: {gpu_sum:.6f}")
print(f"CPU sum: {cpu_sum:.6f}")
print(f"Difference: {abs(gpu_sum - cpu_sum):.6f}")

10. CUDA 관련 도구 및 라이브러리

10.1 핵심 라이브러리 생태계

라이브러리	용도	핵심 기능
cuDNN	딥러닝	Convolution, RNN, Attention, BatchNorm 등 DNN 기본 연산 최적화
cuBLAS	선형대수	GEMM, TRSM 등 BLAS 레벨 1/2/3 연산. FP8/BF16 Group GEMM 지원 (Blackwell)
NCCL	Multi-GPU 통신	AllReduce, AllGather, Broadcast 등 집합 통신. NVLink/NVSwitch 최적화
TensorRT	추론 최적화	그래프 최적화, 양자화 (INT8/FP8), 레이어 퓨전, 동적 배치
Triton	GPU 프로그래밍	Python DSL로 고성능 커널 작성. OpenAI 개발
CUTLASS	커스텀 GEMM	템플릿 기반 CUDA 행렬 곱셈 라이브러리
FlashAttention	Attention 가속	IO-aware 알고리즘으로 Transformer Attention 가속
cuSPARSE	희소 행렬	희소 행렬 연산 최적화
cuRAND	난수 생성	GPU 가속 의사/준난수 생성
cuFFT	FFT	GPU 가속 고속 푸리에 변환

10.2 cuDNN

# cuDNN은 PyTorch/TensorFlow에 내장되어 자동 사용됨

# PyTorch에서 cuDNN 벤치마크 모드 활성화
import torch
torch.backends.cudnn.benchmark = True   # 최적 알고리즘 자동 선택
torch.backends.cudnn.deterministic = False  # 성능 우선 (재현성 불필요 시)

# cuDNN 버전 확인
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"cuDNN enabled: {torch.backends.cudnn.enabled}")

10.3 NCCL (Multi-GPU 통신)

# PyTorch에서 NCCL 사용 (DistributedDataParallel 내부)
import torch.distributed as dist

# NCCL 백엔드 초기화
dist.init_process_group(backend='nccl')

# 주요 집합 통신 연산
# AllReduce: 모든 GPU의 텐서를 합산하여 각 GPU에 분배
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)

# AllGather: 모든 GPU의 텐서를 모아서 각 GPU에 전체 복사
gathered = [torch.zeros_like(tensor) for _ in range(world_size)]
dist.all_gather(gathered, tensor)

# Broadcast: 한 GPU의 텐서를 모든 GPU에 복사
dist.broadcast(tensor, src=0)

10.4 TensorRT

# TensorRT를 활용한 추론 최적화
import tensorrt as trt
import torch

# PyTorch 모델을 ONNX로 변환
model = MyModel().cuda().eval()
dummy_input = torch.randn(1, 3, 224, 224).cuda()
torch.onnx.export(model, dummy_input, "model.onnx",
                  input_names=['input'], output_names=['output'],
                  dynamic_axes={'input': {0: 'batch_size'}})

# torch_tensorrt를 사용한 간편 변환 (PyTorch 2.x)
import torch_tensorrt

optimized_model = torch_tensorrt.compile(model,
    inputs=[torch_tensorrt.Input(
        min_shape=[1, 3, 224, 224],
        opt_shape=[8, 3, 224, 224],
        max_shape=[32, 3, 224, 224],
        dtype=torch.float16
    )],
    enabled_precisions={torch.float16},
)

# 추론
with torch.no_grad():
    output = optimized_model(input_tensor.half().cuda())

# TensorRT 성능 향상 (일반적):
# - FP32 -> FP16: 2~3배 빠름
# - FP32 -> INT8: 3~5배 빠름
# - 레이어 퓨전 + 최적화: 추가 20~50% 향상

10.5 OpenAI Triton

Triton은 Python으로 고성능 GPU 커널을 작성할 수 있는 프로그래밍 언어이다.

import triton
import triton.language as tl
import torch

# Triton 커널: 벡터 덧셈
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements

    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y

    tl.store(output_ptr + offsets, output, mask=mask)

# 사용
def triton_add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output

# 실행
x = torch.randn(1_000_000, device='cuda')
y = torch.randn(1_000_000, device='cuda')
result = triton_add(x, y)
assert torch.allclose(result, x + y)

# Triton의 장점:
# - CUDA C보다 높은 수준의 추상화
# - 자동 메모리 접근 최적화
# - Python 문법으로 고성능 커널 작성
# - PyTorch 2.0 torch.compile의 백엔드

10.6 FlashAttention

FlashAttention은 Transformer의 Self-Attention 연산을 IO-aware 알고리즘으로 가속한다.

# FlashAttention-3 (H100 최적화)
# 설치: pip install flash-attn

from flash_attn import flash_attn_func

# 입력: (batch, seqlen, nheads, headdim)
q = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)
k = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)
v = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)

# FlashAttention 실행
output = flash_attn_func(q, k, v, causal=True)

# 성능 비교 (FlashAttention-3 vs 표준 Attention):
# - 메모리: O(N^2) -> O(N) (시퀀스 길이 N)
# - 속도: 2~4x 빠름 (H100 FP16 기준 최대 740 TFLOPS)
# - FP8: 최대 1.2 PFLOPS (H100)
# - FlashAttention-3는 FlashAttention-2 대비 1.5~2.0x 추가 향상

# PyTorch 2.0+ 내장 scaled_dot_product_attention
output = torch.nn.functional.scaled_dot_product_attention(
    q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2),
    is_causal=True
)
# PyTorch가 자동으로 FlashAttention 또는 Memory-Efficient Attention 선택

11. CUDA 트러블슈팅

11.1 CUDA OOM (Out of Memory) 해결

GPU 메모리 부족은 가장 흔한 CUDA 문제이다.

# 에러: RuntimeError: CUDA out of memory.

# 해결법 1: 배치 크기 줄이기
# batch_size = 64 -> 32 -> 16

# 해결법 2: Mixed Precision 사용 (메모리 약 50% 절약)
from torch.amp import autocast, GradScaler
scaler = GradScaler('cuda')

# 해결법 3: Gradient Accumulation
accumulation_steps = 4
for i, (data, target) in enumerate(loader):
    with autocast('cuda'):
        loss = model(data) / accumulation_steps
    scaler.scale(loss).backward()

    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

# 해결법 4: Gradient Checkpointing (메모리-연산 트레이드오프)
from torch.utils.checkpoint import checkpoint
class MemEfficientModel(nn.Module):
    def forward(self, x):
        # 중간 활성화를 저장하지 않고 역전파 시 재계산
        x = checkpoint(self.block1, x, use_reentrant=False)
        x = checkpoint(self.block2, x, use_reentrant=False)
        return x

# 해결법 5: 캐시 비우기
torch.cuda.empty_cache()

# 해결법 6: 메모리 누수 디버깅
print(torch.cuda.memory_summary())

# 해결법 7: 불필요한 텐서 즉시 삭제
del large_tensor
torch.cuda.empty_cache()

# 해결법 8: 추론 시 그래디언트 비활성화
with torch.no_grad():
    output = model(input_data)

11.2 CUDA 드라이버 vs 런타임 버전 불일치

# 문제: "CUDA driver version is insufficient for CUDA runtime version"

# 확인 방법
nvidia-smi           # 드라이버 지원 CUDA 버전 확인
nvcc --version       # 설치된 CUDA Toolkit 버전 확인

# 규칙: 드라이버 CUDA >= Toolkit CUDA 여야 함
# 예: 드라이버가 CUDA 12.2를 지원하면 CUDA 13.0 Toolkit 사용 불가

# 해결법 1: 드라이버 업데이트
sudo apt-get update
sudo apt-get install nvidia-driver-560  # 최신 드라이버 설치

# 해결법 2: 호환되는 CUDA Toolkit으로 다운그레이드
sudo apt-get install cuda-toolkit-12-4

# 해결법 3: PyTorch/TensorFlow에서 맞는 CUDA 빌드 설치
pip install torch --index-url https://download.pytorch.org/whl/cu124

11.3 nvidia-smi 주요 명령어 정리

명령어	용도
`nvidia-smi`	GPU 상태 요약
`nvidia-smi -l 1`	1초 간격 갱신
`nvidia-smi -q`	상세 정보
`nvidia-smi -q -d MEMORY`	메모리 상세
`nvidia-smi -q -d CLOCK`	클럭 속도
`nvidia-smi -q -d TEMPERATURE`	온도 정보
`nvidia-smi -q -d POWER`	전력 소비
`nvidia-smi -q -d PERFORMANCE`	성능 상태
`nvidia-smi --query-gpu=... --format=csv`	CSV 출력
`nvidia-smi pmon -i 0`	프로세스 모니터링
`nvidia-smi dmon -d 1`	디바이스 모니터링
`nvidia-smi topo -m`	GPU 토폴로지 (NVLink 등)
`nvidia-smi -r -i 0`	GPU 0 리셋
`nvidia-smi -pm 1`	Persistence Mode 활성화
`nvidia-smi -pl 300`	전력 제한 설정 (W)

11.4 cuda-gdb 디버깅

# CUDA 디버그 빌드
nvcc -g -G -O0 -o debug_app app.cu

# cuda-gdb 실행
cuda-gdb ./debug_app

# 주요 명령어
(cuda-gdb) break myKernel            # 커널에 브레이크포인트
(cuda-gdb) run                        # 실행
(cuda-gdb) cuda thread               # 현재 스레드 정보
(cuda-gdb) cuda block                 # 현재 블록 정보
(cuda-gdb) cuda kernel                # 현재 커널 정보
(cuda-gdb) cuda thread (0,0,0)        # 특정 스레드로 전환
(cuda-gdb) cuda block (1,0,0)         # 특정 블록으로 전환
(cuda-gdb) info cuda threads          # 활성 스레드 목록
(cuda-gdb) print threadIdx.x          # 내장 변수 출력
(cuda-gdb) print data[idx]            # 데이터 확인

# compute-sanitizer (메모리 에러 검출)
compute-sanitizer --tool memcheck ./my_app
compute-sanitizer --tool racecheck ./my_app   # 경쟁 조건 검출
compute-sanitizer --tool initcheck ./my_app   # 초기화 검사

11.5 일반적인 에러 메시지와 해결법

에러 메시지	원인	해결법
`CUDA error: out of memory`	GPU 메모리 부족	배치 크기 줄이기, Mixed Precision, Gradient Checkpointing
`CUDA error: device-side assert triggered`	커널 내 assert 실패 또는 인덱스 초과	`CUDA_LAUNCH_BLOCKING=1`로 실행하여 정확한 위치 확인
`CUDA error: an illegal memory access`	잘못된 메모리 접근	`compute-sanitizer`로 디버깅, 인덱스 범위 확인
`CUDA error: no kernel image is available`	Compute Capability 불일치	올바른 `-arch=sm_XX` 옵션으로 재컴파일
`CUDA driver version is insufficient`	드라이버 버전 낮음	드라이버 업데이트 또는 CUDA Toolkit 다운그레이드
`CUDA error: invalid device function`	잘못된 아키텍처 타겟	`-gencode` 옵션 확인, Fat Binary 사용
`cuDNN error: CUDNN_STATUS_NOT_SUPPORTED`	cuDNN 버전 불일치 또는 미지원 연산	cuDNN 업데이트 또는 입력 형식 확인
`NCCL error: unhandled system error`	Multi-GPU 통신 실패	네트워크 설정, `NCCL_DEBUG=INFO` 활성화

# 디버깅 팁: 동기 실행 모드로 전환하여 정확한 에러 위치 파악
CUDA_LAUNCH_BLOCKING=1 python train.py

# NCCL 디버깅
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.launch train.py

# PyTorch CUDA 메모리 디버깅
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python train.py

12. CUDA 대안 기술 비교

12.1 주요 GPU 컴퓨팅 프레임워크

기술	개발사	대상 하드웨어	언어	생태계 성숙도	주요 용도
CUDA	NVIDIA	NVIDIA GPU	C/C++/Python	매우 높음	AI/ML, HPC, 과학 컴퓨팅
ROCm	AMD	AMD GPU (MI300X 등)	C/C++ (HIP)	중간	AI 학습/추론 (PyTorch 지원)
OpenCL	Khronos	범용 (GPU/CPU/FPGA)	C/C++	중간	크로스 플랫폼 GPU 컴퓨팅
SYCL	Khronos	범용	C++	성장 중	oneAPI (Intel), 크로스 플랫폼
Metal	Apple	Apple Silicon	Swift/Obj-C/C++	Apple 생태계	macOS/iOS GPU 컴퓨팅
Vulkan Compute	Khronos	범용 GPU	C/GLSL/SPIR-V	중간	크로스 플랫폼 GPU 연산
Triton	OpenAI	NVIDIA/AMD GPU	Python	성장 중	고수준 GPU 커널 프로그래밍
WebGPU	W3C	브라우저 GPU	WGSL/JS	초기	웹 기반 GPU 연산

12.2 CUDA vs ROCm 상세 비교

CUDA 코드:
  cudaMalloc(&d_ptr, size);
  myKernel<<<grid, block>>>(d_ptr);
  cudaDeviceSynchronize();

ROCm HIP 코드 (거의 동일한 API):
  hipMalloc(&d_ptr, size);
  myKernel<<<grid, block>>>(d_ptr);
  hipDeviceSynchronize();

AMD ROCm의 HIP(Heterogeneous-compute Interface for Portability)는 CUDA와 매우 유사한 API를 제공하며, hipify-perl 도구를 사용하면 CUDA 코드를 HIP 코드로 자동 변환할 수 있다. PyTorch는 ROCm을 공식 지원하여 AMD MI300X 등에서 학습이 가능하다.

12.3 선택 가이드

상황	권장 기술	이유
AI/ML 학습 및 추론	CUDA	최적화된 라이브러리 생태계 (cuDNN, TensorRT)
AMD GPU 사용	ROCm (HIP)	CUDA 호환 API, PyTorch 지원
크로스 플랫폼 요구	OpenCL 또는 SYCL	다양한 하드웨어 지원
Apple 환경	Metal	macOS/iOS 유일한 선택지
커스텀 커널 (Python)	Triton	CUDA C보다 높은 생산성
웹 브라우저 GPU	WebGPU	표준 웹 API

13. 참고 자료

13.1 공식 문서

NVIDIA CUDA Programming Guide: GPU 아키텍처와 프로그래밍 모델의 공식 레퍼런스
CUDA C++ Best Practices Guide: 성능 최적화를 위한 모범 사례 모음
CUDA Toolkit Release Notes: 각 버전별 변경사항 및 호환성 정보
NVIDIA Developer Blog: 최신 기술 동향과 튜토리얼

13.2 주요 참고 링크

자료	URL
CUDA Toolkit 다운로드	`https://developer.nvidia.com/cuda-downloads`
CUDA Programming Guide	`https://docs.nvidia.com/cuda/cuda-c-programming-guide/`
CUDA Best Practices	`https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/`
cuDNN 문서	`https://docs.nvidia.com/deeplearning/cudnn/`
TensorRT 문서	`https://docs.nvidia.com/deeplearning/tensorrt/`
NCCL 문서	`https://docs.nvidia.com/deeplearning/nccl/`
Nsight Compute	`https://developer.nvidia.com/nsight-compute`
Nsight Systems	`https://developer.nvidia.com/nsight-systems`
RAPIDS 공식 사이트	`https://rapids.ai/`
FlashAttention GitHub	`https://github.com/Dao-AILab/flash-attention`
OpenAI Triton GitHub	`https://github.com/openai/triton`
PyTorch CUDA 문서	`https://pytorch.org/docs/stable/cuda.html`

13.3 추천 학습 경로

입문자:
  1. nvidia-smi로 GPU 확인
  2. PyTorch에서 .cuda() 사용
  3. Mixed Precision Training 적용
  4. Numba CUDA JIT으로 간단한 커널 작성

중급자:
  5. CUDA C/C++로 커널 직접 작성
  6. Shared Memory, Coalesced Access 최적화
  7. CUDA Stream으로 비동기 실행
  8. Nsight Compute로 프로파일링

고급자:
  9. Triton으로 커스텀 Attention 커널
  10. TensorRT로 추론 최적화
  11. Multi-GPU DDP 학습
  12. CUTLASS로 커스텀 GEMM 구현

마무리

CUDA는 단순한 GPU 프로그래밍 도구를 넘어 현대 AI 인프라의 근간을 이루는 플랫폼이다. Ampere에서 Hopper, 그리고 Blackwell로 이어지는 하드웨어 진화와 함께 CUDA Toolkit도 13.x 시대에 진입하며 CUDA Tile, cuTile DSL 같은 높은 수준의 추상화를 제공하기 시작했다.

실무에서 CUDA를 효과적으로 활용하기 위한 핵심 포인트를 정리하면 다음과 같다.

기초를 탄탄히: Warp, Block, Grid의 실행 모델과 메모리 계층을 정확히 이해해야 최적화가 가능하다
Python 도구 적극 활용: PyTorch의 Mixed Precision, CuPy, Numba 등으로 대부분의 GPU 가속을 달성할 수 있다
프로파일링 우선: Nsight Compute/Systems로 병목 지점을 먼저 파악하고 최적화에 투자하라
메모리가 핵심: Coalesced Access, Shared Memory Tiling, Pinned Memory 등 메모리 최적화가 성능 향상의 80%를 차지한다
에코시스템 활용: cuDNN, TensorRT, FlashAttention 등 이미 최적화된 라이브러리를 최대한 활용하라

GPU 컴퓨팅의 세계는 계속 진화하고 있다. Triton 같은 고수준 프로그래밍 모델의 등장, FP4/FP6 같은 저정밀도 연산의 확대, 그리고 CUDA Tile로 대표되는 타일 기반 프로그래밍의 도입은 GPU 프로그래밍의 진입 장벽을 낮추면서도 하드웨어 성능을 최대한 끌어낼 수 있는 방향으로 나아가고 있다. 이 가이드가 그 여정의 출발점이 되기를 바란다.

CUDA Hands-on Complete Guide: Everything About GPU Computing

1. Introduction: What is CUDA
2. GPU Architecture Fundamentals
3. CUDA Development Environment Setup
4. CUDA Programming Basics (C/C++)
5. CUDA Memory Management In-Depth
6. CUDA Streams and Asynchronous Execution
7. CUDA Optimization Techniques
8. CUDA in Python
9. Hands-on Examples
10. CUDA Tools and Libraries
11. CUDA Troubleshooting
12. CUDA Alternative Technologies Comparison
13. References
Conclusion
Quiz

1. Introduction: What is CUDA

1.1 Definition of CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It enables general-purpose computing (GPGPU) by leveraging thousands of GPU cores. Since its initial release in 2007, it has become the de facto standard in various fields including AI, scientific simulation, image processing, and financial modeling.

As of 2025, the latest CUDA Toolkit is version 13.1, which introduces CUDA Tile, a tile-based programming model, and cuTile Python DSL, providing Tensor Core abstraction and forward compatibility for next-generation Blackwell GPUs.

1.2 History of GPU Computing and the Emergence of CUDA

The evolution of GPU computing can be summarized by era as follows.

Period	Event	Significance
2001	GPGPU research begins	General-purpose computing attempts using shaders
2006	NVIDIA Tesla architecture announced	First unified shader architecture
2007	CUDA 1.0 released	Birth of GPU general-purpose computing
2012	AlexNet (ImageNet)	Dawn of GPU deep learning era
2017	Volta + Tensor Core	Mixed precision computation acceleration
2020	Ampere (A100)	TF32, Sparsity support
2022	Hopper (H100)	Transformer Engine, FP8 support
2024	Blackwell (B200)	FP4, 5th gen Tensor Core
2025	CUDA 13.0/13.1 released	CUDA Tile, cuTile DSL introduction

1.3 CPU vs GPU Architecture Comparison

CPUs and GPUs have fundamentally different design philosophies.

Attribute	CPU	GPU
Design goal	Low latency (serial processing)	High throughput (parallel processing)
Core count	A few to dozens	Thousands to tens of thousands
Clock speed	High (4-6 GHz)	Relatively low (1-2 GHz)
Cache size	Large (tens of MB)	Small (per core)
Control logic	Complex (branch prediction, OoO execution)	Simple (latency hiding through many threads)
Parallelism model	SIMD (Single Instruction Multiple Data)	SIMT (Single Instruction Multiple Threads)
Optimal workload	Complex branching, sequential logic	Large-scale data-parallel computation

SIMD vs SIMT key difference: SIMD is a vector operation method that processes multiple data simultaneously with a single instruction. SIMT goes a step further, where each thread has its own program counter while executing the same instruction simultaneously. When branching occurs, threads within a Warp can execute different paths, but this causes Warp Divergence, which degrades performance.

1.4 Why CUDA is Essential for AI/ML/Deep Learning

The reasons CUDA is central to modern AI/ML workloads are clear.

Matrix computation acceleration: Dramatically accelerates matrix multiplication (GEMM), the core of deep learning, with Tensor Cores
Software ecosystem: Optimized library ecosystem including cuDNN, cuBLAS, NCCL, TensorRT
Framework support: All major frameworks (PyTorch, TensorFlow, JAX) use CUDA as their default backend
Mixed precision: Maximizes training/inference speed with low-precision computation down to FP16, BF16, FP8, and FP4
Multi-GPU scaling: Distributed training across hundreds of GPUs via NVLink and NVSwitch

2. GPU Architecture Fundamentals

2.1 NVIDIA GPU Internal Structure

Let us examine the core components of an NVIDIA GPU hierarchically.

GPU (GPC - Graphics Processing Cluster)
├── SM (Streaming Multiprocessor) x N
│   ├── CUDA Core (FP32/INT32) x 128 (Hopper)
│   ├── Tensor Core x 4 (4th gen, Hopper)
│   ├── RT Core (Ray Tracing) x 1
│   ├── Warp Scheduler x 4
│   ├── Register File (256 KB)
│   ├── L1 Cache / Shared Memory (shared, up to 228 KB)
│   └── SFU (Special Function Unit)
├── L2 Cache (shared)
├── Memory Controller
└── HBM (High Bandwidth Memory)

Key core types:

CUDA Core: The basic processing unit for general-purpose floating-point/integer operations. Handles FP32, FP64, INT32 operations
Tensor Core: A core specialized for Matrix Multiply-Accumulate (MMA) operations. The key accelerator for deep learning training/inference
RT Core: Dedicated hardware for Ray Tracing acceleration. Primarily used for graphics workloads

2.2 Memory Hierarchy

GPU memory is organized hierarchically by speed and size.

Memory Type	Location	Size (H100)	Bandwidth	Access Scope	Characteristics
Register	Inside SM	256 KB/SM	Fastest	Thread-private	Fastest but limited
Shared Memory	Inside SM	Up to 228 KB/SM	Very fast	Shared within Block	Programmer-managed cache
L1 Cache	Inside SM	Shared with Shared	Very fast	SM-private	HW-managed
L2 Cache	GPU-wide	50 MB	Fast	Shared across all SMs	HW-managed
Global Memory (HBM)	Off-GPU	80 GB	3.35 TB/s	Globally accessible	Largest but highest latency
Constant Memory	GPU-wide	64 KB	Fast when cached	Read-only	Broadcast-optimized
Texture Memory	GPU-wide	Shared with Global	Fast when cached	Read-only	2D spatial locality optimized

Speed:  Register > Shared/L1 > L2 > Global (HBM)
Size:   Global (HBM) > L2 > Shared/L1 > Register

2.3 Warp, Block, Grid Concepts

CUDA's execution model has a 3-level hierarchical structure.

Grid (kernel execution unit)
├── Block (0,0)          Block (1,0)          Block (2,0)
│   ├── Warp 0           ├── Warp 0           ├── Warp 0
│   │   ├── Thread 0     │   ├── Thread 0     │   ├── Thread 0
│   │   ├── Thread 1     │   ├── Thread 1     │   ├── Thread 1
│   │   ├── ...          │   ├── ...          │   ├── ...
│   │   └── Thread 31    │   └── Thread 31    │   └── Thread 31
│   ├── Warp 1           ├── Warp 1           ├── Warp 1
│   └── ...              └── ...              └── ...
├── Block (0,1)          Block (1,1)          ...
└── ...

Thread: The smallest execution unit of GPU computation
Warp: An execution group of 32 Threads. The unit that executes the same instruction simultaneously on an SM (the core of SIMT)
Block (Thread Block): Composed of multiple Warps. Executes on the same SM and shares Shared Memory
Grid: The collection of Blocks that make up the entire kernel execution

Key constraints:

Item	Limit (Compute Capability 9.0)
Max Threads per Block	1,024
Warp size	32 (fixed)
Max Block dimensions	(1024, 1024, 64)
Max Grid dimensions	(2^31-1, 65535, 65535)
Max Blocks per SM	32
Max Warps per SM	64

2.4 Compute Capability Version Differences

Compute Capability (CC) defines the feature set of GPU hardware.

CC	Architecture	Representative GPU	Key Features
7.0	Volta	V100	1st gen Tensor Core, independent thread scheduling
7.5	Turing	RTX 2080	INT8/INT4 Tensor Core, RT Core
8.0	Ampere	A100	3rd gen Tensor Core, TF32, Sparsity
8.6	Ampere	RTX 3090	Consumer Ampere
8.9	Ada Lovelace	RTX 4090	4th gen Tensor Core, FP8, DLSS 3
9.0	Hopper	H100	4th gen Tensor Core, Transformer Engine, DPX
10.0	Blackwell	B200	5th gen Tensor Core, FP4, TMEM
12.0	Blackwell Ultra	B300	Enhanced 5th gen Tensor Core, 288 GB HBM3E

Note that starting with CUDA 13.0, support for Maxwell (CC 5.x), Pascal (CC 6.x), and Volta (CC 7.0) has been removed.

2.5 Latest GPU Generation Comparison

Comparing three generations of data center GPUs.

Spec	A100 (SXM)	H100 (SXM5)	B200 (SXM)	B300 (SXM)
Architecture	Ampere	Hopper	Blackwell	Blackwell Ultra
CUDA Cores	6,912	16,896	18,432	18,432+
Tensor Cores	432 (3rd gen)	528 (4th gen)	5th gen	5th gen
Memory	80 GB HBM2e	80 GB HBM3	180 GB HBM3E	288 GB HBM3E
Memory BW	2.0 TB/s	3.35 TB/s	7.7 TB/s	8.0 TB/s
FP32 Perf	19.5 TFLOPS	60 TFLOPS	Undisclosed	Undisclosed
FP16 Tensor	312 TFLOPS	990 TFLOPS	Undisclosed	Undisclosed
FP4 Tensor	Not supported	Not supported	9.0 PFLOPS	14.0 PFLOPS
NVLink	3rd gen (600 GB/s)	4th gen (900 GB/s)	5th gen (1.8 TB/s)	5th gen (1.8 TB/s)
TDP	400W	700W	1,000W	1,400W
Cooling	Air/Liquid	Air/Liquid	Liquid recommended	Liquid required (DLC)

The B200 achieves 3x training performance and 15x inference performance over the A100. The B300 (Blackwell Ultra) provides 14 PFLOPS in FP4 operations, 55.6% faster than the B200.

3. CUDA Development Environment Setup

3.1 CUDA Toolkit Installation

Linux (Ubuntu/Debian)

# 1. Check NVIDIA driver
nvidia-smi

# 2. Add CUDA Keyring (Ubuntu 22.04 example)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# 3. Install CUDA Toolkit
sudo apt-get install cuda-toolkit-13-1

# 4. Set environment variables
export PATH=/usr/local/cuda-13.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.1/lib64:$LD_LIBRARY_PATH

# 5. Verify installation
nvcc --version

Windows

# 1. Download CUDA Toolkit from the NVIDIA official site
# https://developer.nvidia.com/cuda-downloads

# 2. Check environment variables after installation
echo %CUDA_PATH%
# Typically C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1

# 3. Verify
nvcc --version
nvidia-smi

3.2 nvidia-smi Command Usage

nvidia-smi is the essential tool for monitoring GPU status.

# Basic info output
nvidia-smi

# Output example:
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 560.35.03    Driver Version: 560.35.03    CUDA Version: 13.1                 |
# |-----------------------------------------+------------------------+----------------------|
# | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
# |=========================================+========================+======================|
# |   0  NVIDIA H100 80GB HBM3          On | 00000000:3B:00.0   Off |                    0 |
# | N/A   32C    P0              72W / 700W |    1234MiB / 81559MiB  |      0%      Default |
# +-----------------------------------------+------------------------+----------------------+

# Continuous monitoring (1 second interval)
nvidia-smi -l 1

# dmon - GPU metrics monitoring
nvidia-smi dmon -s pucvmet -d 1

# Specific GPU info
nvidia-smi -i 0 -q

# GPU process monitoring
nvidia-smi pmon -i 0

# GPU clock and memory info
nvidia-smi -q -d CLOCK,MEMORY

# CSV format output (for scripting)
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total \
  --format=csv,noheader,nounits

3.3 nvcc Compiler Usage

nvcc is the NVIDIA CUDA Compiler for compiling CUDA source code.

# Basic compilation
nvcc -o hello hello.cu

# Target specific architecture
nvcc -arch=sm_90 -o kernel kernel.cu   # Hopper H100
nvcc -arch=sm_80 -o kernel kernel.cu   # Ampere A100

# Debug mode
nvcc -g -G -o debug_kernel kernel.cu

# Optimization level
nvcc -O3 -o optimized kernel.cu

# Generate PTX code (intermediate representation)
nvcc -ptx kernel.cu

# Support multiple architectures simultaneously (Fat Binary)
nvcc -gencode arch=compute_80,code=sm_80 \
     -gencode arch=compute_90,code=sm_90 \
     -o multi_arch kernel.cu

# Library linking
nvcc -lcublas -lcurand -o linked_kernel kernel.cu

# Detailed compilation info
nvcc --resource-usage -o kernel kernel.cu

3.4 CUDA Version Check and Compatibility

There are two types of CUDA versions, and it is important not to confuse them.

# 1. Driver-supported CUDA version (Driver API)
nvidia-smi
# "CUDA Version: 13.1" shown in upper right
# This is the maximum CUDA runtime version the driver supports

# 2. CUDA Toolkit version (Runtime API)
nvcc --version
# Shows "release 13.1"
# The actually installed CUDA Toolkit version

# 3. Check at runtime
python3 -c "import torch; print(torch.version.cuda)"

Key compatibility rules:

The driver CUDA version must be greater than or equal to the Toolkit CUDA version
Example: If the driver supports CUDA 13.1, you can also use CUDA 12.x Toolkit (backward compatible)
According to CUDA Toolkit's Minor Version Compatibility policy, binary compatibility is maintained within the same Major version

3.5 Using CUDA in Docker

The NVIDIA Container Toolkit allows you to leverage GPUs in Docker containers.

# 1. Install NVIDIA Container Toolkit (Ubuntu)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# 2. Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# 3. Run GPU container
docker run --rm --gpus all nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi

# Use specific GPUs only
docker run --rm --gpus '"device=0,1"' nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi

# GPU usage in docker compose
# docker-compose.yml:

# docker-compose.yml
services:
  gpu-app:
    image: nvidia/cuda:13.1.0-devel-ubuntu22.04
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: nvidia-smi

3.6 Installing PyTorch/TensorFlow CUDA via conda/pip

# PyTorch (CUDA 12.4 build example)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Using conda
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia

# TensorFlow (auto GPU detection)
pip install tensorflow[and-cuda]

# Check CUDA availability (Python)
python3 -c "
import torch
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'CUDA version: {torch.version.cuda}')
print(f'GPU count: {torch.cuda.device_count()}')
if torch.cuda.is_available():
    print(f'GPU name: {torch.cuda.get_device_name(0)}')
"

4. CUDA Programming Basics (C/C++)

4.1 Host (CPU) vs Device (GPU) Code

In CUDA programming, CPU-side code is called Host and GPU-side code is called Device.

┌─────────────────────────────────────────────────────────┐
│  Host (CPU)                                             │
│  ┌─────────────────────────────────────────────────────┐│
│  │ 1. Prepare data (Host Memory)                       ││
│  │ 2. Allocate GPU memory (cudaMalloc)                 ││
│  │ 3. Transfer data: Host -> Device (cudaMemcpy)       ││
│  │ 4. Launch kernel (<<<grid, block>>>)                ││
│  │ 5. Transfer results: Device -> Host (cudaMemcpy)    ││
│  │ 6. Free GPU memory (cudaFree)                       ││
│  └─────────────────────────────────────────────────────┘│
│                        ↕ PCIe / NVLink                  │
│  ┌─────────────────────────────────────────────────────┐│
│  │  Device (GPU)                                       ││
│  │  - Parallel execution of kernel functions           ││
│  │  - Thousands of threads processing simultaneously   ││
│  └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘

4.2 Function Qualifiers

CUDA provides three function qualifiers.

Qualifier	Execution Location	Callable From	Description
`__global__`	Device (GPU)	Host (CPU)	Kernel function. Return type must be `void`
`__device__`	Device (GPU)	Device (GPU)	Helper function callable only within GPU
`__host__`	Host (CPU)	Host (CPU)	Normal CPU function (default, can be omitted)
`__host__ __device__`	Both	Both	Usable on both CPU and GPU

// __global__: Kernel function - called from Host, executed on Device
__global__ void myKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] *= 2.0f;
    }
}

// __device__: Device-only helper function
__device__ float square(float x) {
    return x * x;
}

// __host__ __device__: CPU/GPU compatible function
__host__ __device__ float add(float a, float b) {
    return a + b;
}

4.3 Kernel Launch Syntax

Kernel functions are launched using special triple angle bracket syntax.

// Kernel launch syntax
// kernel<<<gridDim, blockDim, sharedMemSize, stream>>>(args...);
//
// gridDim:       Number of Blocks in Grid (dim3)
// blockDim:      Number of Threads in Block (dim3)
// sharedMemSize: Dynamic Shared Memory size (bytes, optional)
// stream:        Execution stream (optional)

// 1D example: 256 threads, 1 block
myKernel<<<1, 256>>>(d_data, n);

// 1D example: Process N data elements with 256-thread blocks
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
myKernel<<<blocksPerGrid, threadsPerBlock>>>(d_data, n);

// 2D example: Image processing
dim3 blockDim(16, 16);       // 16x16 = 256 threads/block
dim3 gridDim(
    (width + 15) / 16,
    (height + 15) / 16
);
imageKernel<<<gridDim, blockDim>>>(d_image, width, height);

4.4 Thread Hierarchy

Each thread can determine its position through built-in variables.

// Built-in Variables
threadIdx.x, threadIdx.y, threadIdx.z  // Thread index within Block
blockIdx.x,  blockIdx.y,  blockIdx.z   // Block index within Grid
blockDim.x,  blockDim.y,  blockDim.z   // Block dimension sizes
gridDim.x,   gridDim.y,   gridDim.z    // Grid dimension sizes

// Global Thread ID calculation (1D)
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;

// Global Thread ID calculation (2D)
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int globalIdx = row * width + col;

// Total thread count (for Grid-stride loop)
int totalThreads = gridDim.x * blockDim.x;

4.5 Memory Management Functions

// GPU memory allocation
float* d_data;
cudaMalloc((void**)&d_data, n * sizeof(float));

// Host -> Device copy
cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice);

// Device -> Host copy
cudaMemcpy(h_data, d_data, n * sizeof(float), cudaMemcpyDeviceToHost);

// GPU memory free
cudaFree(d_data);

// GPU memory initialization
cudaMemset(d_data, 0, n * sizeof(float));

// Error check macro (essential!)
#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error at %s:%d - %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// Usage example
CUDA_CHECK(cudaMalloc((void**)&d_data, n * sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice));

4.6 Hello World Example

// hello_cuda.cu
#include <stdio.h>

__global__ void helloKernel() {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    printf("Hello from GPU thread %d (block %d, thread %d)\n",
           tid, blockIdx.x, threadIdx.x);
}

int main() {
    // 2 blocks, 4 threads each = 8 threads total
    helloKernel<<<2, 4>>>();

    // Wait for GPU work to complete
    cudaDeviceSynchronize();

    printf("Hello from CPU!\n");
    return 0;
}

# Compile and run
nvcc -o hello hello_cuda.cu
./hello

4.7 Vector Addition Example

This is the "Hello World" of CUDA - a vector addition example.

// vector_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error: %s at %s:%d\n", \
                    cudaGetErrorString(err), __FILE__, __LINE__); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// GPU kernel: vector addition
__global__ void vectorAdd(const float* A, const float* B, float* C, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    const int N = 1 << 20;  // 1M elements
    size_t bytes = N * sizeof(float);

    // Host memory allocation and initialization
    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);

    for (int i = 0; i < N; i++) {
        h_A[i] = (float)i;
        h_B[i] = (float)(i * 2);
    }

    // Device memory allocation
    float *d_A, *d_B, *d_C;
    CUDA_CHECK(cudaMalloc(&d_A, bytes));
    CUDA_CHECK(cudaMalloc(&d_B, bytes));
    CUDA_CHECK(cudaMalloc(&d_C, bytes));

    // Host -> Device copy
    CUDA_CHECK(cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice));

    // Kernel launch
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Error check
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    // Device -> Host copy
    CUDA_CHECK(cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost));

    // Result verification
    for (int i = 0; i < N; i++) {
        if (fabs(h_C[i] - (h_A[i] + h_B[i])) > 1e-5) {
            fprintf(stderr, "Verification failed at index %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }
    printf("Vector addition of %d elements: SUCCESS\n", N);

    // Memory cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

4.8 Matrix Multiplication Example

// matmul.cu
#include <stdio.h>
#include <cuda_runtime.h>

#define TILE_SIZE 16

// Basic matrix multiplication kernel
__global__ void matMulBasic(const float* A, const float* B, float* C,
                            int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < M && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < K; k++) {
            sum += A[row * K + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

// Tiled matrix multiplication using Shared Memory (optimized version)
__global__ void matMulTiled(const float* A, const float* B, float* C,
                            int M, int N, int K) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    // Iterate over tiles
    for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Load tiles into Shared Memory
        if (row < M && (t * TILE_SIZE + threadIdx.x) < K)
            tileA[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
        else
            tileA[threadIdx.y][threadIdx.x] = 0.0f;

        if ((t * TILE_SIZE + threadIdx.y) < K && col < N)
            tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tileB[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();  // Wait for tile load to complete

        // Multiply-accumulate within tile
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        __syncthreads();  // Synchronize before loading next tile
    }

    if (row < M && col < N) {
        C[row * N + col] = sum;
    }
}

int main() {
    const int M = 1024, N = 1024, K = 1024;

    // ... (memory allocation, initialization, data transfer omitted)

    dim3 blockDim(TILE_SIZE, TILE_SIZE);
    dim3 gridDim((N + TILE_SIZE - 1) / TILE_SIZE,
                 (M + TILE_SIZE - 1) / TILE_SIZE);

    matMulTiled<<<gridDim, blockDim>>>(d_A, d_B, d_C, M, N, K);

    // ... (result copy, verification, memory cleanup omitted)
    return 0;
}

5. CUDA Memory Management In-Depth

5.1 Global Memory and Coalesced Access

Coalesced Access is critical to performance when accessing Global Memory. When 32 threads in a Warp access contiguous memory addresses, the GPU combines them into a single transaction.

// Coalesced Access (good pattern)
// Threads in a Warp access contiguous memory
__global__ void coalesced(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] = data[idx] * 2.0f;  // Contiguous access
    }
}

// Non-coalesced Access (bad pattern)
// Threads access non-contiguously with a stride
__global__ void strided(float* data, int n, int stride) {
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    if (idx < n) {
        data[idx] = data[idx] * 2.0f;  // Strided access
    }
}

// AoS (Array of Structures) vs SoA (Structure of Arrays)
// AoS: Non-coalesced (bad)
struct ParticleAoS {
    float x, y, z;
    float vx, vy, vz;
};
// Thread 0: particle[0].x, Thread 1: particle[1].x -> strided access

// SoA: Coalesced (good)
struct ParticlesSoA {
    float* x;  float* y;  float* z;
    float* vx; float* vy; float* vz;
};
// Thread 0: x[0], Thread 1: x[1] -> contiguous access

5.2 Shared Memory and Bank Conflict

Shared Memory is high-speed memory shared within a thread block on an SM. It consists of 32 banks, and when different threads simultaneously access the same bank, a Bank Conflict occurs.

// Shared Memory usage example
__global__ void sharedMemExample(float* input, float* output, int n) {
    // Static allocation
    __shared__ float sharedData[256];

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;

    // Load from Global -> Shared
    if (gid < n) {
        sharedData[tid] = input[gid];
    }
    __syncthreads();  // Wait until all threads have loaded

    // Compute on Shared Memory (neighbor element sum example)
    if (tid > 0 && tid < blockDim.x - 1 && gid < n) {
        output[gid] = sharedData[tid - 1] + sharedData[tid] + sharedData[tid + 1];
    }
}

// Bank Conflict avoidance: padding technique
// Problem: When accessing columns of a 32x32 matrix, all threads map to the same bank
__shared__ float tile[32][32];       // Bank Conflict!
__shared__ float tile[32][32 + 1];   // Padding avoids Bank Conflict

5.3 Unified Memory

Unified Memory is an abstraction layer that allows both CPU and GPU to access memory through the same pointer.

// Unified Memory usage
float* data;
cudaMallocManaged(&data, N * sizeof(float));

// Initialize on CPU
for (int i = 0; i < N; i++) {
    data[i] = (float)i;
}

// GPU kernel execution - no separate cudaMemcpy needed!
myKernel<<<blocks, threads>>>(data, N);
cudaDeviceSynchronize();

// Access results directly from CPU
printf("Result: %f\n", data[0]);

// Free
cudaFree(data);

// Memory prefetch hints (performance optimization)
cudaMemPrefetchAsync(data, N * sizeof(float), deviceId);  // Prefetch to GPU
cudaMemPrefetchAsync(data, N * sizeof(float), cudaCpuDeviceId);  // Prefetch to CPU

5.4 Pinned Memory

Pinned Memory (Page-locked Memory) is host memory excluded from OS page swapping, which increases data transfer speed between GPU and host.

// Pinned Memory allocation
float* h_pinned;
cudaMallocHost(&h_pinned, N * sizeof(float));  // or cudaHostAlloc

// Transfer speed comparison vs regular memory
// Pageable:  ~12 GB/s (PCIe Gen4)
// Pinned:    ~25 GB/s (PCIe Gen4) - approximately 2x faster

// Required for asynchronous transfers
cudaMemcpyAsync(d_data, h_pinned, bytes, cudaMemcpyHostToDevice, stream);

// Free
cudaFreeHost(h_pinned);

5.5 Memory Pool

Memory Pool, introduced in CUDA 11.2, reduces overhead from repetitive memory allocation/deallocation.

// Stream-ordered Memory Allocator (CUDA 11.2+)
float* d_data;
cudaMallocAsync(&d_data, bytes, stream);

// Asynchronous free after use
cudaFreeAsync(d_data, stream);

// Memory Pool configuration
cudaMemPool_t mempool;
cudaDeviceGetDefaultMemPool(&mempool, deviceId);

// Set pool size limit
uint64_t threshold = 1ULL << 30;  // 1 GB
cudaMemPoolSetAttribute(mempool, cudaMemPoolAttrReleaseThreshold, &threshold);

5.6 Memory Bandwidth Optimization Summary

Technique	Effect	Implementation Difficulty
Coalesced Access	Maximize Global Memory throughput	Low
SoA layout	Guarantee Coalesced Access	Medium
Shared Memory Tiling	Reduce Global Memory access count	Medium
Bank Conflict avoidance (padding)	Maximize Shared Memory throughput	Low
Pinned Memory	2x faster Host-Device transfer	Low
Unified Memory + Prefetch	Programming convenience + performance	Low
Memory Pool	Eliminate allocation/deallocation overhead	Low
Texture Memory	Leverage 2D spatial locality	Medium

6. CUDA Streams and Asynchronous Execution

6.1 CUDA Stream Concept

A CUDA Stream is a sequence of commands executed in order on the GPU. Operations on different Streams can execute concurrently.

Default Stream (Stream 0):
[MemcpyH2D] -> [Kernel A] -> [MemcpyD2H]
                                           (sequential, no pipeline)

Multi-Stream:
Stream 1: [MemcpyH2D_1] -> [Kernel_1] -> [MemcpyD2H_1]
Stream 2:     [MemcpyH2D_2] -> [Kernel_2] -> [MemcpyD2H_2]
Stream 3:         [MemcpyH2D_3] -> [Kernel_3] -> [MemcpyD2H_3]
                                           (overlapped execution, maximize GPU utilization)

6.2 Stream Creation and Usage

// Stream creation
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Asynchronous memory transfer (Pinned Memory required!)
cudaMemcpyAsync(d_A, h_A, bytes, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(d_B, h_B, bytes, cudaMemcpyHostToDevice, stream2);

// Kernel execution (on specific Stream)
kernelA<<<grid, block, 0, stream1>>>(d_A);
kernelB<<<grid, block, 0, stream2>>>(d_B);

// Asynchronous result copy
cudaMemcpyAsync(h_A, d_A, bytes, cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(h_B, d_B, bytes, cudaMemcpyDeviceToHost, stream2);

// Stream synchronization
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

// Stream cleanup
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

6.3 Multi-Stream Pipeline Pattern

const int NUM_STREAMS = 4;
const int CHUNK_SIZE = N / NUM_STREAMS;

cudaStream_t streams[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
    cudaStreamCreate(&streams[i]);
}

// Pinned Memory allocation
float *h_in, *h_out;
cudaMallocHost(&h_in, N * sizeof(float));
cudaMallocHost(&h_out, N * sizeof(float));

float *d_in, *d_out;
cudaMalloc(&d_in, N * sizeof(float));
cudaMalloc(&d_out, N * sizeof(float));

// Pipeline execution
for (int i = 0; i < NUM_STREAMS; i++) {
    int offset = i * CHUNK_SIZE;
    size_t chunkBytes = CHUNK_SIZE * sizeof(float);

    // 1. Host -> Device (async)
    cudaMemcpyAsync(d_in + offset, h_in + offset,
                    chunkBytes, cudaMemcpyHostToDevice, streams[i]);

    // 2. Kernel execution
    int blocks = (CHUNK_SIZE + 255) / 256;
    processKernel<<<blocks, 256, 0, streams[i]>>>(
        d_in + offset, d_out + offset, CHUNK_SIZE);

    // 3. Device -> Host (async)
    cudaMemcpyAsync(h_out + offset, d_out + offset,
                    chunkBytes, cudaMemcpyDeviceToHost, streams[i]);
}

// Wait for all to complete
cudaDeviceSynchronize();

6.4 CUDA Events (Timing Measurement)

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

// Start timing
cudaEventRecord(start);

// Kernel execution
myKernel<<<grid, block>>>(d_data, N);

// Stop timing
cudaEventRecord(stop);
cudaEventSynchronize(stop);

// Calculate elapsed time
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel execution time: %.3f ms\n", milliseconds);

// Bandwidth calculation
float bandwidth = (bytes_read + bytes_written) / (milliseconds * 1e6);
printf("Effective bandwidth: %.2f GB/s\n", bandwidth);

cudaEventDestroy(start);
cudaEventDestroy(stop);

6.5 Synchronization Function Comparison

Function	Scope	Purpose
`cudaDeviceSynchronize()`	Entire Device	Wait for all Streams to complete
`cudaStreamSynchronize(stream)`	Specific Stream	Wait for that Stream to complete
`cudaEventSynchronize(event)`	Specific Event	Wait for that Event to complete
`cudaStreamWaitEvent(stream, event)`	Between Streams	Stream waits for Event before proceeding
`__syncthreads()`	Within Block	Synchronize threads within Block (in kernel)

7. CUDA Optimization Techniques

7.1 Occupancy Optimization

Occupancy is the ratio of actual active Warps to the maximum supported Warps on an SM. High Occupancy does not always mean peak performance, but it is critical for hiding memory latency.

// Occupancy calculation API
int blockSize;
int minGridSize;

// Automatically calculate optimal block size
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, 0);
printf("Optimal block size: %d\n", blockSize);
printf("Minimum grid size: %d\n", minGridSize);

// Check Occupancy for a specific block size
int maxActiveBlocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &maxActiveBlocks, myKernel, blockSize, 0);
printf("Max active blocks per SM: %d\n", maxActiveBlocks);

Block size selection guidelines:

Principle	Description
Multiple of 32	Align with Warp size to avoid thread waste
128 ~ 512 recommended	Generally optimal range
Consider registers/Shared Memory	Reduce block size if resource usage is high
Use `cudaOccupancyMaxPotentialBlockSize`	Automatic optimal calculation

7.2 Minimizing Warp Divergence

When threads within a Warp take different branches, sequential execution occurs.

// Bad example: Warp Divergence
__global__ void divergentKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        // Branches based on even/odd threadIdx.x
        // -> Only half execute at a time within same Warp
        if (threadIdx.x % 2 == 0) {
            data[idx] = data[idx] * 2.0f;
        } else {
            data[idx] = data[idx] + 1.0f;
        }
    }
}

// Good example: Branch at Warp granularity
__global__ void convergentKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int warpId = threadIdx.x / 32;  // Branch by Warp ID
    if (idx < n) {
        // All threads in the same Warp take the same path
        if (warpId % 2 == 0) {
            data[idx] = data[idx] * 2.0f;
        } else {
            data[idx] = data[idx] + 1.0f;
        }
    }
}

7.3 Shared Memory Tiling (Matrix Multiplication Optimization)

Tiling is a key technique to reduce Global Memory access count. We already saw the implementation in section 4.8's matMulTiled, and the performance difference is summarized below.

Method	Global Memory Accesses (MxNxK matrix)	Relative Performance
Naive	`2 * M * N * K`	1x
Tiled (Shared Memory)	`2 * M * N * K / TILE_SIZE`	~TILE_SIZE x

With TILE_SIZE of 16, Global Memory accesses are reduced by 16x.

7.4 Loop Unrolling

// Manual Loop Unrolling
__global__ void unrolled(float* data, int n) {
    int idx = blockIdx.x * blockDim.x * 4 + threadIdx.x;

    // 4x Unrolling
    if (idx < n)         data[idx] *= 2.0f;
    if (idx + 256 < n)   data[idx + 256] *= 2.0f;
    if (idx + 512 < n)   data[idx + 512] *= 2.0f;
    if (idx + 768 < n)   data[idx + 768] *= 2.0f;
}

// Compiler directive-based Unrolling
__global__ void pragmaUnrolled(float* A, float* B, float* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;

    #pragma unroll 8
    for (int k = 0; k < N; k++) {
        sum += A[idx * N + k] * B[k];
    }
    C[idx] = sum;
}

7.5 Grid-Stride Loop Pattern

A flexible pattern used when data size exceeds total thread count.

__global__ void gridStrideKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    // Each thread processes multiple elements, striding by total thread count
    for (int i = idx; i < n; i += stride) {
        data[i] = data[i] * 2.0f;
    }
}

// Advantages:
// 1. Works regardless of data size
// 2. Block/grid sizes can be freely adjusted
// 3. Increased work per thread reduces kernel launch overhead

7.6 NVIDIA Nsight Profiling

# Nsight Compute (kernel-level profiling)
ncu --set full -o profile_report ./my_cuda_app

# Profile specific kernel only
ncu --kernel-name myKernel --launch-skip 0 --launch-count 1 ./my_cuda_app

# Nsight Systems (system-level profiling)
nsys profile --trace=cuda,nvtx -o timeline_report ./my_cuda_app

# Key metrics to check
# - SM Throughput: SM utilization
# - Memory Throughput: Memory bandwidth utilization
# - Achieved Occupancy: Actual Occupancy
# - Warp Stall Reasons: Warp stall causes
# - L1/L2 Hit Rate: Cache hit rate

Profiling checklist:

Metric	Target	Action if problematic
Achieved Occupancy	50% or higher	Adjust block size, register usage
Memory Throughput	60%+ of theoretical BW	Improve Coalesced Access, caching
Compute Throughput	60%+ of theoretical ops	Improve ILP, remove unnecessary ops
Warp Divergence	Minimize	Restructure branching logic

8. CUDA in Python

8.1 PyCUDA Basics

PyCUDA allows you to write and execute CUDA C kernels directly from Python.

import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np

# Define CUDA kernel (write C code as a string)
mod = SourceModule("""
__global__ void multiply(float *dest, float *a, float *b, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        dest[idx] = a[idx] * b[idx];
    }
}
""")

multiply = mod.get_function("multiply")

# Prepare data
n = 1024 * 1024
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
dest = np.zeros_like(a)

# Execute kernel
block_size = 256
grid_size = (n + block_size - 1) // block_size

multiply(
    cuda.Out(dest), cuda.In(a), cuda.In(b), np.int32(n),
    block=(block_size, 1, 1), grid=(grid_size, 1)
)

# Verify results
assert np.allclose(dest, a * b)
print("PyCUDA multiply: SUCCESS")

8.2 Numba CUDA JIT

Numba is the easiest way to JIT-compile Python functions into CUDA kernels.

from numba import cuda
import numpy as np
import math

# CUDA kernel definition
@cuda.jit
def vector_add(a, b, c):
    idx = cuda.grid(1)  # 1D global index
    if idx < a.shape[0]:
        c[idx] = a[idx] + b[idx]

# Prepare data
n = 1_000_000
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
c = np.zeros(n, dtype=np.float32)

# Transfer to Device
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)

# Execute kernel
threads_per_block = 256
blocks_per_grid = math.ceil(n / threads_per_block)
vector_add[blocks_per_grid, threads_per_block](d_a, d_b, d_c)

# Get results
result = d_c.copy_to_host()
assert np.allclose(result, a + b)
print("Numba CUDA vector_add: SUCCESS")


# 2D kernel example: matrix multiplication
@cuda.jit
def matmul_kernel(A, B, C):
    row, col = cuda.grid(2)  # 2D global index
    if row < C.shape[0] and col < C.shape[1]:
        tmp = 0.0
        for k in range(A.shape[1]):
            tmp += A[row, k] * B[k, col]
        C[row, col] = tmp

# Using Shared Memory
@cuda.jit
def matmul_shared(A, B, C):
    TILE = 16
    sA = cuda.shared.array(shape=(TILE, TILE), dtype=np.float32)
    sB = cuda.shared.array(shape=(TILE, TILE), dtype=np.float32)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    row = cuda.blockIdx.y * TILE + ty
    col = cuda.blockIdx.x * TILE + tx

    tmp = 0.0
    for t in range((A.shape[1] + TILE - 1) // TILE):
        if row < A.shape[0] and (t * TILE + tx) < A.shape[1]:
            sA[ty, tx] = A[row, t * TILE + tx]
        else:
            sA[ty, tx] = 0.0

        if (t * TILE + ty) < B.shape[0] and col < B.shape[1]:
            sB[ty, tx] = B[t * TILE + ty, col]
        else:
            sB[ty, tx] = 0.0

        cuda.syncthreads()
        for k in range(TILE):
            tmp += sA[ty, k] * sB[k, tx]
        cuda.syncthreads()

    if row < C.shape[0] and col < C.shape[1]:
        C[row, col] = tmp

8.3 CuPy (NumPy Replacement)

CuPy is a library that provides the same API as NumPy but runs on the GPU.

import cupy as cp
import numpy as np
import time

# Same API as NumPy
a_gpu = cp.random.randn(10000, 10000, dtype=cp.float32)
b_gpu = cp.random.randn(10000, 10000, dtype=cp.float32)

# GPU matrix multiplication
start = time.time()
c_gpu = cp.dot(a_gpu, b_gpu)
cp.cuda.Stream.null.synchronize()  # Wait for GPU work to complete
gpu_time = time.time() - start

# CPU comparison
a_cpu = cp.asnumpy(a_gpu)
b_cpu = cp.asnumpy(b_gpu)
start = time.time()
c_cpu = np.dot(a_cpu, b_cpu)
cpu_time = time.time() - start

print(f"GPU: {gpu_time:.4f}s, CPU: {cpu_time:.4f}s")
print(f"Speedup: {cpu_time / gpu_time:.1f}x")

# Custom CUDA kernel in CuPy
custom_kernel = cp.RawKernel(r'''
extern "C" __global__
void relu_kernel(const float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = fmaxf(input[idx], 0.0f);
    }
}
''', 'relu_kernel')

n = 1_000_000
x = cp.random.randn(n, dtype=cp.float32)
y = cp.zeros(n, dtype=cp.float32)

block_size = 256
grid_size = (n + block_size - 1) // block_size
custom_kernel((grid_size,), (block_size,), (x, y, n))

# Verification
assert cp.allclose(y, cp.maximum(x, 0))

8.4 PyTorch CUDA Operations

Here are practical patterns for leveraging CUDA in PyTorch.

import torch
import torch.nn as nn

# ===== Basic GPU Usage =====

# Check GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")

# Move tensors to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Method 1: .to(device)
x = torch.randn(1000, 1000)
x_gpu = x.to(device)

# Method 2: .cuda()
x_gpu = x.cuda()

# Method 3: Create directly on GPU
x_gpu = torch.randn(1000, 1000, device='cuda')

# Specify a particular GPU
x_gpu1 = x.to('cuda:1')  # Second GPU

# ===== Memory Management =====

# Check memory usage
print(torch.cuda.memory_summary())
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Clear cache
torch.cuda.empty_cache()

# Memory snapshot (for debugging)
torch.cuda.memory._record_memory_history()
# ... run code ...
snapshot = torch.cuda.memory._snapshot()
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")


# ===== Mixed Precision Training =====
from torch.amp import autocast, GradScaler

model = nn.Linear(1024, 1024).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler('cuda')

for data, target in dataloader:
    data, target = data.cuda(), target.cuda()
    optimizer.zero_grad()

    # Automatic FP16 conversion with autocast
    with autocast('cuda'):
        output = model(data)
        loss = nn.functional.mse_loss(output, target)

    # FP16 gradient scaling with GradScaler
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# Benefits of Mixed Precision:
# - ~50% reduction in memory usage
# - 1.5-3x faster training (leveraging Tensor Cores)
# - Minimal accuracy loss

8.5 TensorFlow GPU Configuration

import tensorflow as tf

# Check GPUs
print("GPUs available:", tf.config.list_physical_devices('GPU'))

# Incremental GPU memory allocation (recommended)
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

# Or set memory limit
tf.config.set_logical_device_configuration(
    gpus[0],
    [tf.config.LogicalDeviceConfiguration(memory_limit=8192)]  # 8GB
)

# Use specific GPU
with tf.device('/GPU:0'):
    a = tf.random.normal([1000, 1000])
    b = tf.random.normal([1000, 1000])
    c = tf.matmul(a, b)

# Mixed Precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')

8.6 RAPIDS (cuDF, cuML) Introduction

RAPIDS is a collection of GPU-accelerated data science libraries provided by NVIDIA. As of 2025, the latest version is 26.02, and cuML offers 5-175x faster performance compared to scikit-learn.

# cuDF: GPU DataFrame replacement for pandas
import cudf

# Same API as pandas
df = cudf.read_csv('large_data.csv')
result = df.groupby('category').agg({'value': 'mean', 'count': 'sum'})
filtered = df[df['value'] > 100]

# Interconversion with pandas
import pandas as pd
pdf = df.to_pandas()   # cuDF -> pandas
gdf = cudf.from_pandas(pdf)  # pandas -> cuDF


# cuML: GPU ML replacement for scikit-learn
import cuml
from cuml.ensemble import RandomForestClassifier
from cuml.cluster import KMeans

# Same API as scikit-learn
rf = RandomForestClassifier(n_estimators=100, max_depth=16)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)

# Zero-code-change acceleration (cuml.accel)
# Run scikit-learn code on GPU without modification
import cuml.accel
cuml.accel.install()

from sklearn.ensemble import RandomForestClassifier  # Automatically GPU-accelerated!

9. Hands-on Examples

9.1 Example 1: GPU Vector Operation Benchmark (Numba)

A practical example comparing CPU and GPU vector operation performance.

from numba import cuda, njit
import numpy as np
import time
import math

# CPU version (Numba JIT)
@njit
def vector_ops_cpu(a, b, c):
    for i in range(a.shape[0]):
        c[i] = math.sqrt(a[i] ** 2 + b[i] ** 2) * math.sin(a[i]) + math.log(abs(b[i]) + 1)

# GPU version (CUDA)
@cuda.jit
def vector_ops_gpu(a, b, c):
    idx = cuda.grid(1)
    if idx < a.shape[0]:
        c[idx] = math.sqrt(a[idx] ** 2 + b[idx] ** 2) * math.sin(a[idx]) + math.log(abs(b[idx]) + 1)

# Benchmark
sizes = [100_000, 1_000_000, 10_000_000, 100_000_000]

for n in sizes:
    a = np.random.randn(n).astype(np.float32)
    b = np.random.randn(n).astype(np.float32)
    c_cpu = np.zeros(n, dtype=np.float32)
    c_gpu = np.zeros(n, dtype=np.float32)

    # CPU
    start = time.time()
    vector_ops_cpu(a, b, c_cpu)
    cpu_time = time.time() - start

    # GPU
    d_a = cuda.to_device(a)
    d_b = cuda.to_device(b)
    d_c = cuda.to_device(c_gpu)

    threads = 256
    blocks = math.ceil(n / threads)

    # Warm-up
    vector_ops_gpu[blocks, threads](d_a, d_b, d_c)
    cuda.synchronize()

    start = time.time()
    vector_ops_gpu[blocks, threads](d_a, d_b, d_c)
    cuda.synchronize()
    gpu_time = time.time() - start

    c_gpu = d_c.copy_to_host()

    print(f"N={n:>12,}: CPU={cpu_time:.4f}s, GPU={gpu_time:.4f}s, "
          f"Speedup={cpu_time/gpu_time:.1f}x")

9.2 Example 2: Image Processing Acceleration (CuPy)

An example of GPU-accelerated real-time image filtering.

import cupy as cp
import numpy as np
from PIL import Image

def gpu_gaussian_blur(image_array, kernel_size=5, sigma=1.0):
    """GPU Gaussian blur implementation"""
    # Create Gaussian kernel
    ax = cp.arange(-kernel_size // 2 + 1., kernel_size // 2 + 1.)
    xx, yy = cp.meshgrid(ax, ax)
    kernel = cp.exp(-(xx**2 + yy**2) / (2. * sigma**2))
    kernel = kernel / kernel.sum()

    img_gpu = cp.asarray(image_array, dtype=cp.float32)
    result = cp.zeros_like(img_gpu)

    pad = kernel_size // 2
    # Process per channel
    for c in range(img_gpu.shape[2]):
        padded = cp.pad(img_gpu[:, :, c], pad, mode='reflect')
        # 2D Convolution (using CuPy FFT)
        from cupyx.scipy.ndimage import convolve
        result[:, :, c] = convolve(img_gpu[:, :, c], kernel)

    return cp.asnumpy(result.clip(0, 255).astype(cp.uint8))


# Sobel Edge Detection implemented directly with CuPy RawKernel
sobel_kernel = cp.RawKernel(r'''
extern "C" __global__
void sobel_filter(const float* input, float* output,
                  int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x > 0 && x < width - 1 && y > 0 && y < height - 1) {
        // Sobel X
        float gx = -1.0f * input[(y-1)*width + (x-1)]
                   + 1.0f * input[(y-1)*width + (x+1)]
                   - 2.0f * input[y*width + (x-1)]
                   + 2.0f * input[y*width + (x+1)]
                   - 1.0f * input[(y+1)*width + (x-1)]
                   + 1.0f * input[(y+1)*width + (x+1)];

        // Sobel Y
        float gy = -1.0f * input[(y-1)*width + (x-1)]
                   - 2.0f * input[(y-1)*width + x]
                   - 1.0f * input[(y-1)*width + (x+1)]
                   + 1.0f * input[(y+1)*width + (x-1)]
                   + 2.0f * input[(y+1)*width + x]
                   + 1.0f * input[(y+1)*width + (x+1)];

        output[y * width + x] = sqrtf(gx * gx + gy * gy);
    }
}
''', 'sobel_filter')

def gpu_sobel_edge_detection(gray_image):
    """GPU Sobel edge detection"""
    h, w = gray_image.shape
    img_gpu = cp.asarray(gray_image, dtype=cp.float32)
    output_gpu = cp.zeros_like(img_gpu)

    block = (16, 16)
    grid = ((w + 15) // 16, (h + 15) // 16)

    sobel_kernel(grid, block, (img_gpu, output_gpu, w, h))

    return cp.asnumpy(output_gpu.clip(0, 255).astype(cp.uint8))

# Usage example
# img = np.array(Image.open('photo.jpg'))
# blurred = gpu_gaussian_blur(img, kernel_size=11, sigma=3.0)
# gray = np.mean(img, axis=2).astype(np.float32)
# edges = gpu_sobel_edge_detection(gray)

9.3 Example 3: PyTorch Model GPU Training

A complete PyTorch GPU training pipeline example.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torch.amp import autocast, GradScaler
import time

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Model definition
class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim),
            nn.LayerNorm(dim),
            nn.GELU(),
            nn.Linear(dim, dim),
            nn.LayerNorm(dim),
        )

    def forward(self, x):
        return x + self.net(x)

class DeepModel(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=512, num_classes=10, num_blocks=6):
        super().__init__()
        self.input_proj = nn.Linear(input_dim, hidden_dim)
        self.blocks = nn.Sequential(*[ResidualBlock(hidden_dim) for _ in range(num_blocks)])
        self.head = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.input_proj(x)
        x = self.blocks(x)
        return self.head(x)

# Data generation (in practice, load a dataset)
X_train = torch.randn(50000, 784)
y_train = torch.randint(0, 10, (50000,))
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True,
                          num_workers=4, pin_memory=True)  # pin_memory is important!

# Move model to GPU
model = DeepModel().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scaler = GradScaler('cuda')

# Training loop (Mixed Precision)
def train_epoch(model, loader, criterion, optimizer, scaler, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(loader):
        data, target = data.to(device, non_blocking=True), target.to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)  # set_to_none=True is more efficient

        with autocast('cuda'):
            output = model(data)
            loss = criterion(output, target)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        total_loss += loss.item()
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()

    return total_loss / len(loader), 100. * correct / total

# Run training
num_epochs = 10
for epoch in range(num_epochs):
    start = time.time()
    loss, acc = train_epoch(model, train_loader, criterion, optimizer, scaler, device)
    elapsed = time.time() - start

    print(f"Epoch {epoch+1}/{num_epochs}: Loss={loss:.4f}, "
          f"Acc={acc:.2f}%, Time={elapsed:.2f}s")

    # Print GPU memory status
    if device.type == 'cuda':
        alloc = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"  GPU Memory: {alloc:.2f} GB allocated, {reserved:.2f} GB reserved")

9.4 Example 4: Multi-GPU Training

DataParallel (Simple but Inefficient)

import torch
import torch.nn as nn

model = DeepModel().cuda()

# Apply DataParallel (one line!)
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs with DataParallel")
    model = nn.DataParallel(model)

# Rest of training code is the same
# Drawback: Load imbalance with GPU 0 being the bottleneck

DistributedDataParallel (Recommended)

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
import os

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train_ddp(rank, world_size):
    setup(rank, world_size)

    # Place model on the corresponding GPU
    model = DeepModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    # Distributed data loader
    train_dataset = TensorDataset(X_train, y_train)
    sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank)
    train_loader = DataLoader(train_dataset, batch_size=256,
                              sampler=sampler, num_workers=4, pin_memory=True)

    optimizer = optim.AdamW(ddp_model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()
    scaler = GradScaler('cuda')

    for epoch in range(10):
        sampler.set_epoch(epoch)  # Shuffle each epoch
        for data, target in train_loader:
            data = data.to(rank, non_blocking=True)
            target = target.to(rank, non_blocking=True)

            optimizer.zero_grad(set_to_none=True)
            with autocast('cuda'):
                output = ddp_model(data)
                loss = criterion(output, target)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

        if rank == 0:
            print(f"Epoch {epoch+1} complete")

    cleanup()

# Execution
# torchrun --nproc_per_node=4 train_ddp.py
import torch.multiprocessing as mp
world_size = torch.cuda.device_count()
mp.spawn(train_ddp, args=(world_size,), nprocs=world_size, join=True)

9.5 Example 5: Writing CUDA Kernels Directly (PyCUDA)

Implementing a practical Reduction (summation) kernel with PyCUDA.

import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np

# Parallel Reduction kernel
mod = SourceModule("""
__global__ void parallel_reduce(float *input, float *output, int n) {
    extern __shared__ float sdata[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x * 2 + threadIdx.x;

    // First level reduction (performed during global load)
    float sum = 0.0f;
    if (i < n) sum += input[i];
    if (i + blockDim.x < n) sum += input[i + blockDim.x];
    sdata[tid] = sum;

    __syncthreads();

    // Reduction in Shared Memory
    for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // No synchronization needed within a Warp (Warp-level reduction)
    if (tid < 32) {
        volatile float *smem = sdata;
        smem[tid] += smem[tid + 32];
        smem[tid] += smem[tid + 16];
        smem[tid] += smem[tid + 8];
        smem[tid] += smem[tid + 4];
        smem[tid] += smem[tid + 2];
        smem[tid] += smem[tid + 1];
    }

    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}
""")

reduce_kernel = mod.get_function("parallel_reduce")

# Execution
n = 1 << 20  # 1M elements
data = np.random.randn(n).astype(np.float32)

block_size = 256
grid_size = (n + block_size * 2 - 1) // (block_size * 2)

d_input = cuda.mem_alloc(data.nbytes)
d_output = cuda.mem_alloc(grid_size * 4)  # float32

cuda.memcpy_htod(d_input, data)

reduce_kernel(d_input, d_output, np.int32(n),
              block=(block_size, 1, 1), grid=(grid_size, 1),
              shared=block_size * 4)

# Get partial sums
partial_sums = np.zeros(grid_size, dtype=np.float32)
cuda.memcpy_dtoh(partial_sums, d_output)
gpu_sum = partial_sums.sum()

# Verification
cpu_sum = data.sum()
print(f"GPU sum: {gpu_sum:.6f}")
print(f"CPU sum: {cpu_sum:.6f}")
print(f"Difference: {abs(gpu_sum - cpu_sum):.6f}")

10. CUDA Tools and Libraries

10.1 Core Library Ecosystem

Library	Purpose	Key Features
cuDNN	Deep Learning	Optimized DNN primitives: Convolution, RNN, Attention, BatchNorm
cuBLAS	Linear Algebra	BLAS Level 1/2/3 operations: GEMM, TRSM. FP8/BF16 Group GEMM support (Blackwell)
NCCL	Multi-GPU Comms	Collective communication: AllReduce, AllGather, Broadcast. NVLink/NVSwitch optimized
TensorRT	Inference Opt	Graph optimization, quantization (INT8/FP8), layer fusion, dynamic batching
Triton	GPU Programming	Python DSL for high-performance kernels. Developed by OpenAI
CUTLASS	Custom GEMM	Template-based CUDA matrix multiplication library
FlashAttention	Attention Accel	IO-aware algorithm for Transformer Attention acceleration
cuSPARSE	Sparse Matrices	Optimized sparse matrix operations
cuRAND	Random Numbers	GPU-accelerated pseudo/quasi-random number generation
cuFFT	FFT	GPU-accelerated Fast Fourier Transform

10.2 cuDNN

# cuDNN is built into PyTorch/TensorFlow and used automatically

# Enable cuDNN benchmark mode in PyTorch
import torch
torch.backends.cudnn.benchmark = True   # Auto-select optimal algorithm
torch.backends.cudnn.deterministic = False  # Performance priority (when reproducibility is not needed)

# Check cuDNN version
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"cuDNN enabled: {torch.backends.cudnn.enabled}")

10.3 NCCL (Multi-GPU Communication)

# NCCL usage in PyTorch (internal to DistributedDataParallel)
import torch.distributed as dist

# Initialize NCCL backend
dist.init_process_group(backend='nccl')

# Key collective communication operations
# AllReduce: Sum tensors from all GPUs and distribute to each GPU
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)

# AllGather: Gather tensors from all GPUs and copy the full set to each GPU
gathered = [torch.zeros_like(tensor) for _ in range(world_size)]
dist.all_gather(gathered, tensor)

# Broadcast: Copy a tensor from one GPU to all GPUs
dist.broadcast(tensor, src=0)

10.4 TensorRT

# Inference optimization with TensorRT
import tensorrt as trt
import torch

# Convert PyTorch model to ONNX
model = MyModel().cuda().eval()
dummy_input = torch.randn(1, 3, 224, 224).cuda()
torch.onnx.export(model, dummy_input, "model.onnx",
                  input_names=['input'], output_names=['output'],
                  dynamic_axes={'input': {0: 'batch_size'}})

# Simple conversion with torch_tensorrt (PyTorch 2.x)
import torch_tensorrt

optimized_model = torch_tensorrt.compile(model,
    inputs=[torch_tensorrt.Input(
        min_shape=[1, 3, 224, 224],
        opt_shape=[8, 3, 224, 224],
        max_shape=[32, 3, 224, 224],
        dtype=torch.float16
    )],
    enabled_precisions={torch.float16},
)

# Inference
with torch.no_grad():
    output = optimized_model(input_tensor.half().cuda())

# Typical TensorRT performance gains:
# - FP32 -> FP16: 2-3x faster
# - FP32 -> INT8: 3-5x faster
# - Layer fusion + optimization: Additional 20-50% improvement

10.5 OpenAI Triton

Triton is a programming language for writing high-performance GPU kernels in Python.

import triton
import triton.language as tl
import torch

# Triton kernel: vector addition
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements

    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y

    tl.store(output_ptr + offsets, output, mask=mask)

# Usage
def triton_add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output

# Execution
x = torch.randn(1_000_000, device='cuda')
y = torch.randn(1_000_000, device='cuda')
result = triton_add(x, y)
assert torch.allclose(result, x + y)

# Advantages of Triton:
# - Higher-level abstraction than CUDA C
# - Automatic memory access optimization
# - Write high-performance kernels with Python syntax
# - Backend for PyTorch 2.0 torch.compile

10.6 FlashAttention

FlashAttention accelerates Transformer Self-Attention with an IO-aware algorithm.

# FlashAttention-3 (H100 optimized)
# Install: pip install flash-attn

from flash_attn import flash_attn_func

# Input: (batch, seqlen, nheads, headdim)
q = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)
k = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)
v = torch.randn(2, 4096, 32, 128, device='cuda', dtype=torch.float16)

# FlashAttention execution
output = flash_attn_func(q, k, v, causal=True)

# Performance comparison (FlashAttention-3 vs standard Attention):
# - Memory: O(N^2) -> O(N) (sequence length N)
# - Speed: 2-4x faster (up to 740 TFLOPS on H100 FP16)
# - FP8: up to 1.2 PFLOPS (H100)
# - FlashAttention-3 is 1.5-2.0x faster than FlashAttention-2

# PyTorch 2.0+ built-in scaled_dot_product_attention
output = torch.nn.functional.scaled_dot_product_attention(
    q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2),
    is_causal=True
)
# PyTorch automatically selects FlashAttention or Memory-Efficient Attention

11. CUDA Troubleshooting

11.1 CUDA OOM (Out of Memory) Solutions

GPU memory shortage is the most common CUDA issue.

# Error: RuntimeError: CUDA out of memory.

# Solution 1: Reduce batch size
# batch_size = 64 -> 32 -> 16

# Solution 2: Use Mixed Precision (~50% memory savings)
from torch.amp import autocast, GradScaler
scaler = GradScaler('cuda')

# Solution 3: Gradient Accumulation
accumulation_steps = 4
for i, (data, target) in enumerate(loader):
    with autocast('cuda'):
        loss = model(data) / accumulation_steps
    scaler.scale(loss).backward()

    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

# Solution 4: Gradient Checkpointing (memory-compute tradeoff)
from torch.utils.checkpoint import checkpoint
class MemEfficientModel(nn.Module):
    def forward(self, x):
        # Don't store intermediate activations; recompute during backward
        x = checkpoint(self.block1, x, use_reentrant=False)
        x = checkpoint(self.block2, x, use_reentrant=False)
        return x

# Solution 5: Clear cache
torch.cuda.empty_cache()

# Solution 6: Debug memory leaks
print(torch.cuda.memory_summary())

# Solution 7: Delete unnecessary tensors immediately
del large_tensor
torch.cuda.empty_cache()

# Solution 8: Disable gradients during inference
with torch.no_grad():
    output = model(input_data)

11.2 CUDA Driver vs Runtime Version Mismatch

# Problem: "CUDA driver version is insufficient for CUDA runtime version"

# How to check
nvidia-smi           # Check driver-supported CUDA version
nvcc --version       # Check installed CUDA Toolkit version

# Rule: Driver CUDA >= Toolkit CUDA
# Example: If driver supports CUDA 12.2, CUDA 13.0 Toolkit cannot be used

# Solution 1: Update driver
sudo apt-get update
sudo apt-get install nvidia-driver-560  # Install latest driver

# Solution 2: Downgrade to compatible CUDA Toolkit
sudo apt-get install cuda-toolkit-12-4

# Solution 3: Install matching CUDA build of PyTorch/TensorFlow
pip install torch --index-url https://download.pytorch.org/whl/cu124

11.3 nvidia-smi Key Commands Summary

Command	Purpose
`nvidia-smi`	GPU status summary
`nvidia-smi -l 1`	Refresh every 1 second
`nvidia-smi -q`	Detailed info
`nvidia-smi -q -d MEMORY`	Memory details
`nvidia-smi -q -d CLOCK`	Clock speeds
`nvidia-smi -q -d TEMPERATURE`	Temperature info
`nvidia-smi -q -d POWER`	Power consumption
`nvidia-smi -q -d PERFORMANCE`	Performance state
`nvidia-smi --query-gpu=... --format=csv`	CSV output
`nvidia-smi pmon -i 0`	Process monitoring
`nvidia-smi dmon -d 1`	Device monitoring
`nvidia-smi topo -m`	GPU topology (NVLink, etc.)
`nvidia-smi -r -i 0`	Reset GPU 0
`nvidia-smi -pm 1`	Enable Persistence Mode
`nvidia-smi -pl 300`	Set power limit (W)

11.4 cuda-gdb Debugging

# CUDA debug build
nvcc -g -G -O0 -o debug_app app.cu

# Run cuda-gdb
cuda-gdb ./debug_app

# Key commands
(cuda-gdb) break myKernel            # Breakpoint on kernel
(cuda-gdb) run                        # Run
(cuda-gdb) cuda thread               # Current thread info
(cuda-gdb) cuda block                 # Current block info
(cuda-gdb) cuda kernel                # Current kernel info
(cuda-gdb) cuda thread (0,0,0)        # Switch to specific thread
(cuda-gdb) cuda block (1,0,0)         # Switch to specific block
(cuda-gdb) info cuda threads          # List active threads
(cuda-gdb) print threadIdx.x          # Print built-in variable
(cuda-gdb) print data[idx]            # Check data

# compute-sanitizer (memory error detection)
compute-sanitizer --tool memcheck ./my_app
compute-sanitizer --tool racecheck ./my_app   # Race condition detection
compute-sanitizer --tool initcheck ./my_app   # Initialization check

11.5 Common Error Messages and Solutions

Error Message	Cause	Solution
`CUDA error: out of memory`	GPU memory shortage	Reduce batch size, Mixed Precision, Gradient Checkpointing
`CUDA error: device-side assert triggered`	Assert failure or index out of bounds in kernel	Run with `CUDA_LAUNCH_BLOCKING=1` to identify exact location
`CUDA error: an illegal memory access`	Invalid memory access	Debug with `compute-sanitizer`, check index bounds
`CUDA error: no kernel image is available`	Compute Capability mismatch	Recompile with correct `-arch=sm_XX` option
`CUDA driver version is insufficient`	Driver version too old	Update driver or downgrade CUDA Toolkit
`CUDA error: invalid device function`	Wrong architecture target	Check `-gencode` options, use Fat Binary
`cuDNN error: CUDNN_STATUS_NOT_SUPPORTED`	cuDNN version mismatch or unsupported op	Update cuDNN or check input format
`NCCL error: unhandled system error`	Multi-GPU communication failure	Check network config, enable `NCCL_DEBUG=INFO`

# Debugging tip: Switch to synchronous execution to pinpoint exact error location
CUDA_LAUNCH_BLOCKING=1 python train.py

# NCCL debugging
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.launch train.py

# PyTorch CUDA memory debugging
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python train.py

12. CUDA Alternative Technologies Comparison

12.1 Major GPU Computing Frameworks

Technology	Developer	Target Hardware	Languages	Ecosystem Maturity	Primary Use
CUDA	NVIDIA	NVIDIA GPU	C/C++/Python	Very High	AI/ML, HPC, scientific computing
ROCm	AMD	AMD GPU (MI300X, etc.)	C/C++ (HIP)	Medium	AI training/inference (PyTorch supported)
OpenCL	Khronos	General (GPU/CPU/FPGA)	C/C++	Medium	Cross-platform GPU computing
SYCL	Khronos	General	C++	Growing	oneAPI (Intel), cross-platform
Metal	Apple	Apple Silicon	Swift/Obj-C/C++	Apple ecosystem	macOS/iOS GPU computing
Vulkan Compute	Khronos	General GPU	C/GLSL/SPIR-V	Medium	Cross-platform GPU computation
Triton	OpenAI	NVIDIA/AMD GPU	Python	Growing	High-level GPU kernel programming
WebGPU	W3C	Browser GPU	WGSL/JS	Early	Web-based GPU computation

12.2 CUDA vs ROCm Detailed Comparison

CUDA code:
  cudaMalloc(&d_ptr, size);
  myKernel<<<grid, block>>>(d_ptr);
  cudaDeviceSynchronize();

ROCm HIP code (nearly identical API):
  hipMalloc(&d_ptr, size);
  myKernel<<<grid, block>>>(d_ptr);
  hipDeviceSynchronize();

AMD ROCm's HIP (Heterogeneous-compute Interface for Portability) provides a very similar API to CUDA, and the hipify-perl tool can automatically convert CUDA code to HIP code. PyTorch officially supports ROCm, enabling training on AMD MI300X and similar hardware.

12.3 Selection Guide

Scenario	Recommended Technology	Reason
AI/ML training and inference	CUDA	Optimized library ecosystem (cuDNN, TensorRT)
AMD GPU usage	ROCm (HIP)	CUDA-compatible API, PyTorch support
Cross-platform requirement	OpenCL or SYCL	Broad hardware support
Apple environment	Metal	Only option for macOS/iOS
Custom kernels (Python)	Triton	Higher productivity than CUDA C
Web browser GPU	WebGPU	Standard web API

13. References

13.1 Official Documentation

NVIDIA CUDA Programming Guide: Official reference for GPU architecture and programming model
CUDA C++ Best Practices Guide: Collection of best practices for performance optimization
CUDA Toolkit Release Notes: Change logs and compatibility information for each version
NVIDIA Developer Blog: Latest technology trends and tutorials

13.2 Key Reference Links

Resource	URL
CUDA Toolkit Download	`https://developer.nvidia.com/cuda-downloads`
CUDA Programming Guide	`https://docs.nvidia.com/cuda/cuda-c-programming-guide/`
CUDA Best Practices	`https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/`
cuDNN Documentation	`https://docs.nvidia.com/deeplearning/cudnn/`
TensorRT Documentation	`https://docs.nvidia.com/deeplearning/tensorrt/`
NCCL Documentation	`https://docs.nvidia.com/deeplearning/nccl/`
Nsight Compute	`https://developer.nvidia.com/nsight-compute`
Nsight Systems	`https://developer.nvidia.com/nsight-systems`
RAPIDS Official Site	`https://rapids.ai/`
FlashAttention GitHub	`https://github.com/Dao-AILab/flash-attention`
OpenAI Triton GitHub	`https://github.com/openai/triton`
PyTorch CUDA Docs	`https://pytorch.org/docs/stable/cuda.html`

13.3 Recommended Learning Path

Beginner:
  1. Check GPU with nvidia-smi
  2. Use .cuda() in PyTorch
  3. Apply Mixed Precision Training
  4. Write simple kernels with Numba CUDA JIT

Intermediate:
  5. Write kernels directly in CUDA C/C++
  6. Optimize with Shared Memory, Coalesced Access
  7. Asynchronous execution with CUDA Streams
  8. Profile with Nsight Compute

Advanced:
  9. Custom Attention kernels with Triton
  10. Inference optimization with TensorRT
  11. Multi-GPU DDP training
  12. Custom GEMM implementation with CUTLASS

Conclusion

CUDA is more than a GPU programming tool -- it forms the foundation of modern AI infrastructure. Alongside hardware evolution from Ampere to Hopper to Blackwell, the CUDA Toolkit has entered the 13.x era, beginning to offer higher-level abstractions such as CUDA Tile and cuTile DSL.

The key points for effectively leveraging CUDA in practice are as follows.

Build solid foundations: Understanding the Warp, Block, Grid execution model and memory hierarchy is essential for optimization
Actively use Python tools: Most GPU acceleration can be achieved with PyTorch's Mixed Precision, CuPy, Numba, and similar tools
Profile first: Identify bottlenecks with Nsight Compute/Systems before investing in optimization
Memory is key: Memory optimization techniques like Coalesced Access, Shared Memory Tiling, and Pinned Memory account for 80% of performance improvements
Leverage the ecosystem: Maximize use of already-optimized libraries like cuDNN, TensorRT, and FlashAttention

The world of GPU computing continues to evolve. The emergence of high-level programming models like Triton, the expansion of low-precision operations such as FP4/FP6, and the introduction of tile-based programming represented by CUDA Tile are all moving GPU programming toward lowering the barrier to entry while maximizing hardware performance. We hope this guide serves as a starting point for that journey.

Quiz

Q1: What is the main topic covered in "CUDA Hands-on Complete Guide: Everything About GPU Computing"?

A comprehensive guide covering NVIDIA CUDA from fundamental concepts to GPU architecture, C/C++ kernel programming, memory optimization, Python integration (PyTorch, Numba, CuPy), Multi-GPU training, profiling, and troubleshooting with a hands-on, practical approach.

Q2: Describe the GPU Architecture Fundamentals.

2.1 NVIDIA GPU Internal Structure Let us examine the core components of an NVIDIA GPU hierarchically. Key core types: CUDA Core: The basic processing unit for general-purpose floating-point/integer operations.

Q3: What are the key steps for CUDA Development Environment Setup?

3.1 CUDA Toolkit Installation Linux (Ubuntu/Debian) Windows 3.2 nvidia-smi Command Usage nvidia-smi is the essential tool for monitoring GPU status. 3.3 nvcc Compiler Usage nvcc is the NVIDIA CUDA Compiler for compiling CUDA source code.

Q4: What are the key aspects of CUDA Programming Basics (C/C++)?

4.1 Host (CPU) vs Device (GPU) Code In CUDA programming, CPU-side code is called Host and GPU-side code is called Device. 4.2 Function Qualifiers CUDA provides three function qualifiers.

Q5: How does CUDA Memory Management In-Depth work?

5.1 Global Memory and Coalesced Access Coalesced Access is critical to performance when accessing Global Memory. When 32 threads in a Warp access contiguous memory addresses, the GPU combines them into a single transaction.