Split View: CUDA 프로그래밍 기초: GPU 병렬 컴퓨팅 완전 가이드

CUDA 프로그래밍 기초: GPU 병렬 컴퓨팅 완전 가이드

1. GPU vs CPU 아키텍처 차이
2. CUDA 프로그래밍 모델: Grid, Block, Thread 계층 구조
3. CUDA Kernel 작성법과 실행 구성
- 3.1 Kernel 함수 정의
- 3.2 실행 구성 (Execution Configuration)
4. Thread 인덱싱: threadIdx, blockIdx, blockDim, gridDim
- 4.1 Built-in 변수
- 4.2 전역 Thread ID 계산
5. GPU 메모리 종류
6. 메모리 관리 API
- 6.1 명시적 메모리 관리
- 6.2 Unified Memory (통합 메모리)
7. Warp 실행과 Warp Divergence
- 7.1 Warp의 개념
- 7.2 Warp Divergence
  - Warp Divergence 최소화 전략
8. Occupancy 개념과 최적화
9. 실전 예제
- 9.1 Vector Addition
- 9.2 Matrix Multiplication
  - Naive 버전
  - Tiled 버전 (Shared Memory 활용)
10. NVIDIA GPU 세대별 특성 (Compute Capability)
- 주요 아키텍처별 프로그래밍 관련 변경사항
11. 디버깅 및 프로파일링 도구
12. 에러 처리 Best Practice
마무리
References

1. GPU vs CPU 아키텍처 차이

CUDA 프로그래밍을 이해하기 위해서는 먼저 GPU와 CPU의 근본적인 아키텍처 차이를 파악해야 한다.

1.1 CPU: 순차 처리에 최적화된 프로세서

CPU(Central Processing Unit)는 복잡한 제어 흐름과 분기 예측, 대용량 캐시를 갖추고 있어 순차적(sequential) 작업에 최적화되어 있다. 일반적으로 고성능 CPU는 8~64개의 코어를 가지며, 각 코어는 독립적으로 복잡한 명령어를 빠르게 처리할 수 있다. CPU의 트랜지스터 대부분은 제어 로직(Control Logic)과 캐시(Cache)에 할당되어, 단일 Thread의 실행 속도를 극대화하는 데 초점이 맞춰져 있다.

1.2 GPU: 대규모 병렬 처리에 최적화된 프로세서

반면 GPU(Graphics Processing Unit)는 수천 개의 작은 코어를 탑재하여 대규모 병렬(massively parallel) 연산에 특화되어 있다. NVIDIA GPU는 Streaming Multiprocessor(SM)라는 단위로 구성되며, 각 SM 안에는 수십~수백 개의 CUDA Core가 존재한다. GPU의 트랜지스터 대부분은 연산 유닛(ALU)에 할당되어 있어, 동시에 수천 개의 Thread를 실행할 수 있다.

특성	CPU	GPU
코어 수	8~64개 (고성능)	수천~수만 개
코어 특성	복잡하고 강력함	단순하고 경량
캐시 크기	대용량 (수십 MB)	상대적으로 소용량
최적 작업	순차 처리, 분기 복잡	대규모 데이터 병렬 처리
메모리 대역폭	상대적으로 낮음	매우 높음 (HBM)

1.3 SIMT 실행 모델

NVIDIA GPU는 SIMT(Single Instruction, Multiple Threads) 실행 모델을 사용한다. SIMT는 SIMD(Single Instruction, Multiple Data)와 유사하지만, 핵심적인 차이가 있다. SIMD에서는 벡터 폭(vector width)이 소프트웨어에 노출되지만, SIMT에서는 개별 Thread의 실행과 분기 동작을 명세한다. 즉, 각 Thread가 독립적인 프로그램 카운터와 레지스터 상태를 가지며, 논리적으로는 독립적인 실행 경로를 따를 수 있다.

SIMT의 핵심은 Warp 단위 실행이다. GPU는 32개의 Thread를 하나의 Warp로 묶어 동일한 명령어를 동시에 실행한다. Warp 내 모든 Thread가 같은 코드 경로를 따를 때 최고의 성능을 발휘하며, Thread들이 서로 다른 분기를 탈 경우 성능 저하가 발생한다 (이를 Warp Divergence라 하며, 뒤에서 자세히 다룬다).

2. CUDA 프로그래밍 모델: Grid, Block, Thread 계층 구조

CUDA 프로그래밍 모델의 가장 핵심적인 개념은 Thread 계층 구조(Thread Hierarchy) 이다. CUDA에서는 Kernel 함수를 호출하면 수많은 Thread가 생성되어 병렬로 실행되며, 이 Thread들은 Grid > Block > Thread 의 3단계 계층 구조로 조직된다.

2.1 Thread

Thread는 CUDA 실행의 가장 기본적인 단위이다. 각 Thread는 Kernel 코드의 한 인스턴스를 실행하며, 자신만의 레지스터와 로컬 메모리를 가진다. 각 Thread는 고유한 ID를 통해 자신이 처리해야 할 데이터를 결정한다.

2.2 Thread Block (Block)

Thread Block은 Thread들의 그룹이다. 같은 Block 내의 Thread들은 다음과 같은 특성을 공유한다:

Shared Memory 를 통해 데이터를 공유할 수 있다
__syncthreads()를 통해 동기화(Synchronization) 가 가능하다
하나의 SM에서 실행되며, 실행 중 다른 SM으로 이동하지 않는다
최대 1024개의 Thread 를 포함할 수 있다 (Compute Capability에 따라 다를 수 있음)

Thread Block은 1차원, 2차원, 또는 3차원으로 구성할 수 있어, 벡터, 행렬, 볼륨 데이터에 대한 인덱싱이 자연스럽게 이루어진다.

2.3 Grid

Grid는 Thread Block들의 집합이다. 하나의 Kernel 호출은 하나의 Grid를 생성한다. Grid 역시 1차원, 2차원, 또는 3차원으로 구성할 수 있다. 서로 다른 Block 간에는 Shared Memory를 통한 직접적인 데이터 공유가 불가능하며, 동기화도 제한적이다 (Cooperative Groups 등 특수 API를 사용해야 한다).

2.4 Thread Block Cluster (Compute Capability 9.0 이상)

NVIDIA Hopper 아키텍처(Compute Capability 9.0)부터는 Thread Block Cluster 라는 선택적 계층이 추가되었다. Cluster는 여러 Thread Block으로 구성되며, 같은 Cluster 내의 Block들은 같은 GPC(GPU Processing Cluster)에서 실행되어 Distributed Shared Memory 를 통해 서로의 Shared Memory에 접근할 수 있다.

Grid
 +-- Block Cluster (optional, CC 9.0+)
      +-- Thread Block (최대 1024 Threads)
           +-- Thread (개별 실행 단위)

3. CUDA Kernel 작성법과 실행 구성

3.1 Kernel 함수 정의

CUDA에서 GPU에서 실행되는 함수를 Kernel 이라 한다. Kernel 함수는 __global__ 한정자를 사용하여 정의하며, 반환 타입은 반드시 void여야 한다.

__global__ void myKernel(int *data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        data[idx] = data[idx] * 2;
    }
}

CUDA 함수 한정자는 세 가지가 있다:

한정자	실행 위치	호출 위치
`__global__`	GPU (Device)	CPU (Host) 또는 GPU
`__device__`	GPU (Device)	GPU (Device)
`__host__`	CPU (Host)	CPU (Host)

__host__와 __device__를 동시에 사용하면 Host와 Device 양쪽 모두에서 컴파일된다.

3.2 실행 구성 (Execution Configuration)

Kernel을 호출할 때는 <<<gridDim, blockDim>>> 구문으로 Grid와 Block의 차원을 지정한다.

// 1차원 구성
int N = 1024;
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
myKernel<<<blocksPerGrid, threadsPerBlock>>>(d_data, N);

// 2차원 구성
dim3 blockDim(16, 16);      // 16x16 = 256 threads per block
dim3 gridDim(64, 64);       // 64x64 blocks
matMulKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

// Shared Memory 크기와 Stream 지정
myKernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(args);

<<<>>> 구문의 전체 형태는 <<<gridDim, blockDim, sharedMemBytes, stream>>> 이다. 세 번째 인자는 동적으로 할당할 Shared Memory의 바이트 크기이고, 네 번째 인자는 CUDA Stream이다. 생략하면 각각 0과 default stream이 사용된다.

중요: Block 당 Thread 수는 하드웨어 제한이 있다. 현재 모든 NVIDIA GPU에서 Block 당 최대 1024개의 Thread 를 지원한다. 이 값을 초과하면 Kernel이 실행되지 않는다.

4. Thread 인덱싱: threadIdx, blockIdx, blockDim, gridDim

Kernel 내부에서 각 Thread는 Built-in 변수를 사용하여 자신의 위치를 파악한다. 이 변수들은 uint3 또는 dim3 타입으로, .x, .y, .z 멤버를 가진다.

4.1 Built-in 변수

변수	설명
`threadIdx`	Block 내에서의 Thread 인덱스 (0부터 시작)
`blockIdx`	Grid 내에서의 Block 인덱스 (0부터 시작)
`blockDim`	Block의 차원 (Block 내 Thread 수)
`gridDim`	Grid의 차원 (Grid 내 Block 수)
`warpSize`	Warp 크기 (현재 항상 32)

4.2 전역 Thread ID 계산

1차원 Grid, 1차원 Block의 경우:

int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;

2차원 Grid, 2차원 Block의 경우:

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int globalIdx = row * width + col;

3차원의 경우에도 동일한 패턴으로 확장된다:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;

Boundary Check는 필수이다. 데이터 크기가 Block 크기의 정확한 배수가 아닌 경우, 마지막 Block의 일부 Thread가 유효 범위를 벗어날 수 있으므로 반드시 경계 검사를 수행해야 한다:

if (globalIdx < N) {
    // 유효한 인덱스에 대해서만 연산 수행
    output[globalIdx] = input[globalIdx] * 2;
}

5. GPU 메모리 종류

CUDA GPU는 여러 종류의 메모리 공간을 제공하며, 각각 접근 속도, 크기, 가시성(scope)이 다르다. 올바른 메모리를 선택하는 것이 CUDA 프로그램 최적화의 핵심이다.

5.1 Register (레지스터)

위치: On-chip (SM 내부)
가시성: 개별 Thread 전용
속도: 가장 빠름 (1 cycle latency)
크기: SM당 64K개의 32-bit 레지스터, Thread당 최대 255개

Kernel 내의 지역 변수는 기본적으로 Register에 할당된다. Register 사용량이 많으면 SM에 동시에 올릴 수 있는 Thread 수가 줄어 Occupancy가 감소한다.

5.2 Local Memory (로컬 메모리)

위치: Off-chip (Device Memory, Global Memory와 같은 물리적 위치)
가시성: 개별 Thread 전용
속도: Global Memory와 동일 (느림, 수백 cycle)
용도: Register에 담기지 않는 지역 변수 (Register Spill), 큰 배열

이름은 "Local"이지만 실제로는 Off-chip에 위치 하므로 접근 속도가 느리다. 컴파일러가 Register가 부족할 때 자동으로 Local Memory로 Spill한다.

5.3 Shared Memory

위치: On-chip (SM 내부, L1 Cache와 물리적 공간 공유)
가시성: 같은 Thread Block 내의 모든 Thread
속도: Register에 근접 (Bank Conflict 없을 때 ~5 cycle)
크기: SM당 일반적으로 48KB~164KB (아키텍처에 따라 다름)

__global__ void sharedMemExample(float *data) {
    __shared__ float sharedData[256];  // 정적 할당

    int tid = threadIdx.x;
    sharedData[tid] = data[blockIdx.x * blockDim.x + tid];

    __syncthreads();  // Block 내 모든 Thread 동기화

    // sharedData를 활용한 연산
    float result = sharedData[tid] + sharedData[255 - tid];
    data[blockIdx.x * blockDim.x + tid] = result;
}

Shared Memory는 Bank 로 나뉘어 있으며 (일반적으로 32개), 여러 Thread가 같은 Bank에 동시에 접근하면 Bank Conflict 가 발생하여 직렬화된다. Bank Conflict를 피하는 접근 패턴 설계가 중요하다.

5.4 Global Memory

위치: Off-chip (Device DRAM, 즉 HBM 또는 GDDR)
가시성: 모든 Thread + Host
속도: 가장 느림 (수백 cycle latency)
크기: 가장 큼 (수 GB ~ 수십 GB)

cudaMalloc()으로 할당하는 메모리가 바로 Global Memory이다. L1/L2 Cache를 통해 접근 속도를 향상시킬 수 있으며, Coalesced Access(인접 Thread가 연속된 메모리 주소를 접근)를 통해 대역폭을 극대화해야 한다.

5.5 Constant Memory

위치: Off-chip (Global Memory 영역), 전용 Cache로 캐싱
가시성: 모든 Thread (Read-only)
속도: 캐시 히트 시 매우 빠름
크기: 64KB

__constant__ float constData[256];

// Host에서 값 설정
cudaMemcpyToSymbol(constData, hostData, sizeof(float) * 256);

Warp 내 모든 Thread가 동일한 주소를 읽을 때 최고의 성능을 발휘한다. Broadcast 메커니즘을 통해 한 번의 메모리 읽기로 Warp 내 모든 Thread에 값을 전달한다.

5.6 Texture Memory

위치: Off-chip, 전용 Texture Cache로 캐싱
가시성: 모든 Thread (Read-only)
특징: 2D Spatial Locality에 최적화, 하드웨어 보간(Interpolation) 지원

Texture Memory는 이미지 처리나 2D/3D 데이터에서 공간적으로 인접한 데이터를 접근하는 패턴에 유리하다. 최신 CUDA에서는 Surface Object와 함께 활용된다.

5.7 메모리 계층 요약

빠름 <--------------------------------------------> 느림
Register > Shared Memory > L1/L2 Cache > Constant/Texture Cache > Global Memory
(On-chip)   (On-chip)      (On-chip)     (Cached)                 (Off-chip)

6. 메모리 관리 API

6.1 명시적 메모리 관리

CUDA의 전통적인 메모리 관리 방식은 Host와 Device 메모리를 명시적으로 분리하여 관리하는 것이다.

cudaMalloc: Device 메모리 할당

float *d_array;
cudaMalloc((void **)&d_array, N * sizeof(float));

cudaMalloc은 GPU의 Global Memory에 메모리를 할당한다. 할당된 포인터(d_array)는 Device에서만 유효하며, Host에서 직접 역참조(dereference)할 수 없다.

cudaMemcpy: Host-Device 간 데이터 전송

// Host -> Device
cudaMemcpy(d_array, h_array, N * sizeof(float), cudaMemcpyHostToDevice);

// Device -> Host
cudaMemcpy(h_result, d_result, N * sizeof(float), cudaMemcpyDeviceToHost);

// Device -> Device
cudaMemcpy(d_dest, d_src, N * sizeof(float), cudaMemcpyDeviceToDevice);

cudaMemcpy는 동기(Synchronous) 함수로, 전송이 완료될 때까지 Host Thread가 블로킹된다. 비동기 전송이 필요하면 cudaMemcpyAsync를 사용하고 CUDA Stream을 지정한다.

cudaFree: Device 메모리 해제

cudaFree(d_array);

전형적인 CUDA 프로그램 흐름

// 1. Host 메모리 할당 및 데이터 초기화
float *h_input = (float *)malloc(N * sizeof(float));
float *h_output = (float *)malloc(N * sizeof(float));
initializeData(h_input, N);

// 2. Device 메모리 할당
float *d_input, *d_output;
cudaMalloc(&d_input, N * sizeof(float));
cudaMalloc(&d_output, N * sizeof(float));

// 3. Host -> Device 데이터 전송
cudaMemcpy(d_input, h_input, N * sizeof(float), cudaMemcpyHostToDevice);

// 4. Kernel 실행
int blockSize = 256;
int gridSize = (N + blockSize - 1) / blockSize;
myKernel<<<gridSize, blockSize>>>(d_input, d_output, N);

// 5. Device -> Host 결과 전송
cudaMemcpy(h_output, d_output, N * sizeof(float), cudaMemcpyDeviceToHost);

// 6. 메모리 해제
cudaFree(d_input);
cudaFree(d_output);
free(h_input);
free(h_output);

6.2 Unified Memory (통합 메모리)

CUDA 6.0부터 도입된 Unified Memory 는 Host와 Device가 단일 주소 공간(Single Address Space) 을 공유하도록 한다. 데이터 이동은 CUDA 런타임이 자동으로 관리한다.

float *data;
cudaMallocManaged(&data, N * sizeof(float));

// Host에서 접근 가능
for (int i = 0; i < N; i++) {
    data[i] = (float)i;
}

// Device에서도 동일 포인터 사용
myKernel<<<gridSize, blockSize>>>(data, N);
cudaDeviceSynchronize();

// Host에서 결과 접근 (동기화 후)
printf("Result: %f\n", data[0]);

cudaFree(data);  // cudaMallocManaged도 cudaFree로 해제

Unified Memory의 장점:

cudaMemcpy 호출이 불필요하여 코드가 간결해진다
Host와 Device에서 동일한 포인터를 사용한다
데이터 마이그레이션을 런타임이 On-demand로 처리한다
Device 메모리 용량을 초과하는 데이터도 처리 가능하다 (Oversubscription)

다만 성능 측면에서는, Stream과 cudaMemcpyAsync를 사용하여 Kernel 실행과 데이터 전송을 겹치도록(overlap) 최적화한 프로그램이 Unified Memory만 사용한 프로그램보다 더 높은 성능을 낼 수 있다. cudaMemPrefetchAsync를 활용하면 Unified Memory의 성능을 개선할 수 있다.

7. Warp 실행과 Warp Divergence

7.1 Warp의 개념

Warp 는 CUDA에서 실행의 기본 스케줄링 단위이다. SM의 Warp Scheduler가 Thread를 32개씩 묶어 하나의 Warp를 구성하고, Warp 단위로 명령어를 발행(issue)한다. 같은 Warp 내의 32개 Thread는 동일한 프로그램 카운터(PC)를 공유하며 같은 명령어를 동시에 실행 한다.

Thread Block 내에서 Warp는 Thread ID의 순서대로 구성된다:

Warp 0: Thread 0~31
Warp 1: Thread 32~63
Warp 2: Thread 64~95
... 이하 동일

7.2 Warp Divergence

Warp 내의 Thread들이 조건 분기(if, switch, for 등)에서 서로 다른 경로 를 따를 때 Warp Divergence 가 발생한다.

__global__ void divergentKernel(int *data, int *result) {
    int tid = threadIdx.x;

    // Warp Divergence 발생!
    if (tid % 2 == 0) {
        result[tid] = data[tid] * 2;      // 짝수 Thread
    } else {
        result[tid] = data[tid] + 10;     // 홀수 Thread
    }
}

위 코드에서 Warp 내 짝수 번째 Thread와 홀수 번째 Thread가 서로 다른 분기를 탄다. GPU는 이를 처리하기 위해 각 분기 경로를 순차적으로 실행 하고, 해당 경로에 속하지 않는 Thread는 비활성화(disable)한다. 결과적으로 두 분기의 실행 시간이 합산되어 성능이 저하된다.

Warp Divergence 최소화 전략

// BAD: 같은 Warp 내에서 분기
if (threadIdx.x % 2 == 0) { ... }

// BETTER: Warp 단위로 분기가 일어나도록 설계
if (threadIdx.x / 32 % 2 == 0) { ... }
// 또는
if (blockIdx.x % 2 == 0) { ... }

핵심 원칙은, 같은 Warp 내의 Thread들이 동일한 코드 경로를 따르도록 코드를 설계하는 것이다. 분기 조건이 Warp 경계에 맞춰져 있으면, 각 Warp 전체가 하나의 경로만 실행하므로 Divergence가 발생하지 않는다.

8. Occupancy 개념과 최적화

8.1 Occupancy란?

Occupancy 는 SM에서 동시에 활성화될 수 있는 Warp 수 대비 실제로 활성화된 Warp 수의 비율이다.

Occupancy = Active Warps / Maximum Warps per SM

예를 들어 SM이 최대 64개의 Warp를 지원하고, 실제로 32개의 Warp가 활성화되어 있다면 Occupancy는 50%이다.

8.2 Occupancy에 영향을 미치는 요인

Occupancy는 다음 세 가지 리소스에 의해 결정된다:

Register 사용량: Thread당 Register 수가 많으면, SM에 올릴 수 있는 총 Thread 수가 줄어든다. SM당 64K개의 32-bit Register가 있으므로, Thread당 128개의 Register를 사용하면 SM에 최대 512개 Thread(16 Warp)만 올릴 수 있다.
Shared Memory 사용량: Block당 Shared Memory 사용량이 크면, SM에 동시에 배치할 수 있는 Block 수가 줄어든다.
Block 크기 (Thread 수): SM당 최대 Block 수 제한이 있으므로 (예: Ampere에서 CC 8.0은 32개, CC 8.6은 16개), Block이 너무 작으면 이 제한에 걸려 SM의 Thread 용량을 다 채우지 못할 수 있다.

8.3 Occupancy 최적화 방법

Block 크기는 128, 256, 512 중 선택: 경험적으로 256이 좋은 출발점이다. 너무 작으면(32, 64) SM당 Block 수 제한에 걸리고, 너무 크면(1024) 리소스 요구가 커져 Occupancy가 낮아질 수 있다.
Register 사용량 제어: __launch_bounds__ 한정자나 컴파일러 옵션 -maxrregcount를 사용하여 Kernel당 Register 수를 제한할 수 있다.

__global__ void __launch_bounds__(256, 4)  // maxThreadsPerBlock, minBlocksPerSM
myKernel(float *data) {
    // ...
}

CUDA Occupancy Calculator 활용: NVIDIA에서 제공하는 Occupancy Calculator 스프레드시트나 cudaOccupancyMaxPotentialBlockSize() API를 사용하여 최적의 Block 크기를 자동으로 결정할 수 있다.

int blockSize;
int minGridSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, 0);

주의: 높은 Occupancy가 항상 높은 성능을 보장하지는 않는다. 메모리 접근 패턴, 명령어 수준 병렬성(ILP), Shared Memory 활용도 등이 종합적으로 성능에 영향을 미친다.

9. 실전 예제

9.1 Vector Addition

가장 기본적인 CUDA 예제인 벡터 덧셈이다. 두 배열의 같은 인덱스 원소를 더하여 결과 배열에 저장한다.

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void vectorAdd(const float *A, const float *B, float *C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    int N = 1 << 20;  // 약 100만 개 원소
    size_t size = N * sizeof(float);

    // Host 메모리 할당
    float *h_A = (float *)malloc(size);
    float *h_B = (float *)malloc(size);
    float *h_C = (float *)malloc(size);

    // 데이터 초기화
    for (int i = 0; i < N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // Device 메모리 할당
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Host -> Device 전송
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Kernel 실행
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Device -> Host 전송
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // 검증
    for (int i = 0; i < N; i++) {
        if (fabs(h_C[i] - 3.0f) > 1e-5) {
            fprintf(stderr, "Verification failed at index %d!\n", i);
            return -1;
        }
    }
    printf("Vector addition successful!\n");

    // 정리
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

핵심 포인트:

각 Thread가 하나의 원소를 처리한다 (1:1 매핑)
(N + threadsPerBlock - 1) / threadsPerBlock 로 올림 나눗셈을 수행하여 모든 원소를 커버한다
Boundary Check (if (idx < N))로 범위 초과 접근을 방지한다

9.2 Matrix Multiplication

행렬 곱셈은 CUDA의 대표적인 최적화 예제이다. 먼저 단순(Naive) 버전을 살펴본 후, Shared Memory를 활용한 Tiled 버전을 다룬다.

Naive 버전

__global__ void matMulNaive(float *A, float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

// 호출
dim3 blockDim(16, 16);
dim3 gridDim((N + 15) / 16, (N + 15) / 16);
matMulNaive<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

Naive 버전은 각 Thread가 결과 행렬의 한 원소를 계산하면서, 행렬 A의 한 행과 행렬 B의 한 열 전체를 Global Memory에서 읽는다. 이는 Global Memory 접근 횟수가 O(N^3)으로 매우 비효율적이다.

Tiled 버전 (Shared Memory 활용)

#define TILE_SIZE 16

__global__ void matMulTiled(float *A, float *B, float *C, int N) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    // Tile 단위로 순회
    for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Tile을 Shared Memory로 로드
        if (row < N && (t * TILE_SIZE + threadIdx.x) < N)
            tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        else
            tileA[threadIdx.y][threadIdx.x] = 0.0f;

        if ((t * TILE_SIZE + threadIdx.y) < N && col < N)
            tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tileB[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();

        // Tile 내부에서 곱셈-덧셈
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        __syncthreads();
    }

    if (row < N && col < N) {
        C[row * N + col] = sum;
    }
}

Tiled 행렬 곱셈의 핵심 아이디어:

Global Memory의 데이터를 TILE_SIZE x TILE_SIZE 크기의 타일로 나누어 Shared Memory에 로드한다
Shared Memory에서의 읽기는 Global Memory보다 ~100배 빠르다
동일 데이터를 Block 내 여러 Thread가 재사용하므로 Global Memory 접근이 크게 줄어든다
__syncthreads()로 모든 Thread가 타일 로드를 완료한 후에 연산을 수행한다

10. NVIDIA GPU 세대별 특성 (Compute Capability)

Compute Capability(CC)는 GPU의 하드웨어 기능과 사양을 나타내는 버전 번호이다. Major 버전은 아키텍처 세대를, Minor 버전은 세대 내 개선을 나타낸다.

CC	아키텍처	대표 GPU	주요 특징
3.x	Kepler	GTX 680, K40	Dynamic Parallelism, Hyper-Q
5.x	Maxwell	GTX 980, M40	에너지 효율 개선, SM 재설계
6.x	Pascal	GTX 1080, P100	HBM2, NVLink, FP16 지원
7.0	Volta	V100	Tensor Core 1세대, Independent Thread Scheduling
7.5	Turing	RTX 2080, T4	RT Core, INT8/INT4 Tensor Core
8.0	Ampere	A100	3세대 Tensor Core, TF32, BF16, Sparsity
8.6	Ampere	RTX 3090	SM당 최대 Warp 48개 (8.0은 64개)
8.9	Ada Lovelace	RTX 4090, L40	4세대 Tensor Core, FP8, Shader Execution Reordering
9.0	Hopper	H100	Thread Block Cluster, Transformer Engine, DPX, FP8
10.0	Blackwell	B200, GB200	5세대 Tensor Core, FP4/FP6, 208B 트랜지스터, HBM3e

주요 아키텍처별 프로그래밍 관련 변경사항

Volta (CC 7.0): Independent Thread Scheduling이 도입되어 Warp 내 Thread들이 더 유연하게 분기할 수 있게 되었다. 이전 아키텍처에서는 Warp 내 Thread들이 암묵적으로 동기화되는 Lock-step 동작에 의존하는 코드가 있었으나, Volta부터는 명시적인 동기화(__syncwarp())가 필요하다.

Hopper (CC 9.0): Thread Block Cluster 개념이 추가되어, 여러 Thread Block이 Distributed Shared Memory를 통해 협력할 수 있게 되었다. 또한 비동기 데이터 이동을 위한 TMA(Tensor Memory Accelerator) 유닛이 도입되었다.

Blackwell (CC 10.0): 5세대 Tensor Core가 FP4, FP6 정밀도를 네이티브로 지원하며, 마이크로-텐서 포맷과 동적 범위 스케일링을 사용한다. AI 연산 성능이 Hopper 대비 크게 향상되어 최대 20 PFLOPS의 AI 연산 성능을 제공한다.

11. 디버깅 및 프로파일링 도구

11.1 cuda-gdb

cuda-gdb 는 GNU GDB의 CUDA 확장 버전으로, 실제 GPU 하드웨어에서 실행되는 CUDA 애플리케이션을 디버깅할 수 있다.

주요 기능:

GPU Kernel 내에서 Breakpoint 설정
특정 Thread, Block, Warp 단위로 상태 조회
Device 메모리 내용 검사
Host와 Device 코드를 동시에 디버깅

# 디버그 정보 포함 컴파일
nvcc -g -G -o myapp myapp.cu

# cuda-gdb로 디버깅 시작
cuda-gdb ./myapp

# cuda-gdb 내에서
(cuda-gdb) break myKernel
(cuda-gdb) run
(cuda-gdb) cuda thread       # 현재 Thread 정보
(cuda-gdb) cuda block        # 현재 Block 정보
(cuda-gdb) info cuda threads # 모든 CUDA Thread 목록

11.2 Compute Sanitizer

Compute Sanitizer 는 CUDA 프로그램의 기능적 정확성을 검사하는 도구 모음으로, CUDA Toolkit에 포함되어 있다. 네 가지 하위 도구를 제공한다:

도구	기능
memcheck	메모리 접근 오류 (out-of-bounds, misaligned) 검출
racecheck	Shared Memory 데이터 경쟁 (Data Race) 검출
initcheck	초기화되지 않은 Global Memory 접근 검출
synccheck	Thread 동기화 오류 (`__syncthreads()` 잘못 사용 등) 검출

# memcheck 실행
compute-sanitizer --tool memcheck ./myapp

# racecheck 실행
compute-sanitizer --tool racecheck ./myapp

# initcheck 실행
compute-sanitizer --tool initcheck ./myapp

11.3 NVIDIA Nsight

NVIDIA Nsight은 통합 개발, 디버깅, 프로파일링 환경을 제공하는 도구 제품군이다.

Nsight Systems: 시스템 전체의 성능을 분석한다. CPU-GPU 간 타임라인, Kernel 실행 시간, 메모리 전송, API 호출 등을 시각적으로 보여준다. 전체적인 병목 지점을 파악하는 첫 단계로 사용한다.
Nsight Compute: 개별 CUDA Kernel의 상세 성능 메트릭을 분석한다. Occupancy, 메모리 대역폭 활용률, 명령어 처리량, Warp 상태 등을 세밀하게 조사할 수 있다. 특정 Kernel의 최적화에 사용한다.
Nsight Visual Studio Edition / VS Code Extension: IDE에 통합된 CUDA 디버깅 및 프로파일링 지원을 제공한다.

# Nsight Systems로 프로파일링
nsys profile --stats=true ./myapp

# Nsight Compute로 Kernel 분석
ncu --set full ./myapp

디버깅/프로파일링 워크플로우

기능 정확성 검증: Compute Sanitizer(memcheck, racecheck)로 메모리 오류와 Race Condition을 먼저 잡는다
시스템 레벨 분석: Nsight Systems로 전체적인 병목 지점 (CPU-GPU 동기화, 메모리 전송 등)을 파악한다
Kernel 레벨 최적화: Nsight Compute로 병목이 되는 Kernel의 세부 성능 메트릭을 분석하고 최적화한다

12. 에러 처리 Best Practice

CUDA API 호출은 대부분 cudaError_t 타입의 에러 코드를 반환한다. 프로덕션 코드에서는 반드시 에러를 체크해야 한다.

#define CUDA_CHECK(call)                                                    \
    do {                                                                    \
        cudaError_t err = call;                                             \
        if (err != cudaSuccess) {                                           \
            fprintf(stderr, "CUDA error at %s:%d: %s\n",                    \
                    __FILE__, __LINE__, cudaGetErrorString(err));            \
            exit(EXIT_FAILURE);                                             \
        }                                                                   \
    } while (0)

// 사용 예시
CUDA_CHECK(cudaMalloc(&d_array, size));
CUDA_CHECK(cudaMemcpy(d_array, h_array, size, cudaMemcpyHostToDevice));

// Kernel 실행 후 에러 체크
myKernel<<<gridSize, blockSize>>>(d_array, N);
CUDA_CHECK(cudaGetLastError());       // Kernel launch 에러
CUDA_CHECK(cudaDeviceSynchronize());  // Kernel 실행 중 에러

Kernel 호출은 비동기적이므로, cudaGetLastError()로 Launch 시점의 에러를 확인하고, cudaDeviceSynchronize() 후에 Kernel 실행 중 발생한 에러를 확인해야 한다.

마무리

CUDA 프로그래밍은 GPU의 대규모 병렬 처리 능력을 활용하기 위한 핵심 기술이다. 이 글에서 다룬 내용을 정리하면:

GPU 아키텍처: 수천 개의 경량 코어가 SIMT 모델로 동작하며, 데이터 병렬 처리에 최적화되어 있다
Thread 계층 구조: Grid > Block > Thread의 3단계 계층으로 병렬 작업을 조직한다
메모리 계층: Register, Shared, Global 등 다양한 메모리 공간의 특성을 이해하고 적절히 활용해야 한다
Warp와 Occupancy: Warp Divergence를 최소화하고, Occupancy를 고려하여 실행 구성을 최적화해야 한다
디버깅 도구: Compute Sanitizer, Nsight Systems/Compute를 활용하여 체계적으로 디버깅하고 최적화한다

CUDA 프로그래밍은 단순히 Kernel을 작성하는 것을 넘어, GPU 하드웨어의 특성을 깊이 이해하고 이에 맞게 코드를 최적화하는 과정이다. NVIDIA의 공식 CUDA Programming Guide를 지속적으로 참고하면서, 실제 프로젝트에 적용해 보기를 권한다.

References

CUDA Programming Fundamentals: Complete Guide to GPU Parallel Computing

1. GPU vs CPU Architecture Differences
2. CUDA Programming Model: Grid, Block, Thread Hierarchy
3. CUDA Kernel Development and Execution Configuration
- 3.1 Kernel Function Definition
- 3.2 Execution Configuration
4. Thread Indexing: threadIdx, blockIdx, blockDim, gridDim
- 4.1 Built-in Variables
- 4.2 Global Thread ID Computation
5. GPU Memory Types
6. Memory Management API
- 6.1 Explicit Memory Management
- 6.2 Unified Memory
7. Warp Execution and Warp Divergence
- 7.1 The Concept of a Warp
- 7.2 Warp Divergence
  - Strategies for Minimizing Warp Divergence
8. Occupancy Concept and Optimization
9. Practical Examples
- 9.1 Vector Addition
- 9.2 Matrix Multiplication
  - Naive Version
  - Tiled Version (Using Shared Memory)
10. NVIDIA GPU Generation Characteristics (Compute Capability)
- Key Programming-Related Changes by Architecture
11. Debugging and Profiling Tools
12. Error Handling Best Practices
Conclusion
References
- Quiz

1. GPU vs CPU Architecture Differences

To understand CUDA programming, you must first grasp the fundamental architectural differences between GPUs and CPUs.

1.1 CPU: A Processor Optimized for Sequential Processing

The CPU (Central Processing Unit) is equipped with complex control flow, branch prediction, and large caches, making it optimized for sequential tasks. A typical high-performance CPU has 8 to 64 cores, each capable of independently executing complex instructions at high speed. Most of the CPU's transistors are dedicated to control logic and cache, focusing on maximizing single-thread execution speed.

1.2 GPU: A Processor Optimized for Massive Parallel Processing

In contrast, the GPU (Graphics Processing Unit) is equipped with thousands of small cores, specializing in massively parallel computation. NVIDIA GPUs are organized into units called Streaming Multiprocessors (SMs), each containing dozens to hundreds of CUDA Cores. Most of the GPU's transistors are dedicated to arithmetic logic units (ALUs), enabling simultaneous execution of thousands of threads.

Characteristic	CPU	GPU
Number of Cores	8-64 (high performance)	Thousands to tens of thousands
Core Characteristics	Complex and powerful	Simple and lightweight
Cache Size	Large (tens of MB)	Relatively small
Optimal Tasks	Sequential, complex branches	Massive data parallel processing
Memory Bandwidth	Relatively low	Very high (HBM)

1.3 SIMT Execution Model

NVIDIA GPUs use the SIMT (Single Instruction, Multiple Threads) execution model. SIMT is similar to SIMD (Single Instruction, Multiple Data), but with a key difference. In SIMD, the vector width is exposed to software, whereas SIMT specifies the execution and branching behavior of individual threads. Each thread has its own program counter and register state, and can logically follow independent execution paths.

The core of SIMT is warp-based execution. The GPU groups 32 threads into a single warp and executes the same instruction simultaneously. Maximum performance is achieved when all threads in a warp follow the same code path. When threads take different branches, performance degrades (known as warp divergence, discussed in detail later).

2. CUDA Programming Model: Grid, Block, Thread Hierarchy

The most fundamental concept in the CUDA programming model is the thread hierarchy. When a kernel function is invoked in CUDA, numerous threads are created and executed in parallel, organized into a three-level hierarchy: Grid, Block, and Thread.

2.1 Thread

A thread is the most basic unit of CUDA execution. Each thread executes one instance of the kernel code and has its own registers and local memory. Each thread determines which data it should process through its unique ID.

2.2 Thread Block (Block)

A Thread Block is a group of threads. Threads within the same block share the following characteristics:

Can share data through Shared Memory
Can synchronize via __syncthreads()
Execute on a single SM and do not migrate to another SM during execution
Can contain a maximum of 1024 threads (may vary by Compute Capability)

Thread Blocks can be organized in 1D, 2D, or 3D, allowing natural indexing for vector, matrix, and volume data.

2.3 Grid

A Grid is a collection of Thread Blocks. A single kernel invocation creates one Grid. Grids can also be organized in 1D, 2D, or 3D. Direct data sharing via Shared Memory between different blocks is not possible, and synchronization is limited (special APIs such as Cooperative Groups must be used).

2.4 Thread Block Cluster (Compute Capability 9.0 and above)

Starting with the NVIDIA Hopper architecture (Compute Capability 9.0), an optional layer called Thread Block Cluster was added. A cluster consists of multiple Thread Blocks, and blocks within the same cluster execute on the same GPC (GPU Processing Cluster), enabling access to each other's Shared Memory through Distributed Shared Memory.

Grid
 +-- Block Cluster (optional, CC 9.0+)
      +-- Thread Block (up to 1024 Threads)
           +-- Thread (individual execution unit)

3. CUDA Kernel Development and Execution Configuration

3.1 Kernel Function Definition

In CUDA, functions that execute on the GPU are called kernels. Kernel functions are defined using the __global__ qualifier, and the return type must be void.

__global__ void myKernel(int *data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        data[idx] = data[idx] * 2;
    }
}

CUDA has three function qualifiers:

Qualifier	Execution Location	Call Location
`__global__`	GPU (Device)	CPU (Host) or GPU
`__device__`	GPU (Device)	GPU (Device)
`__host__`	CPU (Host)	CPU (Host)

Using __host__ and __device__ together compiles the function for both host and device.

3.2 Execution Configuration

When calling a kernel, the Grid and Block dimensions are specified using the <<<gridDim, blockDim>>> syntax.

// 1D configuration
int N = 1024;
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
myKernel<<<blocksPerGrid, threadsPerBlock>>>(d_data, N);

// 2D configuration
dim3 blockDim(16, 16);      // 16x16 = 256 threads per block
dim3 gridDim(64, 64);       // 64x64 blocks
matMulKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

// Shared Memory size and Stream specification
myKernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(args);

The full form of the <<<>>> syntax is <<<gridDim, blockDim, sharedMemBytes, stream>>>. The third argument is the size in bytes of dynamically allocated Shared Memory, and the fourth is the CUDA Stream. If omitted, they default to 0 and the default stream, respectively.

Important: There is a hardware limit on the number of threads per block. Currently, all NVIDIA GPUs support a maximum of 1024 threads per block. Exceeding this value will prevent the kernel from launching.

4. Thread Indexing: threadIdx, blockIdx, blockDim, gridDim

Inside a kernel, each thread uses built-in variables to determine its position. These variables are of type uint3 or dim3, with .x, .y, .z members.

4.1 Built-in Variables

Variable	Description
`threadIdx`	Thread index within the block (starting from 0)
`blockIdx`	Block index within the grid (starting from 0)
`blockDim`	Block dimensions (number of threads in the block)
`gridDim`	Grid dimensions (number of blocks in the grid)
`warpSize`	Warp size (currently always 32)

4.2 Global Thread ID Computation

For a 1D grid with 1D blocks:

int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;

For a 2D grid with 2D blocks:

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int globalIdx = row * width + col;

The same pattern extends to 3D:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;

Boundary checks are essential. When the data size is not an exact multiple of the block size, some threads in the last block may fall outside the valid range, so boundary checking must always be performed:

if (globalIdx < N) {
    // Perform computation only for valid indices
    output[globalIdx] = input[globalIdx] * 2;
}

5. GPU Memory Types

CUDA GPUs provide multiple memory spaces, each with different access speeds, sizes, and visibility (scope). Choosing the right memory is the key to CUDA program optimization.

5.1 Register

Location: On-chip (inside the SM)
Visibility: Private to each thread
Speed: Fastest (1 cycle latency)
Size: 64K 32-bit registers per SM, up to 255 per thread

Local variables within a kernel are allocated to registers by default. High register usage reduces the number of threads that can simultaneously reside on an SM, decreasing occupancy.

5.2 Local Memory

Location: Off-chip (Device Memory, same physical location as Global Memory)
Visibility: Private to each thread
Speed: Same as Global Memory (slow, hundreds of cycles)
Purpose: Local variables that don't fit in registers (register spill), large arrays

Despite the name "Local," it is actually located off-chip, so access speed is slow. The compiler automatically spills to Local Memory when registers are insufficient.

5.3 Shared Memory

Location: On-chip (inside the SM, shares physical space with L1 Cache)
Visibility: All threads within the same Thread Block
Speed: Close to registers (~5 cycles without bank conflicts)
Size: Typically 48KB-164KB per SM (varies by architecture)

__global__ void sharedMemExample(float *data) {
    __shared__ float sharedData[256];  // Static allocation

    int tid = threadIdx.x;
    sharedData[tid] = data[blockIdx.x * blockDim.x + tid];

    __syncthreads();  // Synchronize all threads in the block

    // Operations using sharedData
    float result = sharedData[tid] + sharedData[255 - tid];
    data[blockIdx.x * blockDim.x + tid] = result;
}

Shared Memory is divided into banks (typically 32), and when multiple threads access the same bank simultaneously, a bank conflict occurs, causing serialization. Designing access patterns that avoid bank conflicts is important.

5.4 Global Memory

Location: Off-chip (Device DRAM, i.e., HBM or GDDR)
Visibility: All threads + host
Speed: Slowest (hundreds of cycles latency)
Size: Largest (several GB to tens of GB)

Memory allocated with cudaMalloc() is Global Memory. Access speed can be improved through L1/L2 caches, and bandwidth should be maximized through coalesced access (adjacent threads accessing contiguous memory addresses).

5.5 Constant Memory

Location: Off-chip (Global Memory region), cached via dedicated cache
Visibility: All threads (read-only)
Speed: Very fast on cache hit
Size: 64KB

__constant__ float constData[256];

// Set values from host
cudaMemcpyToSymbol(constData, hostData, sizeof(float) * 256);

Delivers best performance when all threads in a warp read the same address. Through a broadcast mechanism, a single memory read delivers the value to all threads in the warp.

5.6 Texture Memory

Location: Off-chip, cached via dedicated Texture Cache
Visibility: All threads (read-only)
Features: Optimized for 2D spatial locality, hardware interpolation support

Texture Memory is advantageous for access patterns involving spatially adjacent data in 2D/3D datasets such as image processing. In modern CUDA, it is used alongside Surface Objects.

5.7 Memory Hierarchy Summary

Fast <--------------------------------------------> Slow
Register > Shared Memory > L1/L2 Cache > Constant/Texture Cache > Global Memory
(On-chip)   (On-chip)      (On-chip)     (Cached)                 (Off-chip)

6. Memory Management API

6.1 Explicit Memory Management

The traditional CUDA memory management approach explicitly separates and manages host and device memory.

cudaMalloc: Device Memory Allocation

float *d_array;
cudaMalloc((void **)&d_array, N * sizeof(float));

cudaMalloc allocates memory in the GPU's Global Memory. The allocated pointer (d_array) is only valid on the device and cannot be directly dereferenced on the host.

cudaMemcpy: Host-Device Data Transfer

// Host -> Device
cudaMemcpy(d_array, h_array, N * sizeof(float), cudaMemcpyHostToDevice);

// Device -> Host
cudaMemcpy(h_result, d_result, N * sizeof(float), cudaMemcpyDeviceToHost);

// Device -> Device
cudaMemcpy(d_dest, d_src, N * sizeof(float), cudaMemcpyDeviceToDevice);

cudaMemcpy is a synchronous function that blocks the host thread until the transfer completes. For asynchronous transfers, use cudaMemcpyAsync with a CUDA Stream.

cudaFree: Device Memory Deallocation

cudaFree(d_array);

Typical CUDA Program Flow

// 1. Allocate host memory and initialize data
float *h_input = (float *)malloc(N * sizeof(float));
float *h_output = (float *)malloc(N * sizeof(float));
initializeData(h_input, N);

// 2. Allocate device memory
float *d_input, *d_output;
cudaMalloc(&d_input, N * sizeof(float));
cudaMalloc(&d_output, N * sizeof(float));

// 3. Transfer data Host -> Device
cudaMemcpy(d_input, h_input, N * sizeof(float), cudaMemcpyHostToDevice);

// 4. Launch kernel
int blockSize = 256;
int gridSize = (N + blockSize - 1) / blockSize;
myKernel<<<gridSize, blockSize>>>(d_input, d_output, N);

// 5. Transfer results Device -> Host
cudaMemcpy(h_output, d_output, N * sizeof(float), cudaMemcpyDeviceToHost);

// 6. Free memory
cudaFree(d_input);
cudaFree(d_output);
free(h_input);
free(h_output);

6.2 Unified Memory

Unified Memory, introduced in CUDA 6.0, allows the host and device to share a single address space. Data movement is managed automatically by the CUDA runtime.

float *data;
cudaMallocManaged(&data, N * sizeof(float));

// Accessible from host
for (int i = 0; i < N; i++) {
    data[i] = (float)i;
}

// Same pointer used on device
myKernel<<<gridSize, blockSize>>>(data, N);
cudaDeviceSynchronize();

// Access results from host (after synchronization)
printf("Result: %f\n", data[0]);

cudaFree(data);  // cudaMallocManaged also freed with cudaFree

Advantages of Unified Memory:

No cudaMemcpy calls needed, resulting in more concise code
Same pointer used on both host and device
Runtime handles data migration on-demand
Can process data exceeding device memory capacity (oversubscription)

However, in terms of performance, programs optimized to overlap kernel execution and data transfer using Streams and cudaMemcpyAsync can achieve higher performance than programs using only Unified Memory. cudaMemPrefetchAsync can be used to improve Unified Memory performance.

7. Warp Execution and Warp Divergence

7.1 The Concept of a Warp

A warp is the basic scheduling unit of execution in CUDA. The SM's warp scheduler groups threads into groups of 32 to form a warp and issues instructions at the warp level. The 32 threads in the same warp share the same program counter (PC) and execute the same instruction simultaneously.

Within a Thread Block, warps are formed in order of thread ID:

Warp 0: Threads 0-31
Warp 1: Threads 32-63
Warp 2: Threads 64-95
... and so on

7.2 Warp Divergence

Warp divergence occurs when threads within a warp follow different paths at conditional branches (if, switch, for, etc.).

__global__ void divergentKernel(int *data, int *result) {
    int tid = threadIdx.x;

    // Warp Divergence occurs!
    if (tid % 2 == 0) {
        result[tid] = data[tid] * 2;      // Even threads
    } else {
        result[tid] = data[tid] + 10;     // Odd threads
    }
}

In the above code, even-numbered and odd-numbered threads within a warp take different branches. The GPU handles this by executing each branch path sequentially, disabling threads that don't belong to the current path. As a result, the execution times of both branches are summed, degrading performance.

Strategies for Minimizing Warp Divergence

// BAD: Branching within the same warp
if (threadIdx.x % 2 == 0) { ... }

// BETTER: Design branching to occur at warp boundaries
if (threadIdx.x / 32 % 2 == 0) { ... }
// or
if (blockIdx.x % 2 == 0) { ... }

The core principle is to design code so that threads within the same warp follow the same code path. If the branching condition aligns with warp boundaries, each entire warp executes only one path, preventing divergence.

8. Occupancy Concept and Optimization

8.1 What is Occupancy?

Occupancy is the ratio of active warps to the maximum number of warps that can be simultaneously active on an SM.

Occupancy = Active Warps / Maximum Warps per SM

For example, if an SM supports a maximum of 64 warps and 32 warps are actually active, the occupancy is 50%.

8.2 Factors Affecting Occupancy

Occupancy is determined by three resources:

Register usage: More registers per thread means fewer total threads that can reside on the SM. With 64K 32-bit registers per SM, using 128 registers per thread allows only 512 threads (16 warps) on the SM.
Shared Memory usage: More Shared Memory per block means fewer blocks that can be simultaneously placed on the SM.
Block size (number of threads): There is a maximum block count limit per SM (e.g., 32 for Ampere CC 8.0, 16 for CC 8.6), so blocks that are too small may hit this limit and fail to fill the SM's thread capacity.

8.3 Occupancy Optimization Methods

Choose block sizes of 128, 256, or 512: Empirically, 256 is a good starting point. Too small (32, 64) hits the per-SM block count limit, while too large (1024) increases resource requirements and may lower occupancy.
Control register usage: Use the __launch_bounds__ qualifier or the compiler option -maxrregcount to limit registers per kernel.

__global__ void __launch_bounds__(256, 4)  // maxThreadsPerBlock, minBlocksPerSM
myKernel(float *data) {
    // ...
}

Use the CUDA Occupancy Calculator: Use the NVIDIA-provided Occupancy Calculator spreadsheet or the cudaOccupancyMaxPotentialBlockSize() API to automatically determine the optimal block size.

int blockSize;
int minGridSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, 0);

Note: High occupancy does not always guarantee high performance. Memory access patterns, instruction-level parallelism (ILP), Shared Memory utilization, and other factors collectively affect performance.

9. Practical Examples

9.1 Vector Addition

The most basic CUDA example: vector addition. It adds corresponding elements of two arrays and stores the result.

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void vectorAdd(const float *A, const float *B, float *C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    int N = 1 << 20;  // Approximately 1 million elements
    size_t size = N * sizeof(float);

    // Allocate host memory
    float *h_A = (float *)malloc(size);
    float *h_B = (float *)malloc(size);
    float *h_C = (float *)malloc(size);

    // Initialize data
    for (int i = 0; i < N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // Allocate device memory
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Transfer Host -> Device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Transfer Device -> Host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Verification
    for (int i = 0; i < N; i++) {
        if (fabs(h_C[i] - 3.0f) > 1e-5) {
            fprintf(stderr, "Verification failed at index %d!\n", i);
            return -1;
        }
    }
    printf("Vector addition successful!\n");

    // Cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

Key Points:

Each thread processes one element (1:1 mapping)
(N + threadsPerBlock - 1) / threadsPerBlock performs ceiling division to cover all elements
Boundary check (if (idx < N)) prevents out-of-bounds access

9.2 Matrix Multiplication

Matrix multiplication is a representative CUDA optimization example. We first look at the naive version, then cover the tiled version using Shared Memory.

Naive Version

__global__ void matMulNaive(float *A, float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

// Launch
dim3 blockDim(16, 16);
dim3 gridDim((N + 15) / 16, (N + 15) / 16);
matMulNaive<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

The naive version has each thread computing one element of the result matrix while reading an entire row of matrix A and an entire column of matrix B from Global Memory. This results in O(N^3) Global Memory accesses, which is highly inefficient.

Tiled Version (Using Shared Memory)

#define TILE_SIZE 16

__global__ void matMulTiled(float *A, float *B, float *C, int N) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    // Iterate over tiles
    for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Load tile into Shared Memory
        if (row < N && (t * TILE_SIZE + threadIdx.x) < N)
            tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        else
            tileA[threadIdx.y][threadIdx.x] = 0.0f;

        if ((t * TILE_SIZE + threadIdx.y) < N && col < N)
            tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tileB[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();

        // Multiply-add within the tile
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        __syncthreads();
    }

    if (row < N && col < N) {
        C[row * N + col] = sum;
    }
}

Key ideas behind tiled matrix multiplication:

Data from Global Memory is divided into TILE_SIZE x TILE_SIZE tiles and loaded into Shared Memory
Reads from Shared Memory are approximately 100x faster than from Global Memory
Multiple threads within the block reuse the same data, greatly reducing Global Memory accesses
__syncthreads() ensures all threads have completed tile loading before computation begins

10. NVIDIA GPU Generation Characteristics (Compute Capability)

Compute Capability (CC) is a version number indicating a GPU's hardware features and specifications. The major version represents the architecture generation, and the minor version represents improvements within a generation.

CC	Architecture	Representative GPU	Key Features
3.x	Kepler	GTX 680, K40	Dynamic Parallelism, Hyper-Q
5.x	Maxwell	GTX 980, M40	Energy efficiency improvements, SM redesign
6.x	Pascal	GTX 1080, P100	HBM2, NVLink, FP16 support
7.0	Volta	V100	1st gen Tensor Core, Independent Thread Scheduling
7.5	Turing	RTX 2080, T4	RT Core, INT8/INT4 Tensor Core
8.0	Ampere	A100	3rd gen Tensor Core, TF32, BF16, Sparsity
8.6	Ampere	RTX 3090	Max 48 Warps per SM (vs 64 for 8.0)
8.9	Ada Lovelace	RTX 4090, L40	4th gen Tensor Core, FP8, Shader Execution Reordering
9.0	Hopper	H100	Thread Block Cluster, Transformer Engine, DPX, FP8
10.0	Blackwell	B200, GB200	5th gen Tensor Core, FP4/FP6, 208B transistors, HBM3e

Volta (CC 7.0): Independent Thread Scheduling was introduced, allowing threads within a warp to branch more flexibly. Earlier architectures relied on implicit lock-step synchronization within warps, but starting from Volta, explicit synchronization (__syncwarp()) is required.

Hopper (CC 9.0): The Thread Block Cluster concept was added, enabling multiple Thread Blocks to cooperate through Distributed Shared Memory. The TMA (Tensor Memory Accelerator) unit for asynchronous data movement was also introduced.

Blackwell (CC 10.0): The 5th generation Tensor Core natively supports FP4 and FP6 precision, using micro-tensor formats and dynamic range scaling. AI computation performance is significantly improved over Hopper, delivering up to 20 PFLOPS of AI compute performance.

11. Debugging and Profiling Tools

11.1 cuda-gdb

cuda-gdb is a CUDA extension of GNU GDB that can debug CUDA applications running on actual GPU hardware.

Key features:

Setting breakpoints within GPU kernels
Querying state at the thread, block, and warp level
Inspecting device memory contents
Simultaneously debugging host and device code

# Compile with debug information
nvcc -g -G -o myapp myapp.cu

# Start debugging with cuda-gdb
cuda-gdb ./myapp

# Inside cuda-gdb
(cuda-gdb) break myKernel
(cuda-gdb) run
(cuda-gdb) cuda thread       # Current thread info
(cuda-gdb) cuda block        # Current block info
(cuda-gdb) info cuda threads # List all CUDA threads

11.2 Compute Sanitizer

Compute Sanitizer is a suite of tools for checking the functional correctness of CUDA programs, included in the CUDA Toolkit. It provides four sub-tools:

Tool	Function
memcheck	Detects memory access errors (out-of-bounds, misaligned)
racecheck	Detects Shared Memory data races
initcheck	Detects uninitialized Global Memory accesses
synccheck	Detects thread synchronization errors (misuse of `__syncthreads()`, etc.)

# Run memcheck
compute-sanitizer --tool memcheck ./myapp

# Run racecheck
compute-sanitizer --tool racecheck ./myapp

# Run initcheck
compute-sanitizer --tool initcheck ./myapp

11.3 NVIDIA Nsight

NVIDIA Nsight is a suite of tools providing integrated development, debugging, and profiling environments.

Nsight Systems: Analyzes system-wide performance. Visually displays CPU-GPU timelines, kernel execution times, memory transfers, and API calls. Used as the first step to identify overall bottlenecks.
Nsight Compute: Analyzes detailed performance metrics of individual CUDA kernels. Examines occupancy, memory bandwidth utilization, instruction throughput, warp state, and more in fine detail. Used for optimizing specific kernels.
Nsight Visual Studio Edition / VS Code Extension: Provides CUDA debugging and profiling support integrated into the IDE.

# Profile with Nsight Systems
nsys profile --stats=true ./myapp

# Analyze kernels with Nsight Compute
ncu --set full ./myapp

Debugging/Profiling Workflow

Verify functional correctness: Use Compute Sanitizer (memcheck, racecheck) to catch memory errors and race conditions first
System-level analysis: Use Nsight Systems to identify overall bottlenecks (CPU-GPU synchronization, memory transfers, etc.)
Kernel-level optimization: Use Nsight Compute to analyze detailed performance metrics of bottleneck kernels and optimize

12. Error Handling Best Practices

Most CUDA API calls return an error code of type cudaError_t. In production code, errors must always be checked.

#define CUDA_CHECK(call)                                                    \
    do {                                                                    \
        cudaError_t err = call;                                             \
        if (err != cudaSuccess) {                                           \
            fprintf(stderr, "CUDA error at %s:%d: %s\n",                    \
                    __FILE__, __LINE__, cudaGetErrorString(err));            \
            exit(EXIT_FAILURE);                                             \
        }                                                                   \
    } while (0)

// Usage examples
CUDA_CHECK(cudaMalloc(&d_array, size));
CUDA_CHECK(cudaMemcpy(d_array, h_array, size, cudaMemcpyHostToDevice));

// Error checking after kernel launch
myKernel<<<gridSize, blockSize>>>(d_array, N);
CUDA_CHECK(cudaGetLastError());       // Kernel launch error
CUDA_CHECK(cudaDeviceSynchronize());  // Error during kernel execution

Since kernel calls are asynchronous, cudaGetLastError() checks for errors at launch time, and cudaDeviceSynchronize() checks for errors that occurred during kernel execution.

Conclusion

CUDA programming is a core technology for leveraging the massive parallel processing capabilities of GPUs. Here is a summary of what we covered:

GPU Architecture: Thousands of lightweight cores operate in the SIMT model, optimized for data parallel processing
Thread Hierarchy: Parallel work is organized into a three-level hierarchy of Grid, Block, and Thread
Memory Hierarchy: You must understand and appropriately utilize the characteristics of various memory spaces including Register, Shared, and Global
Warp and Occupancy: Minimize warp divergence and consider occupancy to optimize execution configuration
Debugging Tools: Use Compute Sanitizer and Nsight Systems/Compute for systematic debugging and optimization

CUDA programming goes beyond simply writing kernels; it involves deeply understanding GPU hardware characteristics and optimizing code accordingly. We recommend continuously referencing NVIDIA's official CUDA Programming Guide while applying these concepts to real projects.

References

Quiz

Q1: What is the main topic covered in "CUDA Programming Fundamentals: Complete Guide to GPU Parallel Computing"?

Systematically analyze the core concepts of CUDA programming including thread hierarchy, memory model, and kernel development based on NVIDIA official documentation.

Q2: What is 1 CPU: A Processor Optimized for Sequential Processing?

The CPU (Central Processing Unit) is equipped with complex control flow, branch prediction, and large caches, making it optimized for sequential tasks. A typical high-performance CPU has 8 to 64 cores, each capable of independently executing complex instructions at high speed.

Q3: Explain the core concept of 2 GPU: A Processor Optimized for Massive Parallel Processing.

In contrast, the GPU (Graphics Processing Unit) is equipped with thousands of small cores, specializing in massively parallel computation. NVIDIA GPUs are organized into units called Streaming Multiprocessors (SMs), each containing dozens to hundreds of CUDA Cores.

Q4: What are the key aspects of 3 SIMT Execution Model?

NVIDIA GPUs use the SIMT (Single Instruction, Multiple Threads) execution model. SIMT is similar to SIMD (Single Instruction, Multiple Data), but with a key difference.

Q5: How does 1 Thread work?